Browse Source
feat: add V2 protocol simulator and enginev2 sender/session prototype
feat: add V2 protocol simulator and enginev2 sender/session prototype
Adds sw-block/ directory with:
- distsim: protocol correctness simulator (96 tests)
- cluster model with epoch fencing, barrier semantics, commit modes
- endpoint identity, control-plane flow, candidate eligibility
- timeout events, timer races, same-tick ordering
- session ownership tracking with ID-based stale fencing
- enginev2: standalone V2 sender/session implementation (63 tests)
- per-replica Sender with identity-preserving reconciliation
- RecoverySession with FSM phase transitions and session ID
- execution APIs: BeginConnect, RecordHandshake, BeginCatchUp,
RecordCatchUpProgress, CompleteSessionByID — all sender-authority-gated
- recovery outcome branching: zero-gap, catch-up, needs-rebuild
- assignment-intent orchestration with epoch fencing
- design docs: acceptance criteria, open questions, first-slice spec,
protocol development process
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feature/sw-block
90 changed files with 19363 additions and 0 deletions
-
4sw-block/.gocache_v2/README
-
1sw-block/.gocache_v2/trim.txt
-
27sw-block/.private/README.md
-
36sw-block/.private/phase/README.md
-
97sw-block/.private/phase/phase-01-decisions.md
-
67sw-block/.private/phase/phase-01-log.md
-
11sw-block/.private/phase/phase-01-v2-scenarios.md
-
164sw-block/.private/phase/phase-01.md
-
51sw-block/.private/phase/phase-02-decisions.md
-
93sw-block/.private/phase/phase-02-log.md
-
191sw-block/.private/phase/phase-02.md
-
97sw-block/.private/phase/phase-03-decisions.md
-
36sw-block/.private/phase/phase-03-log.md
-
193sw-block/.private/phase/phase-03.md
-
97sw-block/.private/phase/phase-04-decisions.md
-
46sw-block/.private/phase/phase-04-log.md
-
153sw-block/.private/phase/phase-04.md
-
49sw-block/.private/phase/phase-04a-decisions.md
-
22sw-block/.private/phase/phase-04a-log.md
-
113sw-block/.private/phase/phase-04a.md
-
18sw-block/README.md
-
26sw-block/design/README.md
-
288sw-block/design/protocol-development-process.md
-
252sw-block/design/protocol-version-simulation.md
-
314sw-block/design/v1-v15-v2-comparison.md
-
281sw-block/design/v1-v15-v2-simulator-goals.md
-
280sw-block/design/v2-acceptance-criteria.md
-
234sw-block/design/v2-dist-fsm.md
-
159sw-block/design/v2-first-slice-sender-ownership.md
-
193sw-block/design/v2-first-slice-session-ownership.md
-
161sw-block/design/v2-open-questions.md
-
239sw-block/design/v2-prototype-roadmap-and-gates.md
-
249sw-block/design/v2-scenario-sources-from-v1.md
-
638sw-block/design/v2_scenarios.md
-
359sw-block/design/wal-replication-v2-orchestrator.md
-
632sw-block/design/wal-replication-v2-state-machine.md
-
401sw-block/design/wal-replication-v2.md
-
349sw-block/design/wal-v1-to-v2-mapping.md
-
277sw-block/design/wal-v2-tiny-prototype.md
-
14sw-block/private/README.md
-
23sw-block/prototype/README.md
-
1120sw-block/prototype/distsim/cluster.go
-
1004sw-block/prototype/distsim/cluster_test.go
-
BINsw-block/prototype/distsim/distsim.test.exe
-
266sw-block/prototype/distsim/eventsim.go
-
213sw-block/prototype/distsim/phase02_advanced_test.go
-
445sw-block/prototype/distsim/phase02_candidate_test.go
-
371sw-block/prototype/distsim/phase02_network_test.go
-
359sw-block/prototype/distsim/phase02_test.go
-
434sw-block/prototype/distsim/phase02_v1_failures_test.go
-
287sw-block/prototype/distsim/phase03_p2_race_test.go
-
281sw-block/prototype/distsim/phase03_race_test.go
-
333sw-block/prototype/distsim/phase03_timeout_test.go
-
243sw-block/prototype/distsim/phase04a_ownership_test.go
-
102sw-block/prototype/distsim/protocol.go
-
84sw-block/prototype/distsim/protocol_test.go
-
256sw-block/prototype/distsim/random.go
-
43sw-block/prototype/distsim/random_test.go
-
95sw-block/prototype/distsim/reference.go
-
66sw-block/prototype/distsim/reference_test.go
-
581sw-block/prototype/distsim/simulator.go
-
285sw-block/prototype/distsim/simulator_test.go
-
129sw-block/prototype/distsim/storage.go
-
64sw-block/prototype/enginev2/assignment.go
-
420sw-block/prototype/enginev2/execution_test.go
-
3sw-block/prototype/enginev2/go.mod
-
39sw-block/prototype/enginev2/outcome.go
-
482sw-block/prototype/enginev2/p2_test.go
-
347sw-block/prototype/enginev2/sender.go
-
119sw-block/prototype/enginev2/sender_group.go
-
203sw-block/prototype/enginev2/sender_group_test.go
-
407sw-block/prototype/enginev2/sender_test.go
-
151sw-block/prototype/enginev2/session.go
-
162sw-block/prototype/fsmv2/apply.go
-
37sw-block/prototype/fsmv2/events.go
-
73sw-block/prototype/fsmv2/fsm.go
-
95sw-block/prototype/fsmv2/fsm_test.go
-
BINsw-block/prototype/fsmv2/fsmv2.test.exe
-
37sw-block/prototype/run-tests.ps1
-
148sw-block/prototype/volumefsm/events.go
-
38sw-block/prototype/volumefsm/format.go
-
142sw-block/prototype/volumefsm/model.go
-
421sw-block/prototype/volumefsm/model_test.go
-
70sw-block/prototype/volumefsm/recovery.go
-
61sw-block/prototype/volumefsm/scenario.go
-
BINsw-block/prototype/volumefsm/volumefsm.test.exe
-
17sw-block/test/README.md
-
1675sw-block/test/test_db.md
-
105sw-block/test/test_db_v2.md
-
115sw-block/test/v2_selected.md
@ -0,0 +1,4 @@ |
|||
This directory holds cached build artifacts from the Go build system. |
|||
Run "go clean -cache" if the directory is getting too large. |
|||
Run "go clean -fuzzcache" to delete the fuzz cache. |
|||
See go.dev to learn more about Go. |
|||
@ -0,0 +1 @@ |
|||
1774577367 |
|||
@ -0,0 +1,27 @@ |
|||
# .private |
|||
|
|||
Private working area for `sw-block`. |
|||
|
|||
Use this for: |
|||
- phase development notes |
|||
- roadmap/progress tracking |
|||
- draft handoff notes |
|||
- temporary design comparisons |
|||
- prototype scratch work not ready for `design/` or `prototype/` |
|||
|
|||
Recommended layout: |
|||
- `.private/phase/`: phase-by-phase development notes |
|||
- `.private/roadmap/`: short-term and medium-term execution notes |
|||
- `.private/handoff/`: notes for `sw`, `qa`, or future sessions |
|||
|
|||
Phase protocol: |
|||
- each phase should normally have: |
|||
- `phase-xx.md` |
|||
- `phase-xx-log.md` |
|||
- `phase-xx-decisions.md` |
|||
- details are defined in `.private/phase/README.md` |
|||
|
|||
Promotion rules: |
|||
- stable vision/design docs go to `../design/` |
|||
- real prototype code stays in `../prototype/` |
|||
- `.private/` is for working material, not source of truth |
|||
@ -0,0 +1,36 @@ |
|||
# Phase Dev |
|||
|
|||
Use this directory for private phase development notes. |
|||
|
|||
## Phase Protocol |
|||
|
|||
Each phase should use this file set: |
|||
|
|||
- `phase-01.md` |
|||
- plan |
|||
- scope |
|||
- progress |
|||
- active tasks |
|||
- exit criteria |
|||
- `phase-01-log.md` |
|||
- dated development log |
|||
- experiments |
|||
- test runs |
|||
- failures and findings |
|||
- `phase-01-decisions.md` |
|||
- key algorithm decisions |
|||
- tradeoffs |
|||
- rejected alternatives |
|||
|
|||
Suggested naming pattern: |
|||
- `phase-01.md` |
|||
- `phase-01-log.md` |
|||
- `phase-01-decisions.md` |
|||
- `phase-02.md` |
|||
- `phase-02-log.md` |
|||
- `phase-02-decisions.md` |
|||
|
|||
Rule of use: |
|||
1. if it is what we are doing -> `phase-xx.md` |
|||
2. if it is what happened -> `phase-xx-log.md` |
|||
3. if it is why we chose something -> `phase-xx-decisions.md` |
|||
@ -0,0 +1,97 @@ |
|||
# Phase 01 Decisions |
|||
|
|||
Date: 2026-03-26 |
|||
Status: active |
|||
|
|||
## Purpose |
|||
|
|||
Capture the key design decisions made during Phase 01 simulator work. |
|||
|
|||
## Initial Decisions |
|||
|
|||
### 1. `design/` vs `.private/phase/` |
|||
|
|||
Decision: |
|||
- `sw-block/design/` holds shared design truth |
|||
- `sw-block/.private/phase/` holds execution planning and progress |
|||
|
|||
Reason: |
|||
- design backlog and execution checklist should not be mixed |
|||
|
|||
### 2. Scenario source of truth |
|||
|
|||
Decision: |
|||
- `sw-block/design/v2_scenarios.md` is the scenario backlog and coverage matrix |
|||
|
|||
Reason: |
|||
- all contributors need one visible scenario list |
|||
|
|||
### 3. Phase 01 priority |
|||
|
|||
Decision: |
|||
- first close: |
|||
- `S19` |
|||
- `S20` |
|||
|
|||
Reason: |
|||
- they are the biggest remaining distributed lineage/partition scenarios |
|||
|
|||
### 4. Current simulator scope |
|||
|
|||
Decision: |
|||
- use the simulator as a V2 design-validation tool, not a product/perf harness |
|||
|
|||
Reason: |
|||
- current goal is correctness and protocol coverage, not productization |
|||
|
|||
### 5. Phase execution format |
|||
|
|||
Decision: |
|||
- keep phase execution in three files: |
|||
- `phase-xx.md` |
|||
- `phase-xx-log.md` |
|||
- `phase-xx-decisions.md` |
|||
|
|||
Reason: |
|||
- separates plan, evidence, and reasoning |
|||
- reduces drift between roadmap and findings |
|||
|
|||
### 6. Design backlog vs execution plan |
|||
|
|||
Decision: |
|||
- `sw-block/design/v2_scenarios.md` remains the source of truth for scenario backlog and coverage |
|||
- `.private/phase/phase-01.md` is the execution layer for `sw` |
|||
|
|||
Reason: |
|||
- design truth should be stable and shareable |
|||
- execution tasks should be easier to edit without polluting design docs |
|||
|
|||
### 7. Immediate Phase 01 priorities |
|||
|
|||
Decision: |
|||
- prioritize: |
|||
- `S19` chain of custody across multiple promotions |
|||
- `S20` live partition with competing writes |
|||
|
|||
Reason: |
|||
- these are the biggest remaining distributed-lineage gaps after current simulator milestone |
|||
|
|||
### 8. Coverage status should be conservative |
|||
|
|||
Decision: |
|||
- mark scenarios as `partial` unless the test actually exercises the core protocol obligation, not just a simplified happy path |
|||
|
|||
Reason: |
|||
- avoids overstating simulator coverage |
|||
- keeps the backlog honest for follow-up strengthening |
|||
|
|||
### 9. Protocol-version comparison belongs in the simulator |
|||
|
|||
Decision: |
|||
- compare `V1`, `V1.5`, and `V2` using the same scenario set where possible |
|||
|
|||
Reason: |
|||
- this is the clearest way to show: |
|||
- where V1 breaks |
|||
- where V1.5 improves but still strains |
|||
- why V2 is architecturally cleaner |
|||
@ -0,0 +1,67 @@ |
|||
# Phase 01 Log |
|||
|
|||
Date: 2026-03-26 |
|||
Status: active |
|||
|
|||
## Log Protocol |
|||
|
|||
Use dated entries like: |
|||
|
|||
## 2026-03-26 |
|||
- work completed |
|||
- tests run |
|||
- failures found |
|||
- seeds/traces worth keeping |
|||
- follow-up items |
|||
|
|||
## Initial State |
|||
|
|||
- Phase 01 created from the earlier `phase-01-v2-scenarios.md` working note |
|||
- scenario source of truth remains: |
|||
- `sw-block/design/v2_scenarios.md` |
|||
- current active asks for `sw`: |
|||
- `S19` |
|||
- `S20` |
|||
|
|||
## 2026-03-26 |
|||
|
|||
- created Phase 01 file set: |
|||
- `phase-01.md` |
|||
- `phase-01-log.md` |
|||
- `phase-01-decisions.md` |
|||
- promoted scenario execution checklist into `phase-01.md` |
|||
- kept `sw-block/design/v2_scenarios.md` as the shared backlog and coverage matrix |
|||
- current simulator milestone: |
|||
- `fsmv2` passing |
|||
- `volumefsm` passing |
|||
- `distsim` passing |
|||
- randomized `distsim` seeds passing |
|||
- event/interleaving simulator work present in `sw-block/prototype/distsim/simulator.go` |
|||
- current immediate development priority for `sw`: |
|||
- implement `S19` |
|||
- implement `S20` |
|||
- `sw` added Phase 01 P0/P1 scenario tests in `distsim`: |
|||
- `S19` |
|||
- `S20` |
|||
- `S5` |
|||
- `S6` |
|||
- `S18` |
|||
- stronger `S12` |
|||
- review result: |
|||
- `S19` looks solid |
|||
- stronger `S12` now looks solid |
|||
- `S20`, `S5`, `S6`, `S18` are better classified as `partial` than fully closed |
|||
- updated `v2_scenarios.md` coverage matrix to reflect actual status |
|||
- next development focus: |
|||
- P2 scenarios |
|||
- stronger versions of current partial scenarios |
|||
- added protocol-version comparison design: |
|||
- `sw-block/design/protocol-version-simulation.md` |
|||
- added minimal protocol policy prototype in `distsim`: |
|||
- `ProtocolV1` |
|||
- `ProtocolV15` |
|||
- `ProtocolV2` |
|||
- focused on: |
|||
- catch-up policy |
|||
- tail-chasing outcome policy |
|||
- restart/rejoin policy |
|||
@ -0,0 +1,11 @@ |
|||
# Deprecated |
|||
|
|||
This file is deprecated. |
|||
|
|||
Use instead: |
|||
- `phase-01.md` |
|||
- `phase-01-log.md` |
|||
- `phase-01-decisions.md` |
|||
|
|||
The scenario source of truth remains: |
|||
- `sw-block/design/v2_scenarios.md` |
|||
@ -0,0 +1,164 @@ |
|||
# Phase 01 |
|||
|
|||
Date: 2026-03-26 |
|||
Status: completed |
|||
Purpose: drive V2 simulator development by closing the scenario backlog in `sw-block/design/v2_scenarios.md` |
|||
|
|||
## Goal |
|||
|
|||
Make the V2 simulator cover the important protocol scenarios as explicitly as possible. |
|||
|
|||
This phase is about: |
|||
- simulator fidelity |
|||
- scenario coverage |
|||
- invariant quality |
|||
|
|||
This phase is not about: |
|||
- product integration |
|||
- SPDK |
|||
- raw allocator |
|||
- production transport |
|||
|
|||
## Source Of Truth |
|||
|
|||
Design/source-of-truth: |
|||
- `sw-block/design/v2_scenarios.md` |
|||
|
|||
Prototype code: |
|||
- `sw-block/prototype/fsmv2/` |
|||
- `sw-block/prototype/volumefsm/` |
|||
- `sw-block/prototype/distsim/` |
|||
|
|||
## Assigned Tasks For `sw` |
|||
|
|||
### P0 |
|||
|
|||
1. `S19` chain of custody across multiple promotions |
|||
- add fixed test(s) |
|||
- verify committed data from `A -> B -> C` |
|||
- update coverage matrix |
|||
|
|||
2. `S20` live partition with competing writes |
|||
- add fixed test(s) |
|||
- stale side must not advance committed lineage |
|||
- update coverage matrix |
|||
|
|||
### P1 |
|||
|
|||
3. `S5` flapping replica stays recoverable |
|||
- repeated disconnect/reconnect |
|||
- no unnecessary rebuild while recovery remains possible |
|||
|
|||
4. `S6` tail-chasing under load |
|||
- primary keeps writing while replica catches up |
|||
- explicit outcome: |
|||
- converge and promote |
|||
- or abort to rebuild |
|||
|
|||
5. `S18` primary restart without failover |
|||
- same-lineage restart behavior |
|||
- no stale session assumptions |
|||
|
|||
6. stronger `S12` |
|||
- more than one promotion candidate |
|||
- choose valid lineage, not merely highest apparent LSN |
|||
|
|||
### P2 |
|||
|
|||
7. protocol-version comparison support |
|||
- model: |
|||
- `V1` |
|||
- `V1.5` |
|||
- `V2` |
|||
- use the same scenario set to show: |
|||
- V1 breaks |
|||
- V1.5 improves but still strains |
|||
- V2 handles recovery more explicitly |
|||
|
|||
8. richer Smart WAL scenarios |
|||
- time-varying `ExtentReferenced` availability |
|||
- recoverable then unrecoverable transitions |
|||
|
|||
9. delayed/drop network scenarios beyond simple disconnect |
|||
|
|||
10. multi-node reservation expiry / rebuild timeout cases |
|||
|
|||
## Invariants To Preserve |
|||
|
|||
After every scenario or random run, preserve: |
|||
|
|||
1. committed data is durable per policy |
|||
2. uncommitted data is not revived as committed |
|||
3. stale epoch traffic does not mutate current lineage |
|||
4. recovered/promoted node matches reference state at target `LSN` |
|||
5. committed prefix remains contiguous |
|||
|
|||
## Required Updates Per Task |
|||
|
|||
For each completed scenario: |
|||
|
|||
1. add or update test(s) |
|||
2. update `sw-block/design/v2_scenarios.md` |
|||
- package |
|||
- test name |
|||
- status |
|||
3. note any missing simulator capability |
|||
|
|||
## Current Progress |
|||
|
|||
Already in place before this phase: |
|||
- `fsmv2` local FSM prototype |
|||
- `volumefsm` orchestrator prototype |
|||
- `distsim` distributed simulator |
|||
- randomized `distsim` runs |
|||
- first event/interleaving simulator work in `distsim/simulator.go` |
|||
|
|||
Open focus: |
|||
- `S19` covered in `distsim` |
|||
- `S20` partially covered in `distsim` |
|||
- `S5` partially covered in `distsim` |
|||
- `S6` partially covered in `distsim` |
|||
- `S18` partially covered in `distsim` |
|||
- stronger `S12` covered in `distsim` |
|||
- protocol-version comparison design added in: |
|||
- `sw-block/design/protocol-version-simulation.md` |
|||
- remaining focus is now P2 plus stronger versions of partial scenarios |
|||
|
|||
## Phase Status |
|||
|
|||
### P0 |
|||
|
|||
- `S19` chain of custody across multiple promotions: done |
|||
- `S20` live partition with competing writes: partial |
|||
|
|||
### P1 |
|||
|
|||
- `S5` flapping replica stays recoverable: partial |
|||
- `S6` tail-chasing under load: partial |
|||
- `S18` primary restart without failover: partial |
|||
- stronger `S12`: done |
|||
|
|||
### P2 |
|||
|
|||
- active next step: |
|||
- protocol-version comparison support |
|||
- stronger versions of current partial scenarios |
|||
|
|||
## Exit Criteria |
|||
|
|||
Phase 01 is done when: |
|||
|
|||
1. `S19` and `S20` are covered |
|||
2. `S5`, `S6`, `S18`, and stronger `S12` are at least partially covered |
|||
3. coverage matrix in `v2_scenarios.md` is current |
|||
4. random simulation still passes after added scenarios |
|||
|
|||
## Completion Note |
|||
|
|||
Phase 01 completed with: |
|||
- `S19` covered |
|||
- stronger `S12` covered |
|||
- `S20`, `S5`, `S6`, `S18` strengthened but correctly left as `partial` |
|||
|
|||
Next execution phase: |
|||
- `sw-block/.private/phase/phase-02.md` |
|||
@ -0,0 +1,51 @@ |
|||
# Phase 02 Decisions |
|||
|
|||
Date: 2026-03-26 |
|||
Status: active |
|||
|
|||
## Decision 1: Extend `distsim` Instead Of Forking A New Protocol Simulator |
|||
|
|||
Reason: |
|||
- current `distsim` already has: |
|||
- node/storage model |
|||
- coordinator/epoch model |
|||
- reference oracle |
|||
- randomized runs |
|||
- the missing layer is protocol-state fidelity, not a new simulation foundation |
|||
|
|||
Implication: |
|||
- add lightweight per-node replication state and protocol decisions to `distsim` |
|||
- do not build a separate fourth simulator yet |
|||
|
|||
## Decision 2: Keep Coverage Status Conservative |
|||
|
|||
Reason: |
|||
- `S20`, `S6`, and `S18` currently prove important safety properties |
|||
- but they do not yet fully assert message-level or explicit state-transition behavior |
|||
|
|||
Implication: |
|||
- leave them `partial` until the model can assert protocol behavior directly |
|||
|
|||
## Decision 3: Use Versioned Scenario Comparison To Justify V2 |
|||
|
|||
Reason: |
|||
- the simulator should not only say "V2 works" |
|||
- it should show: |
|||
- where `V1` fails |
|||
- where `V1.5` improves but still strains |
|||
- why `V2` is worth the complexity |
|||
|
|||
Implication: |
|||
- Phase 02 includes explicit `V1` / `V1.5` / `V2` scenario comparison work |
|||
|
|||
## Decision 4: V2 Must Not Be Described As "Always Catch-Up" |
|||
|
|||
Reason: |
|||
- that wording is too optimistic and hides the real V2 design rule |
|||
- V2 is better because it makes recoverability explicit, not because it retries forever |
|||
|
|||
Implication: |
|||
- describe V2 as: |
|||
- catch-up if explicitly recoverable |
|||
- otherwise explicit rebuild |
|||
- keep this wording consistent in tests and docs |
|||
@ -0,0 +1,93 @@ |
|||
# Phase 02 Log |
|||
|
|||
Date: 2026-03-26 |
|||
Status: active |
|||
|
|||
## 2026-03-26 |
|||
|
|||
- Phase 02 created to move `distsim` from final-state safety validation toward explicit protocol-state simulation. |
|||
- Initial focus: |
|||
- close `S20`, `S6`, and `S18` at protocol level |
|||
- compare `V1`, `V1.5`, and `V2` on the same scenarios |
|||
- Known model gap at phase start: |
|||
- current `distsim` is strong at final-state safety invariants |
|||
- current `distsim` is weaker at mid-flow protocol assertions and message-level rejection reasons |
|||
- Phase 02 progress now in place: |
|||
- delivery accept/reject tracking |
|||
- protocol-level stale-epoch rejection assertions |
|||
- explicit non-convergent catch-up state transition assertions |
|||
- initial version-comparison tests for disconnect, tail-chasing, and restart/rejoin policy |
|||
- Next simulator target: |
|||
- reproduce real `V1.5` address-instability and control-plane-recovery failures as named scenarios |
|||
- Immediate coding asks for `sw`: |
|||
- changed-address restart failure in `V1.5` |
|||
- same-address transient outage comparison across `V1` / `V1.5` / `V2` |
|||
- slow control-plane reassignment scenario derived from `CP13-8 T4b` |
|||
- Local housekeeping done: |
|||
- corrected V2 wording from "always catch-up" to "catch-up if explicitly recoverable; otherwise rebuild" |
|||
- added explicit brief-disconnect and changed-address restart policy helpers |
|||
- verified `distsim` test suite still passes with the Windows-safe runner |
|||
- Scenario status update: |
|||
- `S20` now covered via protocol-level stale-traffic rejection + committed-prefix stability |
|||
- `S6` now covered via explicit `CatchingUp -> NeedsRebuild` assertions |
|||
- `S18` now covered via explicit stale `MsgBarrierAck` rejection + prefix stability |
|||
- Next asks for `sw` after this closure: |
|||
- changed-address restart scenario tied directly to `CP13-8 T4b` |
|||
- same-address transient outage comparison across `V1` / `V1.5` / `V2` |
|||
- slow control-plane reassignment scenario |
|||
- Smart WAL recoverable -> unrecoverable transition scenarios |
|||
- Additional closure completed: |
|||
- `S5` now covered with both: |
|||
- repeated recoverable flapping |
|||
- budget-exceeded escalation to `NeedsRebuild` |
|||
- Smart WAL transitions now exercised with: |
|||
- recoverable -> unrecoverable during active recovery |
|||
- mixed `WALInline` + `ExtentReferenced` success |
|||
- time-varying payload availability |
|||
- Updated next asks for `sw`: |
|||
- changed-address restart scenario tied directly to `CP13-8 T4b` |
|||
- same-address transient outage comparison across `V1` / `V1.5` / `V2` |
|||
- slow control-plane reassignment scenario |
|||
- delayed/drop network beyond simple disconnect |
|||
- multi-node reservation expiry / rebuild timeout cases |
|||
- Additional Phase 02 coverage delivered: |
|||
- delayed stale messages after promote/failover |
|||
- delayed stale barrier ack rejection |
|||
- selective write-drop with barrier delivery under `sync_all` |
|||
- multi-node mixed reservation expiry outcome |
|||
- multi-node `NeedsRebuild` / snapshot rebuild recovery |
|||
- partial rebuild timeout / retry completion |
|||
- Remaining asks are now narrower: |
|||
- changed-address restart scenario tied directly to `CP13-8 T4b` |
|||
- same-address transient outage comparison across `V1` / `V1.5` / `V2` |
|||
- slow control-plane reassignment scenario |
|||
- stronger coordinator candidate-selection scenarios |
|||
- Additional closure after review: |
|||
- safe default promotion selector now refuses `NeedsRebuild` candidates |
|||
- explicit desperate-promotion API separated from safe selection |
|||
- changed-address and slow-control-plane comparison tests now prove actual data divergence / healing, not only policy shape |
|||
- New next-step assignment: |
|||
- strengthen model depth around endpoint identity and control-plane reassignment |
|||
- replace abstract repair helpers with more explicit event flow where practical |
|||
- reduce direct recovery state injection in comparison tests |
|||
- extend candidate selection from ranking into validity rules |
|||
|
|||
## 2026-03-27 |
|||
|
|||
- Phase 02 core simulator hardening is effectively complete. |
|||
- Delivered since the previous checkpoint: |
|||
- endpoint identity / endpoint-version modeling |
|||
- stale-endpoint rejection in delivery path |
|||
- heartbeat -> coordinator detect -> assignment-update control-plane flow |
|||
- recovery-session trigger API for `V1.5` and `V2` |
|||
- explicit candidate eligibility checks: |
|||
- running |
|||
- epoch alignment |
|||
- state eligibility |
|||
- committed-prefix sufficiency |
|||
- safe default promotion now rejects candidates without the committed prefix |
|||
- Current `distsim` status at latest review: |
|||
- 73 tests passing |
|||
- Manager bookkeeping decision: |
|||
- keep Phase 02 active only for doc maintenance / wrap-up |
|||
- treat further simulator depth as likely Phase 03 work, not unbounded Phase 02 scope creep |
|||
@ -0,0 +1,191 @@ |
|||
# Phase 02 |
|||
|
|||
Date: 2026-03-27 |
|||
Status: active |
|||
Purpose: extend the V2 simulator from final-state safety checking into protocol-state simulation that can reproduce `V1`, `V1.5`, and `V2` behavior on the same scenarios |
|||
|
|||
## Goal |
|||
|
|||
Make the simulator model enough node-local replication state and message-level behavior to: |
|||
|
|||
1. reproduce `V1` / `V1.5` failure modes |
|||
2. show why those failures are structural |
|||
3. close the current `partial` V2 scenarios with stronger protocol assertions |
|||
|
|||
This phase is about: |
|||
- protocol-version comparison |
|||
- per-node replication state |
|||
- message-level fencing / accept / reject behavior |
|||
- explicit catch-up abort / rebuild transitions |
|||
|
|||
This phase is not about: |
|||
- product integration |
|||
- production transport |
|||
- SPDK |
|||
- raw allocator |
|||
|
|||
## Source Of Truth |
|||
|
|||
Design/source-of-truth: |
|||
- `sw-block/design/v2_scenarios.md` |
|||
- `sw-block/design/protocol-version-simulation.md` |
|||
- `sw-block/design/v1-v15-v2-simulator-goals.md` |
|||
|
|||
Prototype code: |
|||
- `sw-block/prototype/distsim/` |
|||
|
|||
## Assigned Tasks For `sw` |
|||
|
|||
### P0 |
|||
|
|||
1. Add per-node replication state to `distsim` |
|||
- minimum states: |
|||
- `InSync` |
|||
- `Lagging` |
|||
- `CatchingUp` |
|||
- `NeedsRebuild` |
|||
- `Rebuilding` |
|||
- keep state lightweight; do not clone full `fsmv2` into `distsim` |
|||
|
|||
2. Add message-level protocol decisions |
|||
- stale-epoch write / ship / barrier traffic must be explicitly rejected |
|||
- record whether a message was: |
|||
- accepted |
|||
- rejected by epoch |
|||
- rejected by state |
|||
|
|||
3. Add explicit catch-up abort / rebuild entry |
|||
- non-convergent catch-up must move to explicit modeled failure: |
|||
- `NeedsRebuild` |
|||
- or equivalent abort outcome |
|||
|
|||
### P1 |
|||
|
|||
4. Re-close `S20` at protocol level |
|||
- stale-side writes must go through protocol delivery path |
|||
- prove stale-side traffic cannot advance committed lineage |
|||
|
|||
5. Re-close `S6` at protocol level |
|||
- assert explicit abort/escalation on non-convergence |
|||
- not only final-state safety |
|||
|
|||
6. Re-close `S18` at protocol level |
|||
- assert committed-prefix behavior around delayed old ack / restart races |
|||
- not only final-state oracle checks |
|||
|
|||
### P2 |
|||
|
|||
7. Expand protocol-version comparison |
|||
- run selected scenarios under: |
|||
- `V1` |
|||
- `V1.5` |
|||
- `V2` |
|||
- at minimum: |
|||
- brief disconnect |
|||
- restart with changed address |
|||
- tail-chasing |
|||
|
|||
8. Add V1.5-derived failure scenarios |
|||
- replica restart with changed receiver address |
|||
- same-address transient outage |
|||
- slow control-plane recovery vs fast local reconnect |
|||
|
|||
9. Prepare richer recovery modeling |
|||
- time-varying recoverability |
|||
- reservation loss during active catch-up |
|||
- rebuild timeout / retry in mixed-state cluster |
|||
|
|||
## Invariants To Preserve |
|||
|
|||
After every scenario or random run, preserve: |
|||
|
|||
1. committed data is durable per policy |
|||
2. uncommitted data is not revived as committed |
|||
3. stale epoch traffic does not mutate current lineage |
|||
4. recovered/promoted node matches reference state at target `LSN` |
|||
5. committed prefix remains contiguous |
|||
6. protocol-state transitions are explicit, not inferred from final data only |
|||
|
|||
## Required Updates Per Task |
|||
|
|||
For each completed task: |
|||
|
|||
1. add or update test(s) |
|||
2. update `sw-block/design/v2_scenarios.md` |
|||
- package |
|||
- test name |
|||
- status |
|||
- source if new scenario was derived from V1/V1.5 behavior |
|||
3. add a short note to: |
|||
- `sw-block/.private/phase/phase-02-log.md` |
|||
4. if a design choice changed, record it in: |
|||
- `sw-block/.private/phase/phase-02-decisions.md` |
|||
|
|||
## Current Progress |
|||
|
|||
Already in place before this phase: |
|||
- `distsim` final-state safety invariants |
|||
- randomized simulation |
|||
- event/interleaving simulator work |
|||
- initial `ProtocolVersion` / policy scaffold |
|||
- `S19` covered |
|||
- stronger `S12` covered |
|||
|
|||
Known partials to close in this phase: |
|||
- none in the current named backlog slice |
|||
|
|||
Delivered in this phase so far: |
|||
- delivery accept/reject tracking added |
|||
- protocol-level rejection assertions added |
|||
- explicit `CatchingUp -> NeedsRebuild` state transition tested |
|||
- selected protocol-version comparison tests added |
|||
- `S20`, `S6`, and `S18` moved from `partial` to `covered` |
|||
- Smart WAL transition scenarios added |
|||
- `S5` moved from `partial` to `covered` |
|||
- endpoint identity / endpoint-version modeling added |
|||
- explicit heartbeat -> detect -> assignment-update control-plane flow added for changed-address restart |
|||
- explicit recovery-session triggers added for `V1.5` and `V2` |
|||
- promotion selection now uses explicit eligibility, including committed-prefix gating |
|||
- safe and desperate promotion paths are separated |
|||
- full `distsim` suite at latest review: 73 tests passing |
|||
|
|||
Remaining focus for `sw`: |
|||
- Phase 02 core scope is now largely delivered |
|||
- remaining work should be treated as future-strengthening, not baseline closure |
|||
- if more simulator depth is needed next, it should likely start as Phase 03: |
|||
- timeout semantics |
|||
- timer races |
|||
- richer event/interleaving behavior |
|||
- stronger endpoint/control-plane realism beyond the current abstract model |
|||
|
|||
## Immediate Next Tasks For `sw` |
|||
|
|||
1. Add a documented compare artifact for new scenarios |
|||
- for each new `V1` / `V1.5` / `V2` comparison: |
|||
- record scenario name |
|||
- what fails in `V1` |
|||
- what improves in `V1.5` |
|||
- what is explicit in `V2` |
|||
- keep `sw-block/design/v1-v15-v2-comparison.md` updated |
|||
|
|||
2. Keep the coverage matrix honest |
|||
- do not mark a scenario `covered` unless the test asserts protocol behavior directly |
|||
- final-state oracle checks alone are not enough |
|||
|
|||
3. Prepare Phase 03 proposal instead of broadening ad hoc |
|||
- if more depth is needed, define it cleanly first: |
|||
- timers / timeout events |
|||
- event ordering races |
|||
- richer endpoint lifecycle |
|||
- recovery-session uniqueness across competing triggers |
|||
|
|||
## Exit Criteria |
|||
|
|||
Phase 02 is done when: |
|||
|
|||
1. `S5`, `S6`, `S18`, and `S20` are covered at protocol level |
|||
2. `distsim` can reproduce at least one `V1` failure, one `V1.5` failure, and the corresponding `V2` behavior on the same named scenario |
|||
3. protocol-level rejection/accept behavior is asserted in tests, not only inferred from final-state oracle checks |
|||
4. coverage matrix in `v2_scenarios.md` is current |
|||
5. changed-address and reconnect scenarios are modeled through explicit endpoint / control-plane behavior rather than helper-only abstraction |
|||
6. promotion selection uses explicit eligibility, including committed-prefix safety |
|||
@ -0,0 +1,97 @@ |
|||
# Phase 03 Decisions |
|||
|
|||
Date: 2026-03-27 |
|||
Status: initial |
|||
|
|||
## Why Phase 03 Exists |
|||
|
|||
Phase 02 already covered the main protocol-state story: |
|||
|
|||
- V1 / V1.5 / V2 comparison |
|||
- stale traffic rejection |
|||
- catch-up vs rebuild |
|||
- changed-address restart control-plane flow |
|||
- committed-prefix-safe promotion eligibility |
|||
|
|||
The next simulator problems are different: |
|||
|
|||
- timer semantics |
|||
- timeout races |
|||
- event ordering under contention |
|||
|
|||
That deserves a separate phase so the model boundary stays clear. |
|||
|
|||
## Initial Boundary |
|||
|
|||
### `distsim` |
|||
|
|||
Keep for: |
|||
|
|||
- protocol correctness |
|||
- reference-state validation |
|||
- recoverability logic |
|||
- promotion / lineage rules |
|||
|
|||
### `eventsim` |
|||
|
|||
Grow for: |
|||
|
|||
- explicit event queue behavior |
|||
- timeout events |
|||
- equal-time scheduling choices |
|||
- race exploration |
|||
|
|||
## Working Rule |
|||
|
|||
Do not move all scenarios into `eventsim`. |
|||
|
|||
Only move or duplicate scenarios when: |
|||
|
|||
- timer or event ordering is the real bug surface |
|||
- `distsim` abstraction hides the important behavior |
|||
|
|||
## Accepted Phase 03 Decisions |
|||
|
|||
### Same-tick rule |
|||
|
|||
Within one tick: |
|||
|
|||
- data/message delivery is evaluated before timeout firing |
|||
|
|||
Meaning: |
|||
|
|||
- if an ack arrives in the same tick as a timeout deadline, the ack wins and may cancel the timeout |
|||
|
|||
This is now an explicit simulator rule, not accidental behavior. |
|||
|
|||
### Timeout authority |
|||
|
|||
Not every timeout that reaches its deadline still has authority to mutate state. |
|||
|
|||
So we now distinguish: |
|||
|
|||
- `FiredTimeouts` |
|||
- timeout had authority and changed the model |
|||
- `IgnoredTimeouts` |
|||
- timeout reached deadline but was stale and ignored |
|||
|
|||
This keeps replay/debug output honest. |
|||
|
|||
### Late barrier ack rule |
|||
|
|||
Once a barrier instance times out: |
|||
|
|||
- it is marked expired |
|||
- late ack for that barrier instance is rejected |
|||
|
|||
That prevents a stale ack from reviving old durability state. |
|||
|
|||
### Review gate rule for timer work |
|||
|
|||
Timer/race work is easy to get subtly wrong while still having green tests. |
|||
|
|||
So timer-related work is not accepted until: |
|||
|
|||
- code path is reviewed |
|||
- tests assert the real protocol obligation |
|||
- stale and authoritative timer behavior are clearly distinguished |
|||
@ -0,0 +1,36 @@ |
|||
# Phase 03 Log |
|||
|
|||
Date: 2026-03-27 |
|||
Status: active |
|||
|
|||
## 2026-03-27 |
|||
|
|||
- Phase 03 created after Phase 02 core scope was effectively delivered. |
|||
- Reason for new phase: |
|||
- remaining simulator work is about timer semantics and race behavior, not basic protocol-state coverage |
|||
- Initial target: |
|||
- define `distsim` vs `eventsim` split more clearly |
|||
- add explicit timeout semantics |
|||
- add timer-race scenarios without bloating `distsim` ad hoc |
|||
- P0 delivered: |
|||
- timeout model added for barrier / catch-up / reservation |
|||
- timeout-backed scenarios added |
|||
- same-tick ordering rule defined as data-before-timers |
|||
- First review result: |
|||
- timeout semantics accepted only after making cancellation model-driven |
|||
- late barrier ack after timeout required explicit rejection |
|||
- P0 hardening delivered: |
|||
- recovery timeout cancellation moved into model logic |
|||
- stale late barrier ack rejected via expired-barrier tracking |
|||
- stale vs authoritative timeout distinction added: |
|||
- `FiredTimeouts` |
|||
- `IgnoredTimeouts` |
|||
- P1 delivered and reviewed: |
|||
- promotion vs stale timeout race |
|||
- rebuild completion vs epoch bump race |
|||
- trace builder moved into reusable code |
|||
- Current suite state at latest accepted review: |
|||
- 86 `distsim` tests passing |
|||
- Manager decision: |
|||
- Phase 03 P0/P1 are accepted |
|||
- next work should move to deliberate P2 selection rather than broadening the phase ad hoc |
|||
@ -0,0 +1,193 @@ |
|||
# Phase 03 |
|||
|
|||
Date: 2026-03-27 |
|||
Status: active |
|||
Purpose: define the next simulator tier after Phase 02, focused on timeout semantics, timer races, and a cleaner split between protocol simulation and event/interleaving simulation |
|||
|
|||
## Goal |
|||
|
|||
Phase 03 exists to cover behavior that current `distsim` still abstracts away: |
|||
|
|||
1. timeout semantics |
|||
2. timer races |
|||
3. event ordering under competing triggers |
|||
4. clearer separation between: |
|||
- protocol / lineage simulation |
|||
- event / race simulation |
|||
|
|||
This phase should not reopen already-closed Phase 02 protocol scope unless a clear bug is found. |
|||
|
|||
## Why A New Phase |
|||
|
|||
Phase 02 already delivered: |
|||
|
|||
- protocol-state assertions |
|||
- V1 / V1.5 / V2 comparison scenarios |
|||
- endpoint identity modeling |
|||
- control-plane assignment-update flow |
|||
- committed-prefix-aware promotion eligibility |
|||
|
|||
What remains is different in character: |
|||
|
|||
- timers |
|||
- delayed events racing with each other |
|||
- timeout-triggered state changes |
|||
- more explicit event scheduling |
|||
|
|||
That deserves a new phase boundary. |
|||
|
|||
## Source Of Truth |
|||
|
|||
Design/source-of-truth: |
|||
- `sw-block/design/v2_scenarios.md` |
|||
- `sw-block/design/v2-dist-fsm.md` |
|||
- `sw-block/design/v2-scenario-sources-from-v1.md` |
|||
- `sw-block/design/v1-v15-v2-comparison.md` |
|||
|
|||
Current prototype base: |
|||
- `sw-block/prototype/distsim/` |
|||
- `sw-block/prototype/distsim/simulator.go` |
|||
|
|||
## Scope |
|||
|
|||
### In scope |
|||
|
|||
1. timeout semantics |
|||
- barrier timeout |
|||
- catch-up timeout |
|||
- reservation expiry timeout |
|||
- rebuild timeout |
|||
|
|||
2. timer races |
|||
- delayed ack vs timeout |
|||
- timeout vs promotion |
|||
- reconnect vs timeout |
|||
- catch-up completion vs expiry |
|||
- rebuild completion vs epoch bump |
|||
|
|||
3. simulator split clarification |
|||
- `distsim` keeps: |
|||
- protocol correctness |
|||
- lineage |
|||
- recoverability |
|||
- reference-state checking |
|||
- `eventsim` grows into: |
|||
- event scheduling |
|||
- timer firing |
|||
- same-time interleavings |
|||
- race exploration |
|||
|
|||
### Out of scope |
|||
|
|||
- production integration |
|||
- real transport |
|||
- real disk timings |
|||
- SPDK |
|||
- raw allocator |
|||
|
|||
## Assigned Tasks For `sw` |
|||
|
|||
### P0 |
|||
|
|||
1. Write a concrete `eventsim` scope note in code/docs |
|||
- define what stays in `distsim` |
|||
- define what moves to `eventsim` |
|||
- avoid overlap and duplicated semantics |
|||
|
|||
2. Add minimal timeout event model |
|||
- first-class timeout event type(s) |
|||
- at minimum: |
|||
- barrier timeout |
|||
- catch-up timeout |
|||
- reservation expiry |
|||
|
|||
3. Add timeout-backed scenarios |
|||
- stale delayed ack vs timeout |
|||
- catch-up timeout before convergence |
|||
- reservation expiry during active recovery |
|||
|
|||
### P1 |
|||
|
|||
4. Add race-focused tests |
|||
- promotion vs delayed stale ack |
|||
- rebuild completion vs epoch bump |
|||
- reconnect success vs timeout firing |
|||
|
|||
5. Keep traces debuggable |
|||
- failing runs must dump: |
|||
- seed |
|||
- event order |
|||
- timer events |
|||
- node states |
|||
- committed prefix |
|||
|
|||
### P2 |
|||
|
|||
6. Decide whether selected `distsim` scenarios should also exist in `eventsim` |
|||
- only when timer/event ordering is the real point |
|||
- do not duplicate every scenario blindly |
|||
|
|||
## Current Progress |
|||
|
|||
Delivered in this phase so far: |
|||
|
|||
- `eventsim` scope note added in code |
|||
- explicit timeout model added: |
|||
- barrier timeout |
|||
- catch-up timeout |
|||
- reservation timeout |
|||
- timeout-backed scenarios added and reviewed |
|||
- same-tick rule made explicit: |
|||
- data before timers |
|||
- recovery timeout cancellation is now model-driven, not test-driven |
|||
- stale barrier ack after timeout is explicitly rejected |
|||
- stale timeouts are separated from authoritative timeouts: |
|||
- `FiredTimeouts` |
|||
- `IgnoredTimeouts` |
|||
- race-focused scenarios added and reviewed: |
|||
- promotion vs stale catch-up timeout |
|||
- promotion vs stale barrier timeout |
|||
- rebuild completion vs epoch bump |
|||
- epoch bump vs stale catch-up timeout |
|||
- reusable trace builder added for replay/debug support |
|||
- current `distsim` suite at latest review: |
|||
- 86 tests passing |
|||
|
|||
Remaining focus for `sw`: |
|||
|
|||
- Phase 03 P0 and P1 are effectively complete |
|||
- Phase 03 P2 is also effectively complete after review |
|||
- any further simulator work should now be narrow and evidence-driven |
|||
- recommended next simulator additions only: |
|||
- control-plane latency parameter |
|||
- sustained-write convergence / tail-chasing load test |
|||
- one multi-promotion lineage extension |
|||
|
|||
## Invariants To Preserve |
|||
|
|||
1. committed data remains durable per policy |
|||
2. uncommitted data is never revived as committed |
|||
3. stale epoch traffic never mutates current lineage |
|||
4. committed prefix remains contiguous |
|||
5. timeout-triggered transitions are explicit and explainable |
|||
6. races do not silently bypass fencing or rebuild boundaries |
|||
|
|||
## Required Updates Per Task |
|||
|
|||
For each completed task: |
|||
|
|||
1. add or update tests |
|||
2. update `sw-block/design/v2_scenarios.md` if scenario coverage changed |
|||
3. add a short note to: |
|||
- `sw-block/.private/phase/phase-03-log.md` |
|||
4. if the simulator boundary changed, record it in: |
|||
- `sw-block/.private/phase/phase-03-decisions.md` |
|||
|
|||
## Exit Criteria |
|||
|
|||
Phase 03 is done when: |
|||
|
|||
1. timeout semantics exist as explicit simulator behavior |
|||
2. at least three important timer-race scenarios are modeled and tested |
|||
3. `distsim` vs `eventsim` responsibilities are clearly separated |
|||
4. failure traces from race/timeout scenarios are replayable enough to debug |
|||
@ -0,0 +1,97 @@ |
|||
# Phase 04 Decisions |
|||
|
|||
Date: 2026-03-27 |
|||
Status: initial |
|||
|
|||
## First Slice Decision |
|||
|
|||
The first standalone V2 implementation slice is: |
|||
|
|||
- per-replica sender ownership |
|||
- one active recovery session per replica per epoch |
|||
|
|||
## Why Not Start In V1 |
|||
|
|||
V1/V1.5 remains: |
|||
|
|||
- production line |
|||
- maintenance/fix line |
|||
|
|||
It should not be the place where V2 architecture is first implemented. |
|||
|
|||
## Why This Slice |
|||
|
|||
This slice: |
|||
|
|||
- directly addresses the clearest V1.5 structural pain |
|||
- maps cleanly to the V2-boundary tests |
|||
- is narrow enough to implement without dragging in the entire future architecture |
|||
|
|||
## Accepted P0 Refinements |
|||
|
|||
### Sender epoch coherence |
|||
|
|||
Sender-owned epoch is real state, not decoration. |
|||
|
|||
So: |
|||
|
|||
- reconcile/update paths must refresh sender epoch |
|||
- stale active session must be invalidated on epoch advance |
|||
|
|||
### Session lifecycle |
|||
|
|||
The first slice should not use a totally loose lifecycle shell. |
|||
|
|||
So: |
|||
|
|||
- session phase changes now follow an explicit transition map |
|||
- invalid jumps are rejected |
|||
|
|||
### Session attach rule |
|||
|
|||
Attaching a session at the wrong epoch is invalid. |
|||
|
|||
So: |
|||
|
|||
- `AttachSession(epoch, kind)` must reject epoch mismatch with the owning sender |
|||
|
|||
## Accepted P1 Refinements |
|||
|
|||
### Session identity fencing |
|||
|
|||
The standalone V2 slice must reject stale completion by explicit session identity. |
|||
|
|||
So: |
|||
|
|||
- `RecoverySession` has stable unique identity |
|||
- sender completion must be by session ID, not by "current pointer" |
|||
- stale session results are rejected at the sender authority boundary |
|||
|
|||
### Ownership vs execution |
|||
|
|||
Ownership creation is not the same as execution start. |
|||
|
|||
So: |
|||
|
|||
- `AttachSession()` and `SupersedeSession()` establish ownership only |
|||
- `BeginConnect()` is the first execution-state mutation |
|||
|
|||
### Completion authority |
|||
|
|||
An ID match alone is not enough to complete recovery. |
|||
|
|||
So: |
|||
|
|||
- completion must require a valid completion-ready phase |
|||
- normal completion requires converged catch-up |
|||
- zero-gap fast completion is allowed explicitly from handshake |
|||
|
|||
## P2 Direction |
|||
|
|||
The next prototype step is not broader simulation. |
|||
|
|||
It is: |
|||
|
|||
- recovery outcome branching |
|||
- assignment-intent orchestration |
|||
- prototype-level end-to-end recovery flow |
|||
@ -0,0 +1,46 @@ |
|||
# Phase 04 Log |
|||
|
|||
Date: 2026-03-27 |
|||
Status: active |
|||
|
|||
## 2026-03-27 |
|||
|
|||
- Phase 04 created to start the first standalone V2 implementation slice. |
|||
- Decision: |
|||
- do not begin in `weed/storage/blockvol/` |
|||
- begin under `sw-block/` |
|||
- first slice chosen: |
|||
- per-replica sender ownership |
|||
- explicit recovery-session ownership |
|||
- Initial slice delivered under `sw-block/prototype/enginev2/`: |
|||
- sender |
|||
- recovery session |
|||
- sender group |
|||
- First review found: |
|||
- sender/session epoch coherence gap |
|||
- session lifecycle was shell-only, not enforcing real transitions |
|||
- attach-session epoch mismatch was not rejected |
|||
- Follow-up delivered and accepted: |
|||
- reconcile updates preserved sender epoch |
|||
- epoch bump invalidates stale session |
|||
- session transition map enforced |
|||
- attach-session rejects epoch mismatch |
|||
- enginev2 tests increased to 26 passing |
|||
- Phase 04a created to close the ownership-validation gap: |
|||
- explicit session identity in `distsim` |
|||
- bridge tests into `enginev2` |
|||
- Phase 04a ownership problem closed well enough: |
|||
- stale completion rejected by session ID |
|||
- endpoint invalidation includes `CtrlAddr` |
|||
- boundary doc aligned with real simulator/prototype evidence |
|||
- Phase 04 P1 delivered and accepted: |
|||
- sender-owned execution APIs added |
|||
- all execution APIs fence on `sessionID` |
|||
- completion now requires valid completion point |
|||
- attach/supersede now establish ownership only |
|||
- handshake range validation added |
|||
- enginev2 tests increased to 46 passing |
|||
- Next phase focus narrowed to P2: |
|||
- recovery outcome branching |
|||
- assignment-intent orchestration |
|||
- prototype end-to-end recovery flow |
|||
@ -0,0 +1,153 @@ |
|||
# Phase 04 |
|||
|
|||
Date: 2026-03-27 |
|||
Status: active |
|||
Purpose: start the first standalone V2 implementation slice under `sw-block/`, centered on per-replica sender ownership and explicit recovery-session ownership |
|||
|
|||
## Goal |
|||
|
|||
Build the first real V2 implementation slice without destabilizing V1. |
|||
|
|||
This slice should prove: |
|||
|
|||
1. per-replica sender identity |
|||
2. explicit one-session-per-replica recovery ownership |
|||
3. endpoint/assignment-driven recovery updates |
|||
4. clean handoff between normal sender and recovery session |
|||
|
|||
## Why This Phase Exists |
|||
|
|||
The simulator and design work are now strong enough to support a narrow implementation slice. |
|||
|
|||
We should not start with: |
|||
|
|||
- Smart WAL |
|||
- new storage engine |
|||
- frontend integration |
|||
|
|||
We should start with the ownership problem that most clearly separates V2 from V1.5. |
|||
|
|||
## Source Of Truth |
|||
|
|||
Design: |
|||
- `sw-block/design/v2-first-slice-session-ownership.md` |
|||
- `sw-block/design/v2-acceptance-criteria.md` |
|||
- `sw-block/design/v2-open-questions.md` |
|||
|
|||
Simulator reference: |
|||
- `sw-block/prototype/distsim/` |
|||
|
|||
## Scope |
|||
|
|||
### In scope |
|||
|
|||
1. per-replica sender owner object |
|||
2. explicit recovery session object |
|||
3. session lifecycle rules |
|||
4. endpoint update handling |
|||
5. basic tests for sender/session ownership |
|||
|
|||
### Out of scope |
|||
|
|||
- Smart WAL in production code |
|||
- real block backend redesign |
|||
- V1 integration |
|||
- frontend publication |
|||
|
|||
## Assigned Tasks For `sw` |
|||
|
|||
### P0 |
|||
|
|||
1. create standalone V2 implementation area under `sw-block/` |
|||
- recommended: |
|||
- `sw-block/prototype/enginev2/` |
|||
|
|||
2. define sender/session types |
|||
- sender owner per replica |
|||
- recovery session per replica per epoch |
|||
|
|||
3. implement basic lifecycle |
|||
- create sender |
|||
- attach session |
|||
- supersede stale session |
|||
- close session on success / invalidation |
|||
|
|||
## Current Progress |
|||
|
|||
Delivered in this phase so far: |
|||
|
|||
- standalone V2 area created under: |
|||
- `sw-block/prototype/enginev2/` |
|||
- core types added: |
|||
- `Sender` |
|||
- `RecoverySession` |
|||
- `SenderGroup` |
|||
- sender/session lifecycle shell implemented |
|||
- per-replica ownership implemented |
|||
- endpoint-change invalidation implemented |
|||
- sender epoch coherence implemented |
|||
- session epoch attach validation implemented |
|||
- session phase transitions now enforce a real transition map |
|||
- session identity fencing implemented |
|||
- stale completion rejected by session ID |
|||
- execution APIs implemented: |
|||
- `BeginConnect` |
|||
- `RecordHandshake` |
|||
- `BeginCatchUp` |
|||
- `RecordCatchUpProgress` |
|||
- `CompleteSessionByID` |
|||
- completion authority tightened: |
|||
- catch-up must converge |
|||
- zero-gap handshake fast path allowed |
|||
- attach/supersede now establish ownership only |
|||
- sender-group orchestration tests added |
|||
- current `enginev2` test state at latest review: |
|||
- 46 tests passing |
|||
|
|||
Next focus for `sw`: |
|||
|
|||
- continue Phase 04 beyond execution gating: |
|||
- recovery outcome branching |
|||
- sender-group orchestration from assignment intent |
|||
- prototype-level end-to-end recovery flow |
|||
- do not integrate into V1 production tree yet |
|||
|
|||
### P1 |
|||
|
|||
4. implement endpoint update handling |
|||
- changed-address update must refresh the right sender owner |
|||
|
|||
5. implement epoch invalidation |
|||
- stale session must stop after epoch bump |
|||
|
|||
6. add tests matching the slice acceptance |
|||
|
|||
### P2 |
|||
|
|||
7. add recovery outcome branching |
|||
- distinguish: |
|||
- zero-gap fast completion |
|||
- positive-gap catch-up completion |
|||
- unrecoverable gap / `NeedsRebuild` |
|||
|
|||
8. add assignment-intent driven orchestration |
|||
- move beyond raw reconcile-only tests |
|||
- make sender-group react to explicit recovery intent |
|||
|
|||
9. add prototype-level end-to-end flow tests |
|||
- assignment/update |
|||
- session creation |
|||
- execution |
|||
- completion / invalidation |
|||
- rebuild escalation |
|||
|
|||
## Exit Criteria |
|||
|
|||
Phase 04 is done when: |
|||
|
|||
1. standalone V2 sender/session slice exists under `sw-block/` |
|||
2. sender ownership is per replica, not set-global |
|||
3. one active recovery session per replica per epoch is enforced |
|||
4. endpoint update and epoch invalidation are tested |
|||
5. sender-owned execution flow is validated |
|||
6. recovery outcome branching exists at prototype level |
|||
@ -0,0 +1,49 @@ |
|||
# Phase 04a Decisions |
|||
|
|||
Date: 2026-03-27 |
|||
Status: initial |
|||
|
|||
## Core Decision |
|||
|
|||
The next must-fix validation problem is: |
|||
|
|||
- sender/session ownership semantics |
|||
|
|||
This outranks: |
|||
|
|||
- more timing realism |
|||
- more WAL detail |
|||
- broader scenario growth |
|||
|
|||
## Why |
|||
|
|||
V2's core claim over V1.5 is not only: |
|||
|
|||
- better recovery policy |
|||
|
|||
It is also: |
|||
|
|||
- stable per-replica sender identity |
|||
- one active recovery owner |
|||
- stale work cannot mutate current state |
|||
|
|||
If those ownership rules are not validated, the simulator can overstate confidence. |
|||
|
|||
## Validation Rule |
|||
|
|||
For this phase, a scenario is only complete when it is expressed at two levels: |
|||
|
|||
1. simulator ownership model (`distsim`) |
|||
2. standalone implementation slice (`enginev2`) |
|||
|
|||
Real `weed/` adversarial tests remain the system-level gate. |
|||
|
|||
## Scope Discipline |
|||
|
|||
Do not expand this phase into: |
|||
|
|||
- generic simulator feature growth |
|||
- Smart WAL design growth |
|||
- V1 integration work |
|||
|
|||
Keep it focused on the ownership model. |
|||
@ -0,0 +1,22 @@ |
|||
# Phase 04a Log |
|||
|
|||
Date: 2026-03-27 |
|||
Status: active |
|||
|
|||
## 2026-03-27 |
|||
|
|||
- Phase 04a created as a narrow validation phase. |
|||
- Reason: |
|||
- the biggest remaining V2 validation gap is ownership semantics |
|||
- not general scenario count |
|||
- not more timer realism |
|||
- not more WAL detail |
|||
- Scope chosen: |
|||
- sender identity |
|||
- recovery session identity |
|||
- supersede / invalidate rules |
|||
- stale completion rejection |
|||
- `distsim` to `enginev2` bridge tests |
|||
- This phase is intentionally separate from broad Phase 04 implementation growth. |
|||
- Goal: |
|||
- gain confidence that V2 is validated as owned session/sender protocol state, not only as policy |
|||
@ -0,0 +1,113 @@ |
|||
# Phase 04a |
|||
|
|||
Date: 2026-03-27 |
|||
Status: active |
|||
Purpose: close the critical V2 ownership-validation gap by making sender/session ownership explicit in both simulation and the standalone `enginev2` slice |
|||
|
|||
## Goal |
|||
|
|||
Validate the core V2 claim more deeply: |
|||
|
|||
1. one stable sender identity per replica |
|||
2. one active recovery session per replica |
|||
3. endpoint change, epoch bump, and supersede rules invalidate stale work |
|||
4. stale late results from old sessions cannot mutate current state |
|||
|
|||
This phase is not about adding broad new simulator surface. |
|||
It is about proving the ownership model that is supposed to make V2 better than V1.5. |
|||
|
|||
## Why This Phase Exists |
|||
|
|||
Current simulation is already strong on: |
|||
|
|||
- quorum / commit rules |
|||
- stale epoch rejection |
|||
- catch-up vs rebuild |
|||
- timeout / race ordering |
|||
- changed-address recovery at the policy level |
|||
|
|||
The remaining critical risk is narrower: |
|||
|
|||
- the simulator still validates V2 strongly as policy |
|||
- but not yet strongly enough as owned sender/session protocol state |
|||
|
|||
That is the highest-value validation gap to close before trusting V2 too much. |
|||
|
|||
## Source Of Truth |
|||
|
|||
Design: |
|||
- `sw-block/design/v2-first-slice-session-ownership.md` |
|||
- `sw-block/design/v2-acceptance-criteria.md` |
|||
- `sw-block/design/v2-open-questions.md` |
|||
- `sw-block/design/protocol-development-process.md` |
|||
|
|||
Simulator / prototype: |
|||
- `sw-block/prototype/distsim/` |
|||
- `sw-block/prototype/enginev2/` |
|||
|
|||
Historical / review context: |
|||
- `learn/projects/sw-block/phases/phase-13-v2-boundary-tests.md` |
|||
- `sw-block/design/v2-scenario-sources-from-v1.md` |
|||
|
|||
## Scope |
|||
|
|||
### In scope |
|||
|
|||
1. explicit sender/session identity validation in `distsim` |
|||
2. explicit stale-session invalidation rules |
|||
3. bridge tests from `distsim` scenarios to `enginev2` sender/session invariants |
|||
4. doc cleanup so V2-boundary tests point to real simulator and `enginev2` coverage |
|||
|
|||
### Out of scope |
|||
|
|||
- Smart WAL expansion |
|||
- broad new timing realism |
|||
- TCP / disk realism |
|||
- V1 production integration |
|||
- new backend/storage engine work |
|||
|
|||
## Critical Questions To Close |
|||
|
|||
1. can an old session completion mutate state after a new session supersedes it? |
|||
2. does endpoint change invalidate or supersede the active session cleanly? |
|||
3. does epoch bump remove all authority from prior sessions? |
|||
4. can duplicate recovery triggers create overlapping active sessions? |
|||
|
|||
## Assigned Tasks For `sw` |
|||
|
|||
### P0 |
|||
|
|||
1. add explicit session identity to `distsim` |
|||
- model session ID or equivalent ownership token |
|||
- make stale session results rejectable by identity, not just by coarse state |
|||
|
|||
2. add ownership scenarios to `distsim` |
|||
- endpoint change during active catch-up |
|||
- epoch bump during active catch-up |
|||
- stale late completion from old session |
|||
- duplicate recovery trigger while a session is already active |
|||
|
|||
3. add bridge tests in `enginev2` |
|||
- same-address reconnect preserves sender identity |
|||
- endpoint bump supersedes or invalidates active session |
|||
- epoch bump rejects stale completion |
|||
- only one active session per sender |
|||
|
|||
### P1 |
|||
|
|||
4. tighten `learn/projects/sw-block/phases/phase-13-v2-boundary-tests.md` |
|||
- point to actual `distsim` scenarios |
|||
- point to actual `enginev2` bridge tests |
|||
- state what remains real-engine-only |
|||
|
|||
5. only add simulator mechanics if a bridge test exposes a real ownership gap |
|||
|
|||
## Exit Criteria |
|||
|
|||
Phase 04a is done when: |
|||
|
|||
1. `distsim` explicitly validates sender/session ownership invariants |
|||
2. `enginev2` has bridge tests for the same invariants |
|||
3. stale session work is shown unable to mutate current sender state |
|||
4. V2-boundary doc no longer has stale simulator references |
|||
5. we can say with confidence that V2 ownership semantics, not just V2 policy, are validated at prototype level |
|||
@ -0,0 +1,18 @@ |
|||
# sw-block |
|||
|
|||
Private WAL V2 and standalone block-service workspace. |
|||
|
|||
Purpose: |
|||
- keep WAL V2 design/prototype work isolated from WAL V1 production code in `weed/storage/blockvol` |
|||
- allow private design notes and experiments to evolve without polluting V1 delivery paths |
|||
- keep the future standalone `sw-block` product structure clean enough to split into a separate repo later if needed |
|||
|
|||
Suggested layout: |
|||
- `design/`: shared V2 design docs |
|||
- `prototype/`: code prototypes and experiments |
|||
- `.private/`: private notes, phase development, roadmap, and non-public working material |
|||
|
|||
Repository direction: |
|||
- current state: `sw-block/` is an isolated workspace inside `seaweedfs` |
|||
- likely future state: `sw-block` becomes a standalone sibling repo/product |
|||
- design and prototype structure should therefore stay product-oriented and not depend on SeaweedFS-specific paths |
|||
@ -0,0 +1,26 @@ |
|||
# V2 Design |
|||
|
|||
Current WAL V2 design set: |
|||
- `wal-replication-v2.md` |
|||
- `wal-replication-v2-state-machine.md` |
|||
- `wal-replication-v2-orchestrator.md` |
|||
- `wal-v2-tiny-prototype.md` |
|||
- `wal-v1-to-v2-mapping.md` |
|||
- `v2-dist-fsm.md` |
|||
- `v2_scenarios.md` |
|||
- `v1-v15-v2-comparison.md` |
|||
- `v2-scenario-sources-from-v1.md` |
|||
- `protocol-development-process.md` |
|||
- `v2-acceptance-criteria.md` |
|||
- `v2-open-questions.md` |
|||
- `v2-first-slice-session-ownership.md` |
|||
- `v2-prototype-roadmap-and-gates.md` |
|||
|
|||
These documents are the working design home for the V2 line. |
|||
|
|||
The original project-level copies under `learn/projects/sw-block/design/` remain as shared references for now. |
|||
|
|||
Execution note: |
|||
- active development tracking for the current simulator phase lives under: |
|||
- `../.private/phase/phase-01.md` |
|||
- `../.private/phase/phase-02.md` |
|||
@ -0,0 +1,288 @@ |
|||
# Protocol Development Process |
|||
|
|||
Date: 2026-03-27 |
|||
|
|||
## Purpose |
|||
|
|||
This document defines how `sw-block` protocol work should be developed. |
|||
|
|||
The process is meant to work for: |
|||
|
|||
- V2 |
|||
- future V3 |
|||
- or a later block algorithm that is not WAL-based |
|||
|
|||
The point is to make protocol work systematic rather than reactive. |
|||
|
|||
## Core Philosophy |
|||
|
|||
### 1. Design before implementation |
|||
|
|||
Do not start with production code and hope the protocol becomes clear later. |
|||
|
|||
Start with: |
|||
|
|||
1. system contract |
|||
2. invariants |
|||
3. state model |
|||
4. scenario backlog |
|||
|
|||
Only then move to implementation. |
|||
|
|||
### 2. Real failures are inputs, not just bugs |
|||
|
|||
When V1 or V1.5 fails in real testing, treat that as: |
|||
|
|||
- a design requirement |
|||
- a scenario source |
|||
- a simulator input |
|||
|
|||
Do not patch and forget. |
|||
|
|||
### 3. Simulator is part of the protocol, not a side tool |
|||
|
|||
The simulator exists to answer: |
|||
|
|||
- what should happen |
|||
- what must never happen |
|||
- which old designs fail |
|||
- why the new design is better |
|||
|
|||
It is not a replacement for real testing. |
|||
It is the design-validation layer before production implementation. |
|||
|
|||
### 4. Passing tests are not enough |
|||
|
|||
Green tests are necessary, not sufficient. |
|||
|
|||
We also require: |
|||
|
|||
- explicit invariants |
|||
- explicit scenario intent |
|||
- clear state transitions |
|||
- review of assumptions and abstraction boundaries |
|||
|
|||
### 5. Keep hot-path and recovery-path reasoning separate |
|||
|
|||
Healthy steady-state behavior and degraded recovery behavior are different problems. |
|||
|
|||
Both must be designed explicitly. |
|||
|
|||
## Development Ladder |
|||
|
|||
Every major protocol feature should move through these steps: |
|||
|
|||
1. **Problem statement** |
|||
- what real bug, limit, or product goal is driving the work |
|||
|
|||
2. **Contract** |
|||
- what the protocol guarantees |
|||
- what it does not guarantee |
|||
|
|||
3. **State model** |
|||
- node state |
|||
- coordinator state |
|||
- recovery state |
|||
- role / epoch / lineage rules |
|||
|
|||
4. **Scenario backlog** |
|||
- named scenarios |
|||
- source: |
|||
- real failure |
|||
- design obligation |
|||
- adversarial distributed case |
|||
|
|||
5. **Prototype / simulator** |
|||
- reduced but explicit model |
|||
- invariant checks |
|||
- V1 / V1.5 / V2 comparison where relevant |
|||
|
|||
6. **Implementation** |
|||
- production code only after the protocol shape is clear enough |
|||
|
|||
7. **Real validation** |
|||
- unit |
|||
- component |
|||
- integration |
|||
- real hardware where needed |
|||
|
|||
8. **Feedback loop** |
|||
- turn new failures back into scenario/design inputs |
|||
|
|||
## Required Artifacts |
|||
|
|||
For protocol work to be considered real progress, we usually want: |
|||
|
|||
### Design |
|||
|
|||
- design doc |
|||
- scenario doc |
|||
- comparison doc when replacing an older approach |
|||
|
|||
### Prototype |
|||
|
|||
- simulator or prototype code |
|||
- tests that assert protocol behavior |
|||
|
|||
### Implementation |
|||
|
|||
- production patch |
|||
- production tests |
|||
- docs updated to match the actual algorithm |
|||
|
|||
### Review |
|||
|
|||
- implementation gate |
|||
- design/protocol gate |
|||
|
|||
## Two-Gate Rule |
|||
|
|||
We use two acceptance gates. |
|||
|
|||
### Gate 1: implementation |
|||
|
|||
Owned by the coding side. |
|||
|
|||
Questions: |
|||
|
|||
- does it build? |
|||
- do tests pass? |
|||
- does it behave as intended in code? |
|||
|
|||
### Gate 2: protocol/design |
|||
|
|||
Owned by the design/review side. |
|||
|
|||
Questions: |
|||
|
|||
- is the logic actually sound? |
|||
- do tests prove the intended thing? |
|||
- are assumptions explicit? |
|||
- is the abstraction boundary honest? |
|||
|
|||
A task is not accepted until both gates pass. |
|||
|
|||
## Layering Rule |
|||
|
|||
Keep simulation layers separate. |
|||
|
|||
### `distsim` |
|||
|
|||
Use for: |
|||
|
|||
- protocol correctness |
|||
- state transitions |
|||
- fencing |
|||
- recoverability |
|||
- promotion / lineage |
|||
- reference-state checking |
|||
|
|||
### `eventsim` |
|||
|
|||
Use for: |
|||
|
|||
- timeout behavior |
|||
- timer races |
|||
- event ordering |
|||
- same-tick / delayed event interactions |
|||
|
|||
Do not duplicate scenarios blindly across both layers. |
|||
|
|||
## Test Selection Rule |
|||
|
|||
Do not choose simulator inputs only from failing tests. |
|||
|
|||
Review all relevant tests and classify them by: |
|||
|
|||
- protocol significance |
|||
- simulator value |
|||
- implementation specificity |
|||
|
|||
Good simulator candidates often come from: |
|||
|
|||
- barrier truth |
|||
- catch-up vs rebuild |
|||
- stale message rejection |
|||
- failover / promotion safety |
|||
- changed-address restart |
|||
- mode semantics |
|||
|
|||
Keep real-only tests for: |
|||
|
|||
- wire format |
|||
- OS timing |
|||
- exact WAL file behavior |
|||
- frontend transport specifics |
|||
|
|||
## Version Comparison Rule |
|||
|
|||
When designing a successor protocol: |
|||
|
|||
- keep the old version visible |
|||
- reproduce the old failure or limitation |
|||
- show the improved behavior in the new version |
|||
|
|||
For `sw-block`, that means: |
|||
|
|||
- `V1` |
|||
- `V1.5` |
|||
- `V2` |
|||
|
|||
should be compared explicitly where possible. |
|||
|
|||
## Documentation Rule |
|||
|
|||
The docs must track three different things: |
|||
|
|||
### `learn/projects/sw-block/` |
|||
|
|||
Use for: |
|||
|
|||
- project history |
|||
- V1/V1.5 algorithm records |
|||
- phase records |
|||
- real test history |
|||
|
|||
### `sw-block/design/` |
|||
|
|||
Use for: |
|||
|
|||
- active design truth |
|||
- V2 and later protocol docs |
|||
- scenario backlog |
|||
- comparison docs |
|||
|
|||
### `sw-block/.private/phase/` |
|||
|
|||
Use for: |
|||
|
|||
- active execution plan |
|||
- log |
|||
- decisions |
|||
|
|||
## What Good Progress Looks Like |
|||
|
|||
A good protocol iteration usually has this pattern: |
|||
|
|||
1. real failure or design pressure identified |
|||
2. scenario named and written down |
|||
3. simulator reproduces the bad case |
|||
4. new protocol handles it explicitly |
|||
5. implementation follows |
|||
6. real tests validate it |
|||
|
|||
If one of those steps is missing, confidence is weaker. |
|||
|
|||
## Bottom Line |
|||
|
|||
The process is: |
|||
|
|||
1. design the contract |
|||
2. model the state |
|||
3. define the scenarios |
|||
4. simulate the protocol |
|||
5. implement carefully |
|||
6. validate in real tests |
|||
7. feed failures back into design |
|||
|
|||
That is the process we should keep using for V2 and any later protocol line. |
|||
@ -0,0 +1,252 @@ |
|||
# Protocol Version Simulation |
|||
|
|||
Date: 2026-03-26 |
|||
Status: design proposal |
|||
Purpose: define how the simulator should model WAL V1, WAL V1.5 (Phase 13), and WAL V2 on the same scenario set |
|||
|
|||
## Why This Exists |
|||
|
|||
The simulator is more valuable if the same scenario can answer: |
|||
|
|||
1. how WAL V1 behaves |
|||
2. how WAL V1.5 behaves |
|||
3. how WAL V2 should behave |
|||
|
|||
That turns the simulator into: |
|||
- a regression tool for V1/V1.5 |
|||
- a justification tool for V2 |
|||
- a comparison framework across protocol generations |
|||
|
|||
## Principle |
|||
|
|||
Do not fork three separate simulators. |
|||
|
|||
Instead: |
|||
- keep one simulator core |
|||
- add protocol-version behavior modes |
|||
- run the same named scenario under different modes |
|||
|
|||
## Proposed Versions |
|||
|
|||
### `ProtocolV1` |
|||
|
|||
Intent: |
|||
- represent pre-Phase-13 behavior |
|||
|
|||
Behavior shape: |
|||
- WAL is streamed optimistically |
|||
- lagging replica is degraded/excluded quickly |
|||
- no real short-gap catch-up contract |
|||
- no retention-backed recovery window |
|||
- replica usually falls toward rebuild rather than incremental recovery |
|||
|
|||
What scenarios should expose: |
|||
- short outage still causes unnecessary degrade/rebuild |
|||
- transient jitter may be over-penalized |
|||
- poor graceful rejoin story |
|||
|
|||
### `ProtocolV15` |
|||
|
|||
Intent: |
|||
- represent Phase-13 WAL V1.5 behavior |
|||
|
|||
Behavior shape: |
|||
- reconnect handshake exists |
|||
- WAL catch-up exists |
|||
- primary may retain WAL longer for lagging replica |
|||
- recovery still depends heavily on address stability and control-plane timing |
|||
- catch-up may still tail-chase or stall operationally |
|||
|
|||
What scenarios should expose: |
|||
- transient disconnects may recover |
|||
- restart with new receiver address may still fail practical recovery |
|||
- tail-chasing / retention pressure remain structural risks |
|||
|
|||
### `ProtocolV2` |
|||
|
|||
Intent: |
|||
- represent the target design |
|||
|
|||
Behavior shape: |
|||
- explicit recovery reservation |
|||
- explicit catch-up vs rebuild boundary |
|||
- lineage-first promotion |
|||
- version-correct recovery sources |
|||
- explicit abort/rebuild path on non-convergence or lost recoverability |
|||
|
|||
What scenarios should show: |
|||
- short gap recovers cleanly |
|||
- impossible catch-up fails cleanly |
|||
- rebuild is explicit, not accidental |
|||
|
|||
## Behavior Axes To Toggle |
|||
|
|||
The simulator does not need completely different code paths. |
|||
It needs protocol-version-sensitive policy on these axes: |
|||
|
|||
### 1. Lagging replica treatment |
|||
|
|||
`V1`: |
|||
- degrade quickly |
|||
- no meaningful WAL catch-up window |
|||
|
|||
`V1.5`: |
|||
- allow WAL catch-up while history remains available |
|||
|
|||
`V2`: |
|||
- allow catch-up only with explicit recoverability / reservation |
|||
|
|||
### 2. WAL retention / recoverability |
|||
|
|||
`V1`: |
|||
- little or no retention for lagging-replica recovery |
|||
|
|||
`V1.5`: |
|||
- retention-based recovery window |
|||
- but no strong reservation contract |
|||
|
|||
`V2`: |
|||
- recoverability check plus reservation |
|||
|
|||
### 3. Restart / address stability |
|||
|
|||
`V1`: |
|||
- generally poor rejoin path |
|||
|
|||
`V1.5`: |
|||
- reconnect may work only if replica address is stable |
|||
|
|||
`V2`: |
|||
- address/identity assumptions should be explicit in the model |
|||
|
|||
### 4. Tail-chasing behavior |
|||
|
|||
`V1`: |
|||
- usually degrades rather than catches up |
|||
|
|||
`V1.5`: |
|||
- catch-up may be attempted but may never converge |
|||
|
|||
`V2`: |
|||
- non-convergence should explicitly abort/escalate |
|||
|
|||
### 5. Promotion policy |
|||
|
|||
`V1`: |
|||
- weaker lineage reasoning |
|||
|
|||
`V1.5`: |
|||
- improved epoch/LSN handling |
|||
|
|||
`V2`: |
|||
- lineage-first promotion is a first-class rule |
|||
|
|||
## Recommended Simulator API |
|||
|
|||
Add a version enum, for example: |
|||
|
|||
```go |
|||
type ProtocolVersion string |
|||
|
|||
const ( |
|||
ProtocolV1 ProtocolVersion = "v1" |
|||
ProtocolV15 ProtocolVersion = "v1_5" |
|||
ProtocolV2 ProtocolVersion = "v2" |
|||
) |
|||
``` |
|||
|
|||
Attach it to the simulator or cluster: |
|||
|
|||
```go |
|||
type Cluster struct { |
|||
Protocol ProtocolVersion |
|||
... |
|||
} |
|||
``` |
|||
|
|||
## Policy Hooks |
|||
|
|||
Rather than branching everywhere, centralize the differences in a few hooks: |
|||
|
|||
1. `CanAttemptCatchup(...)` |
|||
2. `CatchupConvergencePolicy(...)` |
|||
3. `RecoverabilityPolicy(...)` |
|||
4. `RestartRejoinPolicy(...)` |
|||
5. `PromotionPolicy(...)` |
|||
|
|||
That keeps the simulator readable. |
|||
|
|||
## Example Scenario Comparisons |
|||
|
|||
### Scenario: brief disconnect |
|||
|
|||
`V1`: |
|||
- likely degrade / no efficient catch-up |
|||
|
|||
`V1.5`: |
|||
- catch-up may succeed if address/history remain stable |
|||
|
|||
`V2`: |
|||
- explicit recoverability + reservation |
|||
- catch-up only if the missing window is still recoverable |
|||
- otherwise explicit rebuild |
|||
|
|||
### Scenario: replica restart with new receiver port |
|||
|
|||
`V1`: |
|||
- poor recovery path |
|||
|
|||
`V1.5`: |
|||
- background reconnect fails if it retries stale address |
|||
|
|||
`V2`: |
|||
- identity/address model must make this explicit |
|||
- direct reconnect is not assumed |
|||
- use explicit reassignment plus catch-up if recoverable, otherwise rebuild cleanly |
|||
|
|||
### Scenario: primary writes faster than catch-up |
|||
|
|||
`V1`: |
|||
- replica degrades |
|||
|
|||
`V1.5`: |
|||
- may tail-chase indefinitely or pin WAL too long |
|||
|
|||
`V2`: |
|||
- explicit non-convergence detection -> abort / rebuild |
|||
|
|||
## What To Measure |
|||
|
|||
For each scenario, compare: |
|||
|
|||
1. does committed data remain safe? |
|||
2. does uncommitted data stay out of committed lineage? |
|||
3. does recovery complete or stall? |
|||
4. does protocol choose catch-up or rebuild? |
|||
5. is the outcome explicit or accidental? |
|||
|
|||
## Immediate Next Step |
|||
|
|||
Start with a minimal versioned policy layer: |
|||
|
|||
1. add `ProtocolVersion` |
|||
2. implement one or two version-sensitive hooks: |
|||
- `CanAttemptCatchup` |
|||
- `CatchupConvergencePolicy` |
|||
3. run existing scenarios under: |
|||
- `ProtocolV1` |
|||
- `ProtocolV15` |
|||
- `ProtocolV2` |
|||
|
|||
That is enough to begin proving: |
|||
- V1 breaks |
|||
- V1.5 improves but still strains |
|||
- V2 handles the same scenario more cleanly |
|||
|
|||
## Bottom Line |
|||
|
|||
The same scenario set should become a comparison harness across protocol generations. |
|||
|
|||
That is one of the strongest uses of the simulator: |
|||
- not only "does V2 work?" |
|||
- but "why is V2 better than V1 and V1.5?" |
|||
@ -0,0 +1,314 @@ |
|||
# V1, V1.5, and V2 Comparison |
|||
|
|||
Date: 2026-03-27 |
|||
|
|||
## Purpose |
|||
|
|||
This document compares: |
|||
|
|||
- `V1`: original replicated WAL shipping model |
|||
- `V1.5`: Phase 13 catch-up-first improvements on top of V1 |
|||
- `V2`: explicit FSM / orchestrator / recoverability-driven design under `sw-block/` |
|||
|
|||
It is a design comparison, not a marketing document. |
|||
|
|||
## 1. One-line summary |
|||
|
|||
- `V1` is simple but weak on short-gap recovery. |
|||
- `V1.5` materially improves recovery, but still relies on assumptions and incremental control-plane fixes. |
|||
- `V2` is structurally cleaner, more explicit, and easier to validate, but is not yet a production engine. |
|||
|
|||
## 2. Steady-State Hot Path |
|||
|
|||
In the healthy case, all three versions can look similar: |
|||
|
|||
1. primary appends ordered WAL |
|||
2. primary ships entries to replicas |
|||
3. replicas apply in order |
|||
4. durability barrier determines when client-visible commit completes |
|||
|
|||
### V1 |
|||
|
|||
- simplest replication path |
|||
- lagging replica typically degrades quickly |
|||
- little explicit recovery structure |
|||
|
|||
### V1.5 |
|||
|
|||
- same basic hot path as V1 |
|||
- WAL retention and reconnect/catch-up improve short outage handling |
|||
- extra logic exists, but much of it is off the hot path |
|||
|
|||
### V2 |
|||
|
|||
- can keep a similar hot path if implemented carefully |
|||
- extra complexity is mainly in: |
|||
- recovery planner |
|||
- replica state machine |
|||
- coordinator/orchestrator |
|||
- recoverability checks |
|||
|
|||
### Performance expectation |
|||
|
|||
In a normal healthy cluster: |
|||
|
|||
- `V2` should not be much heavier than `V1.5` |
|||
- most V2 complexity sits in failure/recovery/control paths |
|||
- there is no proof yet that V2 has better steady-state throughput or latency |
|||
|
|||
## 3. Recovery Behavior |
|||
|
|||
### V1 |
|||
|
|||
Recovery is weakly structured: |
|||
|
|||
- lagging replica tends to degrade |
|||
- short outage often becomes rebuild or long degraded state |
|||
- little explicit catch-up boundary |
|||
|
|||
### V1.5 |
|||
|
|||
Recovery is improved: |
|||
|
|||
- short outage can recover by retained-WAL catch-up |
|||
- background reconnect closes the `sync_all` dead-loop |
|||
- catch-up-first is preferred before rebuild |
|||
|
|||
But the model is still partly implicit: |
|||
|
|||
- reconnect depends on endpoint stability unless control plane refreshes assignment |
|||
- recoverability boundary is not as explicit as V2 |
|||
- tail-chasing and retention pressure still need policy care |
|||
|
|||
### V2 |
|||
|
|||
Recovery is explicit by design: |
|||
|
|||
- `InSync` |
|||
- `Lagging` |
|||
- `CatchingUp` |
|||
- `NeedsRebuild` |
|||
- `Rebuilding` |
|||
|
|||
And explicit decisions exist for: |
|||
|
|||
- catch-up vs rebuild |
|||
- stale-epoch rejection |
|||
- promotion candidate choice |
|||
- recoverable vs unrecoverable gap |
|||
|
|||
## 4. Real V1.5 Lessons |
|||
|
|||
The main V2 requirements come from real V1.5 behavior. |
|||
|
|||
### 4.1 Changed-address restart |
|||
|
|||
Observed in `CP13-8 T4b`: |
|||
|
|||
- replica restarted |
|||
- endpoint changed |
|||
- primary shipper held stale address |
|||
- direct reconnect could not succeed until control plane refreshed assignment |
|||
|
|||
V1.5 fix: |
|||
|
|||
- saved address used only as hint |
|||
- heartbeat-reported address becomes source of truth |
|||
- master refreshes primary assignment |
|||
|
|||
Lesson for V2: |
|||
|
|||
- endpoint is not identity |
|||
- reassignment must be explicit |
|||
|
|||
### 4.2 Reconnect race |
|||
|
|||
Observed in Phase 13 review: |
|||
|
|||
- barrier path and background reconnect path could both trigger reconnect |
|||
|
|||
V1.5 fix: |
|||
|
|||
- `reconnectMu` serializes reconnect / catch-up |
|||
|
|||
Lesson for V2: |
|||
|
|||
- one active recovery session per replica should be a protocol rule, not just a local mutex trick |
|||
|
|||
### 4.3 Tail-chasing |
|||
|
|||
Even with retained WAL: |
|||
|
|||
- primary may write faster than a lagging replica can recover |
|||
- catch-up may not converge |
|||
|
|||
Lesson for V2: |
|||
|
|||
- explicit abort / `NeedsRebuild` |
|||
- do not pretend catch-up will always work |
|||
|
|||
### 4.4 Control-plane recovery latency |
|||
|
|||
V1.5 can be correct but still operationally slow if recovery waits on slower management cycles. |
|||
|
|||
Lesson for V2: |
|||
|
|||
- keep authority in coordinator |
|||
- but make recovery decisions explicit and fast when possible |
|||
|
|||
## 5. V2 Structural Improvements |
|||
|
|||
V2 is better primarily because it is easier to reason about and validate. |
|||
|
|||
### 5.1 Better state model |
|||
|
|||
Instead of implicit recovery behavior, V2 has: |
|||
|
|||
- per-replica FSM |
|||
- volume/orchestrator model |
|||
- distributed simulator with scenario coverage |
|||
|
|||
### 5.2 Better validation |
|||
|
|||
V2 has: |
|||
|
|||
- named scenario backlog |
|||
- protocol-state assertions |
|||
- randomized simulation |
|||
- V1/V1.5/V2 comparison tests |
|||
|
|||
This is a major difference from V1/V1.5, where many fixes were discovered through implementation and hardware testing first. |
|||
|
|||
### 5.3 Better correctness boundaries |
|||
|
|||
V2 makes these explicit: |
|||
|
|||
- recoverable gap vs rebuild |
|||
- stale traffic rejection |
|||
- promotion lineage safety |
|||
- reservation or payload availability transitions |
|||
|
|||
## 6. Stability Comparison |
|||
|
|||
### Current judgment |
|||
|
|||
- `V1`: least stable under failure/recovery stress |
|||
- `V1.5`: meaningfully better and now functionally validated on real tests |
|||
- `V2`: best protocol structure and best simulator confidence |
|||
|
|||
### Important limit |
|||
|
|||
`V2` is not yet proven more stable in production because: |
|||
|
|||
- it is not a production engine yet |
|||
- confidence comes from simulator/design work, not real block workload deployment |
|||
|
|||
So the accurate statement is: |
|||
|
|||
- `V2` is more stable **architecturally** |
|||
- `V1.5` is more stable **operationally today** because it is implemented and tested on real hardware |
|||
|
|||
## 7. Performance Comparison |
|||
|
|||
### What is likely true |
|||
|
|||
`V2` should perform better than rebuild-heavy recovery approaches when: |
|||
|
|||
- outage is short |
|||
- gap is recoverable |
|||
- catch-up avoids full rebuild |
|||
|
|||
It should also behave better under: |
|||
|
|||
- flapping replicas |
|||
- stale delayed messages |
|||
- mixed-state replica sets |
|||
|
|||
### What is not yet proven |
|||
|
|||
We do not yet know whether `V2` has: |
|||
|
|||
- better steady-state throughput |
|||
- lower p99 latency |
|||
- lower CPU overhead |
|||
- lower memory overhead |
|||
|
|||
than `V1.5` |
|||
|
|||
That requires real implementation and benchmarking. |
|||
|
|||
## 8. Smart WAL Fit |
|||
|
|||
### Why Smart WAL is awkward in V1/V1.5 |
|||
|
|||
V1/V1.5 do not naturally model: |
|||
|
|||
- payload classes |
|||
- recoverability reservations |
|||
- historical payload resolution |
|||
- explicit recoverable/unrecoverable transition |
|||
|
|||
So Smart WAL would be harder to add cleanly there. |
|||
|
|||
### Why Smart WAL fits V2 better |
|||
|
|||
V2 already has the right conceptual slots: |
|||
|
|||
- `RecoveryClass` |
|||
- `WALInline` |
|||
- `ExtentReferenced` |
|||
- recoverability planner |
|||
- catch-up vs rebuild decision point |
|||
- simulator for payload-availability transitions |
|||
|
|||
### Important rule |
|||
|
|||
Smart WAL must not mean: |
|||
|
|||
- “read current extent for old LSN” |
|||
|
|||
That is incorrect. |
|||
|
|||
Historical correctness requires: |
|||
|
|||
- WAL inline payload |
|||
- or pinned snapshot/versioned extent state |
|||
- not current live extent contents |
|||
|
|||
## 9. What Is Proven Today |
|||
|
|||
### Proven |
|||
|
|||
- `V1.5` significantly improves V1 recovery behavior |
|||
- real `CP13-8` testing validated the V1.5 data path and `sync_all` behavior |
|||
- the V2 simulator covers: |
|||
- stale traffic rejection |
|||
- tail-chasing |
|||
- flapping replicas |
|||
- multi-promotion lineage |
|||
- changed-address restart comparison |
|||
- same-address transient outage comparison |
|||
- Smart WAL availability transitions |
|||
|
|||
### Not yet proven |
|||
|
|||
- V2 production implementation quality |
|||
- V2 steady-state performance advantage |
|||
- V2 real hardware recovery performance |
|||
|
|||
## 10. Bottom Line |
|||
|
|||
If choosing based on current evidence: |
|||
|
|||
- use `V1.5` as the production line today |
|||
- use `V2` as the better long-term architecture |
|||
|
|||
If choosing based on protocol quality: |
|||
|
|||
- `V2` is clearly better structured |
|||
- `V1.5` is still more ad hoc, even after successful fixes |
|||
|
|||
If choosing based on current real-world proof: |
|||
|
|||
- `V1.5` has the stronger operational evidence today |
|||
- `V2` has the stronger design and simulation evidence today |
|||
@ -0,0 +1,281 @@ |
|||
# V1 / V1.5 / V2 Simulator Goals |
|||
|
|||
Date: 2026-03-26 |
|||
Status: working design note |
|||
Purpose: define how the simulator should be used against WAL V1, Phase-13 V1.5, and WAL V2 |
|||
|
|||
## Why This Exists |
|||
|
|||
The simulator is not only for validating V2. |
|||
|
|||
It should also be used to: |
|||
|
|||
1. break WAL V1 |
|||
2. stress WAL V1.5 / Phase 13 |
|||
3. justify why WAL V2 is needed |
|||
|
|||
This note defines what failures we want the simulator to find in each protocol generation. |
|||
|
|||
## What The Simulator Can And Cannot Do |
|||
|
|||
### What it is good at |
|||
|
|||
The simulator is good at: |
|||
|
|||
1. finding concrete counterexamples |
|||
2. exposing bad protocol assumptions |
|||
3. checking commit / failover / fencing invariants |
|||
4. checking historical data correctness at target `LSN` |
|||
|
|||
### What it is not |
|||
|
|||
The simulator is not a full proof unless promoted to formal model checking. |
|||
|
|||
So the right claim is: |
|||
|
|||
- "no issue found under these modeled runs" |
|||
|
|||
not: |
|||
|
|||
- "protocol proven correct in all implementations" |
|||
|
|||
## Protocol Targets |
|||
|
|||
### WAL V1 |
|||
|
|||
Core shape: |
|||
- primary ships WAL out |
|||
- lagging replica degrades quickly |
|||
- no real recoverability contract |
|||
- no strong short-gap catch-up window |
|||
|
|||
Primary risk: |
|||
- a briefly lagging replica gets downgraded too early and forced into rebuild |
|||
|
|||
### WAL V1.5 / Phase 13 |
|||
|
|||
Core shape: |
|||
- primary retains WAL longer for lagging replicas |
|||
- reconnect / catch-up exists |
|||
- rebuild fallback exists |
|||
- primary may wait before releasing WAL |
|||
|
|||
Primary risks: |
|||
- WAL pinning |
|||
- tail chasing |
|||
- slow availability recovery |
|||
- recoverability assumptions that do not hold long enough |
|||
|
|||
### WAL V2 |
|||
|
|||
Core shape: |
|||
- explicit state machine |
|||
- explicit recoverability / reservation |
|||
- catch-up vs rebuild boundary is formalized |
|||
- eventual support for `WALInline` vs `ExtentReferenced` |
|||
|
|||
Primary goal: |
|||
- no committed data loss |
|||
- no false recovery |
|||
- cheaper and clearer short-gap recovery |
|||
|
|||
## What To Find In WAL V1 |
|||
|
|||
The simulator should try to find scenarios where V1 fails operationally or structurally. |
|||
|
|||
### V1-F1. Short Disconnect Still Forces Rebuild |
|||
|
|||
Sequence: |
|||
1. replica disconnects briefly |
|||
2. primary continues writing |
|||
3. replica returns quickly |
|||
|
|||
Expected ideal behavior: |
|||
- short-gap catch-up |
|||
|
|||
What V1 may do: |
|||
- downgrade replica too early |
|||
- no usable catch-up path |
|||
- rebuild required unnecessarily |
|||
|
|||
### V1-F2. Jitter Causes Avoidable Degrade |
|||
|
|||
Sequence: |
|||
1. replica is alive but sees delayed/reordered delivery |
|||
2. primary interprets this as lag/failure |
|||
|
|||
Failure signal: |
|||
- unnecessary downgrade or exclusion |
|||
|
|||
### V1-F3. Repeated Brief Flaps Cause Thrash |
|||
|
|||
Sequence: |
|||
1. repeated short disconnect/reconnect |
|||
2. primary repeatedly degrades replica |
|||
|
|||
Failure signal: |
|||
- poor availability |
|||
- excessive rebuild churn |
|||
|
|||
### V1-F4. No Efficient Path Back To Healthy State |
|||
|
|||
Sequence: |
|||
1. replica becomes degraded |
|||
2. network recovers |
|||
|
|||
Failure signal: |
|||
- control plane or protocol provides no clean short recovery path |
|||
|
|||
## What To Find In WAL V1.5 / Phase 13 |
|||
|
|||
The simulator should stress whether retention-based catch-up is actually enough. |
|||
|
|||
### V15-F1. Tail Chasing Under Ongoing Writes |
|||
|
|||
Sequence: |
|||
1. replica reconnects behind |
|||
2. primary keeps writing |
|||
3. catch-up tries to close the gap |
|||
|
|||
Failure signal: |
|||
- replica never converges |
|||
- stays forever behind |
|||
- no clean escalation path |
|||
|
|||
### V15-F2. WAL Pinning Harms System Progress |
|||
|
|||
Sequence: |
|||
1. replica lags |
|||
2. primary retains WAL to help recovery |
|||
3. lag persists |
|||
|
|||
Failure signal: |
|||
- WAL window remains pinned too long |
|||
- reclaim stalls |
|||
- system availability or throughput suffers |
|||
|
|||
### V15-F3. Catch-Up Window Expires Mid-Recovery |
|||
|
|||
Sequence: |
|||
1. catch-up begins |
|||
2. primary continues advancing |
|||
3. required recoverability disappears before completion |
|||
|
|||
Failure signal: |
|||
- protocol still claims success |
|||
- or lacks a clean abort-to-rebuild path |
|||
|
|||
### V15-F4. Restart Recovery Too Slow |
|||
|
|||
Sequence: |
|||
1. replica restarts |
|||
2. primary blocks writes correctly under `sync_all` |
|||
3. service recovery takes too long |
|||
|
|||
Failure signal: |
|||
- correctness preserved |
|||
- but availability recovery is operationally unacceptable |
|||
|
|||
### V15-F5. Multiple Lagging Replicas Poison Progress |
|||
|
|||
Sequence: |
|||
1. more than one replica lags |
|||
2. retention and recovery obligations interact |
|||
|
|||
Failure signal: |
|||
- one slow replica or mixed states poison the entire volume behavior |
|||
|
|||
## What WAL V2 Should Survive |
|||
|
|||
V2 should not merely avoid V1/V1.5 failures. |
|||
It should make them explicit and manageable. |
|||
|
|||
### V2-S1. Short Gap Recovers Cheaply |
|||
|
|||
Expected: |
|||
- brief disconnect -> catch-up -> promote |
|||
- no rebuild |
|||
|
|||
### V2-S2. Impossible Catch-Up Fails Cleanly |
|||
|
|||
Expected: |
|||
- not fully recoverable -> `NeedsRebuild` |
|||
- no pretend success |
|||
|
|||
### V2-S3. Reservation Loss Forces Correct Abort |
|||
|
|||
Expected: |
|||
- once recoverability is lost, catch-up aborts |
|||
- rebuild path takes over |
|||
|
|||
### V2-S4. Promotion Is Lineage-First |
|||
|
|||
Expected: |
|||
- new primary chosen from valid lineage |
|||
- not simply highest apparent `LSN` |
|||
|
|||
### V2-S5. Historical Data Correctness Is Preserved |
|||
|
|||
Expected: |
|||
- no rebuild from current extent pretending to be old state |
|||
- correct snapshot/base + replay behavior |
|||
|
|||
## Simulation Strategy By Version |
|||
|
|||
### For V1 |
|||
|
|||
Use simulator to: |
|||
- break it |
|||
- demonstrate avoidable rebuilds and downgrade behavior |
|||
|
|||
The simulator is mainly a diagnostic and justification tool here. |
|||
|
|||
### For V1.5 |
|||
|
|||
Use simulator to: |
|||
- stress retention-based catch-up |
|||
- find operational limits |
|||
- expose where retention alone is not enough |
|||
|
|||
The simulator is a stress and tradeoff tool here. |
|||
|
|||
### For V2 |
|||
|
|||
Use simulator to: |
|||
- validate named protocol scenarios |
|||
- validate random/adversarial runs |
|||
- confirm state + data correctness under failover/recovery |
|||
|
|||
The simulator is a design-validation tool here. |
|||
|
|||
## Practical Outcome |
|||
|
|||
If the simulator finds: |
|||
|
|||
### On V1 |
|||
- short outages still lead to rebuild |
|||
|
|||
Then conclusion: |
|||
- V1 lacks a real short-gap recovery story |
|||
|
|||
### On V1.5 |
|||
- retention helps but can still tail-chase or pin WAL too long |
|||
|
|||
Then conclusion: |
|||
- V1.5 is a useful bridge, but not the final architecture |
|||
|
|||
### On V2 |
|||
- catch-up/rebuild boundary is explicit and safe |
|||
|
|||
Then conclusion: |
|||
- V2 solves the protocol problem more cleanly |
|||
|
|||
## Bottom Line |
|||
|
|||
Use the simulator differently for each generation: |
|||
|
|||
1. WAL V1: find where it breaks |
|||
2. WAL V1.5: find where it strains |
|||
3. WAL V2: validate that it behaves correctly and more cleanly |
|||
|
|||
That is how the simulator justifies the architectural move from V1 to V2. |
|||
@ -0,0 +1,280 @@ |
|||
# V2 Acceptance Criteria |
|||
|
|||
Date: 2026-03-27 |
|||
|
|||
## Purpose |
|||
|
|||
This document defines the minimum protocol-validation bar for V2. |
|||
|
|||
It is not the full scenario backlog. |
|||
|
|||
It is the smaller acceptance set that should be true before we claim: |
|||
|
|||
- the V2 protocol shape is validated enough to guide implementation |
|||
|
|||
## Scope |
|||
|
|||
This acceptance set is about: |
|||
|
|||
- protocol correctness |
|||
- recovery correctness |
|||
- lineage / fencing correctness |
|||
- data correctness at target `LSN` |
|||
|
|||
This acceptance set is not yet about: |
|||
|
|||
- production performance |
|||
- frontend integration |
|||
- wire protocol |
|||
- disk implementation details |
|||
|
|||
## Acceptance Rule |
|||
|
|||
A V2 acceptance item should satisfy all of: |
|||
|
|||
1. named scenario |
|||
2. explicit expected behavior |
|||
3. simulator coverage |
|||
4. clear invariant or pass condition |
|||
5. mapped reason why it matters |
|||
|
|||
## Acceptance Set |
|||
|
|||
### A1. Committed Data Survives Failover |
|||
|
|||
Must prove: |
|||
|
|||
- acknowledged data is not lost after primary failure and promotion |
|||
|
|||
Evidence: |
|||
|
|||
- `S1` |
|||
- distributed simulator pass |
|||
|
|||
Pass condition: |
|||
|
|||
- promoted node matches reference state at committed `LSN` |
|||
|
|||
### A2. Uncommitted Data Is Not Revived |
|||
|
|||
Must prove: |
|||
|
|||
- non-acknowledged writes do not become committed after failover |
|||
|
|||
Evidence: |
|||
|
|||
- `S2` |
|||
|
|||
Pass condition: |
|||
|
|||
- committed prefix remains at the previous valid boundary |
|||
|
|||
### A3. Stale Epoch Traffic Is Fenced |
|||
|
|||
Must prove: |
|||
|
|||
- old primary / stale sender traffic cannot mutate current lineage |
|||
|
|||
Evidence: |
|||
|
|||
- `S3` |
|||
- stale write / stale barrier / stale delayed ack scenarios |
|||
|
|||
Pass condition: |
|||
|
|||
- stale traffic is rejected |
|||
- committed prefix does not change |
|||
|
|||
### A4. Short-Gap Catch-Up Works |
|||
|
|||
Must prove: |
|||
|
|||
- brief outage with recoverable gap returns via catch-up, not rebuild |
|||
|
|||
Evidence: |
|||
|
|||
- `S4` |
|||
- same-address transient outage comparison |
|||
|
|||
Pass condition: |
|||
|
|||
- recovered replica returns to `InSync` |
|||
- final state matches reference |
|||
|
|||
### A5. Non-Convergent Catch-Up Escalates Explicitly |
|||
|
|||
Must prove: |
|||
|
|||
- tail-chasing or failed catch-up does not pretend success |
|||
|
|||
Evidence: |
|||
|
|||
- `S6` |
|||
|
|||
Pass condition: |
|||
|
|||
- explicit `CatchingUp -> NeedsRebuild` |
|||
|
|||
### A6. Recoverability Boundary Is Explicit |
|||
|
|||
Must prove: |
|||
|
|||
- recoverable vs unrecoverable gap is decided explicitly |
|||
|
|||
Evidence: |
|||
|
|||
- `S7` |
|||
- Smart WAL availability transition scenarios |
|||
|
|||
Pass condition: |
|||
|
|||
- recovery aborts when reservation/payload availability is lost |
|||
- rebuild becomes the explicit fallback |
|||
|
|||
### A7. Historical Data Correctness Holds |
|||
|
|||
Must prove: |
|||
|
|||
- recovered data for target `LSN` is historically correct |
|||
- current extent cannot fake old history |
|||
|
|||
Evidence: |
|||
|
|||
- `S8` |
|||
- `S9` |
|||
|
|||
Pass condition: |
|||
|
|||
- snapshot + tail rebuild matches reference state |
|||
- current-extent reconstruction of old `LSN` fails correctness |
|||
|
|||
### A8. Durability Mode Semantics Are Correct |
|||
|
|||
Must prove: |
|||
|
|||
- `best_effort`, `sync_all`, and `sync_quorum` behave as intended under mixed replica states |
|||
|
|||
Evidence: |
|||
|
|||
- `S10` |
|||
- `S11` |
|||
- timeout-backed quorum/all race tests |
|||
|
|||
Pass condition: |
|||
|
|||
- `sync_all` remains strict |
|||
- `sync_quorum` commits only with true durable quorum |
|||
- invalid `sync_quorum` topology assumptions are rejected |
|||
|
|||
### A9. Promotion Uses Safe Candidate Eligibility |
|||
|
|||
Must prove: |
|||
|
|||
- promotion requires: |
|||
- running |
|||
- epoch alignment |
|||
- state eligibility |
|||
- committed-prefix sufficiency |
|||
|
|||
Evidence: |
|||
|
|||
- stronger `S12` |
|||
- candidate eligibility tests |
|||
|
|||
Pass condition: |
|||
|
|||
- unsafe candidates are rejected by default |
|||
- desperate promotion, if any, is explicit and separate |
|||
|
|||
### A10. Changed-Address Restart Is Explicitly Recoverable |
|||
|
|||
Must prove: |
|||
|
|||
- endpoint is not identity |
|||
- changed-address restart does not rely on stale endpoint reuse |
|||
|
|||
Evidence: |
|||
|
|||
- V1 / V1.5 / V2 changed-address comparison |
|||
- endpoint-version / assignment-update simulator flow |
|||
|
|||
Pass condition: |
|||
|
|||
- stale endpoint is rejected |
|||
- control-plane update refreshes primary view |
|||
- recovery proceeds only after explicit update |
|||
|
|||
### A11. Timeout Semantics Are Explicit |
|||
|
|||
Must prove: |
|||
|
|||
- barrier, catch-up, and reservation timeouts are first-class protocol behavior |
|||
|
|||
Evidence: |
|||
|
|||
- Phase 03 P0 timeout tests |
|||
|
|||
Pass condition: |
|||
|
|||
- timeout effects are explicit |
|||
- stale timeouts do not regress recovered state |
|||
- late barrier ack after timeout is rejected |
|||
|
|||
### A12. Timer Races Are Stable |
|||
|
|||
Must prove: |
|||
|
|||
- timer/event ordering does not silently break protocol guarantees |
|||
|
|||
Evidence: |
|||
|
|||
- Phase 03 P1/P2 race tests |
|||
|
|||
Pass condition: |
|||
|
|||
- same-tick ordering is explicit |
|||
- promotion / epoch bump / timeout interactions preserve invariants |
|||
- traces are debuggable |
|||
|
|||
## Compare Requirement |
|||
|
|||
Where meaningful, V2 acceptance should include comparison against: |
|||
|
|||
- `V1` |
|||
- `V1.5` |
|||
|
|||
Especially for: |
|||
|
|||
- changed-address restart |
|||
- same-address transient outage |
|||
- tail-chasing |
|||
- slow control-plane recovery |
|||
|
|||
## Required Evidence |
|||
|
|||
Before calling V2 protocol validation “good enough”, we want: |
|||
|
|||
1. scenario coverage in `v2_scenarios.md` |
|||
2. selected simulator tests in `distsim` |
|||
3. timing/race tests in `eventsim` |
|||
4. V1 / V1.5 / V2 comparison where relevant |
|||
5. review sign-off that the tests prove the right thing |
|||
|
|||
## What This Does Not Prove |
|||
|
|||
Even if all acceptance items pass, this still does not prove: |
|||
|
|||
- production implementation quality |
|||
- wire protocol correctness |
|||
- real performance |
|||
- disk-level behavior |
|||
|
|||
Those require later implementation and real-system validation. |
|||
|
|||
## Bottom Line |
|||
|
|||
If A1 through A12 are satisfied, V2 is validated enough at the protocol/design level to justify: |
|||
|
|||
1. implementation slicing |
|||
2. Smart WAL design refinement |
|||
3. later real-engine integration |
|||
@ -0,0 +1,234 @@ |
|||
# WAL V2 Distributed Simulator |
|||
|
|||
Date: 2026-03-26 |
|||
Status: design proposal |
|||
Purpose: define the next prototype layer above `ReplicaFSM` and `VolumeModel` so WAL V2 can be validated as a distributed state machine rather than only a local state machine |
|||
|
|||
## Why This Exists |
|||
|
|||
The current V2 prototype already has: |
|||
|
|||
- `ReplicaFSM` |
|||
- `VolumeModel` |
|||
- `RecoveryPlanner` |
|||
- scenario tracing |
|||
|
|||
That is enough to reason about local recovery logic and volume-level admission. |
|||
|
|||
It is not enough to prove the distributed safety claim. |
|||
|
|||
The real system question is: |
|||
|
|||
- when time moves forward, nodes start/stop/disconnect/reconnect, and the coordinator changes epoch, |
|||
- do all acknowledged writes remain recoverable according to the configured durability policy? |
|||
|
|||
That requires a distributed simulator. |
|||
|
|||
## Core Idea |
|||
|
|||
Model the system as: |
|||
|
|||
1. node-local state machines |
|||
2. a coordinator state machine |
|||
3. a time-driven message simulator |
|||
4. a reference data model used as the correctness oracle |
|||
|
|||
## Layers |
|||
|
|||
### 1. `NodeModel` |
|||
|
|||
Each node has: |
|||
|
|||
- role |
|||
- epoch seen |
|||
- local WAL state |
|||
- head |
|||
- tail |
|||
- `receivedLSN` |
|||
- `flushedLSN` |
|||
- checkpoint/snapshot state |
|||
- `cpLSN` |
|||
- local extent state |
|||
- local connectivity state |
|||
- local `ReplicaFSM` for each remote relationship as needed |
|||
|
|||
### 2. `CoordinatorModel` |
|||
|
|||
The coordinator owns: |
|||
|
|||
- current epoch |
|||
- primary assignment |
|||
- membership |
|||
- durability policy |
|||
- rebuild assignments |
|||
- promotion decisions |
|||
|
|||
### 3. `Network/Time Simulator` |
|||
|
|||
The simulator owns: |
|||
|
|||
- logical time ticks |
|||
- message delivery queues |
|||
- delay, drop, and disconnect events |
|||
- node start/stop/restart |
|||
|
|||
### 4. `Reference Model` |
|||
|
|||
The reference model is the correctness oracle. |
|||
|
|||
It applies the committed write history to an idealized block map. |
|||
At any target `LSN = X`, it can answer: |
|||
|
|||
- what value should each block contain at `X`? |
|||
|
|||
## Data Correctness Model |
|||
|
|||
### Synthetic 4K writes |
|||
|
|||
For simulation, each 4K write should be represented as: |
|||
|
|||
- block ID |
|||
- value |
|||
|
|||
A simple deterministic choice is: |
|||
- `value = LSN` |
|||
|
|||
Example: |
|||
- `LSN 10`: write block 7 = 10 |
|||
- `LSN 11`: write block 2 = 11 |
|||
- `LSN 12`: write block 7 = 12 |
|||
|
|||
This makes correctness checks trivial. |
|||
|
|||
### Why this matters |
|||
|
|||
This catches the exact extent-recovery trap: |
|||
|
|||
1. `LSN 10`: block 7 = 10 |
|||
2. `LSN 12`: block 7 = 12 |
|||
|
|||
If recovery claims to rebuild state at `LSN 10` using current extent and returns block 7 = 12, the simulator detects the bug immediately. |
|||
|
|||
## Golden Invariant |
|||
|
|||
For any node declared recovered to target `LSN = T`: |
|||
|
|||
- node extent state must equal the reference model's state at `T` |
|||
|
|||
Not: |
|||
- equal to current latest state |
|||
- equal to any valid-looking value |
|||
|
|||
Exactly: |
|||
- the reference state at target `LSN` |
|||
|
|||
## Recovery Correctness Rules |
|||
|
|||
### WAL replay correctness |
|||
|
|||
For `(startLSN, endLSN]` replay to be valid: |
|||
|
|||
- every record in the interval must exist |
|||
- every payload must be the correct historical version for its LSN |
|||
- no replay gaps are allowed |
|||
- no stale-epoch records are allowed |
|||
|
|||
### Extent/snapshot correctness |
|||
|
|||
Extent-based recovery is valid only if the data source is version-correct. |
|||
|
|||
Allowed examples: |
|||
- immutable snapshot at `cpLSN` |
|||
- pinned copy-on-write generation |
|||
- pinned payload object referenced by a recovery record |
|||
|
|||
Not allowed: |
|||
- current live extent used as if it were historical state at old `cpLSN` |
|||
|
|||
## Suggested Prototype Package |
|||
|
|||
Prototype location: |
|||
- `sw-block/prototype/distsim/` |
|||
|
|||
Suggested files: |
|||
- `types.go` |
|||
- `node.go` |
|||
- `coordinator.go` |
|||
- `network.go` |
|||
- `reference.go` |
|||
- `scenario.go` |
|||
- `sim_test.go` |
|||
|
|||
## Minimal First Milestone |
|||
|
|||
Do not try to simulate the whole product first. |
|||
|
|||
First milestone: |
|||
|
|||
1. one primary |
|||
2. one replica |
|||
3. time ticks |
|||
4. synthetic 4K writes with deterministic values |
|||
5. canonical reference model |
|||
6. simple recovery check: |
|||
- WAL replay recovers correct value |
|||
- current extent alone does not recover old `LSN` |
|||
- snapshot/base image at `cpLSN` does recover correct value |
|||
|
|||
If that milestone is solid, then add: |
|||
- failover |
|||
- quorum |
|||
- multi-replica |
|||
- coordinator promotion rules |
|||
|
|||
## Test Cases To Add Early |
|||
|
|||
### 1. WAL replay preserves historical values |
|||
- write block 7 = 10 |
|||
- write block 7 = 12 |
|||
- replay only to `LSN 10` |
|||
- expect block 7 = 10 |
|||
|
|||
### 2. Current extent cannot reconstruct old `LSN` |
|||
- same write sequence |
|||
- try rebuilding `LSN 10` from latest extent |
|||
- expect mismatch/error |
|||
|
|||
### 3. Snapshot at `cpLSN` works |
|||
- snapshot at `LSN 10` |
|||
- later overwrite block 7 at `LSN 12` |
|||
- rebuild from snapshot `LSN 10` |
|||
- expect block 7 = 10 |
|||
|
|||
### 4. Reservation expiration invalidates recovery |
|||
- recovery window initially valid |
|||
- time advances |
|||
- reservation expires |
|||
- recovery must abort rather than return partial or wrong state |
|||
|
|||
## Relationship To Existing Prototype |
|||
|
|||
This simulator should reuse existing prototype concepts where possible: |
|||
|
|||
- `fsmv2` for node-local recovery lifecycle |
|||
- `volumefsm` ideas for mode semantics and admission |
|||
- `RecoveryPlanner` for recoverability decisions |
|||
|
|||
The simulator is the next proof layer: |
|||
- not just whether transitions are legal |
|||
- but whether data remains correct under those transitions |
|||
|
|||
## Bottom Line |
|||
|
|||
WAL V2 correctness is not only a state problem. |
|||
It is also a data-version problem. |
|||
|
|||
The distributed simulator should therefore prove two things together: |
|||
|
|||
1. state-machine safety |
|||
2. data correctness at target `LSN` |
|||
|
|||
That is the right next prototype layer if the goal is to prove: |
|||
- quorum commit safety |
|||
- no committed data loss |
|||
- no incorrect recovery from later extent state |
|||
@ -0,0 +1,159 @@ |
|||
# V2 First Slice: Per-Replica Sender/Session Ownership |
|||
|
|||
Date: 2026-03-27 |
|||
Status: implementation-ready |
|||
Depends-on: Q1 (recovery session), Q6 (orchestrator scope), Q7 (first slice) |
|||
|
|||
## Problem |
|||
|
|||
`SetReplicaAddrs()` replaces the entire `ShipperGroup` atomically. This causes: |
|||
|
|||
1. **State loss on topology change.** All shippers are destroyed and recreated. |
|||
Recovery state (`replicaFlushedLSN`, `lastContactTime`, catch-up progress) is lost. |
|||
After a changed-address restart, the new shipper starts from scratch. |
|||
|
|||
2. **No per-replica identity.** Shippers are identified by array index. The master |
|||
cannot target a specific replica for rebuild/catch-up — it must re-issue the |
|||
entire address set. |
|||
|
|||
3. **Background reconnect races.** A reconnect cycle may be in progress when |
|||
`SetReplicaAddrs` replaces the group. The in-progress reconnect's connection |
|||
objects become orphaned. |
|||
|
|||
## Design |
|||
|
|||
### Per-replica sender identity |
|||
|
|||
`ShipperGroup` changes from `[]*WALShipper` to `map[string]*WALShipper`, keyed by |
|||
the replica's canonical data address. Each shipper stores its own `ReplicaID`. |
|||
|
|||
```go |
|||
type WALShipper struct { |
|||
ReplicaID string // canonical data address — identity across reconnects |
|||
// ... existing fields |
|||
} |
|||
|
|||
type ShipperGroup struct { |
|||
mu sync.RWMutex |
|||
shippers map[string]*WALShipper // keyed by ReplicaID |
|||
} |
|||
``` |
|||
|
|||
### ReconcileReplicas replaces SetReplicaAddrs |
|||
|
|||
Instead of replacing the entire group, `ReconcileReplicas` diffs old vs new: |
|||
|
|||
``` |
|||
ReconcileReplicas(newAddrs []ReplicaAddr): |
|||
for each existing shipper: |
|||
if NOT in newAddrs → Stop and remove |
|||
for each newAddr: |
|||
if matching shipper exists → keep (preserve state) |
|||
if no match → create new shipper |
|||
``` |
|||
|
|||
This preserves `replicaFlushedLSN`, `lastContactTime`, catch-up progress, and |
|||
background reconnect goroutines for replicas that stay in the set. |
|||
|
|||
`SetReplicaAddrs` becomes a wrapper: |
|||
```go |
|||
func (v *BlockVol) SetReplicaAddrs(addrs []ReplicaAddr) { |
|||
if v.shipperGroup == nil { |
|||
v.shipperGroup = NewShipperGroup(nil) |
|||
} |
|||
v.shipperGroup.ReconcileReplicas(addrs, v.makeShipperFactory()) |
|||
} |
|||
``` |
|||
|
|||
### Changed-address restart flow |
|||
|
|||
1. Replica restarts on new port. Heartbeat reports new address. |
|||
2. Master detects endpoint change (address differs, same volume). |
|||
3. Master sends assignment update to primary with new replica address. |
|||
4. Primary's `ReconcileReplicas` receives `[oldAddr1, newAddr2]`. |
|||
5. Old shipper for the changed replica is stopped (old address gone from set). |
|||
6. New shipper created with new address — but this is a fresh shipper. |
|||
7. New shipper bootstraps: Disconnected → Connecting → CatchingUp → InSync. |
|||
|
|||
The improvement over V1.5: the **other** replicas in the set are NOT disturbed. |
|||
Only the changed replica gets a fresh shipper. Recovery state for stable replicas |
|||
is preserved. |
|||
|
|||
### Recovery session |
|||
|
|||
Each WALShipper already contains the recovery state machine: |
|||
- `state` (Disconnected → Connecting → CatchingUp → InSync → Degraded → NeedsRebuild) |
|||
- `replicaFlushedLSN` (authoritative progress) |
|||
- `lastContactTime` (retention budget) |
|||
- `catchupFailures` (escalation counter) |
|||
- Background reconnect goroutine |
|||
|
|||
No separate `RecoverySession` object is needed. The WALShipper IS the per-replica |
|||
recovery session. The state machine already tracks the session lifecycle. |
|||
|
|||
What changes: the session is no longer destroyed on topology change (unless the |
|||
replica itself is removed from the set). |
|||
|
|||
### Coordinator vs primary responsibilities |
|||
|
|||
| Responsibility | Owner | |
|||
|---------------|-------| |
|||
| Endpoint truth (canonical address) | Coordinator (master) | |
|||
| Assignment updates (add/remove replicas) | Coordinator | |
|||
| Epoch authority | Coordinator | |
|||
| Session creation trigger | Coordinator (via assignment) | |
|||
| Session execution (reconnect, catch-up, barrier) | Primary (via WALShipper) | |
|||
| Timeout enforcement | Primary | |
|||
| Ordered receive/apply | Replica | |
|||
| Barrier ack | Replica | |
|||
| Heartbeat reporting | Replica | |
|||
|
|||
### Migration from current code |
|||
|
|||
| Current | V2 | |
|||
|---------|-----| |
|||
| `ShipperGroup.shippers []*WALShipper` | `ShipperGroup.shippers map[string]*WALShipper` | |
|||
| `SetReplicaAddrs()` creates all new | `ReconcileReplicas()` diffs and preserves | |
|||
| `StopAll()` in demote | `StopAll()` unchanged (stops all) | |
|||
| `ShipAll(entry)` iterates slice | `ShipAll(entry)` iterates map values | |
|||
| `BarrierAll(lsn)` parallel slice | `BarrierAll(lsn)` parallel map values | |
|||
| `MinReplicaFlushedLSN()` iterates slice | Same, iterates map values | |
|||
| `ShipperStates()` iterates slice | Same, iterates map values | |
|||
| No per-shipper identity | `WALShipper.ReplicaID` = canonical data addr | |
|||
|
|||
### Files changed |
|||
|
|||
| File | Change | |
|||
|------|--------| |
|||
| `wal_shipper.go` | Add `ReplicaID` field, pass in constructor | |
|||
| `shipper_group.go` | `map[string]*WALShipper`, `ReconcileReplicas`, update iterators | |
|||
| `blockvol.go` | `SetReplicaAddrs` calls `ReconcileReplicas`, shipper factory | |
|||
| `promotion.go` | No change (StopAll unchanged) | |
|||
| `dist_group_commit.go` | No change (uses ShipperGroup API) | |
|||
| `block_heartbeat.go` | No change (uses ShipperStates) | |
|||
|
|||
### Acceptance bar |
|||
|
|||
The following existing tests must continue to pass: |
|||
- All CP13-1 through CP13-7 protocol tests (sync_all_protocol_test.go) |
|||
- All adversarial tests (sync_all_adversarial_test.go) |
|||
- All baseline tests (sync_all_bug_test.go) |
|||
- All rebuild tests (rebuild_v1_test.go) |
|||
|
|||
The following CP13-8 tests validate the V2 improvement: |
|||
- `TestCP13_SyncAll_ReplicaRestart_Rejoin` — changed-address recovery |
|||
- `TestAdversarial_ReconnectUsesHandshakeNotBootstrap` — V2 reconnect protocol |
|||
- `TestAdversarial_CatchupMultipleDisconnects` — state preservation across reconnects |
|||
|
|||
New tests to add: |
|||
- `TestReconcileReplicas_PreservesExistingShipper` — stable replica keeps state |
|||
- `TestReconcileReplicas_RemovesStaleShipper` — removed replica stopped |
|||
- `TestReconcileReplicas_AddsNewShipper` — new replica bootstraps |
|||
- `TestReconcileReplicas_MixedUpdate` — one kept, one removed, one added |
|||
|
|||
## Non-goals for this slice |
|||
|
|||
- Smart WAL payload classes |
|||
- Recovery reservation protocol |
|||
- Full coordinator orchestration |
|||
- New transport layer |
|||
@ -0,0 +1,193 @@ |
|||
# V2 First Slice: Per-Replica Sender and Recovery Session Ownership |
|||
|
|||
Date: 2026-03-27 |
|||
|
|||
## Purpose |
|||
|
|||
This document defines the first real V2 implementation slice. |
|||
|
|||
The slice is intentionally narrow: |
|||
|
|||
- per-replica sender ownership |
|||
- explicit recovery session ownership |
|||
- clear coordinator vs primary responsibility |
|||
|
|||
This is the first step toward a standalone V2 block engine under `sw-block/`. |
|||
|
|||
## Why This Slice First |
|||
|
|||
It directly addresses the clearest V1.5 structural limits: |
|||
|
|||
- sender identity loss when replica sets are refreshed |
|||
- changed-address restart recovery complexity |
|||
- repeated reconnect cycles without stable per-replica ownership |
|||
- adversarial Phase 13 boundary tests that V1.5 cannot cleanly satisfy |
|||
|
|||
It also avoids jumping too early into: |
|||
|
|||
- Smart WAL |
|||
- new backend storage layout |
|||
- full production transport redesign |
|||
|
|||
## Core Decision |
|||
|
|||
Use: |
|||
|
|||
- **one sender owner per replica** |
|||
- **at most one active recovery session per replica per epoch** |
|||
|
|||
Healthy replicas may only need their steady sender object. |
|||
|
|||
Degraded / reconnecting replicas gain an explicit recovery session owned by the primary. |
|||
|
|||
## Ownership Split |
|||
|
|||
### Coordinator |
|||
|
|||
Owns: |
|||
|
|||
- replica identity / endpoint truth |
|||
- assignment updates |
|||
- epoch authority |
|||
- session creation / destruction intent |
|||
|
|||
Does not own: |
|||
|
|||
- byte-by-byte catch-up execution |
|||
- local sender loop scheduling |
|||
|
|||
### Primary |
|||
|
|||
Owns: |
|||
|
|||
- per-replica sender objects |
|||
- per-replica recovery session execution |
|||
- reconnect / catch-up progress |
|||
- timeout enforcement for active session |
|||
- transition from: |
|||
- normal sender |
|||
- to recovery session |
|||
- back to normal sender |
|||
|
|||
### Replica |
|||
|
|||
Owns: |
|||
|
|||
- receive/apply path |
|||
- barrier ack |
|||
- heartbeat/reporting |
|||
|
|||
Replica remains passive from the recovery-orchestration point of view. |
|||
|
|||
## Data Model |
|||
|
|||
## Sender Owner |
|||
|
|||
Per replica, maintain a stable sender owner with: |
|||
|
|||
- replica logical ID |
|||
- current endpoint |
|||
- current epoch view |
|||
- steady-state health/status |
|||
- optional active recovery session reference |
|||
|
|||
## Recovery Session |
|||
|
|||
Per replica, per epoch: |
|||
|
|||
- `ReplicaID` |
|||
- `Epoch` |
|||
- `EndpointVersion` or equivalent endpoint truth |
|||
- `State` |
|||
- `connecting` |
|||
- `catching_up` |
|||
- `in_sync` |
|||
- `needs_rebuild` |
|||
- `StartLSN` |
|||
- `TargetLSN` |
|||
- timeout / deadline metadata |
|||
|
|||
## Session Rules |
|||
|
|||
1. only one active session per replica per epoch |
|||
2. new assignment for same replica: |
|||
- supersedes old session only if epoch/session generation is newer |
|||
3. stale session must not continue after: |
|||
- epoch bump |
|||
- endpoint truth change |
|||
- explicit coordinator replacement |
|||
|
|||
## Minimal State Transitions |
|||
|
|||
### Healthy path |
|||
|
|||
1. replica sender exists |
|||
2. sender ships normally |
|||
3. replica remains `InSync` |
|||
|
|||
### Recovery path |
|||
|
|||
1. sender detects or is told replica is not healthy |
|||
2. coordinator provides valid assignment/endpoint truth |
|||
3. primary creates recovery session |
|||
4. session connects |
|||
5. session catches up if recoverable |
|||
6. on success: |
|||
- session closes |
|||
- steady sender resumes normal state |
|||
|
|||
### Rebuild path |
|||
|
|||
1. session determines catch-up is not sufficient |
|||
2. session transitions to `needs_rebuild` |
|||
3. higher layer rebuild flow takes over |
|||
|
|||
## What This Slice Does Not Include |
|||
|
|||
Not in the first slice: |
|||
|
|||
- Smart WAL payload classes in production |
|||
- snapshot pinning / GC logic |
|||
- new on-disk engine |
|||
- frontend publication changes |
|||
- full production event scheduler |
|||
|
|||
## Proposed V2 Workspace Target |
|||
|
|||
Do this under `sw-block/`, not `weed/storage/blockvol/`. |
|||
|
|||
Suggested area: |
|||
|
|||
- `sw-block/prototype/enginev2/` |
|||
|
|||
Suggested first files: |
|||
|
|||
- `sw-block/prototype/enginev2/session.go` |
|||
- `sw-block/prototype/enginev2/sender.go` |
|||
- `sw-block/prototype/enginev2/group.go` |
|||
- `sw-block/prototype/enginev2/session_test.go` |
|||
|
|||
The first code does not need full storage I/O. |
|||
It should prove ownership and transition shape first. |
|||
|
|||
## Acceptance For This Slice |
|||
|
|||
The slice is good enough when: |
|||
|
|||
1. sender identity is stable per replica |
|||
2. changed-address reassignment updates the right sender owner |
|||
3. multiple reconnect cycles do not lose recovery ownership |
|||
4. stale session does not survive epoch bump |
|||
5. the 4 Phase 13 V2-boundary tests have a clear path to become satisfiable |
|||
|
|||
## Relationship To Existing Simulator |
|||
|
|||
This slice should align with: |
|||
|
|||
- `v2-acceptance-criteria.md` |
|||
- `v2-open-questions.md` |
|||
- `v1-v15-v2-comparison.md` |
|||
- `distsim` / `eventsim` behavior |
|||
|
|||
The simulator remains the design oracle. |
|||
The first implementation slice should not contradict it. |
|||
@ -0,0 +1,161 @@ |
|||
# V2 Open Questions |
|||
|
|||
Date: 2026-03-27 |
|||
|
|||
## Purpose |
|||
|
|||
This document records what is still algorithmically open in V2. |
|||
|
|||
These are not bugs. |
|||
|
|||
They are design questions that should be closed deliberately before or during implementation slicing. |
|||
|
|||
## 1. Recovery Session Ownership |
|||
|
|||
Open question: |
|||
|
|||
- what is the exact ownership model for one active recovery session per replica? |
|||
|
|||
Need to decide: |
|||
|
|||
- session identity fields |
|||
- supersede vs reject vs join behavior |
|||
- how epoch/session invalidates old recovery work |
|||
|
|||
Why it matters: |
|||
|
|||
- V1.5 needed local reconnect serialization |
|||
- V2 should make this a protocol rule |
|||
|
|||
## 2. Promotion Threshold Strictness |
|||
|
|||
Open question: |
|||
|
|||
- must a promotion candidate always have `FlushedLSN >= CommittedLSN`, or is there any narrower safe exception? |
|||
|
|||
Current prototype: |
|||
|
|||
- uses committed-prefix sufficiency as the safety gate |
|||
|
|||
Why it matters: |
|||
|
|||
- determines how strict real failover behavior should be |
|||
|
|||
## 3. Recovery Reservation Shape |
|||
|
|||
Open question: |
|||
|
|||
- what exactly is reserved during catch-up? |
|||
|
|||
Need to decide: |
|||
|
|||
- WAL range only? |
|||
- payload pins? |
|||
- snapshot pin? |
|||
- expiry semantics? |
|||
|
|||
Why it matters: |
|||
|
|||
- recoverability must be explicit, not hopeful |
|||
|
|||
## 4. Smart WAL Payload Classes |
|||
|
|||
Open question: |
|||
|
|||
- which payload classes are allowed in V2 first? |
|||
|
|||
Current model has: |
|||
|
|||
- `WALInline` |
|||
- `ExtentReferenced` |
|||
|
|||
Need to decide: |
|||
|
|||
- whether first real implementation includes both |
|||
- whether `ExtentReferenced` requires pinned snapshot/versioned extent only |
|||
|
|||
## 5. Smart WAL Garbage Collection Boundary |
|||
|
|||
Open question: |
|||
|
|||
- when can a referenced payload stop being recoverable? |
|||
|
|||
Need to decide: |
|||
|
|||
- GC interaction |
|||
- timeout interaction |
|||
- recovery session pinning |
|||
|
|||
Why it matters: |
|||
|
|||
- this is the line between catch-up and rebuild |
|||
|
|||
## 6. Exact Orchestrator Scope |
|||
|
|||
Open question: |
|||
|
|||
- how much of the final V2 control logic belongs in: |
|||
- local node state |
|||
- coordinator |
|||
- transport/session manager |
|||
|
|||
Why it matters: |
|||
|
|||
- avoid V1-style scattered state ownership |
|||
|
|||
## 7. First Real Implementation Slice |
|||
|
|||
Open question: |
|||
|
|||
- what is the first production slice of V2? |
|||
|
|||
Candidates: |
|||
|
|||
1. per-replica sender/session ownership |
|||
2. explicit recovery-session management |
|||
3. catch-up/rebuild decision plumbing |
|||
|
|||
Recommended default: |
|||
|
|||
- per-replica sender/session ownership |
|||
|
|||
## 8. Steady-State Overhead Budget |
|||
|
|||
Open question: |
|||
|
|||
- what overhead is acceptable in the normal healthy case? |
|||
|
|||
Need to decide: |
|||
|
|||
- metadata checks on hot path |
|||
- extra state bookkeeping |
|||
- what stays off the hot path |
|||
|
|||
Why it matters: |
|||
|
|||
- V2 should be structurally better without becoming needlessly heavy |
|||
|
|||
## 9. Smart WAL First-Phase Goal |
|||
|
|||
Open question: |
|||
|
|||
- is the first Smart WAL goal: |
|||
- lower recovery cost |
|||
- lower steady-state WAL volume |
|||
- or just proof of historical correctness model? |
|||
|
|||
Recommended answer: |
|||
|
|||
- first prove correctness model, then optimize |
|||
|
|||
## 10. End Condition For Simulator Work |
|||
|
|||
Open question: |
|||
|
|||
- when do we stop adding simulator depth and start implementation? |
|||
|
|||
Suggested answer: |
|||
|
|||
- once acceptance criteria are satisfied |
|||
- and the first implementation slice is clear |
|||
- and remaining simulator additions are no longer changing core protocol decisions |
|||
@ -0,0 +1,239 @@ |
|||
# V2 Prototype Roadmap And Gates |
|||
|
|||
Date: 2026-03-27 |
|||
Status: active |
|||
Purpose: define the remaining prototype roadmap, the validation gates between stages, and the decision point between real V2 engine work and possible V2.5 redesign |
|||
|
|||
## Current Position |
|||
|
|||
V2 design/FSM/simulator work is sufficiently closed for serious prototyping, but not frozen against later `V2.5` adjustments. |
|||
|
|||
Current state: |
|||
|
|||
- design proof: high |
|||
- execution proof: medium |
|||
- data/recovery proof: low |
|||
- prototype end-to-end proof: low |
|||
|
|||
Rough prototype progress: |
|||
|
|||
- `25%` to `35%` |
|||
|
|||
This is early executable prototype, not engine-ready prototype. |
|||
|
|||
## Roadmap Goal |
|||
|
|||
Answer this question with prototype evidence: |
|||
|
|||
- can V2 become a real engine path? |
|||
- or should it become `V2.5` before real implementation begins? |
|||
|
|||
## Step 1: Execution Authority Closure |
|||
|
|||
Purpose: |
|||
|
|||
- finish the sender / recovery-session authority model so stale work is unambiguously rejected |
|||
|
|||
Scope: |
|||
|
|||
1. ownership-only `AttachSession()` / `SupersedeSession()` |
|||
2. execution begins only through execution APIs |
|||
3. stale handshake / progress / completion fenced by `sessionID` |
|||
4. endpoint bump / epoch bump invalidate execution authority |
|||
5. sender-group preserve-or-kill behavior is explicit |
|||
|
|||
Done when: |
|||
|
|||
1. all execution APIs are sender-gated and reject stale `sessionID` |
|||
2. session creation is separated from execution start |
|||
3. phase ordering is enforced |
|||
4. endpoint bump / epoch bump invalidate execution authority correctly |
|||
5. mixed add/remove/update reconciliation preserves or kills state exactly as intended |
|||
|
|||
Main files: |
|||
|
|||
- `sw-block/prototype/enginev2/` |
|||
- `sw-block/prototype/distsim/` |
|||
- `learn/projects/sw-block/phases/phase-13-v2-boundary-tests.md` |
|||
|
|||
Key gate: |
|||
|
|||
- old recovery work cannot mutate current sender state at any execution stage |
|||
|
|||
## Step 2: Orchestrated Recovery Prototype |
|||
|
|||
Purpose: |
|||
|
|||
- move from good local sender APIs to an actual prototype recovery flow driven by assignment/update intent |
|||
|
|||
Scope: |
|||
|
|||
1. assignment/update intent creates or supersedes recovery attempts |
|||
2. reconnect / reassignment / catch-up / rebuild decision path |
|||
3. sender-group becomes orchestration entry point |
|||
4. explicit outcome branching: |
|||
- zero-gap fast completion |
|||
- positive-gap catch-up |
|||
- unrecoverable gap -> `NeedsRebuild` |
|||
|
|||
Done when: |
|||
|
|||
1. the prototype expresses a realistic recovery flow from topology/control intent |
|||
2. sender-group drives recovery creation, not only unit helpers |
|||
3. recovery outcomes are explicit and testable |
|||
4. orchestrator responsibility is clear enough to narrow `v2-open-questions.md` item 6 |
|||
|
|||
Key gate: |
|||
|
|||
- recovery control is no longer scattered across helper calls; it has one clear orchestration path |
|||
|
|||
## Step 3: Minimal Historical Data Prototype |
|||
|
|||
Purpose: |
|||
|
|||
- prove the recovery model against real data-history assumptions, not only control logic |
|||
|
|||
Scope: |
|||
|
|||
1. minimal WAL/history model, not full engine |
|||
2. enough to exercise: |
|||
- catch-up range |
|||
- retained prefix/window |
|||
- rebuild fallback |
|||
- historical correctness at target LSN |
|||
3. enough reservation/recoverability state to make recovery explicit |
|||
|
|||
Done when: |
|||
|
|||
1. the prototype can prove why a gap is recoverable or unrecoverable |
|||
2. catch-up and rebuild decisions are backed by minimal data/history state |
|||
3. `v2-open-questions.md` items 3, 4, 5 are closed or sharply narrowed |
|||
4. prototype evidence strengthens acceptance criteria `A5`, `A6`, and `A7` |
|||
|
|||
Key gate: |
|||
|
|||
- the prototype must explain why recovery is allowed, not just that policy says it is |
|||
|
|||
## Step 4: Prototype Scenario Closure |
|||
|
|||
Purpose: |
|||
|
|||
- make the prototype itself demonstrate the V2 story end-to-end |
|||
|
|||
Scope: |
|||
|
|||
1. map key V2 scenarios onto the prototype |
|||
2. express the 4 V2-boundary cases against prototype behavior |
|||
3. add one small end-to-end harness inside `sw-block/prototype/` |
|||
4. align prototype evidence with acceptance criteria |
|||
|
|||
Done when: |
|||
|
|||
1. prototype behavior can be reviewed scenario-by-scenario |
|||
2. key V1/V1.5 failures have prototype equivalents |
|||
3. prototype outcomes match intended V2 design claims |
|||
4. remaining gaps are clearly real-engine gaps, not protocol/prototype ambiguity |
|||
|
|||
Key gate: |
|||
|
|||
- a reviewer can trace: |
|||
- acceptance criteria -> scenario -> prototype behavior |
|||
without hand-waving |
|||
|
|||
## Gates |
|||
|
|||
### Gate 1: Design Closed Enough |
|||
|
|||
Status: |
|||
|
|||
- mostly passed |
|||
|
|||
Meaning: |
|||
|
|||
1. acceptance criteria exist |
|||
2. core simulator exists |
|||
3. ownership gap from V1.5 is understood |
|||
|
|||
### Gate 2: Execution Authority Closed |
|||
|
|||
Passes after Step 1. |
|||
|
|||
Meaning: |
|||
|
|||
- stale execution results cannot mutate current authority |
|||
|
|||
### Gate 3: Orchestrated Recovery Closed |
|||
|
|||
Passes after Step 2. |
|||
|
|||
Meaning: |
|||
|
|||
- recovery flow is controlled by one coherent orchestration model |
|||
|
|||
### Gate 4: Historical Data Model Closed |
|||
|
|||
Passes after Step 3. |
|||
|
|||
Meaning: |
|||
|
|||
- catch-up vs rebuild is backed by executable data-history logic |
|||
|
|||
### Gate 5: Prototype Convincing |
|||
|
|||
Passes after Step 4. |
|||
|
|||
Meaning: |
|||
|
|||
- enough evidence exists to choose: |
|||
- real V2 engine path |
|||
- or `V2.5` redesign |
|||
|
|||
## Decision Gate After Step 4 |
|||
|
|||
### Path A: Real V2 Engine Planning |
|||
|
|||
Choose this if: |
|||
|
|||
1. prototype control logic is coherent |
|||
2. recovery boundary is explicit |
|||
3. boundary cases are convincing |
|||
4. no major structural flaw remains |
|||
|
|||
Outputs: |
|||
|
|||
1. real engine slicing plan |
|||
2. migration/integration plan into future standalone `sw-block` |
|||
3. explicit non-goals for first production version |
|||
|
|||
### Path B: V2.5 Redesign |
|||
|
|||
Choose this if the prototype reveals: |
|||
|
|||
1. ownership/orchestration still too fragile |
|||
2. recovery boundary still too implicit |
|||
3. historical correctness model too costly or too unclear |
|||
4. too much complexity leaks into the hot path |
|||
|
|||
Output: |
|||
|
|||
- write `V2.5` as a design/prototype correction before engine work |
|||
|
|||
## What Not To Do Yet |
|||
|
|||
1. no Smart WAL expansion beyond what Step 3 minimally needs |
|||
2. no backend/storage-engine redesign |
|||
3. no V1 production integration |
|||
4. no frontend/wire protocol work |
|||
5. no performance optimization as a primary goal |
|||
|
|||
## Practical Summary |
|||
|
|||
Current sequence: |
|||
|
|||
1. finish execution authority |
|||
2. build orchestrated recovery |
|||
3. add minimal historical-data proof |
|||
4. close key scenarios against the prototype |
|||
5. decide: |
|||
- V2 engine |
|||
- or `V2.5` |
|||
@ -0,0 +1,249 @@ |
|||
# V2 Scenario Sources From V1 and V1.5 |
|||
|
|||
Date: 2026-03-27 |
|||
|
|||
## Purpose |
|||
|
|||
This document distills V1 / V1.5 real-test material into V2 scenario inputs. |
|||
|
|||
Sources: |
|||
|
|||
- `learn/projects/sw-block/phases/phase13_test.md` |
|||
- `learn/projects/sw-block/phases/phase-13-v2-boundary-tests.md` |
|||
|
|||
This is not the active scenario backlog. |
|||
|
|||
Use: |
|||
|
|||
- `v2_scenarios.md` for the active V2 scenario set |
|||
- this file for historical source and rationale |
|||
|
|||
## How To Use This File |
|||
|
|||
For each item below: |
|||
|
|||
1. keep the real V1/V1.5 test as implementation evidence |
|||
2. create or maintain a V2 simulator scenario for the protocol core |
|||
3. define the expected V2 behavior explicitly |
|||
|
|||
## Source Buckets |
|||
|
|||
### 1. Core protocol behavior |
|||
|
|||
These are the highest-value simulator inputs. |
|||
|
|||
- barrier durability truth |
|||
- reconnect + catch-up |
|||
- non-convergent catch-up -> rebuild |
|||
- rebuild fallback |
|||
- failover / promotion safety |
|||
- WAL retention / tail-chasing |
|||
- durability mode semantics |
|||
|
|||
Recommended V2 treatment: |
|||
|
|||
- `sim_core` |
|||
|
|||
### 2. Supporting invariants |
|||
|
|||
These matter, but usually as reduced simulator checks. |
|||
|
|||
- canonical address handling |
|||
- replica role/epoch gating |
|||
- committed-prefix rules |
|||
- rebuild publication cleanup |
|||
- assignment refresh behavior |
|||
|
|||
Recommended V2 treatment: |
|||
|
|||
- `sim_reduced` |
|||
|
|||
### 3. Real-only implementation behavior |
|||
|
|||
These should usually stay in real-engine tests. |
|||
|
|||
- actual wire encoding / decode bugs |
|||
- real disk / `fdatasync` timing |
|||
- NVMe / iSCSI frontend behavior |
|||
- Go concurrency artifacts tied to concrete implementation |
|||
|
|||
Recommended V2 treatment: |
|||
|
|||
- `real_only` |
|||
|
|||
### 4. V2 boundary items |
|||
|
|||
These are especially important. |
|||
|
|||
They should remain visible as: |
|||
|
|||
- current V1/V1.5 limitation |
|||
- explicit V2 acceptance target |
|||
|
|||
Recommended V2 treatment: |
|||
|
|||
- `v2_boundary` |
|||
|
|||
## Distilled Scenario Inputs |
|||
|
|||
### A. Barrier truth uses durable replica progress |
|||
|
|||
Real source: |
|||
|
|||
- Phase 13 barrier / `replicaFlushedLSN` tests |
|||
|
|||
Why it matters: |
|||
|
|||
- commit must follow durable replica progress, not send progress |
|||
|
|||
V2 target: |
|||
|
|||
- barrier completion counted only from explicit durable progress state |
|||
|
|||
### B. Same-address transient outage |
|||
|
|||
Real source: |
|||
|
|||
- Phase 13 reconnect / catch-up tests |
|||
- `CP13-8` short outage recovery |
|||
|
|||
Why it matters: |
|||
|
|||
- proves cheap short-gap recovery path |
|||
|
|||
V2 target: |
|||
|
|||
- explicit recoverability check |
|||
- catch-up if recoverable |
|||
- rebuild otherwise |
|||
|
|||
### C. Changed-address restart |
|||
|
|||
Real source: |
|||
|
|||
- `CP13-8 T4b` |
|||
- changed-address refresh fixes |
|||
|
|||
Why it matters: |
|||
|
|||
- endpoint is not identity |
|||
- stale endpoint must not remain authoritative |
|||
|
|||
V2 target: |
|||
|
|||
- heartbeat/control-plane learns new endpoint |
|||
- reassignment updates sender target |
|||
- recovery session starts only after endpoint truth is updated |
|||
|
|||
### D. Non-convergent catch-up / tail-chasing |
|||
|
|||
Real source: |
|||
|
|||
- Phase 13 retention + catch-up + rebuild fallback line |
|||
|
|||
Why it matters: |
|||
|
|||
- “catch-up exists” is not enough |
|||
- must know when to stop and rebuild |
|||
|
|||
V2 target: |
|||
|
|||
- explicit `CatchingUp -> NeedsRebuild` |
|||
- no fake success |
|||
|
|||
### E. Slow control-plane recovery |
|||
|
|||
Real source: |
|||
|
|||
- `CP13-8 T4b` hardware behavior before fix |
|||
|
|||
Why it matters: |
|||
|
|||
- safety can be correct while availability recovery is poor |
|||
|
|||
V2 target: |
|||
|
|||
- explicit fast recovery path when possible |
|||
- explicit fallback when only control-plane repair can help |
|||
|
|||
### F. Stale message / delayed ack fencing |
|||
|
|||
Real source: |
|||
|
|||
- Phase 13 epoch/fencing tests |
|||
- V2 scenario work already mirrors this |
|||
|
|||
Why it matters: |
|||
|
|||
- old lineage must not mutate committed prefix |
|||
|
|||
V2 target: |
|||
|
|||
- stale message rejection is explicit and testable |
|||
|
|||
### G. Promotion candidate safety |
|||
|
|||
Real source: |
|||
|
|||
- failover / promotion gating tests |
|||
- V2 candidate-selection work |
|||
|
|||
Why it matters: |
|||
|
|||
- wrong promotion loses committed lineage |
|||
|
|||
V2 target: |
|||
|
|||
- candidate must satisfy: |
|||
- running |
|||
- epoch aligned |
|||
- state eligible |
|||
- committed-prefix sufficient |
|||
|
|||
### H. Rebuild boundary after failed catch-up |
|||
|
|||
Real source: |
|||
|
|||
- Phase 13 rebuild fallback behavior |
|||
|
|||
Why it matters: |
|||
|
|||
- rebuild is required when retained WAL cannot safely close the gap |
|||
|
|||
V2 target: |
|||
|
|||
- rebuild is explicit fallback, not ad hoc recovery |
|||
|
|||
## Immediate Feed Into `v2_scenarios.md` |
|||
|
|||
These are the most important V1/V1.5-derived V2 scenarios: |
|||
|
|||
1. same-address transient outage |
|||
2. changed-address restart |
|||
3. non-convergent catch-up / tail-chasing |
|||
4. stale delayed message / barrier ack rejection |
|||
5. committed-prefix-safe promotion |
|||
6. control-plane-latency recovery shape |
|||
|
|||
## What Should Not Be Copied Blindly |
|||
|
|||
Do not clone every real-engine test into the simulator. |
|||
|
|||
Do not use the simulator for: |
|||
|
|||
- exact OS timing |
|||
- exact socket/wire bugs |
|||
- exact block frontend behavior |
|||
- implementation-specific lock races |
|||
|
|||
Instead: |
|||
|
|||
- extract the protocol invariant |
|||
- model the reduced scenario if the protocol value is high |
|||
|
|||
## Bottom Line |
|||
|
|||
V1 / V1.5 tests should feed V2 in two ways: |
|||
|
|||
1. as historical evidence of what failed or mattered in real life |
|||
2. as scenario seeds for the V2 simulator and acceptance backlog |
|||
@ -0,0 +1,638 @@ |
|||
# WAL V2 Scenarios |
|||
|
|||
Date: 2026-03-26 |
|||
Status: working scenario backlog |
|||
Purpose: define the scenario set that proves why WAL V2 exists, what it must do better than WAL V1, and what it should handle better than rebuild-heavy systems |
|||
|
|||
Execution note: |
|||
- active implementation planning for these scenarios lives under `../.private/phase/` |
|||
- `design/` is the design/source-of-truth view |
|||
- `.private/phase/` is the execution/checklist view for `sw` |
|||
|
|||
## Why This File Exists |
|||
|
|||
V2 should not grow by adding random simulations. |
|||
|
|||
Each new scenario should prove one of these claims: |
|||
|
|||
1. committed data is never lost |
|||
2. uncommitted data is never falsely revived |
|||
3. epoch and promotion lineage are safe |
|||
4. short-gap recovery is cheaper and cleaner than rebuild |
|||
5. catch-up vs rebuild boundary is explicit and correct |
|||
6. historical data correctness is preserved |
|||
|
|||
## Scenario Sources |
|||
|
|||
The backlog draws scenarios from three sources: |
|||
|
|||
1. **V1 / V1.5 real failures** |
|||
- real bugs and real-hardware gaps observed during Phase 12 / Phase 13 |
|||
- these are the highest-value scenarios because they came from actual system behavior |
|||
|
|||
2. **V2 design obligations** |
|||
- scenarios required by the intended V2 protocol shape |
|||
- examples: |
|||
- reservations |
|||
- lineage-first promotion |
|||
- explicit catch-up vs rebuild boundary |
|||
|
|||
3. **Distributed-systems adversarial cases** |
|||
- scenarios not yet seen in production, but known to be dangerous |
|||
- examples: |
|||
- zombie primary |
|||
- partitions |
|||
- message reordering |
|||
- multi-promotion lineage chains |
|||
|
|||
This file is the shared backlog for anyone extending: |
|||
|
|||
- `sw-block/prototype/fsmv2/` |
|||
- `sw-block/prototype/volumefsm/` |
|||
- `sw-block/prototype/distsim/` |
|||
|
|||
For active development sequencing, see: |
|||
- `sw-block/.private/phase/phase-01.md` |
|||
- `sw-block/.private/phase/phase-02.md` |
|||
- `sw-block/design/v2-scenario-sources-from-v1.md` |
|||
|
|||
Current simulator note: |
|||
- current `distsim` coverage already includes: |
|||
- changed-address restart comparison across `V1` / `V1.5` / `V2` |
|||
- same-address transient outage comparison |
|||
- slow control-plane recovery comparison |
|||
- stale-endpoint rejection |
|||
- committed-prefix-aware promotion eligibility |
|||
|
|||
## V2 Goals |
|||
|
|||
Compared with WAL V1, V2 should improve: |
|||
|
|||
1. state clarity |
|||
2. recovery boundary clarity |
|||
3. fencing and promotion correctness |
|||
4. testability of distributed behavior |
|||
5. proof of data correctness at a target `LSN` |
|||
|
|||
Compared with rebuild-heavy systems, V2 should improve: |
|||
|
|||
1. short-gap recovery cost |
|||
2. explicit progress semantics |
|||
3. catch-up vs rebuild decision quality |
|||
|
|||
## Scenario Format |
|||
|
|||
Each scenario should eventually define: |
|||
|
|||
1. setup |
|||
2. event sequence |
|||
3. expected commit/ack behavior |
|||
4. expected promotion/fencing behavior |
|||
5. expected final data state at target `LSN` |
|||
|
|||
Where possible, use synthetic 4K writes with: |
|||
|
|||
- `value = LSN` |
|||
|
|||
That makes correctness assertions trivial. |
|||
|
|||
## Priority 1: Commit Safety |
|||
|
|||
These scenarios prove the most important distributed claim: |
|||
|
|||
- if the system ACKed a write under the configured policy, that write is not lost |
|||
|
|||
### S1. ACK Then Primary Crash |
|||
|
|||
Goal: |
|||
- prove a quorum-acknowledged write survives failover |
|||
|
|||
Sequence: |
|||
1. primary commits a write |
|||
2. replicas durable-ACK enough nodes for policy |
|||
3. primary crashes immediately |
|||
4. coordinator promotes a valid replica |
|||
|
|||
Expect: |
|||
- promoted node contains the committed `LSN` |
|||
- final state matches reference model at committed `LSN` |
|||
|
|||
### S2. Non-Quorum Write Then Primary Crash |
|||
|
|||
Goal: |
|||
- prove uncommitted data is not revived after failover |
|||
|
|||
Sequence: |
|||
1. primary accepts a write locally |
|||
2. quorum durability is not reached |
|||
3. primary crashes |
|||
4. coordinator promotes another node |
|||
|
|||
Expect: |
|||
- promoted node does not expose the uncommitted write |
|||
- committed `LSN` stays at previous value |
|||
|
|||
### S3. Zombie Old Primary Is Fenced |
|||
|
|||
Goal: |
|||
- prove old-epoch traffic cannot corrupt new lineage |
|||
|
|||
Sequence: |
|||
1. primary loses lease |
|||
2. coordinator bumps epoch and promotes new primary |
|||
3. old primary continues trying to send writes / barriers |
|||
|
|||
Expect: |
|||
- all old-epoch traffic is rejected |
|||
- no stale write becomes committed under the new epoch |
|||
|
|||
## Priority 2: Short-Gap Recovery |
|||
|
|||
These scenarios justify V2 over rebuild-heavy designs. |
|||
|
|||
### S4. Brief Disconnect, WAL Catch-Up Only |
|||
|
|||
Goal: |
|||
- prove a short outage recovers via WAL catch-up, not rebuild |
|||
|
|||
Sequence: |
|||
1. replica disconnects briefly |
|||
2. primary continues writing |
|||
3. gap stays inside recoverable window |
|||
4. replica reconnects and catches up |
|||
|
|||
Expect: |
|||
- `CatchingUp -> PromotionHold -> InSync` |
|||
- no rebuild required |
|||
- final state matches reference at target `LSN` |
|||
|
|||
### S5. Flapping Replica Stays Recoverable |
|||
|
|||
Goal: |
|||
- prove transient disconnects do not force unnecessary rebuild |
|||
|
|||
Sequence: |
|||
1. replica disconnects and reconnects repeatedly |
|||
2. gaps stay within reserved recoverable windows |
|||
|
|||
Expect: |
|||
- replica may move between `Lagging`, `CatchingUp`, and `PromotionHold` |
|||
- replica does not enter `NeedsRebuild` unless recoverability is actually lost |
|||
|
|||
### S6. Tail-Chasing Under Load |
|||
|
|||
Goal: |
|||
- prove behavior when primary writes faster than catch-up rate |
|||
|
|||
Sequence: |
|||
1. replica reconnects behind |
|||
2. primary continues writing quickly |
|||
3. catch-up target may be reached or may fall behind again |
|||
|
|||
Expect: |
|||
- explicit result: |
|||
- converge and promote |
|||
- or abort to `NeedsRebuild` |
|||
- never silently pretend the replica is current |
|||
|
|||
## Priority 3: Catch-Up vs Rebuild Boundary |
|||
|
|||
These scenarios justify the V2 recoverability model. |
|||
|
|||
### S7. Recovery Initially Possible, Then Reservation Expires |
|||
|
|||
Goal: |
|||
- prove `check -> reserve -> recover` is enforced |
|||
|
|||
Sequence: |
|||
1. primary grants a recoverability reservation |
|||
2. catch-up starts |
|||
3. reservation expires or is revoked before completion |
|||
|
|||
Expect: |
|||
- catch-up aborts |
|||
- replica transitions to `NeedsRebuild` |
|||
- no partial recovery is treated as success |
|||
|
|||
### S8. Current Extent Cannot Recover Old LSN |
|||
|
|||
Goal: |
|||
- prove the historical correctness trap |
|||
|
|||
Sequence: |
|||
1. write block `B = 10` at `LSN 10` |
|||
2. later write block `B = 12` at `LSN 12` |
|||
3. attempt to recover state at `LSN 10` from current extent |
|||
|
|||
Expect: |
|||
- mismatch detected |
|||
- scenario must fail correctness check |
|||
|
|||
### S9. Snapshot + Tail Rebuild Works |
|||
|
|||
Goal: |
|||
- prove correct long-gap reconstruction |
|||
|
|||
Sequence: |
|||
1. take snapshot at `cpLSN` |
|||
2. later writes extend head |
|||
3. lagging replica rebuilds from snapshot |
|||
4. replay trailing WAL tail |
|||
|
|||
Expect: |
|||
- final state matches reference at target `LSN` |
|||
|
|||
## Priority 4: Quorum and Mixed Replica States |
|||
|
|||
These scenarios justify V2 mode clarity. |
|||
|
|||
### S10. Mixed States Under `sync_quorum` |
|||
|
|||
Goal: |
|||
- prove `sync_quorum` remains available with mixed replica states |
|||
|
|||
Sequence: |
|||
1. one replica `InSync` |
|||
2. one replica `CatchingUp` |
|||
3. one replica `Rebuilding` |
|||
|
|||
Expect: |
|||
- writes may continue if durable quorum exists |
|||
- ACK gating follows quorum rules exactly |
|||
|
|||
### S11. Mixed States Under `sync_all` |
|||
|
|||
Goal: |
|||
- prove `sync_all` remains strict |
|||
|
|||
Sequence: |
|||
1. same mixed-state setup as above |
|||
|
|||
Expect: |
|||
- writes/acks block or fail according to `sync_all` |
|||
- no silent downgrade to quorum or best effort |
|||
|
|||
### S12. Promotion Chooses Best Valid Lineage |
|||
|
|||
Goal: |
|||
- prove promotion is correctness-first, not “highest apparent LSN wins” |
|||
|
|||
Sequence: |
|||
1. candidate nodes have different: |
|||
- flushed LSN |
|||
- rebuild state |
|||
- epoch lineage |
|||
2. coordinator chooses a new primary |
|||
|
|||
Expect: |
|||
- only a valid-lineage node is promotable |
|||
- stale or inconsistent node is rejected |
|||
|
|||
## Priority 5: Smart WAL / Recovery Classes |
|||
|
|||
These scenarios justify V2’s future adaptive write path. |
|||
|
|||
### S13. `WALInline` Window Is Recoverable |
|||
|
|||
Goal: |
|||
- prove inline WAL payload replay works directly |
|||
|
|||
Sequence: |
|||
1. missing range consists of `WALInline` records |
|||
2. planner grants reservation |
|||
|
|||
Expect: |
|||
- catch-up allowed |
|||
- final state correct |
|||
|
|||
### S14. `ExtentReferenced` Payload Still Resolvable |
|||
|
|||
Goal: |
|||
- prove direct-extent records can still support catch-up when pinned |
|||
|
|||
Sequence: |
|||
1. missing range includes `ExtentReferenced` records |
|||
2. payload objects / generations are still resolvable |
|||
3. reservation pins those dependencies |
|||
|
|||
Expect: |
|||
- catch-up allowed |
|||
- final state correct |
|||
|
|||
### S15. `ExtentReferenced` Payload Lost |
|||
|
|||
Goal: |
|||
- prove metadata alone is not enough |
|||
|
|||
Sequence: |
|||
1. missing range includes `ExtentReferenced` records |
|||
2. metadata still exists |
|||
3. payload object / version is no longer resolvable |
|||
|
|||
Expect: |
|||
- planner returns `NeedsRebuild` |
|||
- catch-up is forbidden |
|||
|
|||
## Priority 6: Restart and Rebuild Robustness |
|||
|
|||
These scenarios justify operational resilience. |
|||
|
|||
### S16. Replica Restarts During Catch-Up |
|||
|
|||
Goal: |
|||
- prove restart does not corrupt catch-up state |
|||
|
|||
Sequence: |
|||
1. replica is catching up |
|||
2. replica restarts |
|||
3. reconnect and recover again |
|||
|
|||
Expect: |
|||
- no false promotion |
|||
- resume or restart recovery cleanly |
|||
|
|||
### S17. Replica Restarts During Rebuild |
|||
|
|||
Goal: |
|||
- prove rebuild interruption is safe |
|||
|
|||
Sequence: |
|||
1. replica is rebuilding from snapshot |
|||
2. replica restarts mid-copy |
|||
|
|||
Expect: |
|||
- rebuild aborts or restarts safely |
|||
- no partial base image is treated as valid |
|||
|
|||
### S18. Primary Restarts Without Failover |
|||
|
|||
Goal: |
|||
- prove restart with same lineage is handled explicitly |
|||
|
|||
Sequence: |
|||
1. primary stops and restarts |
|||
2. coordinator either preserves or changes epoch depending on policy |
|||
|
|||
Expect: |
|||
- replicas react consistently |
|||
- no stale assumptions about previous sender sessions |
|||
|
|||
### S19. Chain Of Custody Across Multiple Promotions |
|||
|
|||
Goal: |
|||
- prove committed data survives more than one failover lineage step |
|||
|
|||
Sequence: |
|||
1. primary `A` commits writes |
|||
2. fail over to `B` |
|||
3. `B` commits additional writes |
|||
4. fail over to `C` |
|||
|
|||
Expect: |
|||
- `C` contains all writes committed by `A` and `B` |
|||
- no committed data disappears across multiple promotions |
|||
- final state matches reference model at committed `LSN` |
|||
|
|||
### S20. Network Partition With Concurrent Write Attempts |
|||
|
|||
Goal: |
|||
- prove epoch fencing prevents split-brain writes during partition |
|||
|
|||
Sequence: |
|||
1. cluster partitions into two live sides |
|||
2. old primary side continues trying to write |
|||
3. coordinator promotes a new primary on the surviving side |
|||
4. both sides attempt to send control/data traffic |
|||
|
|||
Expect: |
|||
- only the current-epoch side can advance committed state |
|||
- stale-side writes are rejected or ignored |
|||
- no conflicting committed lineage appears |
|||
|
|||
## Suggested Implementation Order |
|||
|
|||
Implement in this order: |
|||
|
|||
1. `S1` ACK then primary crash |
|||
2. `S2` non-quorum write then primary crash |
|||
3. `S3` zombie old primary fenced |
|||
4. `S4` brief disconnect with WAL catch-up |
|||
5. `S7` reservation expiry aborts catch-up |
|||
6. `S10` mixed-state quorum policy |
|||
7. `S9` long-lag rebuild from snapshot + tail |
|||
8. `S13-S15` Smart WAL recoverability |
|||
|
|||
## Coverage Matrix |
|||
|
|||
Status values: |
|||
- `covered` |
|||
- `partial` |
|||
- `not_started` |
|||
- `needs_richer_model` |
|||
|
|||
| Scenario | Package | Test / Artifact | Status | Notes | |
|||
|---|---|---|---|---| |
|||
| `S1` ACK then primary crash | `distsim` | `TestQuorumCommitSurvivesPrimaryFailover` | `covered` | quorum commit survives failover | |
|||
| `S2` non-quorum write then primary crash | `distsim` | `TestUncommittedWriteNotPreservedAfterPrimaryLoss` | `covered` | no false revival | |
|||
| `S3` zombie old primary fenced | `distsim` | `TestZombieOldPrimaryWritesAreFenced` | `covered` | stale epoch traffic ignored | |
|||
| `S4` brief disconnect, WAL catch-up only | `distsim` | `TestReplicaCatchupFromPrimaryWAL` | `covered` | short-gap recovery | |
|||
| `S5` flapping replica stays recoverable | `distsim` | `TestS5_FlappingReplica_NoUnnecessaryRebuild`, `TestS5_FlappingWithStateTracking`, `TestS5_FlappingExceedsBudget_EscalatesToNeedsRebuild` | `covered` | both recoverable flapping and explicit budget-exceeded escalation are now asserted | |
|||
| `S6` tail-chasing under load | `distsim` | `TestS6_TailChasing_ConvergesOrAborts`, `TestS6_TailChasing_NonConvergent_Aborts`, `TestS6_TailChasing_NonConvergent_EscalatesToNeedsRebuild`, `TestP02_S6_NonConvergent_ExplicitStateTransition` | `covered` | explicit non-convergent `CatchingUp -> NeedsRebuild` path now asserted | |
|||
| `S7` reservation expiry aborts catch-up | `fsmv2`, `volumefsm`, `distsim` | `TestFSMReservationLostNeedsRebuild`, `TestModelReservationLostDuringCatchupAfterRebuild`, `TestReservationExpiryAbortsCatchup` | `covered` | present at 3 layers | |
|||
| `S8` current extent cannot recover old LSN | `distsim` | `TestCurrentExtentCannotRecoverOldLSN` | `covered` | historical correctness trap | |
|||
| `S9` snapshot + tail rebuild works | `distsim` | `TestReplicaRebuildFromSnapshotAndTail`, `TestSnapshotPlusTrailingReplayReachesTargetLSN` | `covered` | long-gap reconstruction | |
|||
| `S10` mixed states under `sync_quorum` | `volumefsm`, `distsim` | `TestModelSyncQuorumWithThreeReplicasMixedStates`, `TestSyncQuorumWithMixedReplicaStates` | `covered` | quorum stays available | |
|||
| `S11` mixed states under `sync_all` | `distsim` | `TestSyncAllBlocksWithMixedReplicaStates` | `covered` | strict sync_all behavior | |
|||
| `S12` promotion chooses best valid lineage | `distsim` | `TestPromotionUsesValidLineageNode`, `TestS12_PromotionChoosesBestLineage_NotHighestLSN`, `TestS12_PromotionRejectsRebuildingCandidate` | `covered` | lineage-first promotion now exercised beyond simple LSN comparison | |
|||
| `S13` `WALInline` window recoverable | `distsim` | `TestWALInlineRecordsAreRecoverable` | `covered` | inline payload recoverability | |
|||
| `S14` `ExtentReferenced` payload resolvable | `distsim` | `TestExtentReferencedResolvableRecordsAreRecoverable`, `TestMixedClassRecovery_FullSuccess` | `covered` | recoverable direct-extent and mixed-class recovery case | |
|||
| `S15` `ExtentReferenced` payload lost | `distsim` | `TestExtentReferencedUnresolvableForcesRebuild`, `TestRecoverableThenUnrecoverable`, `TestTimeVaryingAvailability` | `covered` | metadata alone not enough; active recovery can transition from recoverable to unrecoverable | |
|||
| `S16` replica restarts during catch-up | `distsim` | `TestReplicaRestartDuringCatchupRestartsSafely` | `covered` | safe recovery restart | |
|||
| `S17` replica restarts during rebuild | `distsim` | `TestReplicaRestartDuringRebuildRestartsSafely` | `covered` | rebuild interruption safe | |
|||
| `S18` primary restarts without failover | `distsim` | `TestS18_PrimaryRestart_SameLineage`, `TestS18_PrimaryRestart_ReplicasRejectOldEpoch`, `TestS18_PrimaryRestart_DelayedOldAck_DoesNotAdvancePrefix`, `TestS18_PrimaryRestart_InFlightBarrierDropped`, `TestP02_S18_DelayedAck_ExplicitRejection` | `covered` | delayed stale ack rejection and committed-prefix stability are now asserted directly | |
|||
| `S19` chain of custody across promotions | `distsim` | `TestS19_ChainOfCustody_MultiplePromotions`, `TestS19_ChainOfCustody_ThreePromotions` | `covered` | multi-promotion lineage continuity covered | |
|||
| `S20` live partition with competing writes | `distsim` | `TestS20_LivePartition_StaleWritesNotCommitted`, `TestS20_LivePartition_HealRecovers`, `TestS20_StalePartition_ProtocolRejectsStaleWrites`, `TestP02_S20_StaleTraffic_CommittedPrefixUnchanged` | `covered` | stale-side protocol traffic is explicitly rejected and committed prefix remains unchanged | |
|||
|
|||
## Ownership Notes |
|||
|
|||
When adding a scenario: |
|||
|
|||
1. add or extend the relevant prototype test: |
|||
- `fsmv2` |
|||
- `volumefsm` |
|||
- `distsim` |
|||
2. update this file with: |
|||
- status |
|||
- package location |
|||
3. keep correctness checks tied to: |
|||
- committed `LSN` |
|||
- reference model state |
|||
|
|||
## Current Coverage Snapshot |
|||
|
|||
Already covered in some form: |
|||
|
|||
- quorum commit survives primary failover |
|||
- uncommitted write not preserved after primary loss |
|||
- zombie old primary fenced by epoch |
|||
- lagging replica catch-up from primary WAL |
|||
- reservation expiry aborts catch-up in distributed sim |
|||
- `sync_quorum` continues with one lagging replica |
|||
- `sync_all` blocks with one lagging replica |
|||
- `sync_quorum` with mixed replica states |
|||
- `sync_all` with mixed replica states |
|||
- rebuild from snapshot + tail |
|||
- promotion uses valid lineage node |
|||
- flapping recoverable vs budget-exceeded rebuild path |
|||
- tail-chasing explicit escalation to rebuild |
|||
- restart during catch-up recovers safely |
|||
- restart during rebuild recovers safely |
|||
- primary restart delayed stale ack rejection |
|||
- `WALInline` recoverability |
|||
- `ExtentReferenced` resolvable vs unresolvable boundary |
|||
- mixed-class Smart WAL recovery and time-varying payload availability |
|||
- delayed stale messages and selective drop behavior |
|||
- multi-node reservation expiry and rebuild-timeout behavior |
|||
- current extent cannot reconstruct old `LSN` |
|||
|
|||
Still important to add: |
|||
|
|||
- explicit coordinator-driven candidate selection among competing valid/invalid lineages |
|||
- control-plane latency scenarios derived from `CP13-8 T4b` |
|||
- explicit V1 / V1.5 / V2 comparison scenarios for: |
|||
- changed-address restart |
|||
- same-address transient outage |
|||
- slow reassignment recovery |
|||
|
|||
## V1.5 Lessons To Add Or Strengthen |
|||
|
|||
These come directly from WAL V1.5 / Phase 13 behavior and should be treated as high-priority scenario drivers. |
|||
|
|||
### L1. Replica Restart With New Receiver Port |
|||
|
|||
Observed: |
|||
- replica VS restarts |
|||
- receiver comes back on a new random port |
|||
- primary background reconnect retries old address and fails |
|||
|
|||
Implication: |
|||
- direct reconnect only works if replica address is stable |
|||
|
|||
Backlog impact: |
|||
- strengthen `S18` |
|||
- add a restart/address-change sub-scenario under `S20` or a future network/control-plane recovery scenario |
|||
|
|||
### L2. Slow Control-Plane Reassignment Dominates Recovery |
|||
|
|||
Observed: |
|||
- sync correctness preserved |
|||
- write availability recovery waits for heartbeat/reassignment cycle |
|||
|
|||
Implication: |
|||
- "recoverable in theory" is not enough |
|||
- recovery latency is part of protocol quality |
|||
|
|||
Backlog impact: |
|||
- `S5` is now covered at current simulator level |
|||
- strengthen `S18` |
|||
- add long-running restart/rejoin timing scenarios |
|||
|
|||
### L3. Background Reconnect Helps Only Same-Address Recovery |
|||
|
|||
Observed: |
|||
- background reconnect is useful for transient network failure |
|||
- not sufficient for process restart with address change |
|||
|
|||
Implication: |
|||
- scenarios must distinguish: |
|||
- transient disconnect |
|||
- process restart |
|||
- address change |
|||
|
|||
Backlog impact: |
|||
- keep `S4` as transient disconnect |
|||
- strengthen `S18` with restart/address-stability cases |
|||
|
|||
### L4. Tail-Chasing And Retention Pressure Are Structural Risks |
|||
|
|||
Observed: |
|||
- Phase 13 reasoning repeatedly exposed: |
|||
- lagging replica may pin WAL |
|||
- catch-up may not converge while primary keeps advancing |
|||
|
|||
Implication: |
|||
- V2 must explicitly model convergence, abort, and rebuild boundaries |
|||
|
|||
Backlog impact: |
|||
- strengthen `S6` |
|||
- add multi-node retention / timeout variants |
|||
|
|||
### L5. Current Extent Is Not Historical State |
|||
|
|||
Observed: |
|||
- using current extent to reconstruct old `LSN` can return later values |
|||
|
|||
Implication: |
|||
- V2 must require version-correct base images or resolvable historical payloads |
|||
|
|||
Backlog impact: |
|||
- already covered by `S8` |
|||
- should remain a permanent regression scenario |
|||
|
|||
## Randomized Simulation |
|||
|
|||
In addition to fixed scenarios, V2 should keep a randomized simulator suite. |
|||
|
|||
Purpose: |
|||
|
|||
1. discover paths that were not explicitly written as named scenarios |
|||
2. stress promotion, restart, and recovery ordering |
|||
3. check invariants after each random step |
|||
|
|||
Current prototype: |
|||
|
|||
- `sw-block/prototype/distsim/random.go` |
|||
- `sw-block/prototype/distsim/random_test.go` |
|||
|
|||
Current invariants checked: |
|||
|
|||
1. current committed `LSN` remains a committed prefix |
|||
2. promotable nodes match reference state at committed `LSN` |
|||
3. current primary, if valid/running, matches reference state at committed `LSN` |
|||
|
|||
This does not replace named scenarios. |
|||
It complements them. |
|||
|
|||
## Scenario Summary |
|||
|
|||
When reviewing or adding scenarios, always record the source: |
|||
|
|||
1. from real V1/V1.5 behavior |
|||
2. from explicit V2 design obligation |
|||
3. from adversarial distributed-systems reasoning |
|||
|
|||
The best scenarios are the ones that come from real failures first, then are generalized into V2 requirements. |
|||
|
|||
## Development Phases |
|||
|
|||
Execution detail is tracked in: |
|||
- `sw-block/.private/phase/phase-01.md` |
|||
- `sw-block/.private/phase/phase-02.md` |
|||
|
|||
High-level phase order: |
|||
|
|||
1. close explicit scenario backlog |
|||
- `S19` |
|||
- `S20` |
|||
2. strengthen missing lifecycle scenarios |
|||
- `S5` |
|||
- `S6` |
|||
- `S18` |
|||
- stronger `S12` |
|||
3. extend protocol-state simulation and version comparison |
|||
- `V1` |
|||
- `V1.5` |
|||
- `V2` |
|||
- stronger closure of current `partial` scenarios |
|||
4. strengthen random/adversarial simulation |
|||
5. add timeout-based scenarios only when the execution path is modeled |
|||
@ -0,0 +1,359 @@ |
|||
# WAL Replication V2 Orchestrator |
|||
|
|||
Date: 2026-03-26 |
|||
Status: design proposal |
|||
Purpose: define the volume-level orchestration model that sits above the per-replica WAL V2 FSM |
|||
|
|||
## Why This Document Exists |
|||
|
|||
`ReplicaFSM` alone is not enough. |
|||
|
|||
It can describe one replica relative to the current primary, but it cannot by itself model: |
|||
|
|||
- primary head continuing to advance |
|||
- multiple replicas in different states |
|||
- durability mode semantics |
|||
- primary lease loss and epoch change |
|||
- primary failover and replica promotion |
|||
- fencing of old recovery sessions |
|||
|
|||
So WAL V2 needs a second layer: |
|||
- per-replica `ReplicaFSM` |
|||
- volume-level `Orchestrator` |
|||
|
|||
## Scope |
|||
|
|||
This document defines the volume-level logic only. |
|||
|
|||
It does not define: |
|||
- exact network protocol |
|||
- exact master RPCs |
|||
- exact storage backend internals |
|||
|
|||
It assumes the per-replica state machine from: |
|||
- `wal-replication-v2-state-machine.md` |
|||
|
|||
## Core Model |
|||
|
|||
The orchestrator owns: |
|||
|
|||
1. current primary lineage |
|||
- `epoch` |
|||
- lease/authority state |
|||
|
|||
2. volume durability mode |
|||
- `best_effort` |
|||
- `sync_all` |
|||
- `sync_quorum` |
|||
|
|||
3. moving primary progress |
|||
- `headLSN` |
|||
- checkpoint/snapshot anchors |
|||
|
|||
4. replica set |
|||
- one `ReplicaFSM` per replica |
|||
- per-replica role in the current volume topology |
|||
|
|||
5. volume-level admission decision |
|||
- can writes proceed? |
|||
- can sync requests complete? |
|||
- must promotion/failover occur? |
|||
|
|||
## Two FSM Layers |
|||
|
|||
### Layer A: `ReplicaFSM` |
|||
|
|||
Owns per-replica state such as: |
|||
- `Bootstrapping` |
|||
- `InSync` |
|||
- `Lagging` |
|||
- `CatchingUp` |
|||
- `PromotionHold` |
|||
- `NeedsRebuild` |
|||
- `Rebuilding` |
|||
- `CatchUpAfterRebuild` |
|||
- `Failed` |
|||
|
|||
### Layer B: `VolumeOrchestrator` |
|||
|
|||
Owns system-wide state such as: |
|||
- current `epoch` |
|||
- current primary identity |
|||
- durability mode |
|||
- set of required replicas |
|||
- current `headLSN` |
|||
- whether writes or promotions are allowed |
|||
|
|||
The orchestrator does not replace `ReplicaFSM`. |
|||
It drives it. |
|||
|
|||
## Volume State |
|||
|
|||
The orchestrator should track at least: |
|||
|
|||
```go |
|||
type VolumeMode string |
|||
|
|||
type PrimaryState string |
|||
|
|||
const ( |
|||
PrimaryServing PrimaryState = "Serving" |
|||
PrimaryDraining PrimaryState = "Draining" |
|||
PrimaryLost PrimaryState = "Lost" |
|||
) |
|||
|
|||
type VolumeModel struct { |
|||
Epoch uint64 |
|||
PrimaryID string |
|||
PrimaryState PrimaryState |
|||
Mode VolumeMode |
|||
|
|||
HeadLSN uint64 |
|||
CheckpointLSN uint64 |
|||
|
|||
RequiredReplicaIDs []string |
|||
Replicas map[string]*ReplicaFSM |
|||
} |
|||
``` |
|||
|
|||
This is a model shape, not a required production struct. |
|||
|
|||
## Orchestrator Responsibilities |
|||
|
|||
### 1. Advance primary head |
|||
|
|||
When primary commits a new write: |
|||
- increment `headLSN` |
|||
- enqueue/send to replica sender loops |
|||
- evaluate whether the current mode still allows ACK |
|||
|
|||
### 2. Evaluate sync eligibility |
|||
|
|||
The orchestrator computes volume-level durability from replica states. |
|||
|
|||
Derived rule: |
|||
- only `ReplicaFSM.IsSyncEligible()` counts |
|||
|
|||
### 3. Drive recovery entry |
|||
|
|||
When a replica disconnects or falls behind: |
|||
- feed disconnect/lag events into that replica FSM |
|||
- decide whether to try catch-up or rebuild |
|||
- acquire recovery reservation if required |
|||
|
|||
### 4. Handle primary authority changes |
|||
|
|||
When lease is lost or a new primary is chosen: |
|||
- increment epoch |
|||
- abort stale recovery sessions |
|||
- reevaluate all replica relationships from the new primary's perspective |
|||
|
|||
### 5. Drive promotion / failover |
|||
|
|||
When current primary is lost: |
|||
- choose promotion candidate |
|||
- assign new epoch |
|||
- move old primary to stale/lost |
|||
- convert the promoted replica into the new serving primary |
|||
- reclassify remaining replicas relative to the new primary |
|||
|
|||
## Required Volume-Level Events |
|||
|
|||
The orchestrator should be able to simulate at least these events. |
|||
|
|||
### Write/progress events |
|||
- `WriteCommitted(lsn)` |
|||
- `CheckpointAdvanced(lsn)` |
|||
- `BarrierCompleted(replicaID, flushedLSN)` |
|||
|
|||
### Replica health events |
|||
- `ReplicaDisconnected(replicaID)` |
|||
- `ReplicaReconnect(replicaID, flushedLSN)` |
|||
- `ReplicaReservationLost(replicaID)` |
|||
- `ReplicaCatchupTimeout(replicaID)` |
|||
- `ReplicaRebuildTooSlow(replicaID)` |
|||
|
|||
### Topology/control events |
|||
- `PrimaryLeaseLost()` |
|||
- `EpochChanged(newEpoch)` |
|||
- `PromoteReplica(replicaID)` |
|||
- `ReplicaAssigned(replicaID)` |
|||
- `ReplicaRemoved(replicaID)` |
|||
|
|||
## Mode Semantics |
|||
|
|||
### `best_effort` |
|||
|
|||
Rules: |
|||
- ACK after primary local durability |
|||
- replicas may be `Lagging`, `CatchingUp`, `NeedsRebuild`, or `Rebuilding` |
|||
- background recovery continues |
|||
|
|||
Volume implication: |
|||
- primary can keep serving while replicas recover |
|||
|
|||
### `sync_all` |
|||
|
|||
Rules: |
|||
- ACK only when all required replicas are `InSync` and durable through target LSN |
|||
- bounded retry only |
|||
- no silent downgrade |
|||
|
|||
Volume implication: |
|||
- one lagging required replica can block sync completion |
|||
- orchestrator may fail requests, not silently reinterpret policy |
|||
|
|||
### `sync_quorum` |
|||
|
|||
Rules: |
|||
- ACK when quorum of required nodes are durable through target LSN |
|||
- lagging replicas may recover in background as long as quorum remains |
|||
|
|||
Volume implication: |
|||
- orchestrator must count eligible replicas, not just healthy sockets |
|||
|
|||
## Primary-Head Simulation Rules |
|||
|
|||
The orchestrator must explicitly model that the primary keeps moving. |
|||
|
|||
### Rule 1: head moves independently of replica recovery |
|||
|
|||
A replica entering `CatchingUp` does not freeze `headLSN`. |
|||
|
|||
### Rule 2: each recovery attempt uses explicit targets |
|||
|
|||
For a replica in recovery, orchestrator chooses: |
|||
- `catchupTargetLSN = H0` |
|||
- or `snapshotCpLSN = C` and replay target `H0` |
|||
|
|||
### Rule 3: promotion is explicit |
|||
|
|||
A replica is not restored to `InSync` just because it reaches `H0`. |
|||
|
|||
It must still pass: |
|||
- barrier confirmation |
|||
- `PromotionHold` |
|||
|
|||
## Failover / Promotion Model |
|||
|
|||
The orchestrator must be able to simulate: |
|||
|
|||
1. old primary loses lease |
|||
2. old primary is fenced by epoch change |
|||
3. one replica is promoted |
|||
4. promoted replica becomes new primary under a higher epoch |
|||
5. all old recovery sessions from the old primary are invalidated |
|||
6. remaining replicas are reevaluated relative to the new primary's head and retained history |
|||
|
|||
Important consequence: |
|||
- failover is not a `ReplicaFSM` transition only |
|||
- it is a volume-level re-rooting of all replica relationships |
|||
|
|||
## Suggested Promotion Rules |
|||
|
|||
Promotion candidate should prefer: |
|||
1. highest valid durable progress |
|||
2. current epoch-consistent history |
|||
3. healthiest replica among tied candidates |
|||
|
|||
After promotion: |
|||
- `PrimaryID` changes |
|||
- `Epoch` increments |
|||
- all replica reservations from the previous primary are void |
|||
- all non-primary replicas must renegotiate recovery against the new primary |
|||
|
|||
## Multi-Replica Examples |
|||
|
|||
### Example 1: `sync_all` |
|||
|
|||
- replica A = `InSync` |
|||
- replica B = `Lagging` |
|||
- replica C = `InSync` |
|||
|
|||
If A and B are required replicas in RF=3 `sync_all`: |
|||
- writes needing sync durability fail or wait |
|||
- even though one replica is still healthy |
|||
|
|||
### Example 2: `sync_quorum` |
|||
|
|||
- replica A = `InSync` |
|||
- replica B = `CatchingUp` |
|||
- replica C = `InSync` |
|||
|
|||
If quorum is 2: |
|||
- volume can continue serving sync requests |
|||
- B recovers in background |
|||
|
|||
### Example 3: failover |
|||
|
|||
- old primary lost |
|||
- replica A promoted |
|||
- replica B was previously `CatchingUp` under old epoch |
|||
|
|||
After promotion: |
|||
- B's old session is aborted |
|||
- B re-enters evaluation against A's history |
|||
|
|||
## What The Tiny Prototype Should Simulate |
|||
|
|||
The V2 prototype should be able to drive at least these scenarios: |
|||
|
|||
1. steady state keep-up |
|||
- primary head advances |
|||
- all required replicas remain `InSync` |
|||
|
|||
2. short outage |
|||
- one replica disconnects |
|||
- primary keeps writing |
|||
- reconnect succeeds within recoverable window |
|||
- replica returns via `PromotionHold` |
|||
|
|||
3. long outage |
|||
- one replica disconnects too long |
|||
- recoverability expires |
|||
- replica goes `NeedsRebuild` |
|||
- rebuild and trailing replay complete |
|||
|
|||
4. tail chasing |
|||
- replica catch-up speed is below primary ingest speed |
|||
- orchestrator chooses fail, throttle, or rebuild path depending on mode |
|||
|
|||
5. failover |
|||
- primary lease lost |
|||
- new epoch assigned |
|||
- replica promoted |
|||
- old recovery sessions fenced |
|||
|
|||
6. mixed-state quorum |
|||
- different replicas in different states |
|||
- orchestrator computes correct `sync_all` / `sync_quorum` result |
|||
|
|||
## Relationship To WAL V1 |
|||
|
|||
WAL V1 already contains pieces of this logic, but they are scattered across: |
|||
- shipper state |
|||
- barrier code |
|||
- retention code |
|||
- assignment/promotion code |
|||
- rebuild code |
|||
- heartbeat/master logic |
|||
|
|||
V2 should separate these into: |
|||
- per-replica recovery FSM |
|||
- volume-level orchestrator |
|||
|
|||
## Bottom Line |
|||
|
|||
The next step after `ReplicaFSM` is not `Smart WAL`. |
|||
|
|||
The next step is the volume-level orchestrator model. |
|||
|
|||
Why: |
|||
- primary keeps moving |
|||
- durability mode is volume-scoped |
|||
- failover/promotion is volume-scoped |
|||
- replica recovery must be evaluated in the context of the whole volume |
|||
|
|||
So V2 needs: |
|||
- `ReplicaFSM` for one replica |
|||
- `VolumeOrchestrator` for the moving multi-replica system |
|||
@ -0,0 +1,632 @@ |
|||
# WAL Replication V2 State Machine |
|||
|
|||
Date: 2026-03-26 |
|||
Status: design proposal |
|||
Purpose: define the V2 replication state machine for a moving-head primary where replicas may transition between keep-up, catch-up, and reconstruction while the primary continues accepting writes |
|||
|
|||
## Why This Document Exists |
|||
|
|||
The hard part of V2 is not the existence of three modes: |
|||
|
|||
- keep-up |
|||
- catch-up |
|||
- reconstruction |
|||
|
|||
The hard part is that the primary head continues advancing while replicas move between those modes. |
|||
|
|||
So V2 must be specified as a real state machine: |
|||
|
|||
- state definitions |
|||
- state-owned LSN anchors |
|||
- allowed transitions |
|||
- retention obligations |
|||
- abort rules |
|||
|
|||
This document treats edge cases as state-transition cases. |
|||
|
|||
## Scope |
|||
|
|||
This is a protocol/state-machine design. |
|||
|
|||
It does not yet define: |
|||
- exact RPC payloads |
|||
- exact snapshot storage format |
|||
- exact implementation package boundaries |
|||
|
|||
Those can follow after the state model is stable. |
|||
|
|||
## Core Terms |
|||
|
|||
### `headLSN` |
|||
|
|||
The primary's current highest WAL LSN. |
|||
|
|||
### `replicaFlushedLSN` |
|||
|
|||
The highest LSN durably persisted on the replica. |
|||
|
|||
### `cpLSN` |
|||
|
|||
A checkpoint/snapshot base point. A snapshot at `cpLSN` represents the block state exactly at that LSN. |
|||
|
|||
### `promotionBarrierLSN` |
|||
|
|||
The LSN a replica must durably reach before it can re-enter `InSync`. |
|||
|
|||
### `Recovery Feasibility` |
|||
|
|||
Whether `(startLSN, endLSN]` can be reconstructed completely, in order, under the current epoch. |
|||
|
|||
This is not a static fact. It changes over time as WAL is reclaimed, payload generations are garbage-collected, or snapshots are released. |
|||
|
|||
### `Recovery Reservation` |
|||
|
|||
A bounded primary-side reservation proving a recovery window is recoverable and pinning all dependencies needed to finish the current catch-up or rebuild-tail replay. |
|||
|
|||
A transition into recovery is valid only after the reservation is granted. |
|||
|
|||
## State Set |
|||
|
|||
Replica may be in one of these states: |
|||
|
|||
1. `Bootstrapping` |
|||
2. `InSync` |
|||
3. `Lagging` |
|||
4. `CatchingUp` |
|||
5. `PromotionHold` |
|||
6. `NeedsRebuild` |
|||
7. `Rebuilding` |
|||
8. `CatchUpAfterRebuild` |
|||
9. `Failed` |
|||
|
|||
Only `InSync` replicas count for sync durability. |
|||
|
|||
## State Semantics |
|||
|
|||
### 1. `Bootstrapping` |
|||
|
|||
Replica has not yet earned sync eligibility and does not yet have trusted reconnect progress. |
|||
|
|||
Properties: |
|||
- fresh replica identity or newly assigned replica |
|||
- may receive initial baseline/live stream |
|||
- not yet eligible for `sync_all` |
|||
|
|||
Counts for: |
|||
- `sync_all`: no |
|||
- `sync_quorum`: no |
|||
- `best_effort`: background/bootstrap only |
|||
|
|||
Owned anchors: |
|||
- current assignment epoch |
|||
|
|||
### 2. `InSync` |
|||
|
|||
Replica is eligible for sync durability. |
|||
|
|||
Properties: |
|||
- receiving live ordered stream |
|||
- `replicaFlushedLSN` is near the primary head |
|||
- normal barrier protocol is valid |
|||
|
|||
Counts for: |
|||
- `sync_all`: yes |
|||
- `sync_quorum`: yes |
|||
- `best_effort`: yes, but not required for ACK |
|||
|
|||
Owned anchors: |
|||
- `replicaFlushedLSN` |
|||
|
|||
### 3. `Lagging` |
|||
|
|||
Replica has fallen out of the normal live-stream envelope but recovery path is not yet chosen. |
|||
|
|||
Properties: |
|||
- primary no longer treats it as sync-eligible |
|||
- replica may still be recoverable from WAL or extent-backed recovery records |
|||
- or may require rebuild |
|||
|
|||
Counts for: |
|||
- `sync_all`: no |
|||
- `sync_quorum`: no |
|||
- `best_effort`: background recovery only |
|||
|
|||
Owned anchors: |
|||
- last known `replicaFlushedLSN` |
|||
|
|||
### 4. `CatchingUp` |
|||
|
|||
Replica is replaying from its own durable point toward a chosen target. |
|||
|
|||
Properties: |
|||
- short-gap recovery mode |
|||
- primary must reserve and pin the required recovery window |
|||
- primary head continues to move |
|||
|
|||
Counts for: |
|||
- `sync_all`: no |
|||
- `sync_quorum`: no |
|||
- `best_effort`: background recovery only |
|||
|
|||
Owned anchors: |
|||
- `catchupStartLSN = replicaFlushedLSN` |
|||
- `catchupTargetLSN` |
|||
- `promotionBarrierLSN` |
|||
- `recoveryReservationID` |
|||
- `reservationExpiry` |
|||
|
|||
### 5. `PromotionHold` |
|||
|
|||
Replica has reached the chosen promotion point but must demonstrate short stability before re-entering `InSync`. |
|||
|
|||
Properties: |
|||
- prevents immediate flapping back into sync eligibility |
|||
- replica has already reached `promotionBarrierLSN` |
|||
- promotion requires stable barriers or elapsed hold time |
|||
|
|||
Counts for: |
|||
- `sync_all`: no |
|||
- `sync_quorum`: no |
|||
- `best_effort`: stabilization only |
|||
|
|||
Owned anchors: |
|||
- `promotionBarrierLSN` |
|||
- `promotionHoldUntil` or equivalent hold criterion |
|||
|
|||
### 6. `NeedsRebuild` |
|||
|
|||
Replica cannot recover from retained recovery records alone. |
|||
|
|||
Properties: |
|||
- catch-up window is insufficient or no longer provable |
|||
- replica must not count toward sync durability |
|||
- replica no longer pins old catch-up history |
|||
|
|||
Counts for: |
|||
- `sync_all`: no |
|||
- `sync_quorum`: no |
|||
- `best_effort`: background repair candidate only |
|||
|
|||
Owned anchors: |
|||
- last known `replicaFlushedLSN` |
|||
|
|||
### 7. `Rebuilding` |
|||
|
|||
Replica is fetching and installing a checkpoint/snapshot base image. |
|||
|
|||
Properties: |
|||
- primary must preserve the chosen snapshot/base |
|||
- primary must preserve the required WAL or recovery tail after `cpLSN` |
|||
|
|||
Counts for: |
|||
- `sync_all`: no |
|||
- `sync_quorum`: no |
|||
- `best_effort`: background rebuild only |
|||
|
|||
Owned anchors: |
|||
- `snapshotID` |
|||
- `snapshotCpLSN` |
|||
- `tailReplayStartLSN = snapshotCpLSN + 1` |
|||
- `recoveryReservationID` |
|||
- `reservationExpiry` |
|||
|
|||
### 8. `CatchUpAfterRebuild` |
|||
|
|||
Replica has installed the base image and is replaying trailing history after it. |
|||
|
|||
Properties: |
|||
- semantically similar to `CatchingUp` |
|||
- base point is checkpoint/snapshot, not the replica's original own state |
|||
|
|||
Counts for: |
|||
- `sync_all`: no |
|||
- `sync_quorum`: no |
|||
- `best_effort`: background recovery only |
|||
|
|||
Owned anchors: |
|||
- `snapshotCpLSN` |
|||
- `catchupTargetLSN` |
|||
- `promotionBarrierLSN` |
|||
- `recoveryReservationID` |
|||
- `reservationExpiry` |
|||
|
|||
### 9. `Failed` |
|||
|
|||
Replica recovery failed in a way that needs operator/control-plane action beyond normal retry. |
|||
|
|||
Properties: |
|||
- terminal or semi-terminal fault state |
|||
- may require delete/recreate/manual intervention |
|||
|
|||
Counts for: |
|||
- `sync_all`: no |
|||
- `sync_quorum`: no |
|||
- `best_effort`: no direct role |
|||
|
|||
## Transition Rules |
|||
|
|||
### `Bootstrapping -> InSync` |
|||
|
|||
Trigger: |
|||
- initial bootstrap completes |
|||
- barrier confirms durable progress under the current epoch |
|||
|
|||
Action: |
|||
- establish trusted `replicaFlushedLSN` |
|||
- grant sync eligibility for the first time |
|||
|
|||
### `InSync -> Lagging` |
|||
|
|||
Trigger: |
|||
- disconnect |
|||
- barrier timeout |
|||
- barrier fsync failure |
|||
- stream error |
|||
|
|||
Action: |
|||
- remove sync eligibility immediately |
|||
|
|||
### `Lagging -> CatchingUp` |
|||
|
|||
Trigger: |
|||
- reconnect succeeds |
|||
- primary grants a recovery reservation proving `(replicaFlushedLSN, catchupTargetLSN]` is recoverable for a bounded window |
|||
|
|||
Action: |
|||
- choose `catchupTargetLSN` |
|||
- pin required recovery dependencies for the reservation lifetime |
|||
|
|||
### `Lagging -> NeedsRebuild` |
|||
|
|||
Trigger: |
|||
- required recovery window is not recoverable |
|||
- impossible progress reported |
|||
- epoch mismatch invalidates direct catch-up |
|||
- background janitor determines the replica is outside recoverable budget |
|||
|
|||
Action: |
|||
- stop treating replica as a catch-up candidate |
|||
|
|||
### `CatchingUp -> PromotionHold` |
|||
|
|||
Trigger: |
|||
- replica replays to `catchupTargetLSN` |
|||
- barrier confirms `promotionBarrierLSN` |
|||
|
|||
Action: |
|||
- start promotion debounce window |
|||
|
|||
### `PromotionHold -> InSync` |
|||
|
|||
Trigger: |
|||
- promotion hold criteria satisfied |
|||
- stable barrier successes |
|||
- or elapsed hold time |
|||
|
|||
Action: |
|||
- restore sync eligibility |
|||
- clear promotion anchors |
|||
|
|||
### `PromotionHold -> Lagging` |
|||
|
|||
Trigger: |
|||
- disconnect |
|||
- failed barrier |
|||
- failed live stream health check |
|||
|
|||
Action: |
|||
- cancel promotion attempt |
|||
- remove sync eligibility |
|||
|
|||
### `CatchingUp -> NeedsRebuild` |
|||
|
|||
Trigger: |
|||
- catch-up cannot converge |
|||
- recovery reservation is lost |
|||
- catch-up timeout policy exceeded |
|||
- epoch changes |
|||
|
|||
Action: |
|||
- abandon WAL-only catch-up |
|||
- move to reconstruction path |
|||
|
|||
### `NeedsRebuild -> Rebuilding` |
|||
|
|||
Trigger: |
|||
- control plane or primary chooses reconstruction base |
|||
- snapshot/base image transfer starts |
|||
- primary grants a rebuild reservation |
|||
|
|||
Action: |
|||
- bind replica to `snapshotID` and `snapshotCpLSN` |
|||
|
|||
### `Rebuilding -> CatchUpAfterRebuild` |
|||
|
|||
Trigger: |
|||
- snapshot/base image installed successfully |
|||
- trailing recovery reservation is still valid |
|||
|
|||
Action: |
|||
- replay trailing history after `snapshotCpLSN` |
|||
|
|||
### `Rebuilding -> NeedsRebuild` |
|||
|
|||
Trigger: |
|||
- rebuild copy fails |
|||
- rebuild reservation is lost |
|||
- rebuild WAL-tail budget is exceeded |
|||
- epoch changes |
|||
|
|||
Action: |
|||
- abort current rebuild session |
|||
- remain excluded from sync durability |
|||
|
|||
### `CatchUpAfterRebuild -> PromotionHold` |
|||
|
|||
Trigger: |
|||
- trailing replay reaches target |
|||
- barrier confirms durable replay through `promotionBarrierLSN` |
|||
|
|||
Action: |
|||
- start promotion debounce |
|||
|
|||
### `CatchUpAfterRebuild -> NeedsRebuild` |
|||
|
|||
Trigger: |
|||
- reservation is lost |
|||
- replay cannot converge |
|||
- epoch changes |
|||
|
|||
Action: |
|||
- abandon current attempt |
|||
- require a fresh rebuild plan |
|||
|
|||
### Any state -> `Failed` |
|||
|
|||
Trigger examples: |
|||
- unrecoverable protocol inconsistency |
|||
- repeated rebuild failure beyond retry policy |
|||
- snapshot corruption |
|||
- local replica storage failure |
|||
|
|||
## Retention Obligations By State |
|||
|
|||
The key V2 rule is: |
|||
|
|||
- recoverability is not a static fact |
|||
- it is a bounded promise the primary must honor once it admits a replica into recovery |
|||
|
|||
### `InSync` |
|||
|
|||
Primary must retain: |
|||
- recent WAL under normal retention policy |
|||
|
|||
Primary does not need: |
|||
- snapshot pin purely for this replica |
|||
|
|||
### `Lagging` |
|||
|
|||
Primary must retain: |
|||
- enough recent information to evaluate recoverability or intentionally declare `NeedsRebuild` |
|||
|
|||
This state should be short-lived. |
|||
|
|||
### `CatchingUp` |
|||
|
|||
Primary must retain for the reservation lifetime: |
|||
- recovery metadata for `(catchupStartLSN, promotionBarrierLSN]` |
|||
- every payload referenced by that recovery window |
|||
- current epoch lineage for the session |
|||
|
|||
### `PromotionHold` |
|||
|
|||
Primary must retain: |
|||
- whatever live-stream and barrier state is required to validate promotion |
|||
|
|||
This state should be brief and must not pin long-lived history. |
|||
|
|||
### `NeedsRebuild` |
|||
|
|||
Primary retains: |
|||
- no special old recovery window for this replica |
|||
|
|||
This state explicitly releases the old catch-up hold. |
|||
|
|||
### `Rebuilding` |
|||
|
|||
Primary must retain for the reservation lifetime: |
|||
- chosen `snapshotID` |
|||
- any base-image dependencies |
|||
- trailing history after `snapshotCpLSN` |
|||
|
|||
### `CatchUpAfterRebuild` |
|||
|
|||
Primary must retain for the reservation lifetime: |
|||
- recovery metadata for `(snapshotCpLSN, promotionBarrierLSN]` |
|||
- every payload referenced by that trailing window |
|||
|
|||
## Moving-Head Rules |
|||
|
|||
The primary head continues advancing during: |
|||
- `CatchingUp` |
|||
- `Rebuilding` |
|||
- `CatchUpAfterRebuild` |
|||
|
|||
Therefore transitions must never use current head at finish time as an implicit target. |
|||
|
|||
Instead, each transition must select explicit targets. |
|||
|
|||
### Catch-up target |
|||
|
|||
When catch-up starts, choose: |
|||
- `catchupTargetLSN = H0` |
|||
|
|||
Replica first chases to `H0`, not to an infinite moving head. |
|||
|
|||
Then: |
|||
- either enter `PromotionHold` and promote |
|||
- or begin another bounded cycle |
|||
- or abort to rebuild |
|||
|
|||
### Rebuild target |
|||
|
|||
When rebuild starts, choose: |
|||
- `snapshotCpLSN = C` |
|||
- trailing replay target `H0` |
|||
|
|||
Replica installs the snapshot at `C`, then replays `(C, H0]`, then enters `PromotionHold`. |
|||
|
|||
## Tail-Chasing Rule |
|||
|
|||
Replica may fail to converge if: |
|||
- catch-up speed < primary ingest speed |
|||
|
|||
V2 must define bounded behavior: |
|||
|
|||
1. bounded catch-up window |
|||
2. bounded catch-up time |
|||
3. policy after failure to converge: |
|||
- for `sync_all`: bounded retry, then fail requests |
|||
- for `best_effort`: keep serving and continue background recovery or escalate to rebuild |
|||
|
|||
No silent downgrade of `sync_all` is allowed. |
|||
|
|||
## Recovery Feasibility |
|||
|
|||
The primary must not admit a replica into catch-up based on a best-effort guess. |
|||
|
|||
It must prove the requested recovery window is recoverable and then reserve it. |
|||
|
|||
Recommended abstraction: |
|||
|
|||
- `CheckRecoveryFeasibility(startLSN, endLSN) -> fully recoverable | needs rebuild` |
|||
- `ReserveRecoveryWindow(startLSN, endLSN) -> reservation` |
|||
|
|||
Only a successful reservation may drive: |
|||
- `Lagging -> CatchingUp` |
|||
- `NeedsRebuild -> Rebuilding` |
|||
- `Rebuilding -> CatchUpAfterRebuild` |
|||
|
|||
## Recovery Classes |
|||
|
|||
V2 must support more than one local record type without leaking that detail into replica state. |
|||
|
|||
### `WALInline` |
|||
|
|||
Properties: |
|||
- payload lives directly in WAL |
|||
- recoverable while WAL is retained |
|||
|
|||
### `ExtentReferenced` |
|||
|
|||
Properties: |
|||
- recovery metadata points at payload outside WAL |
|||
- payload must be resolved from extent/snapshot generation state |
|||
|
|||
The FSM does not care how payload is stored. |
|||
|
|||
It only cares whether the requested window is fully recoverable for the lifetime of the reservation. |
|||
|
|||
The engine-level rule is: |
|||
|
|||
- every record in `(startLSN, endLSN]` must be payload-resolvable |
|||
- the resolved version must correspond to that record's historical state |
|||
- the payload must stay pinned until the reservation ends |
|||
|
|||
If any required payload is not resolvable: |
|||
- the window is not recoverable |
|||
- the replica must go to `NeedsRebuild` |
|||
|
|||
## Snapshot Rule |
|||
|
|||
Rebuild must use a real checkpoint/snapshot base image. |
|||
|
|||
Valid: |
|||
- immutable snapshot at `cpLSN` |
|||
- copy-on-write checkpoint image |
|||
- frozen base image with exact `cpLSN` |
|||
|
|||
Invalid: |
|||
- current extent treated as historical `cpLSN` |
|||
|
|||
## Epoch / Fencing Rule |
|||
|
|||
Every transition is epoch-bound. |
|||
|
|||
If epoch changes during: |
|||
- `Bootstrapping` |
|||
- `Lagging` |
|||
- `CatchingUp` |
|||
- `PromotionHold` |
|||
- `Rebuilding` |
|||
- `CatchUpAfterRebuild` |
|||
|
|||
Then: |
|||
- abort current transition |
|||
- discard old sender assumptions |
|||
- restart negotiation under the new epoch |
|||
|
|||
This prevents stale-primary recovery traffic from being accepted. |
|||
|
|||
## Multi-Replica Volume Rules |
|||
|
|||
Different replicas may be in different states simultaneously. |
|||
|
|||
Example: |
|||
- replica A = `InSync` |
|||
- replica B = `CatchingUp` |
|||
- replica C = `Rebuilding` |
|||
|
|||
Volume-level durability policy is computed per mode. |
|||
|
|||
### `sync_all` |
|||
- all required replicas must be `InSync` |
|||
|
|||
### `sync_quorum` |
|||
- enough replicas must be `InSync` |
|||
|
|||
### `best_effort` |
|||
- primary local durability only |
|||
- replicas recover in background |
|||
|
|||
## Illegal or Suspicious Conditions |
|||
|
|||
These should force rejection or abort: |
|||
|
|||
1. replica reports `replicaFlushedLSN > headLSN` |
|||
2. replica progress belongs to wrong epoch |
|||
3. requested recovery window is not recoverable |
|||
4. recovery reservation cannot be granted |
|||
5. snapshot base does not match claimed `cpLSN` |
|||
6. replay stream shows impossible gap/ordering after reconstruction |
|||
|
|||
## Design Guidance |
|||
|
|||
V2 should be implemented so that: |
|||
|
|||
1. state owns recovery semantics |
|||
2. anchors make transitions explicit |
|||
3. retention obligations are derived from state |
|||
4. catch-up admission requires reservation, not guesswork |
|||
5. mode semantics are derived from `InSync` eligibility |
|||
|
|||
This is better than burying recovery behavior across many ad hoc code paths. |
|||
|
|||
## Bottom Line |
|||
|
|||
V2 is fundamentally a state machine problem. |
|||
|
|||
The correct abstraction is not: |
|||
- some edge cases around WAL replay |
|||
|
|||
It is: |
|||
- replicas move through explicit states while the primary head continues advancing and recovery windows must be provable and reserved |
|||
|
|||
So V2 must be designed around: |
|||
- state definitions |
|||
- anchor LSNs |
|||
- transition rules |
|||
- retention obligations |
|||
- recoverability checks |
|||
- recovery reservations |
|||
- abort conditions |
|||
@ -0,0 +1,401 @@ |
|||
# WAL Replication V2 |
|||
|
|||
Date: 2026-03-26 |
|||
Status: design proposal |
|||
Purpose: redesign WAL-based block replication around explicit short-gap catch-up and long-gap reconstruction |
|||
|
|||
## Goal |
|||
|
|||
Provide a replication architecture that: |
|||
|
|||
- keeps the primary write path fast |
|||
- supports correct synchronous durability semantics |
|||
- supports short-gap reconnect catch-up using WAL |
|||
- avoids paying unbounded WAL retention tax for long-lag replicas |
|||
- uses reconstruction from a real checkpoint/snapshot base for larger lag |
|||
|
|||
This design replaces a "WAL does everything" mindset with a 3-tier recovery model. |
|||
|
|||
## Core Principle |
|||
|
|||
WAL is excellent for: |
|||
- recent ordered delta |
|||
- local crash recovery |
|||
- short-gap replica catch-up |
|||
|
|||
WAL is not the right long-range recovery mechanism for lagging block replicas. |
|||
|
|||
Long-gap recovery should use: |
|||
- a real checkpoint/snapshot base image |
|||
- plus WAL tail replay after that base point |
|||
|
|||
## Correctness Boundary |
|||
|
|||
Never reconstruct old state from current extent alone. |
|||
|
|||
Example: |
|||
|
|||
1. `LSN 100`: block `A = foo` |
|||
2. `LSN 120`: block `A = bar` |
|||
|
|||
If a replica needs state at `LSN 100`, current extent contains `bar`, not `foo`. |
|||
|
|||
Therefore: |
|||
- current extent is latest state |
|||
- not historical state |
|||
|
|||
So long-gap recovery must use a base image that is known to represent a real checkpoint/snapshot `cpLSN`. |
|||
|
|||
## 3-Tier Replication Model |
|||
|
|||
### Tier A: Keep-up |
|||
|
|||
Replica is close enough to the primary that normal ordered streaming keeps it current. |
|||
|
|||
Properties: |
|||
- normal steady-state mode |
|||
- no special recovery path |
|||
- replica stays `InSync` |
|||
|
|||
### Tier B: Lagging Catch-up |
|||
|
|||
Replica fell behind, but the primary still has enough recoverable history covering the missing range. |
|||
|
|||
Properties: |
|||
- reconnect handshake determines the replica durable point |
|||
- primary proves and reserves a bounded recovery window |
|||
- primary replays missing history |
|||
- replica returns to `InSync` only after replay, barrier confirmation, and promotion hold |
|||
|
|||
### Tier C: Reconstruction |
|||
|
|||
Replica is too far behind for direct replay. |
|||
|
|||
Properties: |
|||
- replica must rebuild from a real checkpoint/snapshot base |
|||
- after base image install, primary replays trailing history after `cpLSN` |
|||
- replica only re-enters `InSync` after durable catch-up completes |
|||
|
|||
## Architecture |
|||
|
|||
### Primary Artifacts |
|||
|
|||
The primary owns three forms of state: |
|||
|
|||
1. `Active WAL` |
|||
- recent ordered metadata/delta stream |
|||
- bounded by retention policy |
|||
|
|||
2. `Checkpoint Snapshot` |
|||
- immutable point-in-time base image at `cpLSN` |
|||
- used for long-gap reconstruction |
|||
|
|||
3. `Current Extent` |
|||
- latest live block state |
|||
- not a substitute for historical checkpoint state |
|||
|
|||
### Replica Artifacts |
|||
|
|||
Replica maintains: |
|||
|
|||
1. local WAL or equivalent recovery log |
|||
2. replica `receivedLSN` |
|||
3. replica `flushedLSN` |
|||
4. local extent state |
|||
|
|||
## Sender Model |
|||
|
|||
Do not ship recovery data inline from foreground write goroutines. |
|||
|
|||
Per replica, use: |
|||
- one ordered send queue |
|||
- one sender loop |
|||
|
|||
The sender loop owns: |
|||
- live stream shipping |
|||
- reconnect handling |
|||
- short-gap catch-up |
|||
- reconstruction tail replay |
|||
|
|||
This guarantees: |
|||
- strict LSN order per replica |
|||
- clean transport state ownership |
|||
- no inline shipping races in the primary write path |
|||
|
|||
## Write Path |
|||
|
|||
Primary write path: |
|||
|
|||
1. allocate monotonic `LSN` |
|||
2. append recovery metadata to local WAL or journal |
|||
3. enqueue the record to each replica sender queue |
|||
4. return according to durability mode semantics |
|||
|
|||
Flusher later: |
|||
- flushes dirty data to extent |
|||
- manages checkpoints |
|||
- manages bounded retention of WAL and other recovery dependencies |
|||
|
|||
## Recovery Classes |
|||
|
|||
V2 supports more than one local record type. |
|||
|
|||
### `WALInline` |
|||
|
|||
Properties: |
|||
- payload lives directly in WAL |
|||
- recoverable while WAL is retained |
|||
|
|||
### `ExtentReferenced` |
|||
|
|||
Properties: |
|||
- journal entry contains metadata only |
|||
- payload is resolved from extent/snapshot generation state |
|||
- direct-extent writes and future smart-WAL paths fall into this class |
|||
|
|||
Replica state does not encode these classes. |
|||
|
|||
Instead, the primary must answer a stricter question for reconnect: |
|||
- is `(startLSN, endLSN]` fully recoverable under the current epoch, and can it be reserved for the duration of recovery? |
|||
|
|||
## Replica Progress Model |
|||
|
|||
Each replica reports progress explicitly. |
|||
|
|||
### `receivedLSN` |
|||
- highest LSN received and appended locally |
|||
- not yet a durability guarantee |
|||
|
|||
### `flushedLSN` |
|||
- highest LSN durably persisted on the replica |
|||
- authoritative sync durability signal |
|||
|
|||
Only `flushedLSN` counts for: |
|||
- `sync_all` |
|||
- `sync_quorum` |
|||
|
|||
## Replica States |
|||
|
|||
Replica state is defined by `wal-replication-v2-state-machine.md`. |
|||
|
|||
Important highlights: |
|||
- `Bootstrapping` |
|||
- `InSync` |
|||
- `Lagging` |
|||
- `CatchingUp` |
|||
- `PromotionHold` |
|||
- `NeedsRebuild` |
|||
- `Rebuilding` |
|||
- `CatchUpAfterRebuild` |
|||
- `Failed` |
|||
|
|||
Only `InSync` replicas count toward sync durability. |
|||
|
|||
## Protocol |
|||
|
|||
### 1. Normal Streaming |
|||
|
|||
Primary sender loop: |
|||
- sends ordered replicated write records |
|||
|
|||
Replica: |
|||
1. validates ordering |
|||
2. appends locally |
|||
3. advances `receivedLSN` |
|||
|
|||
### 2. Barrier / Sync |
|||
|
|||
Primary sends: |
|||
- `BarrierReq{LSN, Epoch}` |
|||
|
|||
Replica: |
|||
1. wait until `receivedLSN >= LSN` |
|||
2. flush durable local state |
|||
3. set `flushedLSN = LSN` |
|||
4. reply `BarrierResp{Status, FlushedLSN}` |
|||
|
|||
Primary uses this to evaluate mode policy. |
|||
|
|||
### 3. Reconnect Handshake |
|||
|
|||
On reconnect, primary obtains: |
|||
- current epoch |
|||
- primary head |
|||
- replica durable `flushedLSN` |
|||
|
|||
Then primary evaluates recovery feasibility. |
|||
|
|||
Possible outcomes: |
|||
|
|||
1. replica already caught up |
|||
- state -> `PromotionHold` or `InSync` depending on policy |
|||
|
|||
2. bounded catch-up possible |
|||
- reserve recovery window |
|||
- state -> `CatchingUp` |
|||
|
|||
3. direct replay not possible |
|||
- state -> `NeedsRebuild` |
|||
|
|||
## Recovery Feasibility and Reservation |
|||
|
|||
The key V2 rule is: |
|||
- `fully recoverable` is not enough |
|||
- the primary must also reserve the recovery window |
|||
|
|||
Recommended engine-side flow: |
|||
|
|||
1. `CheckRecoveryFeasibility(startLSN, endLSN)` |
|||
2. if feasible, `ReserveRecoveryWindow(startLSN, endLSN)` |
|||
3. only then start `CatchingUp` or `CatchUpAfterRebuild` |
|||
|
|||
A recovery reservation pins: |
|||
- recovery metadata |
|||
- referenced payload generations |
|||
- required snapshots/base images |
|||
- current epoch lineage for the session |
|||
|
|||
If the reservation is lost during recovery: |
|||
- abort the current attempt |
|||
- fall back to `NeedsRebuild` |
|||
|
|||
## Tier B: Lagging Catch-up Algorithm |
|||
|
|||
When a replica is behind but within a recoverable retained window: |
|||
|
|||
1. choose a bounded target `H0` |
|||
2. reserve `(ReplicaFlushedLSN, H0]` |
|||
3. replay the missing range |
|||
4. barrier confirms durable `flushedLSN >= H0` |
|||
5. enter `PromotionHold` |
|||
6. only then restore `InSync` |
|||
|
|||
### Tail-chasing problem |
|||
|
|||
If the primary is writing faster than the replica can catch up, the replica may never converge. |
|||
|
|||
To handle this: |
|||
|
|||
1. define a bounded catch-up window |
|||
2. if catch-up rate is slower than ingest rate for too long: |
|||
- either temporarily throttle primary admission for strict `sync_all` |
|||
- or fail `sync_all` requests and let control-plane policy react |
|||
- or abort to rebuild |
|||
3. do not let a replica remain in unbounded perpetual `CatchingUp` |
|||
|
|||
### Important rule |
|||
|
|||
For `sync_all`, the data path must not silently downgrade to `best_effort`. |
|||
|
|||
Correct behavior: |
|||
- bounded retry |
|||
- then fail |
|||
|
|||
Any mode change must be explicit policy, not silent transport behavior. |
|||
|
|||
## Tier C: Reconstruction Algorithm |
|||
|
|||
When a replica is too far behind for direct replay: |
|||
|
|||
1. mark replica `NeedsRebuild` |
|||
2. choose a real checkpoint/snapshot base at `cpLSN` |
|||
3. create a rebuild reservation |
|||
4. replica enters `Rebuilding` |
|||
5. replica pulls immutable checkpoint/snapshot image |
|||
6. replica installs that base image and sets base progress to `cpLSN` |
|||
7. primary replays trailing history `(cpLSN, H0]` |
|||
8. barrier confirms durable replay |
|||
9. replica enters `PromotionHold` |
|||
10. replica returns to `InSync` |
|||
|
|||
### Why snapshot/base image must be real |
|||
|
|||
If the replica needs state at `cpLSN`, the base image must represent exactly that checkpoint. |
|||
|
|||
Invalid: |
|||
- current extent copied at some later time and treated as historical `cpLSN` |
|||
|
|||
Valid: |
|||
- immutable snapshot |
|||
- copy-on-write checkpoint image |
|||
- frozen base image |
|||
|
|||
## Retention and Budget |
|||
|
|||
V2 retention is bounded. |
|||
|
|||
### WAL / recovery metadata retention |
|||
|
|||
Primary keeps only a bounded recent recovery window: |
|||
- `max_retained_wal_bytes` |
|||
- optionally `max_retained_wal_time` |
|||
|
|||
### Recovery reservation budget |
|||
|
|||
Reservations are also bounded: |
|||
- timeout |
|||
- bytes pinned |
|||
- snapshot dependency lifetime |
|||
|
|||
If a catch-up or rebuild session exceeds its reservation budget: |
|||
- primary aborts the session |
|||
- replica falls back to `NeedsRebuild` |
|||
- a newer rebuild plan may be chosen later |
|||
|
|||
## Sync Modes |
|||
|
|||
### `best_effort` |
|||
- ACK after primary local durability |
|||
- replicas may lag |
|||
- background catch-up or rebuild allowed |
|||
|
|||
### `sync_all` |
|||
- ACK only when all required replicas are `InSync` and durably at target LSN |
|||
- bounded retry only |
|||
- no silent downgrade |
|||
|
|||
### `sync_quorum` |
|||
- ACK when enough replicas are `InSync` and durably at target LSN |
|||
|
|||
## Why This Direction |
|||
|
|||
V2 separates three different concerns cleanly: |
|||
|
|||
1. fast steady-state replication |
|||
2. short-gap replay |
|||
3. long-gap reconstruction |
|||
|
|||
This avoids forcing WAL alone to solve all recovery cases. |
|||
|
|||
## Implementation Order |
|||
|
|||
Recommended order: |
|||
|
|||
1. pure FSM |
|||
2. ordered sender loop |
|||
3. bounded direct replay |
|||
4. checkpoint/snapshot reconstruction |
|||
5. smarter local write path and recovery classes |
|||
6. policy and control-plane integration |
|||
|
|||
## Phase 13 current direction |
|||
|
|||
Current Phase 13 / WAL V1 is still: |
|||
- fixing correctness of WAL-centered sync replication |
|||
- still focused mainly on bounded WAL replay and rebuild fallback |
|||
|
|||
That is the right bridge. |
|||
|
|||
V2 should follow after WAL V1 closes. |
|||
|
|||
## Bottom Line |
|||
|
|||
V2 is not "more WAL features." |
|||
|
|||
It is: |
|||
- explicit recovery feasibility |
|||
- explicit recovery reservations |
|||
- ordered sender loops |
|||
- short-gap replay for recent lag |
|||
- checkpoint/snapshot reconstruction for long lag |
|||
- promotion back to `InSync` only after durable proof |
|||
@ -0,0 +1,349 @@ |
|||
# WAL V1 To V2 Mapping |
|||
|
|||
Date: 2026-03-26 |
|||
Status: working note |
|||
Purpose: map the current WAL V1 scattered state across `sw-block` into the proposed WAL V2 FSM vocabulary |
|||
|
|||
## Why This Note Exists |
|||
|
|||
Current WAL V1 correctness logic is spread across: |
|||
|
|||
- `wal_shipper.go` |
|||
- `replica_apply.go` |
|||
- `dist_group_commit.go` |
|||
- `blockvol.go` |
|||
- `promotion.go` |
|||
- `rebuild.go` |
|||
- heartbeat/master reporting |
|||
|
|||
This note does not propose immediate code changes. |
|||
|
|||
It exists to answer two questions: |
|||
|
|||
1. what state already exists in WAL V1 today? |
|||
2. how does that state map into the cleaner WAL V2 FSM model? |
|||
|
|||
## Current V1 State Owners |
|||
|
|||
### 1. Shipper state |
|||
|
|||
Primary-side per-replica transport and recovery state lives mainly in: |
|||
- `weed/storage/blockvol/wal_shipper.go` |
|||
|
|||
Current V1 shipper states: |
|||
- `ReplicaDisconnected` |
|||
- `ReplicaConnecting` |
|||
- `ReplicaCatchingUp` |
|||
- `ReplicaInSync` |
|||
- `ReplicaDegraded` |
|||
- `ReplicaNeedsRebuild` |
|||
|
|||
Other shipper-owned flags/anchors: |
|||
- `replicaFlushedLSN` |
|||
- `hasFlushedProgress` |
|||
- `catchupFailures` |
|||
- `lastContactTime` |
|||
|
|||
### 2. Replica receiver progress |
|||
|
|||
Replica-side receive/apply progress lives mainly in: |
|||
- `weed/storage/blockvol/replica_apply.go` |
|||
|
|||
Current V1 replica progress: |
|||
- `receivedLSN` |
|||
- `flushedLSN` |
|||
- duplicate/gap handling in `applyEntry()` |
|||
|
|||
### 3. Volume-level durability policy |
|||
|
|||
Volume-level sync semantics live mainly in: |
|||
- `weed/storage/blockvol/dist_group_commit.go` |
|||
|
|||
Current V1 policy uses: |
|||
- local WAL sync result |
|||
- per-shipper barrier results |
|||
- `DurabilityBestEffort` |
|||
- `DurabilitySyncAll` |
|||
- `DurabilitySyncQuorum` |
|||
|
|||
### 4. Volume-level retention/checkpoint state |
|||
|
|||
Primary-side local checkpoint and WAL retention state lives mainly in: |
|||
- `weed/storage/blockvol/blockvol.go` |
|||
- `weed/storage/blockvol/flusher.go` |
|||
|
|||
Current V1 anchors: |
|||
- `nextLSN` |
|||
- `CheckpointLSN()` |
|||
- WAL retained range |
|||
- retention-floor callbacks from `ShipperGroup` |
|||
|
|||
### 5. Role/assignment state |
|||
|
|||
Master-driven volume role state lives mainly in: |
|||
- `weed/storage/blockvol/promotion.go` |
|||
- `weed/storage/blockvol/blockvol.go` |
|||
- `weed/server/volume_server_block.go` |
|||
|
|||
Current V1 roles: |
|||
- `RolePrimary` |
|||
- `RoleReplica` |
|||
- `RoleStale` |
|||
- `RoleRebuilding` |
|||
- `RoleDraining` |
|||
|
|||
### 6. Rebuild state |
|||
|
|||
Existing V1 rebuild transport/process lives mainly in: |
|||
- `weed/storage/blockvol/rebuild.go` |
|||
|
|||
Current V1 rebuild phases: |
|||
- WAL catch-up attempt |
|||
- full extent copy |
|||
- trailing WAL catch-up |
|||
- rejoin via assignment + fresh shipper bootstrap |
|||
|
|||
### 7. Heartbeat/master-visible replication state |
|||
|
|||
Master-visible state lives mainly in: |
|||
- `weed/storage/blockvol/block_heartbeat.go` |
|||
- `weed/storage/blockvol/blockvol.go` |
|||
- server-side registry/master handling |
|||
|
|||
Current V1 visible fields include: |
|||
- `ReplicaDegraded` |
|||
- `ReplicaShipperStates []ReplicaShipperStatus` |
|||
- role/epoch/checkpoint/head state |
|||
|
|||
## V1 To V2 Mapping |
|||
|
|||
### Shipper state mapping |
|||
|
|||
| WAL V1 shipper state | Proposed WAL V2 FSM state | Notes | |
|||
| --- | --- | --- | |
|||
| `ReplicaDisconnected` | `Bootstrapping` or `Lagging` | Fresh shipper with no durable progress maps to `Bootstrapping`; previously-synced disconnected replica maps to `Lagging`. | |
|||
| `ReplicaConnecting` | transitional part of `Lagging -> CatchingUp` | V2 should model this as an event/session phase, not a durable steady state. | |
|||
| `ReplicaCatchingUp` | `CatchingUp` | Direct mapping for short-gap replay. | |
|||
| `ReplicaInSync` | `InSync` | Direct mapping. | |
|||
| `ReplicaDegraded` | `Lagging` | V1 transport failure state becomes the cleaner V2 recovery-needed state. | |
|||
| `ReplicaNeedsRebuild` | `NeedsRebuild` | Direct mapping. | |
|||
|
|||
Main V1 cleanup opportunity: |
|||
- V1 mixes transport/session detail (`Connecting`) with recovery lifecycle state. |
|||
- V2 should keep the long-lived FSM smaller and push connection mechanics into sender-loop/session logic. |
|||
|
|||
### Replica receiver progress mapping |
|||
|
|||
| WAL V1 field | WAL V2 concept | Notes | |
|||
| --- | --- | --- | |
|||
| `receivedLSN` | `receivedLSN` | Keep as transport/apply progress only. | |
|||
| `flushedLSN` | `replicaFlushedLSN` | Keep as authoritative durability anchor. | |
|||
| duplicate/gap rules | replay validity rules | These become part of the V2 replay contract, not ad hoc receiver behavior. | |
|||
|
|||
Main V1 cleanup opportunity: |
|||
- V1 receiver progress is already conceptually sound. |
|||
- V2 should keep it but drive it from explicit FSM transitions and replay reservations. |
|||
|
|||
### Volume durability policy mapping |
|||
|
|||
| WAL V1 behavior | WAL V2 concept | Notes | |
|||
| --- | --- | --- | |
|||
| `BarrierAll` against current shippers | promotion and sync gate | V2 should keep barrier-based durability truth. | |
|||
| `sync_all` requires all barriers | `InSync` eligibility gate | Same rule, but V2 eligibility should come from FSM state rather than scattered checks. | |
|||
| `best_effort` ignores barrier failures | background recovery mode | Same high-level policy. | |
|||
| `sync_quorum` counts successful barriers | quorum over `InSync` replicas | Same direction, but should be derived from explicit FSM state. | |
|||
|
|||
Main V1 cleanup opportunity: |
|||
- durability mode logic should depend on `IsSyncEligible()`-style state, not raw shipper state enums spread across code. |
|||
|
|||
### Retention/checkpoint mapping |
|||
|
|||
| WAL V1 concept | WAL V2 concept | Notes | |
|||
| --- | --- | --- | |
|||
| `CheckpointLSN()` | checkpoint/base anchor | Keep, but V2 also adds explicit `cpLSN` snapshot semantics. | |
|||
| retention floor from recoverable replicas | recoverability budget | Keep the idea, but V2 turns this into explicit reservation management. | |
|||
| timeout-based `NeedsRebuild` | janitor-driven `Lagging -> NeedsRebuild` | Keep as background control logic, not hot-path mutation. | |
|||
|
|||
Main V1 cleanup opportunity: |
|||
- V1 retains data because replicas might need it. |
|||
- V2 should reserve specific recovery windows, not rely only on ambient retention conditions. |
|||
|
|||
### Role/assignment mapping |
|||
|
|||
| WAL V1 role state | WAL V2 meaning | Notes | |
|||
| --- | --- | --- | |
|||
| `RolePrimary` | primary ownership / epoch authority | Not a replica FSM state; remains volume/control-plane state. | |
|||
| `RoleReplica` | replica service role | Orthogonal to replication FSM state. A replica volume may be `RoleReplica` while its sender-facing state is `Bootstrapping`, `Lagging`, or `InSync`. | |
|||
| `RoleStale` | pre-rebuild/non-serving | Closest to `NeedsRebuild` preparation on the volume role side. | |
|||
| `RoleRebuilding` | rebuild session role | Maps to volume-wide orchestration around V2 `Rebuilding`. | |
|||
| `RoleDraining` | assignment/failover coordination | Outside replica FSM; remains a volume transition role. | |
|||
|
|||
Main V1 cleanup opportunity: |
|||
- role state and replication FSM state are different dimensions. |
|||
- V1 sometimes implicitly blends them. |
|||
- V2 should keep them separate: |
|||
- control-plane role FSM |
|||
- per-replica replication FSM |
|||
|
|||
### Rebuild flow mapping |
|||
|
|||
| WAL V1 rebuild phase | WAL V2 FSM phase | Notes | |
|||
| --- | --- | --- | |
|||
| WAL catch-up pre-pass | `Lagging -> CatchingUp` if feasible | Same idea, but V2 requires recoverability proof and reservation. | |
|||
| full extent copy | `NeedsRebuild -> Rebuilding` | Same high-level phase. | |
|||
| trailing WAL catch-up | `CatchUpAfterRebuild` | Direct conceptual mapping. | |
|||
| fresh shipper bootstrap after reassignment | `Bootstrapping` then promotion | V1 does this through assignment refresh; V2 may eventually do it with cleaner local transitions. | |
|||
|
|||
Main V1 cleanup opportunity: |
|||
- V1 rebuild success is currently rejoined indirectly through control-plane reassignment. |
|||
- V2 should eventually make rebuild completion and promotion explicit FSM transitions. |
|||
|
|||
### Heartbeat/master state mapping |
|||
|
|||
| WAL V1 visible state | WAL V2 meaning | Notes | |
|||
| --- | --- | --- | |
|||
| `ReplicaShipperStatus{DataAddr, State, FlushedLSN}` | control-plane view of per-replica FSM | Good starting shape. | |
|||
| `ReplicaDegraded` | derived summary only | Too coarse for V2 decision-making; keep only as convenience/compat field. | |
|||
| role/epoch/head/checkpoint | role FSM + replication anchors | Continue reporting; V2 may need richer recovery reservation visibility later. | |
|||
|
|||
Main V1 cleanup opportunity: |
|||
- master-facing replication state should be per replica, not summarized as one degraded bit. |
|||
|
|||
## Current V1 Event Sources vs V2 Events |
|||
|
|||
### V1 event source: `Barrier()` outcome |
|||
|
|||
Current effects: |
|||
- mark `InSync` |
|||
- update `replicaFlushedLSN` |
|||
- mark degraded on error |
|||
|
|||
V2 event mapping: |
|||
- `BarrierSuccess` |
|||
- `BarrierFailure` |
|||
- `PromotionHealthy` |
|||
|
|||
### V1 event source: reconnect handshake |
|||
|
|||
Current effects: |
|||
- `Connecting` |
|||
- choose `InSync`, `CatchingUp`, or `NeedsRebuild` |
|||
|
|||
V2 event mapping: |
|||
- `ReconnectObserved` |
|||
- `RecoveryFeasible` |
|||
- `RecoveryReservationGranted` |
|||
- `ReconnectNeedsRebuild` |
|||
|
|||
### V1 event source: retention budget evaluation |
|||
|
|||
Current effects: |
|||
- stale replica becomes `NeedsRebuild` |
|||
|
|||
V2 event mapping: |
|||
- `RecoverabilityExpired` |
|||
- `BackgroundJanitorNeedsRebuild` |
|||
|
|||
### V1 event source: rebuild assignment and `StartRebuild` |
|||
|
|||
Current effects: |
|||
- role becomes `RoleRebuilding` |
|||
- run baseline + trailing catch-up |
|||
- rejoin later via reassignment |
|||
|
|||
V2 event mapping: |
|||
- `StartRebuild` |
|||
- `RebuildBaseApplied` |
|||
- `RebuildReservationLost` |
|||
- `RebuildCompleteReadyForPromotion` |
|||
|
|||
## Main Gaps Between V1 And V2 |
|||
|
|||
### 1. V1 has shipper state, but not a pure FSM |
|||
|
|||
Current V1 state is embedded in: |
|||
- transport logic |
|||
- barrier logic |
|||
- retention logic |
|||
- rebuild orchestration |
|||
|
|||
V2 goal: |
|||
- one pure FSM that owns state and anchors |
|||
- transport/session code only executes actions |
|||
|
|||
### 2. V1 does not model reservation explicitly |
|||
|
|||
Current V1 asks, roughly: |
|||
- is WAL still retained? |
|||
|
|||
V2 must ask: |
|||
- is `(startLSN, endLSN]` fully recoverable? |
|||
- can the primary reserve that window until recovery completes? |
|||
|
|||
### 3. V1 has no explicit promotion debounce state |
|||
|
|||
Current V1 goes effectively: |
|||
- caught up -> `InSync` |
|||
|
|||
V2 adds: |
|||
- `PromotionHold` |
|||
|
|||
### 4. V1 rebuild completion is control-plane indirect |
|||
|
|||
Current V1: |
|||
- old `NeedsRebuild` shipper stays stuck |
|||
- master reassigns |
|||
- fresh shipper bootstraps |
|||
|
|||
V2 likely wants: |
|||
- cleaner local FSM transitions, even if control plane still participates |
|||
|
|||
### 5. V1 does not yet encode recovery classes |
|||
|
|||
Current V1 is mostly WAL-centric. |
|||
|
|||
V2 should support: |
|||
- `WALInline` |
|||
- `ExtentReferenced` |
|||
|
|||
without leaking storage details into replica state. |
|||
|
|||
## What Should Stay From V1 |
|||
|
|||
These V1 ideas are solid and should be preserved: |
|||
|
|||
1. `replicaFlushedLSN` as sync truth |
|||
2. barrier-driven durability confirmation |
|||
3. explicit `NeedsRebuild` |
|||
4. per-replica status reporting to master |
|||
5. retention budgets eventually forcing rebuild |
|||
6. rebuild as a separate path from normal catch-up |
|||
|
|||
## What Should Move In V2 |
|||
|
|||
These are the main redesign items: |
|||
|
|||
1. move scattered shipper/recovery state into one pure FSM |
|||
2. separate transport/session phases from durable FSM state |
|||
3. add `Bootstrapping` and `PromotionHold` |
|||
4. add recoverability proof and reservation as first-class concepts |
|||
5. make replay/rebuild admission depend on reservation, not just present-time checks |
|||
6. cleanly separate: |
|||
- control-plane role FSM |
|||
- per-replica replication FSM |
|||
|
|||
## Bottom Line |
|||
|
|||
WAL V1 already contains most of the important primitives: |
|||
|
|||
- durable progress |
|||
- barrier truth |
|||
- catch-up |
|||
- rebuild detection |
|||
- master-visible per-replica state |
|||
|
|||
What V2 changes is not the existence of these ideas. |
|||
|
|||
It changes their organization: |
|||
- from scattered transport/rebuild logic |
|||
- to one explicit, testable FSM with recovery reservations and cleaner state boundaries |
|||
@ -0,0 +1,277 @@ |
|||
# WAL V2 Tiny Prototype |
|||
|
|||
Date: 2026-03-26 |
|||
Status: design/prototyping plan |
|||
Purpose: validate the core V2 replication logic before committing to a broader redesign |
|||
|
|||
## Goal |
|||
|
|||
Build a small, non-production prototype that proves the core V2 ideas: |
|||
|
|||
1. `ExtentBackend` abstraction |
|||
2. 3-tier replication FSM |
|||
3. async ordered sender loop |
|||
4. barrier-driven durability tracking |
|||
5. short-gap catch-up vs long-gap rebuild boundary |
|||
6. recovery feasibility and reservation semantics |
|||
|
|||
This prototype is for discovering: |
|||
- state complexity |
|||
- recovery correctness |
|||
- sender-loop behavior |
|||
- performance shape |
|||
|
|||
It is not for shipping. |
|||
|
|||
## Prototype Scope |
|||
|
|||
### 1. Extent backend isolation layer |
|||
|
|||
Define a clean backend interface for extent reads/writes. |
|||
|
|||
Initial implementation: |
|||
- `FileBackend` |
|||
- normal Linux file |
|||
- `pread` |
|||
- `pwrite` |
|||
- optional `fallocate` |
|||
|
|||
Do not start with raw-device allocation. |
|||
|
|||
The point is to stabilize: |
|||
- extent semantics |
|||
- base-image import/export assumptions |
|||
- checkpoint/snapshot integration points |
|||
|
|||
### 2. V2 asynchronous replication FSM |
|||
|
|||
Build a pure in-memory FSM for one replica. |
|||
|
|||
FSM owns: |
|||
- state |
|||
- anchor LSNs |
|||
- transition legality |
|||
- sync eligibility |
|||
- action suggestions |
|||
- recovery reservation metadata |
|||
|
|||
Target state set: |
|||
- `Bootstrapping` |
|||
- `InSync` |
|||
- `Lagging` |
|||
- `CatchingUp` |
|||
- `PromotionHold` |
|||
- `NeedsRebuild` |
|||
- `Rebuilding` |
|||
- `CatchUpAfterRebuild` |
|||
- `Failed` |
|||
|
|||
The FSM must not do: |
|||
- network I/O |
|||
- disk I/O |
|||
- goroutine management |
|||
|
|||
### 3. Sender loop + barrier primitive |
|||
|
|||
For each replica: |
|||
- one ordered sender goroutine |
|||
- one non-blocking enqueue path from primary write path |
|||
- one barrier/progress path |
|||
|
|||
Primary write path: |
|||
1. allocate `LSN` |
|||
2. append local WAL/journal metadata |
|||
3. enqueue to sender loop |
|||
4. return according to durability mode |
|||
|
|||
The sender loop is responsible for: |
|||
- live ordered send |
|||
- reconnect handling |
|||
- catch-up replay |
|||
- rebuild-tail replay |
|||
|
|||
## Explicit Non-Goals |
|||
|
|||
These are intentionally excluded from the tiny prototype: |
|||
|
|||
- raw allocator |
|||
- garbage collection |
|||
- `NVMe-oF` |
|||
- `ublk` |
|||
- chain replication |
|||
- CSI / control plane |
|||
- multi-replica quorum |
|||
- encryption |
|||
- real snapshot storage optimization |
|||
|
|||
These are extension layers, not the core logic being validated here. |
|||
|
|||
## Design Principle |
|||
|
|||
Those excluded items are not being rejected. |
|||
|
|||
They are treated as: |
|||
- extensions of the core logic |
|||
|
|||
The prototype should be designed so they can later plug in without rewriting the state machine. |
|||
|
|||
## Suggested Layout |
|||
|
|||
One reasonable layout: |
|||
|
|||
- `weed/storage/blockvol/fsmv2/` |
|||
- `fsm.go` |
|||
- `events.go` |
|||
- `actions.go` |
|||
- `fsm_test.go` |
|||
- `weed/storage/blockvol/prototypev2/` |
|||
- `backend.go` |
|||
- `file_backend.go` |
|||
- `sender_loop.go` |
|||
- `barrier.go` |
|||
- `prototype_test.go` |
|||
|
|||
Preferred direction: |
|||
- keep it close enough to production packages that later reuse is easy |
|||
- but clearly marked experimental |
|||
|
|||
## Core Interfaces |
|||
|
|||
### Extent backend |
|||
|
|||
Example direction: |
|||
|
|||
```go |
|||
type ExtentBackend interface { |
|||
ReadAt(p []byte, off int64) (int, error) |
|||
WriteAt(p []byte, off int64) (int, error) |
|||
Sync() error |
|||
Size() uint64 |
|||
} |
|||
``` |
|||
|
|||
### FSM |
|||
|
|||
Example direction: |
|||
|
|||
```go |
|||
type ReplicaFSM struct { |
|||
// state |
|||
// epoch |
|||
// anchor LSNs |
|||
// reservation metadata |
|||
} |
|||
|
|||
func (f *ReplicaFSM) Apply(evt ReplicaEvent) ([]ReplicaAction, error) |
|||
``` |
|||
|
|||
### Sender loop |
|||
|
|||
Example direction: |
|||
|
|||
```go |
|||
type SenderLoop struct { |
|||
// input queue |
|||
// FSM |
|||
// transport mock/adapter |
|||
} |
|||
``` |
|||
|
|||
## What The Prototype Must Prove |
|||
|
|||
### A. FSM correctness |
|||
|
|||
The FSM must show that the state set is sufficient and coherent. |
|||
|
|||
Key scenarios: |
|||
|
|||
1. `Bootstrapping -> InSync` |
|||
2. `InSync -> Lagging -> CatchingUp -> PromotionHold -> InSync` |
|||
3. `Lagging -> NeedsRebuild -> Rebuilding -> CatchUpAfterRebuild -> PromotionHold -> InSync` |
|||
4. epoch change aborts catch-up |
|||
5. epoch change aborts rebuild |
|||
6. reservation-lost aborts catch-up |
|||
7. rebuild-too-slow aborts reconstruction |
|||
8. flapping replica does not instantly re-enter `InSync` |
|||
|
|||
### B. Sender ordering |
|||
|
|||
The sender loop must prove: |
|||
- strict LSN order per replica |
|||
- no inline ship races from concurrent writes |
|||
- decoupled foreground write path |
|||
|
|||
### C. Barrier semantics |
|||
|
|||
Barrier must prove: |
|||
- it waits on replica progress |
|||
- it uses `flushedLSN`, not transport guesses |
|||
- it can drive promotion eligibility cleanly |
|||
|
|||
### D. Recovery boundary |
|||
|
|||
Prototype must make the handoff explicit: |
|||
- recent lag -> reserved replay window |
|||
- long lag -> rebuild from base image + trailing replay |
|||
|
|||
### E. Recovery reservation |
|||
|
|||
Prototype must make this explicit: |
|||
- a window is not enough |
|||
- it must be provable and then reserved |
|||
- losing the reservation must abort recovery cleanly |
|||
|
|||
## Performance Questions The Prototype Should Answer |
|||
|
|||
Not benchmark headlines. |
|||
|
|||
Instead: |
|||
|
|||
1. how much contention disappears from the hot write path after removing inline ship |
|||
2. how queue depth grows under slow replicas |
|||
3. when catch-up stops converging |
|||
4. how expensive promotion hold is |
|||
5. how much complexity is added by rebuild-tail replay |
|||
6. how much complexity is added by reservation management |
|||
|
|||
## Success Criteria |
|||
|
|||
The tiny prototype is successful if it gives clear answers to: |
|||
|
|||
1. can the V2 FSM be made explicit and testable? |
|||
2. does sender-loop ordering materially simplify the replication path? |
|||
3. is the catch-up vs rebuild boundary coherent under a moving primary head? |
|||
4. does reservation-based recoverability make the design safer and clearer? |
|||
5. does the architecture look simpler than extending WAL V1 forever? |
|||
|
|||
## Failure Criteria |
|||
|
|||
The prototype should be considered unsuccessful if: |
|||
|
|||
1. state count explodes and remains hard to reason about |
|||
2. sender loop does not materially simplify ordering/recovery |
|||
3. promotion and recovery rules remain too coupled to ad hoc timers and network callbacks |
|||
4. rebuild-from-base + trailing replay is still ambiguous even in a controlled prototype |
|||
5. reservation handling turns into unbounded complexity |
|||
|
|||
## Relationship To WAL V1 |
|||
|
|||
WAL V1 remains the current delivery line. |
|||
|
|||
This prototype is not a replacement for: |
|||
- `CP13-6` |
|||
- `CP13-7` |
|||
- `CP13-8` |
|||
- `CP13-9` |
|||
|
|||
It exists to inform what should move into WAL V2 after WAL V1 closes. |
|||
|
|||
## Bottom Line |
|||
|
|||
The tiny prototype should validate the core logic only: |
|||
|
|||
- clean backend boundary |
|||
- explicit FSM |
|||
- ordered async sender |
|||
- recoverability as a proof-plus-reservation problem |
|||
- rebuild as a separate recovery mode, not a WAL accident |
|||
@ -0,0 +1,14 @@ |
|||
# private |
|||
|
|||
Deprecated in favor of `../.private/`. |
|||
|
|||
Private working area for: |
|||
- design sketches |
|||
- draft notes |
|||
- temporary comparison docs |
|||
- prototype experiments not ready to move into shared design docs |
|||
|
|||
Keep production-independent work here until it is ready to be promoted into: |
|||
- `../design/` |
|||
- `../prototype/` |
|||
- or the main repo docs under `learn/projects/sw-block/` |
|||
@ -0,0 +1,23 @@ |
|||
# V2 Prototype |
|||
|
|||
Experimental WAL V2 prototype code lives here. |
|||
|
|||
Current prototype: |
|||
- `fsmv2/`: pure in-memory replication FSM prototype |
|||
- `volumefsm/`: volume-level orchestrator prototype above `fsmv2` |
|||
- `distsim/`: early distributed/data-correctness simulator with synthetic 4K block values |
|||
|
|||
Rules: |
|||
- do not wire this directly into WAL V1 production code |
|||
- keep interfaces and tests focused on architecture learning |
|||
- promote pieces into production only after V2 design stabilizes |
|||
|
|||
## Windows test workflow |
|||
|
|||
Because normal `go test` may be blocked by Windows Defender when it executes temporary test binaries from `%TEMP%`, use: |
|||
|
|||
```powershell |
|||
powershell -ExecutionPolicy Bypass -File .\sw-block\prototype\run-tests.ps1 |
|||
``` |
|||
|
|||
This builds test binaries into the workspace and runs them directly. |
|||
1120
sw-block/prototype/distsim/cluster.go
File diff suppressed because it is too large
View File
File diff suppressed because it is too large
View File
1004
sw-block/prototype/distsim/cluster_test.go
File diff suppressed because it is too large
View File
File diff suppressed because it is too large
View File
@ -0,0 +1,266 @@ |
|||
// eventsim.go — timeout events and timer-race infrastructure.
|
|||
//
|
|||
// This file implements the eventsim layer within the distsim package.
|
|||
// The two conceptual layers share the Cluster model but serve different purposes:
|
|||
//
|
|||
// distsim (protocol layer — cluster.go, protocol.go):
|
|||
// - Protocol correctness: epoch fencing, barrier semantics, commit rules
|
|||
// - Reference-state validation: AssertCommittedRecoverable
|
|||
// - Recoverability logic: catch-up, rebuild, reservation
|
|||
// - Promotion/lineage: candidate eligibility, ranking
|
|||
// - Endpoint identity: address versioning, stale endpoint rejection
|
|||
// - Control-plane flow: heartbeat → detect → assignment
|
|||
//
|
|||
// eventsim (timing/race layer — this file):
|
|||
// - Explicit timeout events: barrier, catch-up, reservation
|
|||
// - Timer-triggered state transitions
|
|||
// - Same-tick race resolution: data events process before timeouts
|
|||
// - Timeout cancellation on successful ack/convergence
|
|||
//
|
|||
// Boundary rule:
|
|||
// - A scenario belongs in distsim tests if the bug is protocol-level
|
|||
// (wrong state, wrong commit, wrong rejection).
|
|||
// - A scenario belongs in eventsim tests if the bug is timing-level
|
|||
// (race between ack and timeout, ordering of concurrent events).
|
|||
// - Do not duplicate scenarios across both layers unless
|
|||
// timer/event ordering is the actual bug surface.
|
|||
|
|||
package distsim |
|||
|
|||
import "fmt" |
|||
|
|||
// TimeoutKind identifies the type of timeout event.
|
|||
type TimeoutKind string |
|||
|
|||
const ( |
|||
TimeoutBarrier TimeoutKind = "barrier" |
|||
TimeoutCatchup TimeoutKind = "catchup" |
|||
TimeoutReservation TimeoutKind = "reservation" |
|||
) |
|||
|
|||
// PendingTimeout represents a registered timeout that has not yet fired or been cancelled.
|
|||
type PendingTimeout struct { |
|||
Kind TimeoutKind |
|||
ReplicaID string |
|||
LSN uint64 // for barrier timeouts: which LSN's barrier
|
|||
DeadlineAt uint64 // absolute tick when timeout fires
|
|||
Cancelled bool |
|||
} |
|||
|
|||
// FiredTimeout records a timeout that actually fired (was not cancelled in time).
|
|||
type FiredTimeout struct { |
|||
PendingTimeout |
|||
FiredAt uint64 |
|||
} |
|||
|
|||
// barrierExpiredKey uniquely identifies a timed-out barrier instance.
|
|||
type barrierExpiredKey struct { |
|||
ReplicaID string |
|||
LSN uint64 |
|||
} |
|||
|
|||
// RegisterTimeout adds a pending timeout to the cluster.
|
|||
func (c *Cluster) RegisterTimeout(kind TimeoutKind, replicaID string, lsn uint64, deadline uint64) { |
|||
c.Timeouts = append(c.Timeouts, PendingTimeout{ |
|||
Kind: kind, |
|||
ReplicaID: replicaID, |
|||
LSN: lsn, |
|||
DeadlineAt: deadline, |
|||
}) |
|||
} |
|||
|
|||
// CancelTimeout cancels a pending timeout matching the given kind, replica, and LSN.
|
|||
// For catch-up/reservation timeouts, LSN is ignored (matched by kind+replica only).
|
|||
func (c *Cluster) CancelTimeout(kind TimeoutKind, replicaID string, lsn uint64) { |
|||
for i := range c.Timeouts { |
|||
t := &c.Timeouts[i] |
|||
if t.Cancelled { |
|||
continue |
|||
} |
|||
if t.Kind != kind || t.ReplicaID != replicaID { |
|||
continue |
|||
} |
|||
if kind == TimeoutBarrier && t.LSN != lsn { |
|||
continue |
|||
} |
|||
t.Cancelled = true |
|||
c.logEvent(EventTimeoutCancelled, fmt.Sprintf("%s replica=%s lsn=%d", kind, replicaID, t.LSN)) |
|||
} |
|||
} |
|||
|
|||
// fireTimeouts checks all pending timeouts against the current tick.
|
|||
// Called by Tick() AFTER message delivery, so data events (acks) get
|
|||
// a chance to cancel timeouts before they fire. This is the same-tick
|
|||
// race resolution rule: data before timers.
|
|||
//
|
|||
// State-guard rules (prevent stale timeout from mutating post-success state):
|
|||
// - CatchupTimeout only fires if replica is still CatchingUp
|
|||
// - ReservationTimeout only fires if replica is still CatchingUp
|
|||
// - BarrierTimeout marks the barrier instance as expired (late acks rejected)
|
|||
func (c *Cluster) fireTimeouts() { |
|||
var remaining []PendingTimeout |
|||
for i := range c.Timeouts { |
|||
t := c.Timeouts[i] |
|||
if t.Cancelled { |
|||
continue |
|||
} |
|||
if c.Now < t.DeadlineAt { |
|||
remaining = append(remaining, t) |
|||
continue |
|||
} |
|||
// Check whether the timeout still has authority to mutate state.
|
|||
stale := false |
|||
switch t.Kind { |
|||
case TimeoutBarrier: |
|||
// Barrier timeouts always apply — they mark the instance as expired.
|
|||
case TimeoutCatchup, TimeoutReservation: |
|||
// Only valid if replica is still CatchingUp. If already recovered
|
|||
// or escalated, the timeout is stale and has no authority.
|
|||
if n := c.Nodes[t.ReplicaID]; n == nil || n.ReplicaState != NodeStateCatchingUp { |
|||
stale = true |
|||
} |
|||
} |
|||
|
|||
if stale { |
|||
c.IgnoredTimeouts = append(c.IgnoredTimeouts, FiredTimeout{ |
|||
PendingTimeout: t, |
|||
FiredAt: c.Now, |
|||
}) |
|||
c.logEvent(EventTimeoutIgnored, fmt.Sprintf("%s replica=%s lsn=%d (stale)", t.Kind, t.ReplicaID, t.LSN)) |
|||
continue |
|||
} |
|||
|
|||
// Timeout fires with authority.
|
|||
c.FiredTimeouts = append(c.FiredTimeouts, FiredTimeout{ |
|||
PendingTimeout: t, |
|||
FiredAt: c.Now, |
|||
}) |
|||
c.logEvent(EventTimeoutFired, fmt.Sprintf("%s replica=%s lsn=%d", t.Kind, t.ReplicaID, t.LSN)) |
|||
switch t.Kind { |
|||
case TimeoutBarrier: |
|||
c.removeQueuedBarrier(t.ReplicaID, t.LSN) |
|||
c.ExpiredBarriers[barrierExpiredKey{t.ReplicaID, t.LSN}] = true |
|||
case TimeoutCatchup: |
|||
c.Nodes[t.ReplicaID].ReplicaState = NodeStateNeedsRebuild |
|||
case TimeoutReservation: |
|||
c.Nodes[t.ReplicaID].ReplicaState = NodeStateNeedsRebuild |
|||
} |
|||
} |
|||
c.Timeouts = remaining |
|||
} |
|||
|
|||
// removeQueuedBarrier removes a re-queuing barrier from the message queue
|
|||
// after its timeout fires. Without this, the barrier would re-queue indefinitely.
|
|||
func (c *Cluster) removeQueuedBarrier(replicaID string, lsn uint64) { |
|||
var kept []inFlightMessage |
|||
for _, item := range c.Queue { |
|||
if item.msg.Kind == MsgBarrier && item.msg.To == replicaID && item.msg.TargetLSN == lsn { |
|||
continue |
|||
} |
|||
kept = append(kept, item) |
|||
} |
|||
c.Queue = kept |
|||
} |
|||
|
|||
// cancelRecoveryTimeouts cancels all catch-up and reservation timeouts for a replica.
|
|||
// Called automatically by CatchUpWithEscalation on convergence or escalation,
|
|||
// so stale timeouts cannot regress a replica that already recovered or failed.
|
|||
func (c *Cluster) cancelRecoveryTimeouts(replicaID string) { |
|||
c.CancelTimeout(TimeoutCatchup, replicaID, 0) |
|||
c.CancelTimeout(TimeoutReservation, replicaID, 0) |
|||
} |
|||
|
|||
// === Tick event log ===
|
|||
|
|||
// TickEventKind identifies the type of event within a tick.
|
|||
type TickEventKind string |
|||
|
|||
const ( |
|||
EventDeliveryAccepted TickEventKind = "delivery_accepted" |
|||
EventDeliveryRejected TickEventKind = "delivery_rejected" |
|||
EventTimeoutFired TickEventKind = "timeout_fired" |
|||
EventTimeoutIgnored TickEventKind = "timeout_ignored" |
|||
EventTimeoutCancelled TickEventKind = "timeout_cancelled" |
|||
) |
|||
|
|||
// TickEvent records a single event within a tick, in processing order.
|
|||
type TickEvent struct { |
|||
Tick uint64 |
|||
Kind TickEventKind |
|||
Detail string |
|||
} |
|||
|
|||
// logEvent appends a tick event to the cluster's event log.
|
|||
func (c *Cluster) logEvent(kind TickEventKind, detail string) { |
|||
c.TickLog = append(c.TickLog, TickEvent{Tick: c.Now, Kind: kind, Detail: detail}) |
|||
} |
|||
|
|||
// TickEventsAt returns all events recorded at a specific tick.
|
|||
func (c *Cluster) TickEventsAt(tick uint64) []TickEvent { |
|||
var events []TickEvent |
|||
for _, e := range c.TickLog { |
|||
if e.Tick == tick { |
|||
events = append(events, e) |
|||
} |
|||
} |
|||
return events |
|||
} |
|||
|
|||
// === Trace infrastructure ===
|
|||
|
|||
// Trace captures a snapshot of cluster state for debugging failed scenarios.
|
|||
// Reusable across test files and future replay/debug tooling.
|
|||
type Trace struct { |
|||
Tick uint64 |
|||
CommittedLSN uint64 |
|||
PrimaryID string |
|||
Epoch uint64 |
|||
NodeStates map[string]string |
|||
FiredTimeouts []string |
|||
IgnoredTimeouts []string |
|||
TickEvents []TickEvent // full ordered event log
|
|||
Deliveries int |
|||
Rejections int |
|||
QueueDepth int |
|||
} |
|||
|
|||
// BuildTrace captures the current cluster state as a debuggable trace.
|
|||
func BuildTrace(c *Cluster) Trace { |
|||
tr := Trace{ |
|||
Tick: c.Now, |
|||
CommittedLSN: c.Coordinator.CommittedLSN, |
|||
PrimaryID: c.Coordinator.PrimaryID, |
|||
Epoch: c.Coordinator.Epoch, |
|||
NodeStates: map[string]string{}, |
|||
TickEvents: c.TickLog, |
|||
Deliveries: len(c.Deliveries), |
|||
Rejections: len(c.Rejected), |
|||
QueueDepth: len(c.Queue), |
|||
} |
|||
for id, n := range c.Nodes { |
|||
tr.NodeStates[id] = fmt.Sprintf("role=%s state=%s epoch=%d flushed=%d running=%v", |
|||
n.Role, n.ReplicaState, n.Epoch, n.Storage.FlushedLSN, n.Running) |
|||
} |
|||
for _, ft := range c.FiredTimeouts { |
|||
tr.FiredTimeouts = append(tr.FiredTimeouts, |
|||
fmt.Sprintf("%s replica=%s lsn=%d fired_at=%d", ft.Kind, ft.ReplicaID, ft.LSN, ft.FiredAt)) |
|||
} |
|||
for _, it := range c.IgnoredTimeouts { |
|||
tr.IgnoredTimeouts = append(tr.IgnoredTimeouts, |
|||
fmt.Sprintf("%s replica=%s lsn=%d stale_at=%d", it.Kind, it.ReplicaID, it.LSN, it.FiredAt)) |
|||
} |
|||
return tr |
|||
} |
|||
|
|||
// === Query helpers ===
|
|||
|
|||
// FiredTimeoutsByKind returns the count of fired timeouts of a specific kind.
|
|||
func (c *Cluster) FiredTimeoutsByKind(kind TimeoutKind) int { |
|||
count := 0 |
|||
for _, ft := range c.FiredTimeouts { |
|||
if ft.Kind == kind { |
|||
count++ |
|||
} |
|||
} |
|||
return count |
|||
} |
|||
@ -0,0 +1,213 @@ |
|||
package distsim |
|||
|
|||
import ( |
|||
"testing" |
|||
) |
|||
|
|||
// ============================================================
|
|||
// Phase 02: Item 4 — Smart WAL recovery-class transitions
|
|||
// ============================================================
|
|||
|
|||
// Test: recovery starts with resolvable ExtentReferenced records,
|
|||
// then a payload becomes unresolvable during active recovery.
|
|||
// Protocol must detect the transition and abort to NeedsRebuild.
|
|||
|
|||
func TestP02_SmartWAL_RecoverableThenUnrecoverable(t *testing.T) { |
|||
// Build recovery records: first 3 WALInline, then 2 ExtentReferenced.
|
|||
records := []RecoveryRecord{ |
|||
{Write: Write{LSN: 1, Block: 1, Value: 1}, Class: RecoveryClassWALInline}, |
|||
{Write: Write{LSN: 2, Block: 2, Value: 2}, Class: RecoveryClassWALInline}, |
|||
{Write: Write{LSN: 3, Block: 3, Value: 3}, Class: RecoveryClassWALInline}, |
|||
{Write: Write{LSN: 4, Block: 4, Value: 4}, Class: RecoveryClassExtentReferenced, PayloadResolvable: true}, |
|||
{Write: Write{LSN: 5, Block: 5, Value: 5}, Class: RecoveryClassExtentReferenced, PayloadResolvable: true}, |
|||
} |
|||
|
|||
// Initially fully recoverable.
|
|||
if !FullyRecoverable(records) { |
|||
t.Fatal("initial records should be fully recoverable") |
|||
} |
|||
|
|||
// Simulate payload becoming unresolvable (e.g., extent generation GC'd).
|
|||
records[4].PayloadResolvable = false |
|||
|
|||
// Now NOT recoverable — must detect and abort.
|
|||
if FullyRecoverable(records) { |
|||
t.Fatal("after payload loss, records should NOT be recoverable") |
|||
} |
|||
|
|||
// Apply only the recoverable prefix.
|
|||
state := ApplyRecoveryRecords(records[:4], 0, 4) // only first 4
|
|||
if state[4] != 4 { |
|||
t.Fatalf("partial apply: block 4 should be 4, got %d", state[4]) |
|||
} |
|||
if _, has5 := state[5]; has5 { |
|||
t.Fatal("block 5 should NOT be in partial state — payload was lost") |
|||
} |
|||
} |
|||
|
|||
func TestP02_SmartWAL_MixedClassRecovery_FullSuccess(t *testing.T) { |
|||
records := []RecoveryRecord{ |
|||
{Write: Write{LSN: 1, Block: 0, Value: 10}, Class: RecoveryClassWALInline}, |
|||
{Write: Write{LSN: 2, Block: 1, Value: 20}, Class: RecoveryClassExtentReferenced, PayloadResolvable: true}, |
|||
{Write: Write{LSN: 3, Block: 0, Value: 30}, Class: RecoveryClassWALInline}, |
|||
{Write: Write{LSN: 4, Block: 2, Value: 40}, Class: RecoveryClassExtentReferenced, PayloadResolvable: true}, |
|||
} |
|||
|
|||
if !FullyRecoverable(records) { |
|||
t.Fatal("all resolvable — should be recoverable") |
|||
} |
|||
|
|||
state := ApplyRecoveryRecords(records, 0, 4) |
|||
// Block 0 overwritten: 10 then 30.
|
|||
if state[0] != 30 { |
|||
t.Fatalf("block 0: got %d, want 30", state[0]) |
|||
} |
|||
if state[1] != 20 { |
|||
t.Fatalf("block 1: got %d, want 20", state[1]) |
|||
} |
|||
if state[2] != 40 { |
|||
t.Fatalf("block 2: got %d, want 40", state[2]) |
|||
} |
|||
} |
|||
|
|||
func TestP02_SmartWAL_TimeVaryingAvailability(t *testing.T) { |
|||
// Simulate time-varying payload availability:
|
|||
// At time T1, all records are recoverable.
|
|||
// At time T2, one becomes unrecoverable.
|
|||
// At time T3, it becomes recoverable again (re-pinned).
|
|||
|
|||
records := []RecoveryRecord{ |
|||
{Write: Write{LSN: 1, Block: 0, Value: 1}, Class: RecoveryClassWALInline}, |
|||
{Write: Write{LSN: 2, Block: 1, Value: 2}, Class: RecoveryClassExtentReferenced, PayloadResolvable: true}, |
|||
{Write: Write{LSN: 3, Block: 2, Value: 3}, Class: RecoveryClassExtentReferenced, PayloadResolvable: true}, |
|||
} |
|||
|
|||
// T1: all recoverable.
|
|||
if !FullyRecoverable(records) { |
|||
t.Fatal("T1: should be recoverable") |
|||
} |
|||
|
|||
// T2: payload for LSN 2 lost.
|
|||
records[1].PayloadResolvable = false |
|||
if FullyRecoverable(records) { |
|||
t.Fatal("T2: should NOT be recoverable after payload loss") |
|||
} |
|||
|
|||
// T3: payload re-pinned (e.g., operator restores snapshot).
|
|||
records[1].PayloadResolvable = true |
|||
if !FullyRecoverable(records) { |
|||
t.Fatal("T3: should be recoverable after re-pin") |
|||
} |
|||
} |
|||
|
|||
// ============================================================
|
|||
// Phase 02: Item 5 — Strengthen S5 (flapping replica)
|
|||
// ============================================================
|
|||
|
|||
// S5 strengthened: repeated disconnect/reconnect with catch-up
|
|||
// state tracking. If flapping exceeds budget, escalate to NeedsRebuild.
|
|||
|
|||
func TestP02_S5_FlappingWithStateTracking(t *testing.T) { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
c.MaxCatchupAttempts = 10 // generous for flapping
|
|||
|
|||
// Initial writes.
|
|||
c.CommitWrite(1) |
|||
c.CommitWrite(2) |
|||
c.TickN(5) |
|||
|
|||
r1 := c.Nodes["r1"] |
|||
|
|||
// 5 flapping cycles — each creates a small gap then catches up.
|
|||
for cycle := 0; cycle < 5; cycle++ { |
|||
c.Disconnect("p", "r1") |
|||
c.Disconnect("r1", "p") |
|||
|
|||
c.CommitWrite(uint64(3 + cycle*2)) |
|||
c.CommitWrite(uint64(4 + cycle*2)) |
|||
c.TickN(3) |
|||
|
|||
c.Connect("p", "r1") |
|||
c.Connect("r1", "p") |
|||
|
|||
r1.ReplicaState = NodeStateCatchingUp |
|||
converged := c.CatchUpWithEscalation("r1", 100) |
|||
if !converged { |
|||
t.Fatalf("cycle %d: catch-up should converge for small gap", cycle) |
|||
} |
|||
if r1.ReplicaState != NodeStateInSync { |
|||
t.Fatalf("cycle %d: expected InSync, got %s", cycle, r1.ReplicaState) |
|||
} |
|||
} |
|||
|
|||
// After 5 successful flaps, CatchupAttempts should be 0 (reset on success).
|
|||
if r1.CatchupAttempts != 0 { |
|||
t.Fatalf("CatchupAttempts should be 0 after successful catch-ups, got %d", r1.CatchupAttempts) |
|||
} |
|||
|
|||
// No unnecessary rebuild — r1 should NOT have a base snapshot.
|
|||
if r1.Storage.BaseSnapshot != nil { |
|||
t.Fatal("flapping replica should not have been rebuilt — only WAL catch-up") |
|||
} |
|||
|
|||
if err := c.AssertCommittedRecoverable("r1"); err != nil { |
|||
t.Fatal(err) |
|||
} |
|||
} |
|||
|
|||
func TestP02_S5_FlappingExceedsBudget_EscalatesToNeedsRebuild(t *testing.T) { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
c.MaxCatchupAttempts = 3 // tight budget
|
|||
|
|||
c.CommitWrite(1) |
|||
c.TickN(5) |
|||
|
|||
r1 := c.Nodes["r1"] |
|||
|
|||
// Each flap creates a gap, but primary writes a LOT during disconnect.
|
|||
// Catch-up recovers only 1 entry per attempt. After MaxCatchupAttempts
|
|||
// non-convergent attempts, escalate.
|
|||
for cycle := 0; cycle < 5; cycle++ { |
|||
c.Disconnect("p", "r1") |
|||
c.Disconnect("r1", "p") |
|||
|
|||
// Large writes during disconnect.
|
|||
for w := 0; w < 30; w++ { |
|||
c.CommitWrite(uint64(cycle*30+w+2) % 8) |
|||
} |
|||
c.TickN(3) |
|||
|
|||
c.Connect("p", "r1") |
|||
c.Connect("r1", "p") |
|||
|
|||
r1.ReplicaState = NodeStateCatchingUp |
|||
|
|||
// Try catch-up with small batch — will not converge.
|
|||
for attempt := 0; attempt < 5; attempt++ { |
|||
c.Disconnect("p", "r1") |
|||
c.Disconnect("r1", "p") |
|||
for w := 0; w < 10; w++ { |
|||
c.CommitWrite(uint64(200+cycle*50+attempt*10+w) % 8) |
|||
} |
|||
c.TickN(2) |
|||
c.Connect("p", "r1") |
|||
c.Connect("r1", "p") |
|||
|
|||
c.CatchUpWithEscalation("r1", 1) |
|||
|
|||
if r1.ReplicaState == NodeStateNeedsRebuild { |
|||
t.Logf("flapping escalated to NeedsRebuild at cycle %d, attempt %d", cycle, attempt) |
|||
// Verify: NeedsRebuild is sticky.
|
|||
c.CatchUpWithEscalation("r1", 100) |
|||
if r1.ReplicaState != NodeStateNeedsRebuild { |
|||
t.Fatal("NeedsRebuild should be sticky — catch-up should not reset it") |
|||
} |
|||
return |
|||
} |
|||
} |
|||
} |
|||
|
|||
// If we got here, the budget wasn't reached. That's wrong.
|
|||
t.Fatalf("expected NeedsRebuild escalation, but state is %s with %d attempts", |
|||
r1.ReplicaState, r1.CatchupAttempts) |
|||
} |
|||
@ -0,0 +1,445 @@ |
|||
package distsim |
|||
|
|||
import ( |
|||
"testing" |
|||
) |
|||
|
|||
// ============================================================
|
|||
// Phase 02: Coordinator candidate-selection tests
|
|||
// Verifies promotion ranking under mixed replica states.
|
|||
// ============================================================
|
|||
|
|||
func TestP02_CandidateSelection_AllEqual_AlphabeticalTieBreak(t *testing.T) { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3") |
|||
c.CommitWrite(1) |
|||
c.TickN(5) |
|||
|
|||
// All replicas InSync with same FlushedLSN → alphabetical tie-break.
|
|||
best := c.BestPromotionCandidate() |
|||
if best != "r1" { |
|||
t.Fatalf("all equal: expected r1 (alphabetical), got %q", best) |
|||
} |
|||
|
|||
candidates := c.PromotionCandidates() |
|||
if len(candidates) != 3 { |
|||
t.Fatalf("expected 3 candidates, got %d", len(candidates)) |
|||
} |
|||
for i, exp := range []string{"r1", "r2", "r3"} { |
|||
if candidates[i].ID != exp { |
|||
t.Fatalf("candidate[%d]: got %q, want %q", i, candidates[i].ID, exp) |
|||
} |
|||
} |
|||
} |
|||
|
|||
func TestP02_CandidateSelection_HigherLSN_Wins(t *testing.T) { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3") |
|||
|
|||
// Directly set FlushedLSN to simulate different progress.
|
|||
// All InSync — higher LSN wins.
|
|||
for _, id := range []string{"r1", "r2", "r3"} { |
|||
c.Nodes[id].ReplicaState = NodeStateInSync |
|||
} |
|||
c.Nodes["r1"].Storage.FlushedLSN = 10 |
|||
c.Nodes["r2"].Storage.FlushedLSN = 20 |
|||
c.Nodes["r3"].Storage.FlushedLSN = 15 |
|||
|
|||
best := c.BestPromotionCandidate() |
|||
if best != "r2" { |
|||
t.Fatalf("higher LSN: expected r2, got %q", best) |
|||
} |
|||
|
|||
candidates := c.PromotionCandidates() |
|||
if candidates[0].ID != "r2" || candidates[1].ID != "r3" || candidates[2].ID != "r1" { |
|||
t.Fatalf("order: got [%s, %s, %s], want [r2, r3, r1]", |
|||
candidates[0].ID, candidates[1].ID, candidates[2].ID) |
|||
} |
|||
} |
|||
|
|||
func TestP02_CandidateSelection_StoppedNode_Excluded(t *testing.T) { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
|
|||
c.Nodes["r1"].Storage.FlushedLSN = 100 |
|||
c.Nodes["r2"].Storage.FlushedLSN = 50 |
|||
c.StopNode("r1") // highest LSN but stopped
|
|||
|
|||
best := c.BestPromotionCandidate() |
|||
if best != "r2" { |
|||
t.Fatalf("stopped excluded: expected r2, got %q", best) |
|||
} |
|||
|
|||
// r1 should be last in ranking (not running).
|
|||
candidates := c.PromotionCandidates() |
|||
if candidates[0].ID != "r2" { |
|||
t.Fatalf("first candidate should be r2, got %s", candidates[0].ID) |
|||
} |
|||
if candidates[1].Running { |
|||
t.Fatal("r1 should be marked not running") |
|||
} |
|||
} |
|||
|
|||
func TestP02_CandidateSelection_InSync_Beats_CatchingUp(t *testing.T) { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3") |
|||
|
|||
// r1: CatchingUp with highest LSN.
|
|||
c.Nodes["r1"].ReplicaState = NodeStateCatchingUp |
|||
c.Nodes["r1"].Storage.FlushedLSN = 100 |
|||
|
|||
// r2: InSync with lower LSN.
|
|||
c.Nodes["r2"].ReplicaState = NodeStateInSync |
|||
c.Nodes["r2"].Storage.FlushedLSN = 50 |
|||
|
|||
// r3: InSync with even lower LSN.
|
|||
c.Nodes["r3"].ReplicaState = NodeStateInSync |
|||
c.Nodes["r3"].Storage.FlushedLSN = 40 |
|||
|
|||
// InSync with lower LSN beats CatchingUp with higher LSN.
|
|||
best := c.BestPromotionCandidate() |
|||
if best != "r2" { |
|||
t.Fatalf("InSync beats CatchingUp: expected r2, got %q", best) |
|||
} |
|||
|
|||
candidates := c.PromotionCandidates() |
|||
// r2 (InSync, 50), r3 (InSync, 40), r1 (CatchingUp, 100)
|
|||
if candidates[0].ID != "r2" || candidates[1].ID != "r3" || candidates[2].ID != "r1" { |
|||
t.Fatalf("order: got [%s, %s, %s]", candidates[0].ID, candidates[1].ID, candidates[2].ID) |
|||
} |
|||
} |
|||
|
|||
func TestP02_CandidateSelection_AllCatchingUp_HighestLSN_Wins(t *testing.T) { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3") |
|||
|
|||
for _, id := range []string{"r1", "r2", "r3"} { |
|||
c.Nodes[id].ReplicaState = NodeStateCatchingUp |
|||
} |
|||
c.Nodes["r1"].Storage.FlushedLSN = 30 |
|||
c.Nodes["r2"].Storage.FlushedLSN = 80 |
|||
c.Nodes["r3"].Storage.FlushedLSN = 50 |
|||
|
|||
best := c.BestPromotionCandidate() |
|||
if best != "r2" { |
|||
t.Fatalf("all CatchingUp: expected r2 (highest LSN), got %q", best) |
|||
} |
|||
} |
|||
|
|||
func TestP02_CandidateSelection_NeedsRebuild_Skipped(t *testing.T) { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3") |
|||
|
|||
// r1: NeedsRebuild with highest LSN.
|
|||
c.Nodes["r1"].ReplicaState = NodeStateNeedsRebuild |
|||
c.Nodes["r1"].Storage.FlushedLSN = 100 |
|||
|
|||
// r2: InSync with moderate LSN.
|
|||
c.Nodes["r2"].ReplicaState = NodeStateInSync |
|||
c.Nodes["r2"].Storage.FlushedLSN = 50 |
|||
|
|||
// r3: CatchingUp with low LSN.
|
|||
c.Nodes["r3"].ReplicaState = NodeStateCatchingUp |
|||
c.Nodes["r3"].Storage.FlushedLSN = 20 |
|||
|
|||
best := c.BestPromotionCandidate() |
|||
if best != "r2" { |
|||
t.Fatalf("NeedsRebuild skipped: expected r2, got %q", best) |
|||
} |
|||
|
|||
candidates := c.PromotionCandidates() |
|||
// r2 (InSync, 50), r3 (CatchingUp, 20), r1 (NeedsRebuild, 100)
|
|||
if candidates[0].ID != "r2" { |
|||
t.Fatalf("first should be r2, got %s", candidates[0].ID) |
|||
} |
|||
if candidates[2].ID != "r1" { |
|||
t.Fatalf("last should be r1 (NeedsRebuild), got %s", candidates[2].ID) |
|||
} |
|||
} |
|||
|
|||
func TestP02_CandidateSelection_NoRunning_ReturnsEmpty(t *testing.T) { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
c.StopNode("r1") |
|||
c.StopNode("r2") |
|||
|
|||
best := c.BestPromotionCandidate() |
|||
if best != "" { |
|||
t.Fatalf("no running: expected empty, got %q", best) |
|||
} |
|||
} |
|||
|
|||
func TestP02_CandidateSelection_AfterPartition_RankingUpdates(t *testing.T) { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3") |
|||
|
|||
c.CommitWrite(1) |
|||
c.CommitWrite(2) |
|||
c.TickN(5) |
|||
|
|||
// All InSync at FlushedLSN=2. Best = r1 (alphabetical).
|
|||
if best := c.BestPromotionCandidate(); best != "r1" { |
|||
t.Fatalf("before partition: expected r1, got %q", best) |
|||
} |
|||
|
|||
// Partition r1. Write more via p+r2+r3.
|
|||
c.Disconnect("p", "r1") |
|||
c.Disconnect("r1", "p") |
|||
// With 4 members (p, r1, r2, r3), quorum=3. p+r2+r3=3. OK.
|
|||
// Actually, quorum = 4/2+1=3. p+r2+r3=3. Marginal.
|
|||
c.CommitWrite(3) |
|||
c.CommitWrite(4) |
|||
c.CommitWrite(5) |
|||
c.TickN(5) |
|||
|
|||
// r1 lagging, r2/r3 ahead.
|
|||
c.Nodes["r1"].ReplicaState = NodeStateCatchingUp |
|||
|
|||
// Now r2 or r3 should win (both InSync with higher LSN).
|
|||
best := c.BestPromotionCandidate() |
|||
if best == "r1" { |
|||
t.Fatal("after partition: r1 should not be best (CatchingUp)") |
|||
} |
|||
if best != "r2" { |
|||
t.Fatalf("after partition: expected r2 (InSync, alphabetical tie-break), got %q", best) |
|||
} |
|||
t.Logf("after partition: best=%s", best) |
|||
} |
|||
|
|||
func TestP02_CandidateSelection_MixedStates_FullRanking(t *testing.T) { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3", "r4", "r5") |
|||
|
|||
// Set up a diverse state mix:
|
|||
// r1: InSync, LSN=50
|
|||
// r2: InSync, LSN=60 (highest InSync)
|
|||
// r3: CatchingUp, LSN=80 (highest overall but CatchingUp)
|
|||
// r4: NeedsRebuild, LSN=90 (highest but NeedsRebuild)
|
|||
// r5: stopped, LSN=100 (highest but not running)
|
|||
c.Nodes["r1"].ReplicaState = NodeStateInSync |
|||
c.Nodes["r1"].Storage.FlushedLSN = 50 |
|||
c.Nodes["r2"].ReplicaState = NodeStateInSync |
|||
c.Nodes["r2"].Storage.FlushedLSN = 60 |
|||
c.Nodes["r3"].ReplicaState = NodeStateCatchingUp |
|||
c.Nodes["r3"].Storage.FlushedLSN = 80 |
|||
c.Nodes["r4"].ReplicaState = NodeStateNeedsRebuild |
|||
c.Nodes["r4"].Storage.FlushedLSN = 90 |
|||
c.Nodes["r5"].Storage.FlushedLSN = 100 |
|||
c.StopNode("r5") |
|||
|
|||
best := c.BestPromotionCandidate() |
|||
if best != "r2" { |
|||
t.Fatalf("mixed states: expected r2 (InSync+highest among InSync), got %q", best) |
|||
} |
|||
|
|||
candidates := c.PromotionCandidates() |
|||
// Expected order: r2(InSync,60), r1(InSync,50), r3(CatchingUp,80),
|
|||
// r4(NeedsRebuild,90), r5(stopped,100)
|
|||
expected := []string{"r2", "r1", "r3", "r4", "r5"} |
|||
for i, exp := range expected { |
|||
if candidates[i].ID != exp { |
|||
t.Fatalf("candidate[%d]: got %q, want %q", i, candidates[i].ID, exp) |
|||
} |
|||
} |
|||
t.Logf("full ranking: %s(%s/%d) > %s(%s/%d) > %s(%s/%d) > %s(%s/%d) > %s(%s/%d)", |
|||
candidates[0].ID, candidates[0].State, candidates[0].FlushedLSN, |
|||
candidates[1].ID, candidates[1].State, candidates[1].FlushedLSN, |
|||
candidates[2].ID, candidates[2].State, candidates[2].FlushedLSN, |
|||
candidates[3].ID, candidates[3].State, candidates[3].FlushedLSN, |
|||
candidates[4].ID, candidates[4].State, candidates[4].FlushedLSN) |
|||
} |
|||
|
|||
func TestP02_CandidateSelection_AllNeedsRebuild_SafeDefaultEmpty(t *testing.T) { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
c.Nodes["r1"].ReplicaState = NodeStateNeedsRebuild |
|||
c.Nodes["r1"].Storage.FlushedLSN = 50 |
|||
c.Nodes["r2"].ReplicaState = NodeStateNeedsRebuild |
|||
c.Nodes["r2"].Storage.FlushedLSN = 80 |
|||
|
|||
// Safe default: refuses NeedsRebuild candidates.
|
|||
safe := c.BestPromotionCandidate() |
|||
if safe != "" { |
|||
t.Fatalf("safe default should return empty for all-NeedsRebuild, got %q", safe) |
|||
} |
|||
} |
|||
|
|||
func TestP02_CandidateSelection_DesperationPromotion_ExplicitAPI(t *testing.T) { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3") |
|||
for _, id := range []string{"r1", "r2", "r3"} { |
|||
c.Nodes[id].ReplicaState = NodeStateNeedsRebuild |
|||
} |
|||
c.Nodes["r1"].Storage.FlushedLSN = 10 |
|||
c.Nodes["r2"].Storage.FlushedLSN = 30 |
|||
c.Nodes["r3"].Storage.FlushedLSN = 20 |
|||
|
|||
safe := c.BestPromotionCandidate() |
|||
if safe != "" { |
|||
t.Fatalf("safe default should return empty, got %q", safe) |
|||
} |
|||
|
|||
desperate := c.BestPromotionCandidateDesperate() |
|||
if desperate != "r2" { |
|||
t.Fatalf("desperation: expected r2 (highest LSN), got %q", desperate) |
|||
} |
|||
} |
|||
|
|||
// === Candidate eligibility tests ===
|
|||
|
|||
func TestP02_CandidateEligibility_Running(t *testing.T) { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
c.CommitWrite(1) |
|||
c.TickN(5) |
|||
|
|||
e := c.EvaluateCandidateEligibility("r1") |
|||
if !e.Eligible { |
|||
t.Fatalf("running InSync replica should be eligible, reasons: %v", e.Reasons) |
|||
} |
|||
|
|||
c.StopNode("r1") |
|||
e = c.EvaluateCandidateEligibility("r1") |
|||
if e.Eligible { |
|||
t.Fatal("stopped replica should not be eligible") |
|||
} |
|||
if e.Reasons[0] != "not_running" { |
|||
t.Fatalf("expected not_running reason, got %v", e.Reasons) |
|||
} |
|||
} |
|||
|
|||
func TestP02_CandidateEligibility_EpochAlignment(t *testing.T) { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
c.CommitWrite(1) |
|||
c.TickN(5) |
|||
|
|||
// Manually desync r1's epoch.
|
|||
c.Nodes["r1"].Epoch = c.Coordinator.Epoch - 1 |
|||
|
|||
e := c.EvaluateCandidateEligibility("r1") |
|||
if e.Eligible { |
|||
t.Fatal("epoch-misaligned replica should not be eligible") |
|||
} |
|||
found := false |
|||
for _, r := range e.Reasons { |
|||
if r == "epoch_misaligned" { |
|||
found = true |
|||
} |
|||
} |
|||
if !found { |
|||
t.Fatalf("expected epoch_misaligned reason, got %v", e.Reasons) |
|||
} |
|||
} |
|||
|
|||
func TestP02_CandidateEligibility_StateIneligible(t *testing.T) { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
c.CommitWrite(1) |
|||
c.TickN(5) |
|||
|
|||
for _, state := range []ReplicaNodeState{NodeStateNeedsRebuild, NodeStateRebuilding} { |
|||
c.Nodes["r1"].ReplicaState = state |
|||
e := c.EvaluateCandidateEligibility("r1") |
|||
if e.Eligible { |
|||
t.Fatalf("%s should not be eligible", state) |
|||
} |
|||
} |
|||
|
|||
// CatchingUp IS eligible (data may be mostly current).
|
|||
c.Nodes["r1"].ReplicaState = NodeStateCatchingUp |
|||
e := c.EvaluateCandidateEligibility("r1") |
|||
if !e.Eligible { |
|||
t.Fatalf("CatchingUp should be eligible, reasons: %v", e.Reasons) |
|||
} |
|||
} |
|||
|
|||
func TestP02_CandidateEligibility_InsufficientCommittedPrefix(t *testing.T) { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
c.CommitWrite(1) |
|||
c.TickN(5) |
|||
|
|||
// r1 has FlushedLSN=1, CommittedLSN=1 → eligible.
|
|||
e := c.EvaluateCandidateEligibility("r1") |
|||
if !e.Eligible { |
|||
t.Fatalf("r1 at committed prefix should be eligible, reasons: %v", e.Reasons) |
|||
} |
|||
|
|||
// Manually set r1 behind committed prefix.
|
|||
c.Nodes["r1"].Storage.FlushedLSN = 0 |
|||
e = c.EvaluateCandidateEligibility("r1") |
|||
if e.Eligible { |
|||
t.Fatal("FlushedLSN=0 with CommittedLSN=1 should not be eligible") |
|||
} |
|||
found := false |
|||
for _, r := range e.Reasons { |
|||
if r == "insufficient_committed_prefix" { |
|||
found = true |
|||
} |
|||
} |
|||
if !found { |
|||
t.Fatalf("expected insufficient_committed_prefix reason, got %v", e.Reasons) |
|||
} |
|||
} |
|||
|
|||
func TestP02_CandidateEligibility_InSyncButLagging_Rejected(t *testing.T) { |
|||
// Scenario from finding: r1 is InSync with correct epoch but FlushedLSN << CommittedLSN.
|
|||
// r2 is CatchingUp but has the committed prefix. r2 should be selected over r1.
|
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3") |
|||
|
|||
// Set committed prefix high.
|
|||
c.Coordinator.CommittedLSN = 100 |
|||
|
|||
// r1: InSync, correct epoch, but FlushedLSN=1. Ineligible.
|
|||
c.Nodes["r1"].ReplicaState = NodeStateInSync |
|||
c.Nodes["r1"].Storage.FlushedLSN = 1 |
|||
|
|||
// r2: CatchingUp, correct epoch, FlushedLSN=100. Eligible.
|
|||
c.Nodes["r2"].ReplicaState = NodeStateCatchingUp |
|||
c.Nodes["r2"].Storage.FlushedLSN = 100 |
|||
|
|||
// r3: InSync, correct epoch, FlushedLSN=100. Eligible.
|
|||
c.Nodes["r3"].ReplicaState = NodeStateInSync |
|||
c.Nodes["r3"].Storage.FlushedLSN = 100 |
|||
|
|||
// r1 is ineligible despite being InSync.
|
|||
e1 := c.EvaluateCandidateEligibility("r1") |
|||
if e1.Eligible { |
|||
t.Fatal("r1 (InSync, FlushedLSN=1, CommittedLSN=100) should be ineligible") |
|||
} |
|||
|
|||
// r2 and r3 are eligible.
|
|||
e2 := c.EvaluateCandidateEligibility("r2") |
|||
if !e2.Eligible { |
|||
t.Fatalf("r2 should be eligible, reasons: %v", e2.Reasons) |
|||
} |
|||
|
|||
// BestPromotionCandidate should pick r3 (InSync with prefix) over r2 (CatchingUp).
|
|||
best := c.BestPromotionCandidate() |
|||
if best != "r3" { |
|||
t.Fatalf("expected r3 (InSync+prefix), got %q", best) |
|||
} |
|||
|
|||
// r1 must NOT be in the eligible list at all.
|
|||
eligible := c.EligiblePromotionCandidates() |
|||
for _, pc := range eligible { |
|||
if pc.ID == "r1" { |
|||
t.Fatal("r1 should not appear in eligible candidates") |
|||
} |
|||
} |
|||
t.Logf("committed-prefix gate: r1(InSync/flushed=1) rejected, r3(InSync/flushed=100) selected") |
|||
} |
|||
|
|||
func TestP02_CandidateEligibility_EligiblePromotionCandidates(t *testing.T) { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3", "r4") |
|||
c.CommitWrite(1) |
|||
c.TickN(5) |
|||
|
|||
// r1: InSync, eligible
|
|||
// r2: NeedsRebuild, ineligible
|
|||
c.Nodes["r2"].ReplicaState = NodeStateNeedsRebuild |
|||
// r3: stopped, ineligible
|
|||
c.StopNode("r3") |
|||
// r4: epoch misaligned, ineligible
|
|||
c.Nodes["r4"].Epoch = 0 |
|||
|
|||
eligible := c.EligiblePromotionCandidates() |
|||
if len(eligible) != 1 { |
|||
t.Fatalf("expected 1 eligible candidate, got %d", len(eligible)) |
|||
} |
|||
if eligible[0].ID != "r1" { |
|||
t.Fatalf("expected r1 as only eligible, got %s", eligible[0].ID) |
|||
} |
|||
|
|||
// BestPromotionCandidate uses eligibility.
|
|||
best := c.BestPromotionCandidate() |
|||
if best != "r1" { |
|||
t.Fatalf("BestPromotionCandidate should return r1, got %q", best) |
|||
} |
|||
} |
|||
@ -0,0 +1,371 @@ |
|||
package distsim |
|||
|
|||
import ( |
|||
"testing" |
|||
) |
|||
|
|||
// ============================================================
|
|||
// Phase 02: Delayed/drop network + multi-node reservation expiry
|
|||
// ============================================================
|
|||
|
|||
// --- Item 4: Stale delayed messages after heal/promote ---
|
|||
|
|||
// Scenario: messages from old primary are in-flight when partition heals
|
|||
// and a new primary is promoted. The stale messages arrive AFTER the
|
|||
// promotion. They must be rejected by epoch fencing.
|
|||
|
|||
func TestP02_DelayedStaleMessages_AfterPromote(t *testing.T) { |
|||
c := NewCluster(CommitSyncQuorum, "A", "B", "C") |
|||
|
|||
// Phase 1: A writes, ships to B and C.
|
|||
c.CommitWrite(1) |
|||
c.CommitWrite(2) |
|||
c.TickN(5) |
|||
|
|||
// Phase 2: A writes more, but we manually enqueue delayed delivery
|
|||
// to simulate in-flight messages when partition happens.
|
|||
c.CommitWrite(3) // LSN 3 ships normally
|
|||
// Don't tick yet — messages are in the queue.
|
|||
|
|||
// Phase 3: Partition A from everyone, promote B.
|
|||
c.Disconnect("A", "B") |
|||
c.Disconnect("B", "A") |
|||
c.Disconnect("A", "C") |
|||
c.Disconnect("C", "A") |
|||
c.StopNode("A") |
|||
c.Promote("B") |
|||
|
|||
// Phase 4: Manually inject stale messages as if they were delayed in the network.
|
|||
// These represent A's write(3) + barrier(3) that were in-flight when A crashed.
|
|||
staleEpoch := c.Coordinator.Epoch - 1 |
|||
c.InjectMessage(Message{ |
|||
Kind: MsgWrite, From: "A", To: "B", Epoch: staleEpoch, |
|||
Write: Write{LSN: 3, Block: 3, Value: 3}, |
|||
}, c.Now+1) |
|||
c.InjectMessage(Message{ |
|||
Kind: MsgBarrier, From: "A", To: "B", Epoch: staleEpoch, |
|||
TargetLSN: 3, |
|||
}, c.Now+2) |
|||
c.InjectMessage(Message{ |
|||
Kind: MsgWrite, From: "A", To: "C", Epoch: staleEpoch, |
|||
Write: Write{LSN: 3, Block: 3, Value: 3}, |
|||
}, c.Now+1) |
|||
|
|||
// Phase 5: Tick to deliver stale messages.
|
|||
committedBefore := c.Coordinator.CommittedLSN |
|||
c.TickN(5) |
|||
|
|||
// All stale messages must be rejected — either by epoch fencing or node-down.
|
|||
epochRejects := c.RejectedByReason(RejectEpochMismatch) |
|||
nodeDownRejects := c.RejectedByReason(RejectNodeDown) |
|||
totalRejects := epochRejects + nodeDownRejects |
|||
if totalRejects == 0 { |
|||
t.Fatal("stale delayed messages were not rejected") |
|||
} |
|||
|
|||
// Committed prefix must not change from stale messages.
|
|||
if c.Coordinator.CommittedLSN != committedBefore { |
|||
t.Fatalf("stale delayed messages changed committed prefix: before=%d after=%d", |
|||
committedBefore, c.Coordinator.CommittedLSN) |
|||
} |
|||
|
|||
// Data correct on new primary.
|
|||
if err := c.AssertCommittedRecoverable("B"); err != nil { |
|||
t.Fatalf("data incorrect after stale delayed messages: %v", err) |
|||
} |
|||
t.Logf("stale delayed messages: %d rejected by epoch_mismatch", epochRejects) |
|||
} |
|||
|
|||
// Scenario: old barrier ACK arrives after promotion with long delay.
|
|||
// This is different from S18 — the delay is network-level, not restart-level.
|
|||
|
|||
func TestP02_DelayedBarrierAck_LongNetworkDelay(t *testing.T) { |
|||
c := NewCluster(CommitSyncAll, "p", "r1") |
|||
|
|||
c.CommitWrite(1) |
|||
c.TickN(5) |
|||
|
|||
// Write 2 — barrier sent to r1.
|
|||
c.CommitWrite(2) |
|||
c.TickN(2) // barrier in flight
|
|||
|
|||
// Promote r1 (simulate primary failure + promotion).
|
|||
c.StopNode("p") |
|||
c.Promote("r1") |
|||
|
|||
committedBefore := c.Coordinator.CommittedLSN |
|||
|
|||
// Long-delayed barrier ack from r1 → dead primary p.
|
|||
c.InjectMessage(Message{ |
|||
Kind: MsgBarrierAck, From: "r1", To: "p", |
|||
Epoch: c.Coordinator.Epoch - 1, TargetLSN: 2, |
|||
}, c.Now+10) |
|||
|
|||
c.TickN(15) |
|||
|
|||
// Must be rejected — p is dead and epoch is stale.
|
|||
nodeDownRejects := c.RejectedByReason(RejectNodeDown) |
|||
epochRejects := c.RejectedByReason(RejectEpochMismatch) |
|||
if nodeDownRejects == 0 && epochRejects == 0 { |
|||
t.Fatal("delayed barrier ack should be rejected (node down or epoch mismatch)") |
|||
} |
|||
|
|||
// Stale ack must not advance committed prefix.
|
|||
if c.Coordinator.CommittedLSN != committedBefore { |
|||
t.Fatalf("stale ack changed committed prefix: before=%d after=%d", |
|||
committedBefore, c.Coordinator.CommittedLSN) |
|||
} |
|||
} |
|||
|
|||
// Scenario: write ships to replica, network drops the write but delivers
|
|||
// the barrier. Barrier should timeout or detect missing data.
|
|||
|
|||
func TestP02_DroppedWrite_BarrierDelivered_Stalls(t *testing.T) { |
|||
c := NewCluster(CommitSyncAll, "p", "r1") |
|||
|
|||
c.CommitWrite(1) |
|||
c.TickN(5) |
|||
|
|||
// Write 2 — but drop the write message to r1 (link down for data only).
|
|||
// We simulate by writing but not ticking, then dropping queued writes.
|
|||
c.CommitWrite(2) // enqueues write(2) + barrier(2) to r1
|
|||
|
|||
// Remove only the write message from the queue (simulate selective drop).
|
|||
var kept []inFlightMessage |
|||
for _, item := range c.Queue { |
|||
if item.msg.Kind == MsgWrite && item.msg.To == "r1" && item.msg.Write.LSN == 2 { |
|||
continue // drop this write
|
|||
} |
|||
kept = append(kept, item) |
|||
} |
|||
c.Queue = kept |
|||
|
|||
// Tick — barrier arrives at r1 but r1 doesn't have LSN 2.
|
|||
// Barrier should re-queue (waiting for data).
|
|||
c.TickN(10) |
|||
|
|||
// Assert 1: sync_all blocked — CommittedLSN stuck at 1.
|
|||
if c.Coordinator.CommittedLSN != 1 { |
|||
t.Fatalf("sync_all should be blocked at LSN 1, got committed=%d", c.Coordinator.CommittedLSN) |
|||
} |
|||
|
|||
// Assert 2: LSN 2 is pending but NOT committed.
|
|||
p2 := c.Pending[2] |
|||
if p2 == nil { |
|||
t.Fatal("LSN 2 should be pending") |
|||
} |
|||
if p2.Committed { |
|||
t.Fatal("LSN 2 committed under sync_all but r1 never received the write — safety violation") |
|||
} |
|||
|
|||
// Assert 3: barrier still re-queuing — stall proven positively.
|
|||
barrierRequeued := false |
|||
for _, item := range c.Queue { |
|||
if item.msg.Kind == MsgBarrier && item.msg.To == "r1" && item.msg.TargetLSN == 2 { |
|||
barrierRequeued = true |
|||
break |
|||
} |
|||
} |
|||
if !barrierRequeued { |
|||
t.Fatal("barrier for LSN 2 should still be re-queuing — stall not proven") |
|||
} |
|||
t.Logf("dropped write stall proven: committed=%d, pending[2].committed=%v, barrier re-queuing=%v", |
|||
c.Coordinator.CommittedLSN, p2.Committed, barrierRequeued) |
|||
} |
|||
|
|||
// --- Item 5: Multi-node reservation expiry / rebuild timeout ---
|
|||
|
|||
// Scenario: RF=3 cluster. Two replicas need catch-up. One's reservation
|
|||
// expires during recovery. Must handle correctly: one rebuilds, one catches up.
|
|||
|
|||
func TestP02_MultiNode_ReservationExpiry_MixedOutcome(t *testing.T) { |
|||
// 5 nodes: p+r3+r4 provide quorum (3 of 5) while r1+r2 are disconnected.
|
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3", "r4") |
|||
|
|||
// Write initial data.
|
|||
for i := uint64(1); i <= 10; i++ { |
|||
c.CommitWrite(i % 4) |
|||
} |
|||
c.TickN(5) |
|||
|
|||
// Take snapshot for rebuild.
|
|||
c.Primary().Storage.TakeSnapshot("snap-1", c.Coordinator.CommittedLSN) |
|||
|
|||
// r1+r2 disconnect. r3+r4 stay for quorum (p+r3+r4 = 3 of 5).
|
|||
c.Disconnect("p", "r1") |
|||
c.Disconnect("r1", "p") |
|||
c.Disconnect("p", "r2") |
|||
c.Disconnect("r2", "p") |
|||
|
|||
// Write more during disconnect — committed via p+r3+r4 quorum.
|
|||
for i := uint64(11); i <= 30; i++ { |
|||
c.CommitWrite(i % 4) |
|||
} |
|||
c.TickN(5) |
|||
|
|||
// Reconnect both.
|
|||
c.Connect("p", "r1") |
|||
c.Connect("r1", "p") |
|||
c.Connect("p", "r2") |
|||
c.Connect("r2", "p") |
|||
|
|||
// r1: reserved catch-up with tight expiry — MUST expire.
|
|||
// 20 entries to replay, but only 2 ticks of budget.
|
|||
r1 := c.Nodes["r1"] |
|||
shortExpiry := c.Now + 2 |
|||
err := c.RecoverReplicaFromPrimaryReserved("r1", r1.Storage.FlushedLSN, c.Coordinator.CommittedLSN, shortExpiry) |
|||
if err == nil { |
|||
t.Fatal("r1 reservation must expire — 20 entries with 2-tick budget") |
|||
} |
|||
r1.ReplicaState = NodeStateNeedsRebuild |
|||
t.Logf("r1 reservation expired: %v", err) |
|||
|
|||
// r2: full catch-up (no reservation pressure).
|
|||
r2 := c.Nodes["r2"] |
|||
if err := c.RecoverReplicaFromPrimary("r2", r2.Storage.FlushedLSN, c.Coordinator.CommittedLSN); err != nil { |
|||
t.Fatalf("r2 full catch-up failed: %v", err) |
|||
} |
|||
r2.ReplicaState = NodeStateInSync |
|||
|
|||
// Deterministic mixed outcome: r1=NeedsRebuild, r2=InSync.
|
|||
if r1.ReplicaState != NodeStateNeedsRebuild { |
|||
t.Fatalf("r1 should be NeedsRebuild, got %s", r1.ReplicaState) |
|||
} |
|||
if r2.ReplicaState != NodeStateInSync { |
|||
t.Fatalf("r2 should be InSync, got %s", r2.ReplicaState) |
|||
} |
|||
|
|||
// r2 data correct.
|
|||
if err := c.AssertCommittedRecoverable("r2"); err != nil { |
|||
t.Fatalf("r2 data incorrect: %v", err) |
|||
} |
|||
|
|||
// r1 rebuild from snapshot.
|
|||
c.RebuildReplicaFromSnapshot("r1", "snap-1", c.Coordinator.CommittedLSN) |
|||
r1.ReplicaState = NodeStateInSync |
|||
if err := c.AssertCommittedRecoverable("r1"); err != nil { |
|||
t.Fatalf("r1 data incorrect after rebuild: %v", err) |
|||
} |
|||
|
|||
t.Logf("mixed outcome proven: r1=NeedsRebuild→rebuilt, r2=InSync") |
|||
} |
|||
|
|||
// Scenario: all replicas need rebuild but only one snapshot exists.
|
|||
// First replica rebuilds from snapshot, second must wait or use first
|
|||
// replica as rebuild source.
|
|||
|
|||
func TestP02_MultiNode_AllNeedRebuild(t *testing.T) { |
|||
// Use 5 nodes so quorum (3 of 5) can be met with p+r3+r4 while r1+r2 are down.
|
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3", "r4") |
|||
c.MaxCatchupAttempts = 2 |
|||
|
|||
for i := uint64(1); i <= 5; i++ { |
|||
c.CommitWrite(i) |
|||
} |
|||
c.TickN(5) |
|||
|
|||
c.Primary().Storage.TakeSnapshot("snap-all", c.Coordinator.CommittedLSN) |
|||
|
|||
// r1 and r2 disconnect. r3+r4 stay connected so quorum (p+r3+r4) can commit.
|
|||
c.Disconnect("p", "r1") |
|||
c.Disconnect("r1", "p") |
|||
c.Disconnect("p", "r2") |
|||
c.Disconnect("r2", "p") |
|||
for i := uint64(6); i <= 100; i++ { |
|||
c.CommitWrite(i % 8) |
|||
} |
|||
c.TickN(5) |
|||
|
|||
// Try catch-up for r1 and r2 — both will escalate.
|
|||
// Pattern: write while target disconnected, then try partial catch-up.
|
|||
for _, id := range []string{"r1", "r2"} { |
|||
n := c.Nodes[id] |
|||
n.ReplicaState = NodeStateCatchingUp |
|||
for attempt := 0; attempt < 5; attempt++ { |
|||
// Write MORE while target is still disconnected (r3+r4 provide quorum).
|
|||
for w := 0; w < 20; w++ { |
|||
c.CommitWrite(uint64(101+attempt*20+w) % 8) |
|||
} |
|||
c.TickN(3) // 3 ticks: deliver writes, barriers, then acks
|
|||
// Now try catch-up (partial, batch=1). Target stays disconnected —
|
|||
// RecoverReplicaFromPrimaryPartial reads directly from primary WAL.
|
|||
c.CatchUpWithEscalation(id, 1) |
|||
if n.ReplicaState == NodeStateNeedsRebuild { |
|||
break |
|||
} |
|||
} |
|||
} |
|||
|
|||
// Reconnect all for rebuild.
|
|||
c.Connect("p", "r1") |
|||
c.Connect("r1", "p") |
|||
c.Connect("p", "r2") |
|||
c.Connect("r2", "p") |
|||
|
|||
// Both should be NeedsRebuild.
|
|||
if c.Nodes["r1"].ReplicaState != NodeStateNeedsRebuild { |
|||
t.Fatalf("r1: expected NeedsRebuild, got %s", c.Nodes["r1"].ReplicaState) |
|||
} |
|||
if c.Nodes["r2"].ReplicaState != NodeStateNeedsRebuild { |
|||
t.Fatalf("r2: expected NeedsRebuild, got %s", c.Nodes["r2"].ReplicaState) |
|||
} |
|||
|
|||
// Rebuild both from snapshot.
|
|||
c.RebuildReplicaFromSnapshot("r1", "snap-all", c.Coordinator.CommittedLSN) |
|||
c.RebuildReplicaFromSnapshot("r2", "snap-all", c.Coordinator.CommittedLSN) |
|||
c.Nodes["r1"].ReplicaState = NodeStateInSync |
|||
c.Nodes["r2"].ReplicaState = NodeStateInSync |
|||
|
|||
// Both correct.
|
|||
if err := c.AssertCommittedRecoverable("r1"); err != nil { |
|||
t.Fatal(err) |
|||
} |
|||
if err := c.AssertCommittedRecoverable("r2"); err != nil { |
|||
t.Fatal(err) |
|||
} |
|||
t.Logf("multi-node rebuild complete: both replicas recovered from snapshot") |
|||
} |
|||
|
|||
// Scenario: rebuild timeout — rebuild takes too long, coordinator
|
|||
// should be able to abort and retry or fail explicitly.
|
|||
|
|||
func TestP02_RebuildTimeout_PartialRebuildAborts(t *testing.T) { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
|
|||
for i := uint64(1); i <= 20; i++ { |
|||
c.CommitWrite(i % 4) |
|||
} |
|||
c.TickN(5) |
|||
|
|||
c.Primary().Storage.TakeSnapshot("snap-timeout", c.Coordinator.CommittedLSN) |
|||
|
|||
// Write much more.
|
|||
for i := uint64(21); i <= 100; i++ { |
|||
c.CommitWrite(i % 4) |
|||
} |
|||
c.TickN(5) |
|||
|
|||
// r1 needs rebuild — use partial rebuild with small max.
|
|||
lastRecovered, err := c.RebuildReplicaFromSnapshotPartial("r1", "snap-timeout", c.Coordinator.CommittedLSN, 5) |
|||
if err != nil { |
|||
t.Fatalf("partial rebuild: %v", err) |
|||
} |
|||
|
|||
// Partial rebuild: not complete.
|
|||
if lastRecovered >= c.Coordinator.CommittedLSN { |
|||
t.Fatal("expected partial rebuild, not complete") |
|||
} |
|||
|
|||
// r1 state should remain NeedsRebuild (not promoted to InSync).
|
|||
c.Nodes["r1"].ReplicaState = NodeStateRebuilding |
|||
if c.Nodes["r1"].ReplicaState == NodeStateInSync { |
|||
t.Fatal("partial rebuild should not grant InSync") |
|||
} |
|||
|
|||
// Full rebuild to complete.
|
|||
c.RebuildReplicaFromSnapshot("r1", "snap-timeout", c.Coordinator.CommittedLSN) |
|||
c.Nodes["r1"].ReplicaState = NodeStateInSync |
|||
if err := c.AssertCommittedRecoverable("r1"); err != nil { |
|||
t.Fatal(err) |
|||
} |
|||
} |
|||
@ -0,0 +1,359 @@ |
|||
package distsim |
|||
|
|||
import ( |
|||
"testing" |
|||
) |
|||
|
|||
// ============================================================
|
|||
// Phase 02: Protocol-state assertions + version comparison
|
|||
// ============================================================
|
|||
|
|||
// --- P0: Protocol-level rejection assertions ---
|
|||
|
|||
func TestP02_EpochFencing_AllStaleTrafficRejected(t *testing.T) { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
|
|||
c.CommitWrite(1) |
|||
c.TickN(5) |
|||
|
|||
// Partition + promote.
|
|||
c.Disconnect("p", "r1") |
|||
c.Disconnect("r1", "p") |
|||
c.Disconnect("p", "r2") |
|||
c.Disconnect("r2", "p") |
|||
c.Promote("r1") |
|||
staleEpoch := c.Coordinator.Epoch - 1 |
|||
c.Nodes["p"].Epoch = staleEpoch |
|||
|
|||
// Stale writes through protocol.
|
|||
delivered := c.StaleWrite("p", staleEpoch, 99) |
|||
|
|||
// Protocol-level assertion: zero accepted, all rejected by epoch.
|
|||
if delivered > 0 { |
|||
t.Fatalf("stale traffic accepted: %d messages passed fencing", delivered) |
|||
} |
|||
epochRejects := c.RejectedByReason(RejectEpochMismatch) |
|||
if epochRejects == 0 { |
|||
t.Fatal("no epoch rejections recorded — fencing not tracked") |
|||
} |
|||
|
|||
// Delivery log must show explicit rejections (protocol behavior, not just final state).
|
|||
totalRejected := 0 |
|||
for _, d := range c.Deliveries { |
|||
if !d.Accepted { |
|||
totalRejected++ |
|||
} |
|||
} |
|||
if totalRejected == 0 { |
|||
t.Fatal("delivery log has no rejections — protocol behavior not recorded") |
|||
} |
|||
t.Logf("protocol-level: %d rejected, %d epoch_mismatch", totalRejected, epochRejects) |
|||
} |
|||
|
|||
func TestP02_AcceptedDeliveries_Tracked(t *testing.T) { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
|
|||
c.CommitWrite(1) |
|||
c.TickN(5) |
|||
|
|||
// Should have accepted write + barrier deliveries.
|
|||
accepted := c.AcceptedCount() |
|||
if accepted == 0 { |
|||
t.Fatal("no accepted deliveries recorded") |
|||
} |
|||
acceptedWrites := c.AcceptedByKind(MsgWrite) |
|||
if acceptedWrites == 0 { |
|||
t.Fatal("no accepted write deliveries") |
|||
} |
|||
t.Logf("after 1 write: %d accepted total, %d writes", accepted, acceptedWrites) |
|||
} |
|||
|
|||
// --- P1: S20 protocol-level closure ---
|
|||
|
|||
func TestP02_S20_StaleTraffic_CommittedPrefixUnchanged(t *testing.T) { |
|||
c := NewCluster(CommitSyncQuorum, "A", "B", "C") |
|||
|
|||
c.CommitWrite(1) |
|||
c.CommitWrite(2) |
|||
c.TickN(5) |
|||
|
|||
// Partition A, promote B.
|
|||
c.Disconnect("A", "B") |
|||
c.Disconnect("B", "A") |
|||
c.Disconnect("A", "C") |
|||
c.Disconnect("C", "A") |
|||
c.Promote("B") |
|||
c.Nodes["A"].Epoch = c.Coordinator.Epoch - 1 |
|||
|
|||
// B writes (new epoch).
|
|||
c.CommitWrite(3) |
|||
c.TickN(5) |
|||
committedBefore := c.Coordinator.CommittedLSN |
|||
|
|||
// A stale writes through protocol.
|
|||
c.StaleWrite("A", c.Nodes["A"].Epoch, 99) |
|||
|
|||
// Protocol assertion: committed prefix unchanged by stale traffic.
|
|||
committedAfter := c.Coordinator.CommittedLSN |
|||
if committedAfter != committedBefore { |
|||
t.Fatalf("stale traffic changed committed prefix: before=%d after=%d", committedBefore, committedAfter) |
|||
} |
|||
|
|||
// All stale messages rejected by epoch.
|
|||
if c.RejectedByReason(RejectEpochMismatch) == 0 { |
|||
t.Fatal("no epoch rejections for stale traffic") |
|||
} |
|||
} |
|||
|
|||
// --- P1: S6 protocol-level closure ---
|
|||
|
|||
func TestP02_S6_NonConvergent_ExplicitStateTransition(t *testing.T) { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
c.MaxCatchupAttempts = 3 |
|||
|
|||
for i := uint64(1); i <= 5; i++ { |
|||
c.CommitWrite(i) |
|||
} |
|||
c.TickN(5) |
|||
|
|||
c.Disconnect("p", "r1") |
|||
c.Disconnect("r1", "p") |
|||
for i := uint64(6); i <= 100; i++ { |
|||
c.CommitWrite(i % 8) |
|||
} |
|||
c.TickN(5) |
|||
c.Connect("p", "r1") |
|||
c.Connect("r1", "p") |
|||
|
|||
r1 := c.Nodes["r1"] |
|||
r1.ReplicaState = NodeStateCatchingUp |
|||
|
|||
// Protocol assertion: state transitions are explicit.
|
|||
// Track the state at each step.
|
|||
var stateTrace []ReplicaNodeState |
|||
for attempt := 0; attempt < 10; attempt++ { |
|||
c.Disconnect("p", "r1") |
|||
c.Disconnect("r1", "p") |
|||
for w := 0; w < 20; w++ { |
|||
c.CommitWrite(uint64(101+attempt*20+w) % 8) |
|||
} |
|||
c.TickN(2) |
|||
c.Connect("p", "r1") |
|||
c.Connect("r1", "p") |
|||
|
|||
c.CatchUpWithEscalation("r1", 1) |
|||
stateTrace = append(stateTrace, r1.ReplicaState) |
|||
|
|||
if r1.ReplicaState == NodeStateNeedsRebuild { |
|||
break |
|||
} |
|||
} |
|||
|
|||
// Must have explicit state transitions: CatchingUp → ... → NeedsRebuild.
|
|||
if r1.ReplicaState != NodeStateNeedsRebuild { |
|||
t.Fatalf("expected NeedsRebuild, got %s", r1.ReplicaState) |
|||
} |
|||
// Trace must show CatchingUp before NeedsRebuild.
|
|||
hasCatchingUp := false |
|||
for _, s := range stateTrace { |
|||
if s == NodeStateCatchingUp { |
|||
hasCatchingUp = true |
|||
} |
|||
} |
|||
if !hasCatchingUp { |
|||
t.Fatal("state trace should include CatchingUp before NeedsRebuild") |
|||
} |
|||
t.Logf("state trace: %v", stateTrace) |
|||
} |
|||
|
|||
// --- P1: S18 protocol-level closure ---
|
|||
|
|||
func TestP02_S18_DelayedAck_ExplicitRejection(t *testing.T) { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
|
|||
c.CommitWrite(1) |
|||
c.TickN(5) |
|||
|
|||
// Write 2 without r1 ack.
|
|||
c.Disconnect("r1", "p") |
|||
c.CommitWrite(2) |
|||
c.TickN(3) |
|||
|
|||
// Restart primary with epoch bump.
|
|||
c.StopNode("p") |
|||
c.Coordinator.Epoch++ |
|||
for _, n := range c.Nodes { |
|||
if n.Running { |
|||
n.Epoch = c.Coordinator.Epoch |
|||
} |
|||
} |
|||
c.StartNode("p") |
|||
|
|||
committedBefore := c.Coordinator.CommittedLSN |
|||
deliveriesBefore := len(c.Deliveries) |
|||
|
|||
// Reconnect r1, deliver stale ack.
|
|||
c.Connect("r1", "p") |
|||
c.Connect("p", "r1") |
|||
oldAck := Message{ |
|||
Kind: MsgBarrierAck, From: "r1", To: "p", |
|||
Epoch: c.Coordinator.Epoch - 1, TargetLSN: 2, |
|||
} |
|||
c.deliver(oldAck) |
|||
c.refreshCommits() |
|||
|
|||
// Protocol assertion 1: committed prefix unchanged.
|
|||
if c.Coordinator.CommittedLSN > committedBefore { |
|||
t.Fatalf("stale ack advanced prefix: %d → %d", committedBefore, c.Coordinator.CommittedLSN) |
|||
} |
|||
|
|||
// Protocol assertion 2: the delivery was explicitly recorded as rejected.
|
|||
newDeliveries := c.Deliveries[deliveriesBefore:] |
|||
found := false |
|||
for _, d := range newDeliveries { |
|||
if !d.Accepted && d.Reason == RejectEpochMismatch && d.Msg.Kind == MsgBarrierAck { |
|||
found = true |
|||
} |
|||
} |
|||
if !found { |
|||
t.Fatal("stale ack not recorded as epoch_mismatch rejection in delivery log") |
|||
} |
|||
} |
|||
|
|||
// --- P2: Version comparison ---
|
|||
|
|||
func TestP02_VersionComparison_BriefDisconnect(t *testing.T) { |
|||
// Same scenario under V1, V1.5, V2 — different expected outcomes.
|
|||
for _, tc := range []struct { |
|||
version ProtocolVersion |
|||
expectCatchup bool |
|||
expectRebuild bool |
|||
}{ |
|||
{ProtocolV1, false, false}, // V1: no catch-up, stays degraded
|
|||
{ProtocolV15, true, false}, // V1.5: catch-up possible if address stable
|
|||
{ProtocolV2, true, false}, // V2: catch-up allowed for this recoverable short-gap case
|
|||
} { |
|||
t.Run(string(tc.version), func(t *testing.T) { |
|||
c := NewClusterWithProtocol(CommitSyncQuorum, tc.version, "p", "r1", "r2") |
|||
c.MaxCatchupAttempts = 5 |
|||
|
|||
c.CommitWrite(1) |
|||
c.CommitWrite(2) |
|||
c.TickN(5) |
|||
|
|||
// Brief disconnect.
|
|||
c.Disconnect("p", "r1") |
|||
c.Disconnect("r1", "p") |
|||
c.CommitWrite(3) |
|||
c.CommitWrite(4) |
|||
c.TickN(5) |
|||
c.Connect("p", "r1") |
|||
c.Connect("r1", "p") |
|||
|
|||
canCatchup := c.Protocol.CanAttemptCatchup(true) |
|||
if canCatchup != tc.expectCatchup { |
|||
t.Fatalf("CanAttemptCatchup: got %v, want %v", canCatchup, tc.expectCatchup) |
|||
} |
|||
|
|||
if canCatchup { |
|||
// Catch up r1.
|
|||
r1 := c.Nodes["r1"] |
|||
r1.ReplicaState = NodeStateCatchingUp |
|||
converged := c.CatchUpWithEscalation("r1", 100) |
|||
if !converged { |
|||
t.Fatal("expected catch-up to converge for short gap") |
|||
} |
|||
if r1.ReplicaState != NodeStateInSync { |
|||
t.Fatalf("expected InSync after catch-up, got %s", r1.ReplicaState) |
|||
} |
|||
} |
|||
}) |
|||
} |
|||
} |
|||
|
|||
func TestP02_VersionComparison_BriefDisconnectActions(t *testing.T) { |
|||
for _, tc := range []struct { |
|||
version ProtocolVersion |
|||
addrStable bool |
|||
recoverable bool |
|||
expectAction string |
|||
}{ |
|||
{ProtocolV1, true, true, "degrade_or_rebuild"}, |
|||
{ProtocolV15, true, true, "catchup_if_history_survives"}, |
|||
{ProtocolV15, false, true, "stall_or_control_plane_recovery"}, |
|||
{ProtocolV2, true, true, "reserved_catchup"}, |
|||
{ProtocolV2, false, false, "explicit_rebuild"}, |
|||
} { |
|||
t.Run(string(tc.version)+"_stable="+boolStr(tc.addrStable)+"_recoverable="+boolStr(tc.recoverable), func(t *testing.T) { |
|||
policy := ProtocolPolicy{Version: tc.version} |
|||
action := policy.BriefDisconnectAction(tc.addrStable, tc.recoverable) |
|||
if action != tc.expectAction { |
|||
t.Fatalf("BriefDisconnectAction(%v,%v): got %q, want %q", tc.addrStable, tc.recoverable, action, tc.expectAction) |
|||
} |
|||
}) |
|||
} |
|||
} |
|||
|
|||
func TestP02_VersionComparison_TailChasing(t *testing.T) { |
|||
for _, tc := range []struct { |
|||
version ProtocolVersion |
|||
expectAction string |
|||
}{ |
|||
{ProtocolV1, "degrade"}, |
|||
{ProtocolV15, "stall_or_rebuild"}, |
|||
{ProtocolV2, "abort_to_rebuild"}, |
|||
} { |
|||
t.Run(string(tc.version), func(t *testing.T) { |
|||
policy := ProtocolPolicy{Version: tc.version} |
|||
action := policy.TailChasingAction(false) // non-convergent
|
|||
if action != tc.expectAction { |
|||
t.Fatalf("TailChasingAction(false): got %q, want %q", action, tc.expectAction) |
|||
} |
|||
}) |
|||
} |
|||
} |
|||
|
|||
func TestP02_VersionComparison_RestartRejoin(t *testing.T) { |
|||
for _, tc := range []struct { |
|||
version ProtocolVersion |
|||
addrStable bool |
|||
expectAction string |
|||
}{ |
|||
{ProtocolV1, true, "control_plane_only"}, |
|||
{ProtocolV1, false, "control_plane_only"}, |
|||
{ProtocolV15, true, "background_reconnect_or_control_plane"}, |
|||
{ProtocolV15, false, "control_plane_only"}, |
|||
{ProtocolV2, true, "direct_reconnect_or_control_plane"}, |
|||
{ProtocolV2, false, "explicit_reassignment_or_rebuild"}, |
|||
} { |
|||
t.Run(string(tc.version)+"_stable="+boolStr(tc.addrStable), func(t *testing.T) { |
|||
policy := ProtocolPolicy{Version: tc.version} |
|||
action := policy.RestartRejoinAction(tc.addrStable) |
|||
if action != tc.expectAction { |
|||
t.Fatalf("RestartRejoinAction(%v): got %q, want %q", tc.addrStable, action, tc.expectAction) |
|||
} |
|||
}) |
|||
} |
|||
} |
|||
|
|||
func TestP02_VersionComparison_V15RestartAddressInstability(t *testing.T) { |
|||
v15 := ProtocolPolicy{Version: ProtocolV15} |
|||
v2 := ProtocolPolicy{Version: ProtocolV2} |
|||
|
|||
if got := v15.RestartRejoinAction(false); got != "control_plane_only" { |
|||
t.Fatalf("v1.5 changed-address restart should fall back to control plane, got %q", got) |
|||
} |
|||
if got := v2.ChangedAddressRestartAction(true); got != "explicit_reassignment_then_catchup" { |
|||
t.Fatalf("v2 changed-address recoverable restart should use explicit reassignment + catch-up, got %q", got) |
|||
} |
|||
if got := v2.ChangedAddressRestartAction(false); got != "explicit_reassignment_or_rebuild" { |
|||
t.Fatalf("v2 changed-address unrecoverable restart should go to explicit reassignment/rebuild, got %q", got) |
|||
} |
|||
} |
|||
|
|||
func boolStr(b bool) string { |
|||
if b { |
|||
return "true" |
|||
} |
|||
return "false" |
|||
} |
|||
@ -0,0 +1,434 @@ |
|||
package distsim |
|||
|
|||
import ( |
|||
"testing" |
|||
) |
|||
|
|||
// ============================================================
|
|||
// Phase 02 P2: Real V1/V1.5 failure reproductions
|
|||
// Source: actual Phase 13 hardware behavior and CP13-8 findings
|
|||
// ============================================================
|
|||
|
|||
// --- Scenario: Changed-address restart (CP13-8 T4b) ---
|
|||
// Real bug: replica restarts on a different port. V1.5 shipper retries
|
|||
// the old address forever. Catch-up never succeeds because the old
|
|||
// address is dead.
|
|||
|
|||
func TestP02_V1_ChangedAddressRestart_NeverRecovers(t *testing.T) { |
|||
c := NewClusterWithProtocol(CommitSyncQuorum, ProtocolV1, "p", "r1", "r2") |
|||
|
|||
c.CommitWrite(1) |
|||
c.CommitWrite(2) |
|||
c.TickN(5) |
|||
|
|||
// r1 restarts with changed address — endpoint version bumps.
|
|||
c.StopNode("r1") |
|||
c.Coordinator.Epoch++ |
|||
for _, n := range c.Nodes { |
|||
if n.Running { |
|||
n.Epoch = c.Coordinator.Epoch |
|||
} |
|||
} |
|||
c.RestartNodeWithNewAddress("r1") |
|||
|
|||
// Messages from primary to r1 now rejected: stale endpoint.
|
|||
staleRejects := c.RejectedByReason(RejectStaleEndpoint) |
|||
|
|||
// Writes accumulate — r1 can't receive (endpoint mismatch).
|
|||
for i := uint64(3); i <= 12; i++ { |
|||
c.CommitWrite(i) |
|||
} |
|||
c.TickN(5) |
|||
|
|||
// Verify: messages rejected by stale endpoint, not just link down.
|
|||
newStaleRejects := c.RejectedByReason(RejectStaleEndpoint) - staleRejects |
|||
if newStaleRejects == 0 { |
|||
t.Fatal("V1: writes to r1 should be rejected by stale_endpoint") |
|||
} |
|||
|
|||
// V1: no recovery trigger available.
|
|||
trigger, _, ok := c.TriggerRecoverySession("r1") |
|||
if ok { |
|||
t.Fatalf("V1 should not trigger recovery, got %s", trigger) |
|||
} |
|||
|
|||
// Gap confirmed.
|
|||
r1 := c.Nodes["r1"] |
|||
if err := c.AssertCommittedRecoverable("r1"); err == nil { |
|||
t.Fatal("V1: r1 should have data inconsistency") |
|||
} |
|||
t.Logf("V1: gap=%d, %d stale_endpoint rejections, no recovery path", |
|||
c.Coordinator.CommittedLSN-r1.Storage.FlushedLSN, newStaleRejects) |
|||
} |
|||
|
|||
func TestP02_V15_ChangedAddressRestart_RetriesToStaleAddress(t *testing.T) { |
|||
c := NewClusterWithProtocol(CommitSyncQuorum, ProtocolV15, "p", "r1", "r2") |
|||
|
|||
c.CommitWrite(1) |
|||
c.CommitWrite(2) |
|||
c.TickN(5) |
|||
|
|||
// r1 restarts with changed address — endpoint version bumps.
|
|||
c.StopNode("r1") |
|||
c.Coordinator.Epoch++ |
|||
for _, n := range c.Nodes { |
|||
if n.Running { |
|||
n.Epoch = c.Coordinator.Epoch |
|||
} |
|||
} |
|||
c.RestartNodeWithNewAddress("r1") |
|||
|
|||
// Writes accumulate — rejected by stale endpoint.
|
|||
for i := uint64(3); i <= 12; i++ { |
|||
c.CommitWrite(i) |
|||
} |
|||
c.TickN(5) |
|||
|
|||
// V1.5: recovery trigger fails — address mismatch detected.
|
|||
trigger, _, ok := c.TriggerRecoverySession("r1") |
|||
if ok { |
|||
t.Fatalf("V1.5 should not trigger recovery with changed address, got %s", trigger) |
|||
} |
|||
|
|||
// Heartbeat reveals new endpoint, but V1.5 can only do control_plane_only.
|
|||
report := c.ReportHeartbeat("r1") |
|||
update := c.CoordinatorDetectEndpointChange(report) |
|||
if update == nil { |
|||
t.Fatal("coordinator should detect endpoint change") |
|||
} |
|||
// V1.5: does NOT apply assignment update — no mechanism to update primary.
|
|||
if got := c.Protocol.ChangedAddressRestartAction(true); got != "control_plane_only" { |
|||
t.Fatalf("V1.5: got %q, want control_plane_only", got) |
|||
} |
|||
|
|||
// Gap persists, data inconsistency.
|
|||
r1 := c.Nodes["r1"] |
|||
if err := c.AssertCommittedRecoverable("r1"); err == nil { |
|||
t.Fatal("V1.5: r1 should have data inconsistency") |
|||
} |
|||
t.Logf("V1.5: gap=%d, stale endpoint blocks recovery — control_plane_only", |
|||
c.Coordinator.CommittedLSN-r1.Storage.FlushedLSN) |
|||
} |
|||
|
|||
func TestP02_V2_ChangedAddressRestart_ExplicitReassignment(t *testing.T) { |
|||
c := NewClusterWithProtocol(CommitSyncQuorum, ProtocolV2, "p", "r1", "r2") |
|||
c.MaxCatchupAttempts = 5 |
|||
|
|||
c.CommitWrite(1) |
|||
c.CommitWrite(2) |
|||
c.TickN(5) |
|||
|
|||
// r1 restarts with changed address — endpoint version bumps.
|
|||
c.StopNode("r1") |
|||
c.Coordinator.Epoch++ |
|||
for _, n := range c.Nodes { |
|||
if n.Running { |
|||
n.Epoch = c.Coordinator.Epoch |
|||
} |
|||
} |
|||
c.RestartNodeWithNewAddress("r1") |
|||
|
|||
// Writes accumulate — rejected by stale endpoint.
|
|||
for i := uint64(3); i <= 12; i++ { |
|||
c.CommitWrite(i) |
|||
} |
|||
c.TickN(5) |
|||
|
|||
// Before control-plane flow: recovery trigger fails (stale endpoint).
|
|||
trigger, _, ok := c.TriggerRecoverySession("r1") |
|||
if ok { |
|||
t.Fatalf("V2: recovery should fail before assignment update, got %s", trigger) |
|||
} |
|||
|
|||
// Step 1: heartbeat discovers new endpoint.
|
|||
report := c.ReportHeartbeat("r1") |
|||
update := c.CoordinatorDetectEndpointChange(report) |
|||
if update == nil { |
|||
t.Fatal("coordinator should detect endpoint change") |
|||
} |
|||
|
|||
// Step 2: coordinator applies assignment — primary learns new address.
|
|||
c.ApplyAssignmentUpdate(*update) |
|||
|
|||
// Step 3: recovery trigger now succeeds (endpoint matches).
|
|||
trigger, _, ok = c.TriggerRecoverySession("r1") |
|||
if !ok || trigger != TriggerReassignment { |
|||
t.Fatalf("V2: expected reassignment trigger after update, got %s/%v", trigger, ok) |
|||
} |
|||
|
|||
// Step 4: catch-up via protocol.
|
|||
converged := c.CatchUpWithEscalation("r1", 100) |
|||
if !converged { |
|||
t.Fatal("V2: catch-up should converge after reassignment") |
|||
} |
|||
|
|||
// Data correct after full control-plane flow.
|
|||
if err := c.AssertCommittedRecoverable("r1"); err != nil { |
|||
t.Fatalf("V2: data incorrect after reassignment+catchup: %v", err) |
|||
} |
|||
t.Logf("V2: recovered via heartbeat→detect→assignment→trigger→catchup") |
|||
} |
|||
|
|||
// --- Scenario: Same-address transient outage ---
|
|||
// Common case: brief network hiccup, same ports.
|
|||
|
|||
func TestP02_V1_TransientOutage_Degrades(t *testing.T) { |
|||
c := NewClusterWithProtocol(CommitSyncQuorum, ProtocolV1, "p", "r1", "r2") |
|||
|
|||
c.CommitWrite(1) |
|||
c.TickN(5) |
|||
|
|||
// Brief partition.
|
|||
c.Disconnect("p", "r1") |
|||
c.Disconnect("r1", "p") |
|||
c.CommitWrite(2) |
|||
c.CommitWrite(3) |
|||
c.TickN(5) |
|||
|
|||
// Heal.
|
|||
c.Connect("p", "r1") |
|||
c.Connect("r1", "p") |
|||
|
|||
// V1: no catch-up. r1 stays at flushed=1.
|
|||
if c.Protocol.CanAttemptCatchup(true) { |
|||
t.Fatal("V1 should not catch-up even with stable address") |
|||
} |
|||
|
|||
c.TickN(5) |
|||
r1 := c.Nodes["r1"] |
|||
if r1.Storage.FlushedLSN >= c.Coordinator.CommittedLSN { |
|||
// V1 doesn't catch up — unless messages from BEFORE disconnect are still delivering.
|
|||
// In our model, messages enqueued before disconnect may still arrive. That's a V1 "accident" not protocol.
|
|||
} |
|||
t.Logf("V1 transient outage: flushed=%d committed=%d action=%s", |
|||
r1.Storage.FlushedLSN, c.Coordinator.CommittedLSN, |
|||
c.Protocol.BriefDisconnectAction(true, true)) |
|||
} |
|||
|
|||
func TestP02_V15_TransientOutage_CatchesUp(t *testing.T) { |
|||
c := NewClusterWithProtocol(CommitSyncQuorum, ProtocolV15, "p", "r1", "r2") |
|||
c.MaxCatchupAttempts = 5 |
|||
|
|||
c.CommitWrite(1) |
|||
c.TickN(5) |
|||
|
|||
c.Disconnect("p", "r1") |
|||
c.Disconnect("r1", "p") |
|||
c.CommitWrite(2) |
|||
c.CommitWrite(3) |
|||
c.TickN(5) |
|||
|
|||
c.Connect("p", "r1") |
|||
c.Connect("r1", "p") |
|||
|
|||
// V1.5: catch-up works if address stable.
|
|||
if !c.Protocol.CanAttemptCatchup(true) { |
|||
t.Fatal("V1.5 should catch-up with stable address") |
|||
} |
|||
|
|||
r1 := c.Nodes["r1"] |
|||
r1.ReplicaState = NodeStateCatchingUp |
|||
converged := c.CatchUpWithEscalation("r1", 100) |
|||
if !converged { |
|||
t.Fatal("V1.5: should converge for short gap with stable address") |
|||
} |
|||
if r1.ReplicaState != NodeStateInSync { |
|||
t.Fatalf("V1.5: expected InSync, got %s", r1.ReplicaState) |
|||
} |
|||
if err := c.AssertCommittedRecoverable("r1"); err != nil { |
|||
t.Fatal(err) |
|||
} |
|||
t.Logf("V1.5 transient outage: recovered via catch-up, flushed=%d", r1.Storage.FlushedLSN) |
|||
} |
|||
|
|||
func TestP02_V2_TransientOutage_ReservedCatchup(t *testing.T) { |
|||
c := NewClusterWithProtocol(CommitSyncQuorum, ProtocolV2, "p", "r1", "r2") |
|||
c.MaxCatchupAttempts = 5 |
|||
|
|||
c.CommitWrite(1) |
|||
c.TickN(5) |
|||
|
|||
c.Disconnect("p", "r1") |
|||
c.Disconnect("r1", "p") |
|||
c.CommitWrite(2) |
|||
c.CommitWrite(3) |
|||
c.TickN(5) |
|||
|
|||
c.Connect("p", "r1") |
|||
c.Connect("r1", "p") |
|||
|
|||
// V2: reserved catch-up — explicit recoverability check.
|
|||
action := c.Protocol.BriefDisconnectAction(true, true) |
|||
if action != "reserved_catchup" { |
|||
t.Fatalf("V2 brief disconnect: got %q, want reserved_catchup", action) |
|||
} |
|||
|
|||
r1 := c.Nodes["r1"] |
|||
r1.ReplicaState = NodeStateCatchingUp |
|||
converged := c.CatchUpWithEscalation("r1", 100) |
|||
if !converged { |
|||
t.Fatal("V2: should converge for short gap") |
|||
} |
|||
if err := c.AssertCommittedRecoverable("r1"); err != nil { |
|||
t.Fatal(err) |
|||
} |
|||
t.Logf("V2 transient outage: reserved catch-up succeeded") |
|||
} |
|||
|
|||
// --- Scenario: Slow control-plane recovery ---
|
|||
// Source: real Phase 13 hardware behavior.
|
|||
// Data path recovers fast. Control plane (master) is slow to re-issue
|
|||
// assignments. During this window, V1/V1.5 behavior differs from V2.
|
|||
|
|||
func TestP02_SlowControlPlane_V1_WaitsForMaster(t *testing.T) { |
|||
c := NewClusterWithProtocol(CommitSyncQuorum, ProtocolV1, "p", "r1", "r2") |
|||
|
|||
c.CommitWrite(1) |
|||
c.TickN(5) |
|||
|
|||
// r1 disconnects. Stays disconnected through outage + control-plane delay.
|
|||
c.Disconnect("p", "r1") |
|||
c.Disconnect("r1", "p") |
|||
|
|||
// Writes accumulate: outage write + delay-window writes. r1 misses all.
|
|||
for i := uint64(2); i <= 10; i++ { |
|||
c.CommitWrite(i) |
|||
} |
|||
c.TickN(5) |
|||
|
|||
// Data path heals — but V1 has no catch-up protocol.
|
|||
c.Connect("p", "r1") |
|||
c.Connect("r1", "p") |
|||
|
|||
// V1: no recovery trigger even with address stable.
|
|||
trigger, _, ok := c.TriggerRecoverySession("r1") |
|||
if ok { |
|||
t.Fatalf("V1 should not trigger recovery, got %s", trigger) |
|||
} |
|||
|
|||
// r1 is behind: FlushedLSN=1, CommittedLSN=10. Gap = 9.
|
|||
r1 := c.Nodes["r1"] |
|||
gap := c.Coordinator.CommittedLSN - r1.Storage.FlushedLSN |
|||
if gap < 9 { |
|||
t.Fatalf("V1: expected gap >= 9, got %d", gap) |
|||
} |
|||
|
|||
// V1 data inconsistency: r1 missed writes 2-10. No self-heal mechanism.
|
|||
err := c.AssertCommittedRecoverable("r1") |
|||
if err == nil { |
|||
t.Fatal("V1: r1 should have data inconsistency — no catch-up mechanism") |
|||
} |
|||
t.Logf("V1 slow control-plane: gap=%d, data inconsistency — %v", gap, err) |
|||
} |
|||
|
|||
func TestP02_SlowControlPlane_V15_BackgroundReconnect(t *testing.T) { |
|||
c := NewClusterWithProtocol(CommitSyncQuorum, ProtocolV15, "p", "r1", "r2") |
|||
c.MaxCatchupAttempts = 5 |
|||
|
|||
c.CommitWrite(1) |
|||
c.TickN(5) |
|||
|
|||
// r1 disconnects. Stays disconnected through outage + delay window.
|
|||
c.Disconnect("p", "r1") |
|||
c.Disconnect("r1", "p") |
|||
|
|||
// Writes accumulate while r1 is disconnected.
|
|||
for i := uint64(2); i <= 10; i++ { |
|||
c.CommitWrite(i) |
|||
} |
|||
c.TickN(5) |
|||
|
|||
// Data path heals.
|
|||
c.Connect("p", "r1") |
|||
c.Connect("r1", "p") |
|||
|
|||
// Before catch-up: r1 is behind (FlushedLSN=1, CommittedLSN=10).
|
|||
r1 := c.Nodes["r1"] |
|||
if r1.Storage.FlushedLSN >= c.Coordinator.CommittedLSN { |
|||
t.Fatal("V1.5: r1 should be behind before catch-up") |
|||
} |
|||
if err := c.AssertCommittedRecoverable("r1"); err == nil { |
|||
t.Fatal("V1.5: r1 should have data gap before catch-up") |
|||
} |
|||
|
|||
// V1.5 policy: background reconnect if address stable.
|
|||
if c.Protocol.RestartRejoinAction(true) != "background_reconnect_or_control_plane" { |
|||
t.Fatal("V1.5 stable-address should be background_reconnect_or_control_plane") |
|||
} |
|||
|
|||
// V1.5 recovery trigger: background reconnect (address stable → endpoint matches).
|
|||
trigger, _, ok := c.TriggerRecoverySession("r1") |
|||
if !ok || trigger != TriggerBackgroundReconnect { |
|||
t.Fatalf("V1.5: expected background_reconnect trigger, got %s/%v", trigger, ok) |
|||
} |
|||
// r1.ReplicaState is now CatchingUp (set by TriggerRecoverySession).
|
|||
converged := c.CatchUpWithEscalation("r1", 100) |
|||
if !converged { |
|||
t.Fatal("V1.5: should catch up with stable address") |
|||
} |
|||
|
|||
// After catch-up: data correct.
|
|||
if err := c.AssertCommittedRecoverable("r1"); err != nil { |
|||
t.Fatalf("V1.5: data should be correct after catch-up — %v", err) |
|||
} |
|||
|
|||
// V1.5 changed-address: falls back to control plane.
|
|||
if c.Protocol.RestartRejoinAction(false) != "control_plane_only" { |
|||
t.Fatal("V1.5 changed-address should fall back to control_plane_only") |
|||
} |
|||
t.Logf("V1.5 slow control-plane: caught up %d entries via background reconnect", |
|||
c.Coordinator.CommittedLSN-1) |
|||
} |
|||
|
|||
func TestP02_SlowControlPlane_V2_DirectReconnect(t *testing.T) { |
|||
c := NewClusterWithProtocol(CommitSyncQuorum, ProtocolV2, "p", "r1", "r2") |
|||
c.MaxCatchupAttempts = 5 |
|||
|
|||
c.CommitWrite(1) |
|||
c.TickN(5) |
|||
|
|||
// r1 disconnects. Stays disconnected through outage + delay window.
|
|||
c.Disconnect("p", "r1") |
|||
c.Disconnect("r1", "p") |
|||
|
|||
// Writes accumulate while r1 is disconnected.
|
|||
for i := uint64(2); i <= 10; i++ { |
|||
c.CommitWrite(i) |
|||
} |
|||
c.TickN(5) |
|||
|
|||
// Data path heals.
|
|||
c.Connect("p", "r1") |
|||
c.Connect("r1", "p") |
|||
|
|||
// Before catch-up: r1 is behind.
|
|||
r1 := c.Nodes["r1"] |
|||
if r1.Storage.FlushedLSN >= c.Coordinator.CommittedLSN { |
|||
t.Fatal("V2: r1 should be behind before direct reconnect") |
|||
} |
|||
if err := c.AssertCommittedRecoverable("r1"); err == nil { |
|||
t.Fatal("V2: r1 should have data gap before direct reconnect") |
|||
} |
|||
|
|||
// V2 policy: direct reconnect, doesn't wait for master.
|
|||
if c.Protocol.RestartRejoinAction(true) != "direct_reconnect_or_control_plane" { |
|||
t.Fatal("V2 should be direct_reconnect_or_control_plane") |
|||
} |
|||
|
|||
// V2 recovery trigger: reassignment (address stable → endpoint matches).
|
|||
trigger, _, ok := c.TriggerRecoverySession("r1") |
|||
if !ok || trigger != TriggerReassignment { |
|||
t.Fatalf("V2: expected reassignment trigger, got %s/%v", trigger, ok) |
|||
} |
|||
converged := c.CatchUpWithEscalation("r1", 100) |
|||
if !converged { |
|||
t.Fatal("V2: should catch up directly without master intervention") |
|||
} |
|||
|
|||
// After catch-up: data correct.
|
|||
if err := c.AssertCommittedRecoverable("r1"); err != nil { |
|||
t.Fatalf("V2: data should be correct after direct reconnect — %v", err) |
|||
} |
|||
t.Logf("V2 slow control-plane: caught up %d entries immediately via direct reconnect", |
|||
c.Coordinator.CommittedLSN-1) |
|||
} |
|||
@ -0,0 +1,287 @@ |
|||
package distsim |
|||
|
|||
import ( |
|||
"testing" |
|||
) |
|||
|
|||
// ============================================================
|
|||
// Phase 03 P2: Timer-ordering races
|
|||
// ============================================================
|
|||
|
|||
// --- Race 1: Concurrent barrier timeouts under sync_quorum ---
|
|||
|
|||
func TestP03_P2_ConcurrentBarrierTimeout_QuorumEdge(t *testing.T) { |
|||
// RF=3 (p, r1, r2). sync_quorum (quorum=2).
|
|||
// Both r1 and r2 have barrier timeouts. r1's ack arrives in the same tick
|
|||
// as r2's timeout fires. The "data before timers" rule means:
|
|||
// r1 ack processed → cancels r1 timeout → r2 timeout fires → quorum = p+r1 = 2 → committed.
|
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
c.BarrierTimeoutTicks = 5 |
|||
|
|||
// r2 disconnected — barrier will time out. r1 connected — will ack.
|
|||
c.Disconnect("p", "r2") |
|||
c.Disconnect("r2", "p") |
|||
|
|||
c.CommitWrite(1) // barrier to r1 at Now+2, barrier to r2 at Now+2
|
|||
// Barrier timeout for both at Now+5.
|
|||
|
|||
c.TickN(10) |
|||
|
|||
// r1 ack arrived → cancelled r1 timeout.
|
|||
// r2 barrier timed out (link down, no ack).
|
|||
firedBarriers := c.FiredTimeoutsByKind(TimeoutBarrier) |
|||
if firedBarriers != 1 { |
|||
t.Fatalf("expected 1 barrier timeout (r2), got %d", firedBarriers) |
|||
} |
|||
|
|||
// Event log: r1's barrier timeout was cancelled (ack arrived earlier).
|
|||
// r2's barrier timeout fired. Verify both are in the TickLog.
|
|||
var cancelCount, fireCount int |
|||
for _, e := range c.TickLog { |
|||
if e.Kind == EventTimeoutCancelled { |
|||
cancelCount++ |
|||
} |
|||
if e.Kind == EventTimeoutFired { |
|||
fireCount++ |
|||
} |
|||
} |
|||
if cancelCount != 1 { |
|||
t.Fatalf("expected 1 timeout cancel (r1 ack), got %d", cancelCount) |
|||
} |
|||
if fireCount != 1 { |
|||
t.Fatalf("expected 1 timeout fire (r2), got %d", fireCount) |
|||
} |
|||
|
|||
// Quorum: p + r1 = 2 of 3 → committed.
|
|||
if c.Coordinator.CommittedLSN != 1 { |
|||
t.Fatalf("LSN 1 should commit via quorum (p+r1), committed=%d", c.Coordinator.CommittedLSN) |
|||
} |
|||
|
|||
// DurableOn: p=true (self-ack), r1=true (ack), r2 NOT set (timed out).
|
|||
p1 := c.Pending[1] |
|||
if !p1.DurableOn["p"] || !p1.DurableOn["r1"] { |
|||
t.Fatal("DurableOn should have p and r1") |
|||
} |
|||
if p1.DurableOn["r2"] { |
|||
t.Fatal("DurableOn should NOT have r2 (timed out)") |
|||
} |
|||
|
|||
t.Logf("concurrent timeout: r1 acked, r2 timed out, quorum met, committed=%d", c.Coordinator.CommittedLSN) |
|||
} |
|||
|
|||
func TestP03_P2_ConcurrentBarrierTimeout_BothTimeout_NoQuorum(t *testing.T) { |
|||
// Both r1 and r2 disconnected. Both timeouts fire. Quorum = p alone = 1 < 2.
|
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
c.BarrierTimeoutTicks = 5 |
|||
|
|||
c.Disconnect("p", "r1") |
|||
c.Disconnect("r1", "p") |
|||
c.Disconnect("p", "r2") |
|||
c.Disconnect("r2", "p") |
|||
|
|||
c.CommitWrite(1) |
|||
c.TickN(10) |
|||
|
|||
// Both barriers timed out.
|
|||
if c.FiredTimeoutsByKind(TimeoutBarrier) != 2 { |
|||
t.Fatalf("expected 2 barrier timeouts, got %d", c.FiredTimeoutsByKind(TimeoutBarrier)) |
|||
} |
|||
|
|||
// No quorum — uncommitted.
|
|||
if c.Coordinator.CommittedLSN != 0 { |
|||
t.Fatalf("LSN 1 should not commit without quorum, committed=%d", c.Coordinator.CommittedLSN) |
|||
} |
|||
|
|||
// Neither r1 nor r2 in DurableOn.
|
|||
p1 := c.Pending[1] |
|||
if p1.DurableOn["r1"] || p1.DurableOn["r2"] { |
|||
t.Fatal("DurableOn should not have r1 or r2") |
|||
} |
|||
t.Logf("both timeouts: no quorum, LSN 1 uncommitted") |
|||
} |
|||
|
|||
func TestP03_P2_ConcurrentBarrierTimeout_SameTick_AckAndTimeout(t *testing.T) { |
|||
// The precise same-tick race: r1 ack arrives at exactly the tick when r2's
|
|||
// timeout fires. Verify data-before-timers ordering in the event log.
|
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
c.BarrierTimeoutTicks = 4 // timeout at Now+4
|
|||
|
|||
c.Disconnect("p", "r2") |
|||
c.Disconnect("r2", "p") |
|||
|
|||
c.CommitWrite(1) |
|||
// Write at Now+1, barrier at Now+2, ack back at Now+3.
|
|||
// Timeout for r1 at Now+4, timeout for r2 at Now+4.
|
|||
|
|||
// Tick to barrier ack arrival (tick 3): r1 ack delivered, cancels r1 timeout.
|
|||
// Tick 4: r2 timeout fires. r1 timeout already cancelled.
|
|||
c.TickN(6) |
|||
|
|||
// Check event ordering at the timeout tick.
|
|||
timeoutTick := uint64(0) |
|||
for _, ft := range c.FiredTimeouts { |
|||
timeoutTick = ft.FiredAt |
|||
} |
|||
events := c.TickEventsAt(timeoutTick) |
|||
|
|||
// At the timeout tick, we should see: r2 timeout fired (r1 was cancelled earlier).
|
|||
var firedDetails []string |
|||
for _, e := range events { |
|||
if e.Kind == EventTimeoutFired { |
|||
firedDetails = append(firedDetails, e.Detail) |
|||
} |
|||
} |
|||
if len(firedDetails) != 1 { |
|||
t.Fatalf("expected 1 timeout fire at tick %d, got %d: %v", timeoutTick, len(firedDetails), firedDetails) |
|||
} |
|||
|
|||
// Committed via quorum.
|
|||
if c.Coordinator.CommittedLSN != 1 { |
|||
t.Fatalf("committed=%d, want 1", c.Coordinator.CommittedLSN) |
|||
} |
|||
t.Logf("same-tick race: r1 ack cancelled at tick 3, r2 timeout fired at tick %d, committed=1", timeoutTick) |
|||
} |
|||
|
|||
// --- Race 2: Epoch bump during active barrier timeout window ---
|
|||
|
|||
func TestP03_P2_EpochBumpDuringBarrierTimeout_CrossSurface(t *testing.T) { |
|||
// Three cleanup mechanisms interact for the same barrier:
|
|||
// 1. Epoch fencing in deliver() rejects old-epoch messages
|
|||
// 2. Barrier timeout in fireTimeouts() removes queued barriers + marks expired
|
|||
// 3. ExpiredBarriers in deliver() rejects late acks
|
|||
//
|
|||
// Scenario: barrier re-queues (r1 missing data), epoch bumps, then timeout fires.
|
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
c.BarrierTimeoutTicks = 10 |
|||
|
|||
c.CommitWrite(1) // write+barrier to r1, r2
|
|||
|
|||
// Drop write to r1 so barrier keeps re-queuing.
|
|||
var kept []inFlightMessage |
|||
for _, item := range c.Queue { |
|||
if item.msg.Kind == MsgWrite && item.msg.To == "r1" && item.msg.Write.LSN == 1 { |
|||
continue |
|||
} |
|||
kept = append(kept, item) |
|||
} |
|||
c.Queue = kept |
|||
|
|||
// Tick 1-3: r1's barrier delivers but r1 doesn't have data → re-queues.
|
|||
// r2 gets write+barrier normally → acks.
|
|||
c.TickN(3) |
|||
|
|||
// Epoch bump: promote r2 (p stays running as demoted replica).
|
|||
// This ensures the old-epoch barrier hits epoch fencing, not node_down.
|
|||
if err := c.Promote("r2"); err != nil { |
|||
t.Fatal(err) |
|||
} |
|||
|
|||
// Record state before timeout window.
|
|||
epochRejectsBefore := c.RejectedByReason(RejectEpochMismatch) |
|||
|
|||
// Tick 4-5: old-epoch barrier (p→r1) is in queue. deliver() rejects
|
|||
// with epoch_mismatch (msg epoch=1 vs coordinator epoch=2).
|
|||
c.TickN(2) |
|||
|
|||
// Old barrier rejected by epoch fencing.
|
|||
epochRejectsAfter := c.RejectedByReason(RejectEpochMismatch) |
|||
newEpochRejects := epochRejectsAfter - epochRejectsBefore |
|||
if newEpochRejects == 0 { |
|||
t.Fatal("old-epoch barrier should be rejected by epoch fencing") |
|||
} |
|||
|
|||
// Tick past barrier timeout deadline.
|
|||
c.TickN(10) |
|||
|
|||
// Barrier timeout fires for r1/LSN 1 (removes any remaining queued copies).
|
|||
if c.FiredTimeoutsByKind(TimeoutBarrier) == 0 { |
|||
t.Fatal("barrier timeout should fire for r1/LSN 1") |
|||
} |
|||
|
|||
// Expired barrier marked.
|
|||
if !c.ExpiredBarriers[barrierExpiredKey{"r1", 1}] { |
|||
t.Fatal("r1/LSN 1 should be in ExpiredBarriers") |
|||
} |
|||
|
|||
// Inject late ack from r1 for LSN 1 at current epoch (to new primary r2).
|
|||
// The barrier is expired — ack should be rejected by barrier_expired.
|
|||
deliveriesBefore := len(c.Deliveries) |
|||
c.InjectMessage(Message{ |
|||
Kind: MsgBarrierAck, From: "r1", To: "r2", |
|||
Epoch: c.Coordinator.Epoch, TargetLSN: 1, |
|||
}, c.Now+1) |
|||
c.TickN(2) |
|||
|
|||
// Late ack rejected by barrier_expired.
|
|||
lateRejected := false |
|||
for _, d := range c.Deliveries[deliveriesBefore:] { |
|||
if d.Msg.Kind == MsgBarrierAck && d.Msg.From == "r1" && d.Msg.TargetLSN == 1 { |
|||
if !d.Accepted && d.Reason == RejectBarrierExpired { |
|||
lateRejected = true |
|||
} |
|||
} |
|||
} |
|||
if !lateRejected { |
|||
t.Fatal("late ack for expired barrier should be rejected as barrier_expired") |
|||
} |
|||
|
|||
// Verify event log shows the cross-surface interaction.
|
|||
var epochRejectEvents, timeoutFireEvents int |
|||
for _, e := range c.TickLog { |
|||
if e.Kind == EventDeliveryRejected { |
|||
epochRejectEvents++ |
|||
} |
|||
if e.Kind == EventTimeoutFired { |
|||
timeoutFireEvents++ |
|||
} |
|||
} |
|||
if epochRejectEvents == 0 || timeoutFireEvents == 0 { |
|||
t.Fatalf("event log should show both epoch rejections (%d) and timeout fires (%d)", |
|||
epochRejectEvents, timeoutFireEvents) |
|||
} |
|||
|
|||
t.Logf("cross-surface: epoch_rejects=%d, timeout_fires=%d, expired_barrier=true, late_ack_rejected=true", |
|||
newEpochRejects, c.FiredTimeoutsByKind(TimeoutBarrier)) |
|||
} |
|||
|
|||
// --- TickEvents trace verification ---
|
|||
|
|||
func TestP03_P2_TickEvents_OrderingVerifiable(t *testing.T) { |
|||
// Verify that TickEvents captures delivery → timeout ordering within a tick.
|
|||
c := NewCluster(CommitSyncAll, "p", "r1") |
|||
c.BarrierTimeoutTicks = 5 |
|||
|
|||
c.CommitWrite(1) |
|||
c.TickN(10) // normal flow: ack cancels timeout
|
|||
|
|||
// TickLog should have events.
|
|||
if len(c.TickLog) == 0 { |
|||
t.Fatal("TickLog should record events") |
|||
} |
|||
|
|||
// Find delivery events and timeout events.
|
|||
var deliveries, cancels int |
|||
for _, e := range c.TickLog { |
|||
switch e.Kind { |
|||
case EventDeliveryAccepted: |
|||
deliveries++ |
|||
case EventTimeoutCancelled: |
|||
cancels++ |
|||
} |
|||
} |
|||
if deliveries == 0 { |
|||
t.Fatal("should have delivery events") |
|||
} |
|||
if cancels == 0 { |
|||
t.Fatal("should have timeout cancel events (ack cancelled barrier timeout)") |
|||
} |
|||
|
|||
// BuildTrace includes TickEvents.
|
|||
trace := BuildTrace(c) |
|||
if len(trace.TickEvents) == 0 { |
|||
t.Fatal("BuildTrace should include TickEvents") |
|||
} |
|||
|
|||
t.Logf("tick events: %d deliveries, %d cancels, %d total events", |
|||
deliveries, cancels, len(c.TickLog)) |
|||
} |
|||
@ -0,0 +1,281 @@ |
|||
package distsim |
|||
|
|||
import ( |
|||
"testing" |
|||
) |
|||
|
|||
// ============================================================
|
|||
// Phase 03 P1: Race-focused tests with trace quality
|
|||
// ============================================================
|
|||
|
|||
// --- Race 1: Promotion vs delayed catch-up timeout ---
|
|||
|
|||
func TestP03_Race_PromotionThenStaleCatchupTimeout(t *testing.T) { |
|||
// r1 is CatchingUp with a catch-up timeout registered.
|
|||
// Before the timeout fires, primary crashes and r1 is promoted.
|
|||
// The stale catch-up timeout must not regress r1 (now primary) to NeedsRebuild.
|
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
|
|||
c.CommitWrite(1) |
|||
c.CommitWrite(2) |
|||
c.TickN(5) |
|||
|
|||
// r1 falls behind, starts catching up.
|
|||
c.Disconnect("p", "r1") |
|||
c.Disconnect("r1", "p") |
|||
c.CommitWrite(3) |
|||
c.TickN(5) |
|||
c.Connect("p", "r1") |
|||
c.Connect("r1", "p") |
|||
|
|||
r1 := c.Nodes["r1"] |
|||
r1.ReplicaState = NodeStateCatchingUp |
|||
c.RegisterTimeout(TimeoutCatchup, "r1", 0, c.Now+10) |
|||
|
|||
// r1 catches up successfully.
|
|||
converged := c.CatchUpWithEscalation("r1", 100) |
|||
if !converged { |
|||
t.Fatal("r1 should converge before promotion") |
|||
} |
|||
|
|||
// Primary crashes. Promote r1.
|
|||
c.StopNode("p") |
|||
if err := c.Promote("r1"); err != nil { |
|||
t.Fatal(err) |
|||
} |
|||
|
|||
// Tick past the catch-up timeout deadline.
|
|||
c.TickN(15) |
|||
|
|||
// Stale timeout must not fire (was auto-cancelled on convergence).
|
|||
if c.FiredTimeoutsByKind(TimeoutCatchup) != 0 { |
|||
t.Fatal("stale catch-up timeout must not fire after promotion") |
|||
} |
|||
// r1 must remain primary and running.
|
|||
if r1.Role != RolePrimary { |
|||
t.Fatalf("r1 should be primary, got %s", r1.Role) |
|||
} |
|||
if r1.ReplicaState == NodeStateNeedsRebuild { |
|||
t.Fatal("stale timeout regressed promoted r1 to NeedsRebuild") |
|||
} |
|||
t.Logf("promotion vs timeout: stale catch-up timeout suppressed, r1 is primary") |
|||
} |
|||
|
|||
func TestP03_Race_PromotionThenStaleBarrierTimeout(t *testing.T) { |
|||
// Barrier timeout registered for r1 at old epoch.
|
|||
// Promotion bumps epoch. The stale barrier timeout fires but must not
|
|||
// affect the new epoch's commit state.
|
|||
c := NewCluster(CommitSyncAll, "p", "r1") |
|||
c.BarrierTimeoutTicks = 8 |
|||
|
|||
// Write 1 — barrier to r1. Disconnect r1 so barrier can't ack.
|
|||
c.Disconnect("p", "r1") |
|||
c.Disconnect("r1", "p") |
|||
c.CommitWrite(1) |
|||
|
|||
// Tick 2 — barrier timeout registered at Now+8.
|
|||
c.TickN(2) |
|||
|
|||
// Primary crashes, promote r1 (even though it doesn't have write 1).
|
|||
c.StopNode("p") |
|||
c.StartNode("r1") |
|||
if err := c.Promote("r1"); err != nil { |
|||
t.Fatal(err) |
|||
} |
|||
|
|||
// Snapshot committed prefix before stale timeout window.
|
|||
committedBefore := c.Coordinator.CommittedLSN |
|||
|
|||
// r1 is now primary at new epoch. Write new data.
|
|||
c.CommitWrite(10) |
|||
c.TickN(10) // well past barrier timeout deadline
|
|||
|
|||
// Stale barrier timeout fires (from old epoch, old primary "p" → old replica "r1").
|
|||
barriersFired := c.FiredTimeoutsByKind(TimeoutBarrier) |
|||
|
|||
// Assert 1: old timed-out barrier did not change committed prefix unexpectedly.
|
|||
// CommittedLSN may advance from r1's new-epoch writes, but must not regress
|
|||
// or be influenced by the stale barrier timeout.
|
|||
if c.Coordinator.CommittedLSN < committedBefore { |
|||
t.Fatalf("committed prefix regressed: before=%d after=%d", |
|||
committedBefore, c.Coordinator.CommittedLSN) |
|||
} |
|||
|
|||
// Assert 2: old-epoch barrier did not set DurableOn for new-epoch writes.
|
|||
// LSN 1 was written by old primary "p". Under the new epoch, DurableOn
|
|||
// should not have been modified by the stale barrier's timeout path.
|
|||
if p1 := c.Pending[1]; p1 != nil { |
|||
if p1.DurableOn["r1"] { |
|||
t.Fatal("stale barrier timeout should not set DurableOn[r1] for old-epoch LSN 1") |
|||
} |
|||
} |
|||
|
|||
// Assert 3: old-epoch LSN 1 barrier is marked expired (stale timeout fired correctly).
|
|||
if !c.ExpiredBarriers[barrierExpiredKey{"r1", 1}] { |
|||
t.Fatal("old-epoch barrier for r1/LSN 1 should be in ExpiredBarriers") |
|||
} |
|||
|
|||
t.Logf("promotion vs barrier timeout: committed=%d, fired=%d, DurableOn[r1]=%v, expired[r1/1]=%v", |
|||
c.Coordinator.CommittedLSN, barriersFired, |
|||
c.Pending[1] != nil && c.Pending[1].DurableOn["r1"], |
|||
c.ExpiredBarriers[barrierExpiredKey{"r1", 1}]) |
|||
} |
|||
|
|||
// --- Race 2: Rebuild completion vs epoch bump ---
|
|||
|
|||
func TestP03_Race_RebuildCompletes_ThenEpochBumps(t *testing.T) { |
|||
// r1 needs rebuild. Rebuild completes, but before r1 can rejoin,
|
|||
// epoch bumps (another failover). The rebuild result is valid but
|
|||
// the replica must re-validate against the new epoch before rejoining.
|
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
|
|||
for i := uint64(1); i <= 10; i++ { |
|||
c.CommitWrite(i) |
|||
} |
|||
c.TickN(5) |
|||
c.Primary().Storage.TakeSnapshot("snap-1", c.Coordinator.CommittedLSN) |
|||
|
|||
// r1 needs rebuild.
|
|||
r1 := c.Nodes["r1"] |
|||
r1.ReplicaState = NodeStateNeedsRebuild |
|||
|
|||
// Rebuild from snapshot — succeeds.
|
|||
c.RebuildReplicaFromSnapshot("r1", "snap-1", c.Coordinator.CommittedLSN) |
|||
r1.ReplicaState = NodeStateRebuilding // transitional
|
|||
|
|||
// Before r1 can rejoin: epoch bumps (simulate another failure/promotion).
|
|||
epochBefore := c.Coordinator.Epoch |
|||
c.StopNode("p") |
|||
if err := c.Promote("r2"); err != nil { |
|||
t.Fatal(err) |
|||
} |
|||
epochAfter := c.Coordinator.Epoch |
|||
|
|||
if epochAfter <= epochBefore { |
|||
t.Fatal("epoch should have bumped") |
|||
} |
|||
|
|||
// r1's epoch is now stale (was set to epochBefore, promotion updated running nodes).
|
|||
// r1 was stopped? No, r1 is still running. But Promote sets all running nodes' epoch.
|
|||
// Wait — r1 IS running, so Promote set r1.Epoch = new epoch. Let me check.
|
|||
// Actually Promote() sets all running nodes' epoch to new coordinator epoch.
|
|||
// r1 is running. So r1.Epoch = epochAfter. But r1.Role = RoleReplica.
|
|||
|
|||
// The rebuild data is from the OLD epoch's committed prefix.
|
|||
// Under the new primary (r2), committed prefix may differ.
|
|||
// r1 must NOT be promoted to InSync until validated against new epoch.
|
|||
|
|||
// Eligibility check: r1 is Rebuilding — ineligible for promotion.
|
|||
e := c.EvaluateCandidateEligibility("r1") |
|||
if e.Eligible { |
|||
t.Fatal("r1 in Rebuilding state should not be eligible") |
|||
} |
|||
|
|||
// r1 should NOT be InSync until it completes catch-up from new primary.
|
|||
if r1.ReplicaState == NodeStateInSync { |
|||
t.Fatal("r1 should not be InSync after epoch bump during rebuild") |
|||
} |
|||
|
|||
// After catch-up from new primary (r2), r1 can rejoin.
|
|||
r1.ReplicaState = NodeStateCatchingUp |
|||
converged := c.CatchUpWithEscalation("r1", 100) |
|||
if !converged { |
|||
t.Fatal("r1 should converge from new primary") |
|||
} |
|||
if err := c.AssertCommittedRecoverable("r1"); err != nil { |
|||
t.Fatalf("r1 data incorrect after post-epoch-bump catch-up: %v", err) |
|||
} |
|||
|
|||
t.Logf("rebuild vs epoch bump: r1 rebuilt at epoch %d, bumped to %d, caught up from r2", |
|||
epochBefore, epochAfter) |
|||
} |
|||
|
|||
func TestP03_Race_EpochBumpsDuringCatchupTimeout(t *testing.T) { |
|||
// Catch-up timeout registered. Epoch bumps before timeout fires.
|
|||
// The timeout is now stale (different epoch context).
|
|||
// Must not mutate state under the new epoch.
|
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
|
|||
c.CommitWrite(1) |
|||
c.TickN(5) |
|||
|
|||
c.Disconnect("p", "r1") |
|||
c.Disconnect("r1", "p") |
|||
c.CommitWrite(2) |
|||
c.TickN(5) |
|||
c.Connect("p", "r1") |
|||
c.Connect("r1", "p") |
|||
|
|||
r1 := c.Nodes["r1"] |
|||
r1.ReplicaState = NodeStateCatchingUp |
|||
c.RegisterTimeout(TimeoutCatchup, "r1", 0, c.Now+10) |
|||
|
|||
// Epoch bumps (promotion) before timeout.
|
|||
c.StopNode("p") |
|||
if err := c.Promote("r1"); err != nil { |
|||
t.Fatal(err) |
|||
} |
|||
// r1 is now primary. State changes from CatchingUp to... well, we need to
|
|||
// set it. In production, promotion sets the role but the replica state is
|
|||
// reset. Let me set it to InSync (as new primary).
|
|||
r1.ReplicaState = NodeStateInSync |
|||
|
|||
// Tick past timeout deadline.
|
|||
c.TickN(15) |
|||
|
|||
// Timeout should be ignored (r1 is InSync, not CatchingUp).
|
|||
if c.FiredTimeoutsByKind(TimeoutCatchup) != 0 { |
|||
t.Fatal("catch-up timeout should not fire after epoch bump + promotion") |
|||
} |
|||
if len(c.IgnoredTimeouts) != 1 { |
|||
t.Fatalf("expected 1 ignored (stale) timeout, got %d", len(c.IgnoredTimeouts)) |
|||
} |
|||
if r1.ReplicaState != NodeStateInSync { |
|||
t.Fatalf("r1 should remain InSync, got %s", r1.ReplicaState) |
|||
} |
|||
t.Logf("epoch bump vs timeout: stale catch-up timeout correctly ignored") |
|||
} |
|||
|
|||
// --- Trace quality: dump state on failure ---
|
|||
|
|||
func TestP03_TraceQuality_FailingScenarioDumpsState(t *testing.T) { |
|||
// Verify that the timeout model produces debuggable traces.
|
|||
// This test does NOT intentionally fail — it verifies that trace
|
|||
// information is available for inspection.
|
|||
c := NewCluster(CommitSyncAll, "p", "r1") |
|||
c.BarrierTimeoutTicks = 5 |
|||
|
|||
c.CommitWrite(1) |
|||
c.TickN(3) |
|||
|
|||
c.Disconnect("p", "r1") |
|||
c.Disconnect("r1", "p") |
|||
c.CommitWrite(2) |
|||
c.TickN(10) |
|||
|
|||
// Build trace.
|
|||
trace := BuildTrace(c) |
|||
|
|||
// Trace must contain key debugging information.
|
|||
if trace.Tick == 0 { |
|||
t.Fatal("trace should have non-zero tick") |
|||
} |
|||
if trace.CommittedLSN == 0 && len(c.Pending) == 0 { |
|||
t.Fatal("trace should reflect cluster state") |
|||
} |
|||
if len(trace.FiredTimeouts) == 0 { |
|||
t.Fatal("trace should include fired timeouts") |
|||
} |
|||
if len(trace.NodeStates) < 2 { |
|||
t.Fatal("trace should include all node states") |
|||
} |
|||
if trace.Deliveries == 0 { |
|||
t.Fatal("trace should include deliveries") |
|||
} |
|||
|
|||
t.Logf("trace: tick=%d committed=%d fired_timeouts=%d deliveries=%d nodes=%v", |
|||
trace.Tick, trace.CommittedLSN, len(trace.FiredTimeouts), |
|||
trace.Deliveries, trace.NodeStates) |
|||
} |
|||
|
|||
// Trace infrastructure lives in eventsim.go (BuildTrace / Trace type).
|
|||
@ -0,0 +1,333 @@ |
|||
package distsim |
|||
|
|||
import ( |
|||
"testing" |
|||
) |
|||
|
|||
// ============================================================
|
|||
// Phase 03 P0: Timeout-backed scenarios
|
|||
// ============================================================
|
|||
|
|||
// --- Barrier timeout ---
|
|||
|
|||
func TestP03_BarrierTimeout_SyncAllBlocked(t *testing.T) { |
|||
// Barrier sent to replica, link goes down, ack never arrives.
|
|||
// Barrier timeout fires → barrier removed from queue.
|
|||
// sync_all: write stays uncommitted.
|
|||
c := NewCluster(CommitSyncAll, "p", "r1") |
|||
c.BarrierTimeoutTicks = 5 |
|||
|
|||
c.CommitWrite(1) |
|||
c.TickN(10) // enough for barrier timeout to fire and normal commit
|
|||
|
|||
// LSN 1: p self-acks. r1 acks. sync_all: both must ack. Should commit.
|
|||
if c.Coordinator.CommittedLSN != 1 { |
|||
t.Fatalf("LSN 1 should commit normally, got committed=%d", c.Coordinator.CommittedLSN) |
|||
} |
|||
// No timeouts fired for LSN 1 (ack arrived in time).
|
|||
if c.FiredTimeoutsByKind(TimeoutBarrier) != 0 { |
|||
t.Fatal("no barrier timeouts should have fired for LSN 1") |
|||
} |
|||
|
|||
// Now disconnect r1. Write LSN 2. Barrier can't be acked.
|
|||
c.Disconnect("p", "r1") |
|||
c.Disconnect("r1", "p") |
|||
c.CommitWrite(2) |
|||
c.TickN(10) // barrier timeout fires after 5 ticks
|
|||
|
|||
// Barrier timeout should have fired for r1/LSN 2.
|
|||
if c.FiredTimeoutsByKind(TimeoutBarrier) != 1 { |
|||
t.Fatalf("expected 1 barrier timeout, got %d", c.FiredTimeoutsByKind(TimeoutBarrier)) |
|||
} |
|||
|
|||
// sync_all: LSN 2 NOT committed (r1 never acked).
|
|||
if c.Coordinator.CommittedLSN != 1 { |
|||
t.Fatalf("LSN 2 should not commit under sync_all without r1 ack, committed=%d", |
|||
c.Coordinator.CommittedLSN) |
|||
} |
|||
|
|||
// Barrier removed from queue (no indefinite re-queuing).
|
|||
for _, item := range c.Queue { |
|||
if item.msg.Kind == MsgBarrier && item.msg.To == "r1" && item.msg.TargetLSN == 2 { |
|||
t.Fatal("timed-out barrier should be removed from queue") |
|||
} |
|||
} |
|||
t.Logf("barrier timeout: LSN 2 uncommitted, barrier cleaned from queue") |
|||
} |
|||
|
|||
func TestP03_BarrierTimeout_SyncQuorum_StillCommits(t *testing.T) { |
|||
// RF=3 sync_quorum: r1 times out, but r2 acks → quorum met → commits.
|
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
c.BarrierTimeoutTicks = 5 |
|||
|
|||
// Disconnect r1 only. r2 stays connected.
|
|||
c.Disconnect("p", "r1") |
|||
c.Disconnect("r1", "p") |
|||
|
|||
c.CommitWrite(1) |
|||
c.TickN(10) |
|||
|
|||
// r1 barrier times out, but r2 acked. quorum = p + r2 = 2 of 3.
|
|||
if c.FiredTimeoutsByKind(TimeoutBarrier) != 1 { |
|||
t.Fatalf("expected 1 barrier timeout (r1), got %d", c.FiredTimeoutsByKind(TimeoutBarrier)) |
|||
} |
|||
if c.Coordinator.CommittedLSN != 1 { |
|||
t.Fatalf("LSN 1 should commit via quorum (p+r2), committed=%d", c.Coordinator.CommittedLSN) |
|||
} |
|||
t.Logf("barrier timeout: r1 timed out, LSN 1 committed via quorum") |
|||
} |
|||
|
|||
// --- Catch-up timeout ---
|
|||
|
|||
func TestP03_CatchupTimeout_EscalatesToNeedsRebuild(t *testing.T) { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
|
|||
c.CommitWrite(1) |
|||
c.TickN(5) |
|||
|
|||
// r1 disconnects, primary writes more.
|
|||
c.Disconnect("p", "r1") |
|||
c.Disconnect("r1", "p") |
|||
for i := uint64(2); i <= 20; i++ { |
|||
c.CommitWrite(i) |
|||
} |
|||
c.TickN(5) |
|||
c.Connect("p", "r1") |
|||
c.Connect("r1", "p") |
|||
|
|||
// Register catch-up timeout: 3 ticks from now.
|
|||
r1 := c.Nodes["r1"] |
|||
r1.ReplicaState = NodeStateCatchingUp |
|||
c.RegisterTimeout(TimeoutCatchup, "r1", 0, c.Now+3) |
|||
|
|||
// Tick 3 times — timeout fires before catch-up completes.
|
|||
c.TickN(3) |
|||
|
|||
if r1.ReplicaState != NodeStateNeedsRebuild { |
|||
t.Fatalf("catch-up timeout should escalate to NeedsRebuild, got %s", r1.ReplicaState) |
|||
} |
|||
if c.FiredTimeoutsByKind(TimeoutCatchup) != 1 { |
|||
t.Fatalf("expected 1 catchup timeout, got %d", c.FiredTimeoutsByKind(TimeoutCatchup)) |
|||
} |
|||
t.Logf("catch-up timeout: escalated to NeedsRebuild after 3 ticks") |
|||
} |
|||
|
|||
// --- Reservation expiry as timeout event ---
|
|||
|
|||
func TestP03_ReservationTimeout_AbortsCatchup(t *testing.T) { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
|
|||
for i := uint64(1); i <= 10; i++ { |
|||
c.CommitWrite(i) |
|||
} |
|||
c.TickN(5) |
|||
|
|||
// r1 disconnects, more writes.
|
|||
c.Disconnect("p", "r1") |
|||
c.Disconnect("r1", "p") |
|||
for i := uint64(11); i <= 30; i++ { |
|||
c.CommitWrite(i) |
|||
} |
|||
c.TickN(5) |
|||
c.Connect("p", "r1") |
|||
c.Connect("r1", "p") |
|||
|
|||
// Register reservation timeout: 2 ticks.
|
|||
r1 := c.Nodes["r1"] |
|||
r1.ReplicaState = NodeStateCatchingUp |
|||
c.RegisterTimeout(TimeoutReservation, "r1", 0, c.Now+2) |
|||
|
|||
c.TickN(2) |
|||
|
|||
if r1.ReplicaState != NodeStateNeedsRebuild { |
|||
t.Fatalf("reservation timeout should escalate to NeedsRebuild, got %s", r1.ReplicaState) |
|||
} |
|||
if c.FiredTimeoutsByKind(TimeoutReservation) != 1 { |
|||
t.Fatalf("expected 1 reservation timeout, got %d", c.FiredTimeoutsByKind(TimeoutReservation)) |
|||
} |
|||
} |
|||
|
|||
// --- Timer-race scenarios: same-tick resolution ---
|
|||
|
|||
func TestP03_Race_AckArrivesBeforeTimeout_Cancels(t *testing.T) { |
|||
// Barrier ack arrives in the same tick as the timeout deadline.
|
|||
// Rule: data events (ack) process before timeouts → timeout is cancelled.
|
|||
c := NewCluster(CommitSyncAll, "p", "r1") |
|||
c.BarrierTimeoutTicks = 4 // timeout at Now+4
|
|||
|
|||
c.CommitWrite(1) // barrier enqueued at Now+2, ack back at Now+3
|
|||
// Barrier timeout registered at Now+4.
|
|||
|
|||
// Tick 1: write delivered.
|
|||
// Tick 2: barrier delivered, ack enqueued at Now+1 = tick 3.
|
|||
// Tick 3: ack delivered → cancels timeout.
|
|||
// Tick 4: timeout deadline reached — but already cancelled.
|
|||
c.TickN(5) |
|||
|
|||
// Ack arrived first → timeout cancelled → LSN 1 committed.
|
|||
if c.FiredTimeoutsByKind(TimeoutBarrier) != 0 { |
|||
t.Fatal("barrier timeout should be cancelled by ack arriving first") |
|||
} |
|||
if c.Coordinator.CommittedLSN != 1 { |
|||
t.Fatalf("LSN 1 should commit (ack arrived before timeout), committed=%d", |
|||
c.Coordinator.CommittedLSN) |
|||
} |
|||
t.Logf("race resolved: ack cancelled timeout, LSN 1 committed") |
|||
} |
|||
|
|||
func TestP03_Race_TimeoutBeforeAck_Fires(t *testing.T) { |
|||
// Timeout fires before barrier can deliver (timeout < barrier delivery time).
|
|||
// CommitWrite enqueues barrier at Now+2. Timeout at Now+1 fires first.
|
|||
c := NewCluster(CommitSyncAll, "p", "r1") |
|||
c.BarrierTimeoutTicks = 1 // timeout at Now+1 — before barrier delivers at Now+2
|
|||
|
|||
c.CommitWrite(1) |
|||
c.TickN(5) |
|||
|
|||
// Timeout fires at tick 1. Barrier would deliver at tick 2, but timeout
|
|||
// removes it from queue first.
|
|||
if c.FiredTimeoutsByKind(TimeoutBarrier) != 1 { |
|||
t.Fatalf("expected barrier timeout to fire, got %d", c.FiredTimeoutsByKind(TimeoutBarrier)) |
|||
} |
|||
// sync_all: uncommitted (r1 never acked).
|
|||
if c.Coordinator.CommittedLSN != 0 { |
|||
t.Fatalf("LSN 1 should not commit (timeout before barrier delivery), committed=%d", |
|||
c.Coordinator.CommittedLSN) |
|||
} |
|||
t.Logf("race resolved: timeout fired before barrier delivery, LSN 1 uncommitted") |
|||
} |
|||
|
|||
func TestP03_Race_CatchupConverges_CancelsTimeout(t *testing.T) { |
|||
// Catch-up completes before the timeout fires.
|
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
|
|||
c.CommitWrite(1) |
|||
c.TickN(5) |
|||
|
|||
c.Disconnect("p", "r1") |
|||
c.Disconnect("r1", "p") |
|||
c.CommitWrite(2) |
|||
c.CommitWrite(3) |
|||
c.TickN(5) |
|||
c.Connect("p", "r1") |
|||
c.Connect("r1", "p") |
|||
|
|||
// Register catch-up timeout: 10 ticks (generous).
|
|||
r1 := c.Nodes["r1"] |
|||
r1.ReplicaState = NodeStateCatchingUp |
|||
c.RegisterTimeout(TimeoutCatchup, "r1", 0, c.Now+10) |
|||
|
|||
// Catch-up completes immediately (small gap).
|
|||
// CatchUpWithEscalation auto-cancels recovery timeouts on convergence.
|
|||
converged := c.CatchUpWithEscalation("r1", 100) |
|||
if !converged { |
|||
t.Fatal("catch-up should converge for small gap") |
|||
} |
|||
|
|||
// Tick past deadline — timeout should already be cancelled.
|
|||
c.TickN(15) |
|||
|
|||
// Timeout should NOT have fired (was cancelled).
|
|||
if c.FiredTimeoutsByKind(TimeoutCatchup) != 0 { |
|||
t.Fatal("catch-up timeout should be cancelled on convergence") |
|||
} |
|||
if r1.ReplicaState != NodeStateInSync { |
|||
t.Fatalf("r1 should be InSync after convergence, got %s", r1.ReplicaState) |
|||
} |
|||
t.Logf("race resolved: catch-up converged, timeout auto-cancelled") |
|||
} |
|||
|
|||
// --- Stale timeout hardening ---
|
|||
|
|||
func TestP03_StaleReservationTimeout_AfterRecoverySuccess(t *testing.T) { |
|||
// Reservation timeout registered, but recovery completes before deadline.
|
|||
// The stale timeout must NOT regress state from InSync back to NeedsRebuild.
|
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
|
|||
c.CommitWrite(1) |
|||
c.CommitWrite(2) |
|||
c.TickN(5) |
|||
|
|||
c.Disconnect("p", "r1") |
|||
c.Disconnect("r1", "p") |
|||
c.CommitWrite(3) |
|||
c.TickN(5) |
|||
c.Connect("p", "r1") |
|||
c.Connect("r1", "p") |
|||
|
|||
// Register reservation timeout: 10 ticks.
|
|||
r1 := c.Nodes["r1"] |
|||
r1.ReplicaState = NodeStateCatchingUp |
|||
c.RegisterTimeout(TimeoutReservation, "r1", 0, c.Now+10) |
|||
|
|||
// Catch-up succeeds immediately — auto-cancels reservation timeout.
|
|||
converged := c.CatchUpWithEscalation("r1", 100) |
|||
if !converged { |
|||
t.Fatal("catch-up should converge") |
|||
} |
|||
if r1.ReplicaState != NodeStateInSync { |
|||
t.Fatalf("expected InSync after convergence, got %s", r1.ReplicaState) |
|||
} |
|||
|
|||
// Tick well past the deadline.
|
|||
c.TickN(20) |
|||
|
|||
// Stale reservation timeout must NOT fire (cancelled by convergence).
|
|||
if c.FiredTimeoutsByKind(TimeoutReservation) != 0 { |
|||
t.Fatal("stale reservation timeout should not fire after recovery success") |
|||
} |
|||
if r1.ReplicaState != NodeStateInSync { |
|||
t.Fatalf("stale timeout regressed state: expected InSync, got %s", r1.ReplicaState) |
|||
} |
|||
t.Logf("stale reservation timeout correctly suppressed after recovery") |
|||
} |
|||
|
|||
func TestP03_LateBarrierAck_AfterTimeout_Rejected(t *testing.T) { |
|||
// Barrier times out, then a late ack arrives. The late ack must be
|
|||
// rejected — it must not count toward DurableOn.
|
|||
c := NewCluster(CommitSyncAll, "p", "r1") |
|||
c.BarrierTimeoutTicks = 1 // timeout at Now+1
|
|||
|
|||
c.CommitWrite(1) |
|||
|
|||
// Tick 1: write delivered, timeout fires (barrier at Now+2 not yet delivered).
|
|||
c.TickN(1) |
|||
|
|||
if c.FiredTimeoutsByKind(TimeoutBarrier) != 1 { |
|||
t.Fatalf("expected barrier timeout to fire, got %d", c.FiredTimeoutsByKind(TimeoutBarrier)) |
|||
} |
|||
|
|||
// LSN 1 should NOT be committed.
|
|||
if c.Coordinator.CommittedLSN != 0 { |
|||
t.Fatalf("LSN 1 should not be committed after timeout, got %d", c.Coordinator.CommittedLSN) |
|||
} |
|||
|
|||
// Now inject a late barrier ack (as if the network delayed it massively).
|
|||
c.InjectMessage(Message{ |
|||
Kind: MsgBarrierAck, |
|||
From: "r1", |
|||
To: "p", |
|||
Epoch: c.Coordinator.Epoch, |
|||
TargetLSN: 1, |
|||
}, c.Now+1) |
|||
|
|||
c.TickN(5) |
|||
|
|||
// Late ack must be rejected with barrier_expired reason.
|
|||
expiredRejects := c.RejectedByReason(RejectBarrierExpired) |
|||
if expiredRejects == 0 { |
|||
t.Fatal("late barrier ack should be rejected as barrier_expired") |
|||
} |
|||
|
|||
// LSN 1 must still be uncommitted (late ack did not count).
|
|||
if c.Coordinator.CommittedLSN != 0 { |
|||
t.Fatalf("late ack should not commit LSN 1, got committed=%d", c.Coordinator.CommittedLSN) |
|||
} |
|||
|
|||
// DurableOn should NOT include r1.
|
|||
p1 := c.Pending[1] |
|||
if p1 != nil && p1.DurableOn["r1"] { |
|||
t.Fatal("late ack should not set DurableOn for r1") |
|||
} |
|||
t.Logf("late barrier ack: rejected as barrier_expired, LSN 1 stays uncommitted") |
|||
} |
|||
@ -0,0 +1,243 @@ |
|||
package distsim |
|||
|
|||
import "testing" |
|||
|
|||
// ============================================================
|
|||
// Phase 04a: Session ownership validation in distsim
|
|||
// ============================================================
|
|||
|
|||
// --- Scenario 1: Endpoint change during active catch-up ---
|
|||
|
|||
func TestP04a_EndpointChangeDuringCatchup_InvalidatesSession(t *testing.T) { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
|
|||
c.CommitWrite(1) |
|||
c.TickN(5) |
|||
|
|||
// Start catch-up session for r1.
|
|||
c.Disconnect("p", "r1") |
|||
c.Disconnect("r1", "p") |
|||
c.CommitWrite(2) |
|||
c.TickN(5) |
|||
c.Connect("p", "r1") |
|||
c.Connect("r1", "p") |
|||
|
|||
trigger, sessID, ok := c.TriggerRecoverySession("r1") |
|||
if !ok || trigger != TriggerReassignment { |
|||
t.Fatalf("should trigger reassignment, got %s/%v", trigger, ok) |
|||
} |
|||
|
|||
// Session is active.
|
|||
sess := c.Sessions["r1"] |
|||
if !sess.Active { |
|||
t.Fatal("session should be active") |
|||
} |
|||
|
|||
// Endpoint changes (replica restarts on new address).
|
|||
c.StopNode("r1") |
|||
c.RestartNodeWithNewAddress("r1") |
|||
|
|||
// Session invalidated by endpoint change.
|
|||
if sess.Active { |
|||
t.Fatal("session should be invalidated after endpoint change") |
|||
} |
|||
if sess.Reason != "endpoint_changed" { |
|||
t.Fatalf("invalidation reason: got %q, want endpoint_changed", sess.Reason) |
|||
} |
|||
|
|||
// Stale completion from old session is rejected.
|
|||
if c.CompleteRecoverySession("r1", sessID) { |
|||
t.Fatal("stale session completion should be rejected") |
|||
} |
|||
t.Logf("endpoint change: session %d invalidated, stale completion rejected", sessID) |
|||
} |
|||
|
|||
// --- Scenario 2: Epoch bump during active catch-up ---
|
|||
|
|||
func TestP04a_EpochBumpDuringCatchup_InvalidatesSession(t *testing.T) { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
|
|||
c.CommitWrite(1) |
|||
c.TickN(5) |
|||
|
|||
c.Disconnect("p", "r1") |
|||
c.Disconnect("r1", "p") |
|||
c.CommitWrite(2) |
|||
c.TickN(5) |
|||
c.Connect("p", "r1") |
|||
c.Connect("r1", "p") |
|||
|
|||
_, sessID, ok := c.TriggerRecoverySession("r1") |
|||
if !ok { |
|||
t.Fatal("trigger should succeed") |
|||
} |
|||
sess := c.Sessions["r1"] |
|||
|
|||
// Epoch bumps (promotion).
|
|||
c.StopNode("p") |
|||
c.Promote("r2") |
|||
|
|||
// Session invalidated by epoch bump.
|
|||
if sess.Active { |
|||
t.Fatal("session should be invalidated after epoch bump") |
|||
} |
|||
if sess.Reason != "epoch_bump_promotion" { |
|||
t.Fatalf("reason: got %q", sess.Reason) |
|||
} |
|||
|
|||
// Stale completion rejected.
|
|||
if c.CompleteRecoverySession("r1", sessID) { |
|||
t.Fatal("stale completion after epoch bump should be rejected") |
|||
} |
|||
t.Logf("epoch bump: session %d invalidated, completion rejected", sessID) |
|||
} |
|||
|
|||
// --- Scenario 3: Stale late completion from old session ---
|
|||
|
|||
func TestP04a_StaleCompletion_AfterSupersede_Rejected(t *testing.T) { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
|
|||
c.CommitWrite(1) |
|||
c.TickN(5) |
|||
|
|||
c.Disconnect("p", "r1") |
|||
c.Disconnect("r1", "p") |
|||
c.CommitWrite(2) |
|||
c.TickN(5) |
|||
c.Connect("p", "r1") |
|||
c.Connect("r1", "p") |
|||
|
|||
// First session.
|
|||
_, oldSessID, _ := c.TriggerRecoverySession("r1") |
|||
oldSess := c.Sessions["r1"] |
|||
|
|||
// Invalidate old session manually (simulate timeout or abort).
|
|||
c.InvalidateReplicaSession("r1", "timeout") |
|||
if oldSess.Active { |
|||
t.Fatal("old session should be invalidated") |
|||
} |
|||
|
|||
// New session triggered.
|
|||
c.Nodes["r1"].ReplicaState = NodeStateLagging // reset state to allow retrigger
|
|||
_, newSessID, ok := c.TriggerRecoverySession("r1") |
|||
if !ok { |
|||
t.Fatal("second trigger should succeed after invalidation") |
|||
} |
|||
newSess := c.Sessions["r1"] |
|||
|
|||
// Old session completion attempt — must be rejected by ID mismatch.
|
|||
if c.CompleteRecoverySession("r1", oldSessID) { |
|||
t.Fatal("old session completion must be rejected") |
|||
} |
|||
// New session still active.
|
|||
if !newSess.Active { |
|||
t.Fatal("new session should still be active") |
|||
} |
|||
|
|||
// New session completion succeeds.
|
|||
if !c.CompleteRecoverySession("r1", newSessID) { |
|||
t.Fatal("new session completion should succeed") |
|||
} |
|||
if c.Nodes["r1"].ReplicaState != NodeStateInSync { |
|||
t.Fatalf("r1 should be InSync after new session completes, got %s", c.Nodes["r1"].ReplicaState) |
|||
} |
|||
t.Logf("stale completion: old=%d rejected, new=%d accepted", oldSessID, newSessID) |
|||
} |
|||
|
|||
// --- Scenario 4: Duplicate recovery trigger while session active ---
|
|||
|
|||
func TestP04a_DuplicateTrigger_WhileActive_Rejected(t *testing.T) { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
|
|||
c.CommitWrite(1) |
|||
c.TickN(5) |
|||
|
|||
c.Disconnect("p", "r1") |
|||
c.Disconnect("r1", "p") |
|||
c.CommitWrite(2) |
|||
c.TickN(5) |
|||
c.Connect("p", "r1") |
|||
c.Connect("r1", "p") |
|||
|
|||
// First trigger succeeds.
|
|||
_, _, ok := c.TriggerRecoverySession("r1") |
|||
if !ok { |
|||
t.Fatal("first trigger should succeed") |
|||
} |
|||
|
|||
// Duplicate trigger while session active — rejected.
|
|||
_, _, ok = c.TriggerRecoverySession("r1") |
|||
if ok { |
|||
t.Fatal("duplicate trigger should be rejected while session active") |
|||
} |
|||
|
|||
// Session count: only one in history.
|
|||
sessCount := 0 |
|||
for _, s := range c.SessionHistory { |
|||
if s.ReplicaID == "r1" { |
|||
sessCount++ |
|||
} |
|||
} |
|||
if sessCount != 1 { |
|||
t.Fatalf("should have exactly 1 session in history, got %d", sessCount) |
|||
} |
|||
t.Logf("duplicate trigger correctly rejected") |
|||
} |
|||
|
|||
// --- Scenario 5: Session tracking through full lifecycle ---
|
|||
|
|||
func TestP04a_FullLifecycle_SessionTracking(t *testing.T) { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
|
|||
c.CommitWrite(1) |
|||
c.CommitWrite(2) |
|||
c.TickN(5) |
|||
|
|||
// Disconnect, write, reconnect.
|
|||
c.Disconnect("p", "r1") |
|||
c.Disconnect("r1", "p") |
|||
for i := uint64(3); i <= 10; i++ { |
|||
c.CommitWrite(i) |
|||
} |
|||
c.TickN(5) |
|||
c.Connect("p", "r1") |
|||
c.Connect("r1", "p") |
|||
|
|||
// Trigger session.
|
|||
trigger, sessID, ok := c.TriggerRecoverySession("r1") |
|||
if !ok { |
|||
t.Fatal("trigger failed") |
|||
} |
|||
if trigger != TriggerReassignment { |
|||
t.Fatalf("expected reassignment, got %s", trigger) |
|||
} |
|||
|
|||
// Catch up.
|
|||
converged := c.CatchUpWithEscalation("r1", 100) |
|||
if !converged { |
|||
t.Fatal("catch-up should converge") |
|||
} |
|||
|
|||
// Complete session.
|
|||
if !c.CompleteRecoverySession("r1", sessID) { |
|||
t.Fatal("completion should succeed") |
|||
} |
|||
|
|||
// Verify final state.
|
|||
if c.Nodes["r1"].ReplicaState != NodeStateInSync { |
|||
t.Fatalf("r1 should be InSync, got %s", c.Nodes["r1"].ReplicaState) |
|||
} |
|||
if err := c.AssertCommittedRecoverable("r1"); err != nil { |
|||
t.Fatalf("data incorrect: %v", err) |
|||
} |
|||
|
|||
// Session in history, not active.
|
|||
sess := c.Sessions["r1"] |
|||
if sess.Active { |
|||
t.Fatal("session should not be active after completion") |
|||
} |
|||
if len(c.SessionHistory) != 1 { |
|||
t.Fatalf("expected 1 session in history, got %d", len(c.SessionHistory)) |
|||
} |
|||
t.Logf("full lifecycle: trigger=%s session=%d → catch-up → complete → InSync", trigger, sessID) |
|||
} |
|||
@ -0,0 +1,102 @@ |
|||
package distsim |
|||
|
|||
type ProtocolVersion string |
|||
|
|||
const ( |
|||
ProtocolV1 ProtocolVersion = "v1" |
|||
ProtocolV15 ProtocolVersion = "v1_5" |
|||
ProtocolV2 ProtocolVersion = "v2" |
|||
) |
|||
|
|||
type ProtocolPolicy struct { |
|||
Version ProtocolVersion |
|||
} |
|||
|
|||
func (p ProtocolPolicy) CanAttemptCatchup(addressStable bool) bool { |
|||
switch p.Version { |
|||
case ProtocolV1: |
|||
return false |
|||
case ProtocolV15: |
|||
return addressStable |
|||
case ProtocolV2: |
|||
return true |
|||
default: |
|||
return false |
|||
} |
|||
} |
|||
|
|||
func (p ProtocolPolicy) BriefDisconnectAction(addressStable, recoverable bool) string { |
|||
switch p.Version { |
|||
case ProtocolV1: |
|||
return "degrade_or_rebuild" |
|||
case ProtocolV15: |
|||
if addressStable && recoverable { |
|||
return "catchup_if_history_survives" |
|||
} |
|||
return "stall_or_control_plane_recovery" |
|||
case ProtocolV2: |
|||
if recoverable { |
|||
return "reserved_catchup" |
|||
} |
|||
return "explicit_rebuild" |
|||
default: |
|||
return "unknown" |
|||
} |
|||
} |
|||
|
|||
func (p ProtocolPolicy) TailChasingAction(converged bool) string { |
|||
switch p.Version { |
|||
case ProtocolV1: |
|||
if converged { |
|||
return "unexpected_catchup" |
|||
} |
|||
return "degrade" |
|||
case ProtocolV15: |
|||
if converged { |
|||
return "catchup" |
|||
} |
|||
return "stall_or_rebuild" |
|||
case ProtocolV2: |
|||
if converged { |
|||
return "catchup" |
|||
} |
|||
return "abort_to_rebuild" |
|||
default: |
|||
return "unknown" |
|||
} |
|||
} |
|||
|
|||
func (p ProtocolPolicy) RestartRejoinAction(addressStable bool) string { |
|||
switch p.Version { |
|||
case ProtocolV1: |
|||
return "control_plane_only" |
|||
case ProtocolV15: |
|||
if addressStable { |
|||
return "background_reconnect_or_control_plane" |
|||
} |
|||
return "control_plane_only" |
|||
case ProtocolV2: |
|||
if addressStable { |
|||
return "direct_reconnect_or_control_plane" |
|||
} |
|||
return "explicit_reassignment_or_rebuild" |
|||
default: |
|||
return "unknown" |
|||
} |
|||
} |
|||
|
|||
func (p ProtocolPolicy) ChangedAddressRestartAction(recoverable bool) string { |
|||
switch p.Version { |
|||
case ProtocolV1: |
|||
return "control_plane_only" |
|||
case ProtocolV15: |
|||
return "control_plane_only" |
|||
case ProtocolV2: |
|||
if recoverable { |
|||
return "explicit_reassignment_then_catchup" |
|||
} |
|||
return "explicit_reassignment_or_rebuild" |
|||
default: |
|||
return "unknown" |
|||
} |
|||
} |
|||
@ -0,0 +1,84 @@ |
|||
package distsim |
|||
|
|||
import "testing" |
|||
|
|||
func TestProtocolV1CannotAttemptCatchup(t *testing.T) { |
|||
p := ProtocolPolicy{Version: ProtocolV1} |
|||
if p.CanAttemptCatchup(true) { |
|||
t.Fatal("v1 should not expose meaningful catch-up path") |
|||
} |
|||
} |
|||
|
|||
func TestProtocolV15CatchupDependsOnStableAddress(t *testing.T) { |
|||
p := ProtocolPolicy{Version: ProtocolV15} |
|||
if !p.CanAttemptCatchup(true) { |
|||
t.Fatal("v1.5 should allow catch-up when address is stable") |
|||
} |
|||
if p.CanAttemptCatchup(false) { |
|||
t.Fatal("v1.5 should not assume reconnect with changed address") |
|||
} |
|||
} |
|||
|
|||
func TestProtocolV2AllowsCatchupByPolicy(t *testing.T) { |
|||
p := ProtocolPolicy{Version: ProtocolV2} |
|||
if !p.CanAttemptCatchup(true) || !p.CanAttemptCatchup(false) { |
|||
t.Fatal("v2 policy should allow catch-up attempt subject to explicit recoverability checks") |
|||
} |
|||
} |
|||
|
|||
func TestProtocolBriefDisconnectActions(t *testing.T) { |
|||
if got := (ProtocolPolicy{Version: ProtocolV1}).BriefDisconnectAction(true, true); got != "degrade_or_rebuild" { |
|||
t.Fatalf("v1 brief-disconnect action = %s", got) |
|||
} |
|||
if got := (ProtocolPolicy{Version: ProtocolV15}).BriefDisconnectAction(true, true); got != "catchup_if_history_survives" { |
|||
t.Fatalf("v1.5 brief-disconnect action = %s", got) |
|||
} |
|||
if got := (ProtocolPolicy{Version: ProtocolV15}).BriefDisconnectAction(false, true); got != "stall_or_control_plane_recovery" { |
|||
t.Fatalf("v1.5 changed-address brief-disconnect action = %s", got) |
|||
} |
|||
if got := (ProtocolPolicy{Version: ProtocolV2}).BriefDisconnectAction(true, false); got != "explicit_rebuild" { |
|||
t.Fatalf("v2 unrecoverable brief-disconnect action = %s", got) |
|||
} |
|||
if got := (ProtocolPolicy{Version: ProtocolV2}).BriefDisconnectAction(false, true); got != "reserved_catchup" { |
|||
t.Fatalf("v2 recoverable brief-disconnect action = %s", got) |
|||
} |
|||
} |
|||
|
|||
func TestProtocolTailChasingActions(t *testing.T) { |
|||
if got := (ProtocolPolicy{Version: ProtocolV1}).TailChasingAction(false); got != "degrade" { |
|||
t.Fatalf("v1 tail-chasing action = %s", got) |
|||
} |
|||
if got := (ProtocolPolicy{Version: ProtocolV15}).TailChasingAction(false); got != "stall_or_rebuild" { |
|||
t.Fatalf("v1.5 tail-chasing action = %s", got) |
|||
} |
|||
if got := (ProtocolPolicy{Version: ProtocolV2}).TailChasingAction(false); got != "abort_to_rebuild" { |
|||
t.Fatalf("v2 tail-chasing action = %s", got) |
|||
} |
|||
} |
|||
|
|||
func TestProtocolRestartRejoinActions(t *testing.T) { |
|||
if got := (ProtocolPolicy{Version: ProtocolV1}).RestartRejoinAction(true); got != "control_plane_only" { |
|||
t.Fatalf("v1 restart action = %s", got) |
|||
} |
|||
if got := (ProtocolPolicy{Version: ProtocolV15}).RestartRejoinAction(false); got != "control_plane_only" { |
|||
t.Fatalf("v1.5 changed-address restart action = %s", got) |
|||
} |
|||
if got := (ProtocolPolicy{Version: ProtocolV2}).RestartRejoinAction(false); got != "explicit_reassignment_or_rebuild" { |
|||
t.Fatalf("v2 changed-address restart action = %s", got) |
|||
} |
|||
} |
|||
|
|||
func TestProtocolChangedAddressRestartActions(t *testing.T) { |
|||
if got := (ProtocolPolicy{Version: ProtocolV1}).ChangedAddressRestartAction(true); got != "control_plane_only" { |
|||
t.Fatalf("v1 changed-address restart action = %s", got) |
|||
} |
|||
if got := (ProtocolPolicy{Version: ProtocolV15}).ChangedAddressRestartAction(true); got != "control_plane_only" { |
|||
t.Fatalf("v1.5 changed-address restart action = %s", got) |
|||
} |
|||
if got := (ProtocolPolicy{Version: ProtocolV2}).ChangedAddressRestartAction(true); got != "explicit_reassignment_then_catchup" { |
|||
t.Fatalf("v2 recoverable changed-address restart action = %s", got) |
|||
} |
|||
if got := (ProtocolPolicy{Version: ProtocolV2}).ChangedAddressRestartAction(false); got != "explicit_reassignment_or_rebuild" { |
|||
t.Fatalf("v2 unrecoverable changed-address restart action = %s", got) |
|||
} |
|||
} |
|||
@ -0,0 +1,256 @@ |
|||
package distsim |
|||
|
|||
import ( |
|||
"fmt" |
|||
"math/rand" |
|||
"sort" |
|||
) |
|||
|
|||
type RandomEvent string |
|||
|
|||
const ( |
|||
RandomCommitWrite RandomEvent = "commit_write" |
|||
RandomTick RandomEvent = "tick" |
|||
RandomDisconnect RandomEvent = "disconnect" |
|||
RandomReconnect RandomEvent = "reconnect" |
|||
RandomStopNode RandomEvent = "stop_node" |
|||
RandomStartNode RandomEvent = "start_node" |
|||
RandomPromote RandomEvent = "promote" |
|||
RandomTakeSnapshot RandomEvent = "take_snapshot" |
|||
RandomCatchup RandomEvent = "catchup" |
|||
RandomRebuild RandomEvent = "rebuild" |
|||
) |
|||
|
|||
type RandomStep struct { |
|||
Step int |
|||
Event RandomEvent |
|||
Detail string |
|||
} |
|||
|
|||
type RandomResult struct { |
|||
Seed int64 |
|||
Steps []RandomStep |
|||
Cluster *Cluster |
|||
Snapshots []string |
|||
} |
|||
|
|||
func RunRandomScenario(seed int64, steps int) (*RandomResult, error) { |
|||
rng := rand.New(rand.NewSource(seed)) |
|||
cluster := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
result := &RandomResult{ |
|||
Seed: seed, |
|||
Cluster: cluster, |
|||
} |
|||
|
|||
for i := 0; i < steps; i++ { |
|||
step, err := runRandomStep(cluster, rng, i) |
|||
if err != nil { |
|||
result.Steps = append(result.Steps, step) |
|||
return result, err |
|||
} |
|||
result.Steps = append(result.Steps, step) |
|||
if err := assertClusterInvariants(cluster); err != nil { |
|||
return result, fmt.Errorf("seed=%d step=%d event=%s detail=%s: %w", seed, i, step.Event, step.Detail, err) |
|||
} |
|||
} |
|||
return result, assertClusterInvariants(cluster) |
|||
} |
|||
|
|||
func runRandomStep(c *Cluster, rng *rand.Rand, step int) (RandomStep, error) { |
|||
events := []RandomEvent{ |
|||
RandomCommitWrite, |
|||
RandomTick, |
|||
RandomDisconnect, |
|||
RandomReconnect, |
|||
RandomStopNode, |
|||
RandomStartNode, |
|||
RandomPromote, |
|||
RandomTakeSnapshot, |
|||
RandomCatchup, |
|||
RandomRebuild, |
|||
} |
|||
ev := events[rng.Intn(len(events))] |
|||
rs := RandomStep{Step: step, Event: ev} |
|||
|
|||
switch ev { |
|||
case RandomCommitWrite: |
|||
block := uint64(rng.Intn(8) + 1) |
|||
lsn := c.CommitWrite(block) |
|||
rs.Detail = fmt.Sprintf("block=%d lsn=%d", block, lsn) |
|||
case RandomTick: |
|||
n := rng.Intn(3) + 1 |
|||
c.TickN(n) |
|||
rs.Detail = fmt.Sprintf("ticks=%d", n) |
|||
case RandomDisconnect: |
|||
from, to := randomPair(c, rng) |
|||
c.Disconnect(from, to) |
|||
rs.Detail = fmt.Sprintf("%s->%s", from, to) |
|||
case RandomReconnect: |
|||
from, to := randomPair(c, rng) |
|||
c.Connect(from, to) |
|||
rs.Detail = fmt.Sprintf("%s->%s", from, to) |
|||
case RandomStopNode: |
|||
id := randomNodeID(c, rng) |
|||
c.StopNode(id) |
|||
rs.Detail = id |
|||
case RandomStartNode: |
|||
id := randomNodeID(c, rng) |
|||
c.StartNode(id) |
|||
rs.Detail = id |
|||
case RandomPromote: |
|||
if primary := c.Primary(); primary != nil && primary.Running { |
|||
rs.Detail = "primary_still_running" |
|||
return rs, nil |
|||
} |
|||
candidates := promotableNodes(c) |
|||
if len(candidates) == 0 { |
|||
rs.Detail = "no_candidate" |
|||
return rs, nil |
|||
} |
|||
id := candidates[rng.Intn(len(candidates))] |
|||
rs.Detail = id |
|||
if err := c.Promote(id); err != nil { |
|||
return rs, err |
|||
} |
|||
case RandomTakeSnapshot: |
|||
primary := c.Primary() |
|||
if primary == nil || !primary.Running { |
|||
rs.Detail = "no_primary" |
|||
return rs, nil |
|||
} |
|||
lsn := c.Coordinator.CommittedLSN |
|||
id := fmt.Sprintf("snap-%s-%d", primary.ID, lsn) |
|||
primary.Storage.TakeSnapshot(id, lsn) |
|||
rs.Detail = fmt.Sprintf("%s@%d", id, lsn) |
|||
case RandomCatchup: |
|||
id := randomReplicaID(c, rng) |
|||
if id == "" { |
|||
rs.Detail = "no_replica" |
|||
return rs, nil |
|||
} |
|||
node := c.Nodes[id] |
|||
if node == nil || !node.Running { |
|||
rs.Detail = id + ":down" |
|||
return rs, nil |
|||
} |
|||
start := node.Storage.FlushedLSN |
|||
end := c.Coordinator.CommittedLSN |
|||
if end <= start { |
|||
rs.Detail = fmt.Sprintf("%s:no_gap", id) |
|||
return rs, nil |
|||
} |
|||
rs.Detail = fmt.Sprintf("%s:%d..%d", id, start+1, end) |
|||
if err := c.RecoverReplicaFromPrimary(id, start, end); err != nil { |
|||
return rs, err |
|||
} |
|||
case RandomRebuild: |
|||
id := randomReplicaID(c, rng) |
|||
if id == "" { |
|||
rs.Detail = "no_replica" |
|||
return rs, nil |
|||
} |
|||
primary := c.Primary() |
|||
node := c.Nodes[id] |
|||
if primary == nil || node == nil || !primary.Running || !node.Running { |
|||
rs.Detail = id + ":unavailable" |
|||
return rs, nil |
|||
} |
|||
snapshotIDs := make([]string, 0, len(primary.Storage.Snapshots)) |
|||
for snapID := range primary.Storage.Snapshots { |
|||
snapshotIDs = append(snapshotIDs, snapID) |
|||
} |
|||
if len(snapshotIDs) == 0 { |
|||
rs.Detail = id + ":no_snapshot" |
|||
return rs, nil |
|||
} |
|||
sort.Strings(snapshotIDs) |
|||
snapID := snapshotIDs[rng.Intn(len(snapshotIDs))] |
|||
rs.Detail = fmt.Sprintf("%s:%s->%d", id, snapID, c.Coordinator.CommittedLSN) |
|||
if err := c.RebuildReplicaFromSnapshot(id, snapID, c.Coordinator.CommittedLSN); err != nil { |
|||
return rs, err |
|||
} |
|||
default: |
|||
return rs, fmt.Errorf("unknown random event %s", ev) |
|||
} |
|||
|
|||
return rs, nil |
|||
} |
|||
|
|||
func randomNodeID(c *Cluster, rng *rand.Rand) string { |
|||
ids := append([]string(nil), c.Coordinator.Members...) |
|||
sort.Strings(ids) |
|||
if len(ids) == 0 { |
|||
return "" |
|||
} |
|||
return ids[rng.Intn(len(ids))] |
|||
} |
|||
|
|||
func randomReplicaID(c *Cluster, rng *rand.Rand) string { |
|||
ids := c.replicaIDs() |
|||
if len(ids) == 0 { |
|||
return "" |
|||
} |
|||
return ids[rng.Intn(len(ids))] |
|||
} |
|||
|
|||
func randomPair(c *Cluster, rng *rand.Rand) (string, string) { |
|||
from := randomNodeID(c, rng) |
|||
to := randomNodeID(c, rng) |
|||
if from == to { |
|||
ids := append([]string(nil), c.Coordinator.Members...) |
|||
sort.Strings(ids) |
|||
for _, id := range ids { |
|||
if id != from { |
|||
to = id |
|||
break |
|||
} |
|||
} |
|||
} |
|||
return from, to |
|||
} |
|||
|
|||
func promotableNodes(c *Cluster) []string { |
|||
out := make([]string, 0) |
|||
want := c.Reference.StateAt(c.Coordinator.CommittedLSN) |
|||
for _, id := range c.Coordinator.Members { |
|||
n := c.Nodes[id] |
|||
if n == nil || !n.Running || n.Storage.FlushedLSN < c.Coordinator.CommittedLSN { |
|||
continue |
|||
} |
|||
if !EqualState(n.Storage.StateAt(c.Coordinator.CommittedLSN), want) { |
|||
continue |
|||
} |
|||
out = append(out, id) |
|||
} |
|||
sort.Strings(out) |
|||
return out |
|||
} |
|||
|
|||
func assertClusterInvariants(c *Cluster) error { |
|||
committed := c.Coordinator.CommittedLSN |
|||
want := c.Reference.StateAt(committed) |
|||
|
|||
for lsn, p := range c.Pending { |
|||
if p.Committed && lsn > committed { |
|||
return fmt.Errorf("pending lsn %d marked committed above coordinator committed lsn %d", lsn, committed) |
|||
} |
|||
} |
|||
|
|||
for _, id := range promotableNodes(c) { |
|||
n := c.Nodes[id] |
|||
got := n.Storage.StateAt(committed) |
|||
if !EqualState(got, want) { |
|||
return fmt.Errorf("promotable node %s mismatch at committed lsn %d: got=%v want=%v", id, committed, got, want) |
|||
} |
|||
} |
|||
|
|||
primary := c.Primary() |
|||
if primary != nil && primary.Running && primary.Epoch == c.Coordinator.Epoch { |
|||
got := primary.Storage.StateAt(committed) |
|||
if !EqualState(got, want) { |
|||
return fmt.Errorf("primary %s mismatch at committed lsn %d: got=%v want=%v", primary.ID, committed, got, want) |
|||
} |
|||
} |
|||
|
|||
return nil |
|||
} |
|||
@ -0,0 +1,43 @@ |
|||
package distsim |
|||
|
|||
import "testing" |
|||
|
|||
func TestRandomScenarioSeeds(t *testing.T) { |
|||
seeds := []int64{ |
|||
1, 2, 3, 4, 5, |
|||
11, 21, 34, 55, 89, |
|||
101, 202, 303, 404, 505, |
|||
} |
|||
|
|||
for _, seed := range seeds { |
|||
seed := seed |
|||
t.Run("seed_"+itoa64(seed), func(t *testing.T) { |
|||
t.Parallel() |
|||
if _, err := RunRandomScenario(seed, 60); err != nil { |
|||
t.Fatal(err) |
|||
} |
|||
}) |
|||
} |
|||
} |
|||
|
|||
func itoa64(v int64) string { |
|||
if v == 0 { |
|||
return "0" |
|||
} |
|||
neg := v < 0 |
|||
if neg { |
|||
v = -v |
|||
} |
|||
buf := make([]byte, 0, 20) |
|||
for v > 0 { |
|||
buf = append(buf, byte('0'+v%10)) |
|||
v /= 10 |
|||
} |
|||
if neg { |
|||
buf = append(buf, '-') |
|||
} |
|||
for i, j := 0, len(buf)-1; i < j; i, j = i+1, j-1 { |
|||
buf[i], buf[j] = buf[j], buf[i] |
|||
} |
|||
return string(buf) |
|||
} |
|||
@ -0,0 +1,95 @@ |
|||
package distsim |
|||
|
|||
type Write struct { |
|||
LSN uint64 |
|||
Block uint64 |
|||
Value uint64 |
|||
} |
|||
|
|||
type Snapshot struct { |
|||
LSN uint64 |
|||
State map[uint64]uint64 |
|||
} |
|||
|
|||
type Reference struct { |
|||
writes []Write |
|||
snapshots map[uint64]Snapshot |
|||
} |
|||
|
|||
func NewReference() *Reference { |
|||
return &Reference{snapshots: map[uint64]Snapshot{}} |
|||
} |
|||
|
|||
func (r *Reference) Apply(w Write) { |
|||
r.writes = append(r.writes, w) |
|||
} |
|||
|
|||
func (r *Reference) StateAt(lsn uint64) map[uint64]uint64 { |
|||
state := make(map[uint64]uint64) |
|||
for _, w := range r.writes { |
|||
if w.LSN > lsn { |
|||
break |
|||
} |
|||
state[w.Block] = w.Value |
|||
} |
|||
return state |
|||
} |
|||
|
|||
func cloneMap(in map[uint64]uint64) map[uint64]uint64 { |
|||
out := make(map[uint64]uint64, len(in)) |
|||
for k, v := range in { |
|||
out[k] = v |
|||
} |
|||
return out |
|||
} |
|||
|
|||
func (r *Reference) TakeSnapshot(lsn uint64) Snapshot { |
|||
s := Snapshot{LSN: lsn, State: cloneMap(r.StateAt(lsn))} |
|||
r.snapshots[lsn] = s |
|||
return s |
|||
} |
|||
|
|||
func (r *Reference) SnapshotAt(lsn uint64) (Snapshot, bool) { |
|||
s, ok := r.snapshots[lsn] |
|||
return s, ok |
|||
} |
|||
|
|||
type Node struct { |
|||
Extent map[uint64]uint64 |
|||
} |
|||
|
|||
func NewNode() *Node { |
|||
return &Node{Extent: map[uint64]uint64{}} |
|||
} |
|||
|
|||
func (n *Node) ApplyWrite(w Write) { |
|||
n.Extent[w.Block] = w.Value |
|||
} |
|||
|
|||
func (n *Node) LoadSnapshot(s Snapshot) { |
|||
n.Extent = cloneMap(s.State) |
|||
} |
|||
|
|||
func (n *Node) ReplayFromWrites(writes []Write, startExclusive, endInclusive uint64) { |
|||
for _, w := range writes { |
|||
if w.LSN <= startExclusive { |
|||
continue |
|||
} |
|||
if w.LSN > endInclusive { |
|||
break |
|||
} |
|||
n.ApplyWrite(w) |
|||
} |
|||
} |
|||
|
|||
func EqualState(a, b map[uint64]uint64) bool { |
|||
if len(a) != len(b) { |
|||
return false |
|||
} |
|||
for k, v := range a { |
|||
if b[k] != v { |
|||
return false |
|||
} |
|||
} |
|||
return true |
|||
} |
|||
@ -0,0 +1,66 @@ |
|||
package distsim |
|||
|
|||
import "testing" |
|||
|
|||
func TestWALReplayPreservesHistoricalValue(t *testing.T) { |
|||
ref := NewReference() |
|||
ref.Apply(Write{LSN: 10, Block: 7, Value: 10}) |
|||
ref.Apply(Write{LSN: 12, Block: 7, Value: 12}) |
|||
|
|||
node := NewNode() |
|||
node.ReplayFromWrites(ref.writes, 0, 10) |
|||
|
|||
want := ref.StateAt(10) |
|||
if !EqualState(node.Extent, want) { |
|||
t.Fatalf("replay mismatch: got=%v want=%v", node.Extent, want) |
|||
} |
|||
} |
|||
|
|||
func TestCurrentExtentCannotRecoverOldLSN(t *testing.T) { |
|||
ref := NewReference() |
|||
ref.Apply(Write{LSN: 10, Block: 7, Value: 10}) |
|||
ref.Apply(Write{LSN: 12, Block: 7, Value: 12}) |
|||
|
|||
primary := NewNode() |
|||
for _, w := range ref.writes { |
|||
primary.ApplyWrite(w) |
|||
} |
|||
|
|||
wantOld := ref.StateAt(10) |
|||
if EqualState(primary.Extent, wantOld) { |
|||
t.Fatalf("latest extent should not equal old LSN state: latest=%v old=%v", primary.Extent, wantOld) |
|||
} |
|||
} |
|||
|
|||
func TestSnapshotAtCpLSNRecoversCorrectHistoricalValue(t *testing.T) { |
|||
ref := NewReference() |
|||
ref.Apply(Write{LSN: 10, Block: 7, Value: 10}) |
|||
snap := ref.TakeSnapshot(10) |
|||
ref.Apply(Write{LSN: 12, Block: 7, Value: 12}) |
|||
|
|||
node := NewNode() |
|||
node.LoadSnapshot(snap) |
|||
|
|||
want := ref.StateAt(10) |
|||
if !EqualState(node.Extent, want) { |
|||
t.Fatalf("snapshot mismatch: got=%v want=%v", node.Extent, want) |
|||
} |
|||
} |
|||
|
|||
func TestSnapshotPlusTrailingReplayReachesTargetLSN(t *testing.T) { |
|||
ref := NewReference() |
|||
ref.Apply(Write{LSN: 10, Block: 7, Value: 10}) |
|||
ref.Apply(Write{LSN: 11, Block: 2, Value: 11}) |
|||
snap := ref.TakeSnapshot(11) |
|||
ref.Apply(Write{LSN: 12, Block: 7, Value: 12}) |
|||
ref.Apply(Write{LSN: 13, Block: 9, Value: 13}) |
|||
|
|||
node := NewNode() |
|||
node.LoadSnapshot(snap) |
|||
node.ReplayFromWrites(ref.writes, 11, 13) |
|||
|
|||
want := ref.StateAt(13) |
|||
if !EqualState(node.Extent, want) { |
|||
t.Fatalf("snapshot+replay mismatch: got=%v want=%v", node.Extent, want) |
|||
} |
|||
} |
|||
@ -0,0 +1,581 @@ |
|||
package distsim |
|||
|
|||
import ( |
|||
"container/heap" |
|||
"fmt" |
|||
"math/rand" |
|||
"strings" |
|||
) |
|||
|
|||
// --- Event types ---
|
|||
|
|||
type EventKind int |
|||
|
|||
const ( |
|||
EvWriteStart EventKind = iota // client writes to primary
|
|||
EvShipEntry // primary sends WAL entry to replica
|
|||
EvShipDeliver // entry arrives at replica
|
|||
EvBarrierSend // primary sends barrier to replica
|
|||
EvBarrierDeliver // barrier arrives at replica
|
|||
EvBarrierFsync // replica fsync completes
|
|||
EvBarrierAck // ack arrives back at primary
|
|||
EvNodeCrash // node crashes
|
|||
EvNodeRestart // node restarts
|
|||
EvLinkDown // network link drops
|
|||
EvLinkUp // network link restores
|
|||
EvFlusherTick // flusher checkpoint cycle
|
|||
EvPromote // coordinator promotes a node
|
|||
EvLockAcquire // thread tries to acquire lock
|
|||
EvLockRelease // thread releases lock
|
|||
) |
|||
|
|||
func (k EventKind) String() string { |
|||
names := [...]string{ |
|||
"WriteStart", "ShipEntry", "ShipDeliver", |
|||
"BarrierSend", "BarrierDeliver", "BarrierFsync", "BarrierAck", |
|||
"NodeCrash", "NodeRestart", "LinkDown", "LinkUp", |
|||
"FlusherTick", "Promote", |
|||
"LockAcquire", "LockRelease", |
|||
} |
|||
if int(k) < len(names) { |
|||
return names[k] |
|||
} |
|||
return fmt.Sprintf("Event(%d)", k) |
|||
} |
|||
|
|||
type Event struct { |
|||
Time uint64 |
|||
ID uint64 // unique, for stable ordering
|
|||
Kind EventKind |
|||
NodeID string |
|||
Payload EventPayload |
|||
} |
|||
|
|||
type EventPayload struct { |
|||
Write Write // for WriteStart, ShipEntry, ShipDeliver
|
|||
TargetLSN uint64 // for barriers
|
|||
FromNode string // for delivered messages
|
|||
ToNode string |
|||
LockName string // for lock events
|
|||
ThreadID string |
|||
PromoteID string // for EvPromote
|
|||
} |
|||
|
|||
// --- Priority queue ---
|
|||
|
|||
type eventHeap []Event |
|||
|
|||
func (h eventHeap) Len() int { return len(h) } |
|||
func (h eventHeap) Swap(i, j int) { h[i], h[j] = h[j], h[i] } |
|||
func (h eventHeap) Less(i, j int) bool { |
|||
if h[i].Time != h[j].Time { |
|||
return h[i].Time < h[j].Time |
|||
} |
|||
return h[i].ID < h[j].ID // stable tie-break
|
|||
} |
|||
func (h *eventHeap) Push(x interface{}) { *h = append(*h, x.(Event)) } |
|||
func (h *eventHeap) Pop() interface{} { |
|||
old := *h |
|||
n := len(old) |
|||
e := old[n-1] |
|||
*h = old[:n-1] |
|||
return e |
|||
} |
|||
|
|||
// --- Lock model ---
|
|||
|
|||
type lockState struct { |
|||
held bool |
|||
holder string // threadID
|
|||
waiting []Event // parked EvLockAcquire events
|
|||
} |
|||
|
|||
// --- Trace ---
|
|||
|
|||
type TraceEntry struct { |
|||
Time uint64 |
|||
Event Event |
|||
Note string |
|||
} |
|||
|
|||
// --- Simulator ---
|
|||
|
|||
type Simulator struct { |
|||
Cluster *Cluster |
|||
rng *rand.Rand |
|||
queue eventHeap |
|||
nextID uint64 |
|||
locks map[string]*lockState // lockName -> state
|
|||
trace []TraceEntry |
|||
Errors []string |
|||
maxTime uint64 |
|||
jitterMax uint64 // max random delay added to message delivery
|
|||
|
|||
// Config
|
|||
FaultRate float64 // probability of injecting a fault per step [0,1]
|
|||
MaxEvents int // stop after this many events
|
|||
eventsRun int |
|||
} |
|||
|
|||
func NewSimulator(cluster *Cluster, seed int64) *Simulator { |
|||
return &Simulator{ |
|||
Cluster: cluster, |
|||
rng: rand.New(rand.NewSource(seed)), |
|||
locks: map[string]*lockState{}, |
|||
maxTime: 100000, |
|||
jitterMax: 3, |
|||
FaultRate: 0.05, |
|||
MaxEvents: 5000, |
|||
} |
|||
} |
|||
|
|||
// Enqueue adds an event to the priority queue.
|
|||
func (s *Simulator) Enqueue(e Event) { |
|||
s.nextID++ |
|||
e.ID = s.nextID |
|||
heap.Push(&s.queue, e) |
|||
} |
|||
|
|||
// EnqueueAt is a convenience for enqueueing at a specific time.
|
|||
func (s *Simulator) EnqueueAt(time uint64, kind EventKind, nodeID string, payload EventPayload) { |
|||
s.Enqueue(Event{Time: time, Kind: kind, NodeID: nodeID, Payload: payload}) |
|||
} |
|||
|
|||
// jitter returns a random delay in [1, jitterMax].
|
|||
func (s *Simulator) jitter() uint64 { |
|||
if s.jitterMax <= 1 { |
|||
return 1 |
|||
} |
|||
return 1 + uint64(s.rng.Int63n(int64(s.jitterMax))) |
|||
} |
|||
|
|||
// --- Main loop ---
|
|||
|
|||
// Step executes the next event. Returns false if queue is empty or limit reached.
|
|||
// When multiple events share the same timestamp, one is chosen randomly
|
|||
// to explore different interleavings across runs with different seeds.
|
|||
func (s *Simulator) Step() bool { |
|||
if s.queue.Len() == 0 || s.eventsRun >= s.MaxEvents { |
|||
return false |
|||
} |
|||
// Collect all events at the earliest timestamp.
|
|||
earliest := s.queue[0].Time |
|||
if earliest > s.maxTime { |
|||
return false |
|||
} |
|||
var ready []Event |
|||
for s.queue.Len() > 0 && s.queue[0].Time == earliest { |
|||
ready = append(ready, heap.Pop(&s.queue).(Event)) |
|||
} |
|||
// Shuffle to randomize interleaving of equal-time events.
|
|||
s.rng.Shuffle(len(ready), func(i, j int) { ready[i], ready[j] = ready[j], ready[i] }) |
|||
// Execute the first, re-enqueue the rest.
|
|||
e := ready[0] |
|||
for _, r := range ready[1:] { |
|||
heap.Push(&s.queue, r) |
|||
} |
|||
|
|||
s.Cluster.Now = e.Time |
|||
s.eventsRun++ |
|||
|
|||
s.execute(e) |
|||
s.checkInvariants(e) |
|||
|
|||
return len(s.Errors) == 0 |
|||
} |
|||
|
|||
// Run executes until queue empty, limit reached, or invariant violated.
|
|||
func (s *Simulator) Run() { |
|||
for s.Step() { |
|||
} |
|||
} |
|||
|
|||
// --- Event execution ---
|
|||
|
|||
func (s *Simulator) execute(e Event) { |
|||
node := s.Cluster.Nodes[e.NodeID] |
|||
|
|||
switch e.Kind { |
|||
case EvWriteStart: |
|||
s.executeWriteStart(e) |
|||
|
|||
case EvShipEntry: |
|||
// Primary ships entry to a replica. Enqueue delivery with jitter.
|
|||
if node != nil && node.Running { |
|||
deliverTime := s.Cluster.Now + s.jitter() |
|||
s.EnqueueAt(deliverTime, EvShipDeliver, e.Payload.ToNode, EventPayload{ |
|||
Write: e.Payload.Write, |
|||
FromNode: e.NodeID, |
|||
}) |
|||
s.record(e, fmt.Sprintf("ship LSN=%d to %s, deliver@%d", e.Payload.Write.LSN, e.Payload.ToNode, deliverTime)) |
|||
} |
|||
|
|||
case EvShipDeliver: |
|||
if node != nil && node.Running && node.Epoch == s.Cluster.Coordinator.Epoch { |
|||
if s.Cluster.Links[e.Payload.FromNode] != nil && s.Cluster.Links[e.Payload.FromNode][e.NodeID] { |
|||
node.Storage.AppendWrite(e.Payload.Write) |
|||
s.record(e, fmt.Sprintf("deliver LSN=%d on %s, receivedLSN=%d", e.Payload.Write.LSN, e.NodeID, node.Storage.ReceivedLSN)) |
|||
} else { |
|||
s.record(e, fmt.Sprintf("drop LSN=%d to %s (link down)", e.Payload.Write.LSN, e.NodeID)) |
|||
} |
|||
} |
|||
|
|||
case EvBarrierSend: |
|||
if node != nil && node.Running { |
|||
deliverTime := s.Cluster.Now + s.jitter() |
|||
s.EnqueueAt(deliverTime, EvBarrierDeliver, e.Payload.ToNode, EventPayload{ |
|||
TargetLSN: e.Payload.TargetLSN, |
|||
FromNode: e.NodeID, |
|||
}) |
|||
s.record(e, fmt.Sprintf("barrier LSN=%d to %s", e.Payload.TargetLSN, e.Payload.ToNode)) |
|||
} |
|||
|
|||
case EvBarrierDeliver: |
|||
if node != nil && node.Running && node.Epoch == s.Cluster.Coordinator.Epoch { |
|||
if s.Cluster.Links[e.Payload.FromNode] != nil && s.Cluster.Links[e.Payload.FromNode][e.NodeID] { |
|||
if node.Storage.ReceivedLSN >= e.Payload.TargetLSN { |
|||
// Can fsync now. Enqueue fsync completion with small delay.
|
|||
s.EnqueueAt(s.Cluster.Now+1, EvBarrierFsync, e.NodeID, EventPayload{ |
|||
TargetLSN: e.Payload.TargetLSN, |
|||
FromNode: e.Payload.FromNode, |
|||
}) |
|||
s.record(e, fmt.Sprintf("barrier deliver LSN=%d, fsync scheduled", e.Payload.TargetLSN)) |
|||
} else { |
|||
// Not enough entries yet. Re-enqueue barrier with delay (retry).
|
|||
s.EnqueueAt(s.Cluster.Now+1, EvBarrierDeliver, e.NodeID, e.Payload) |
|||
s.record(e, fmt.Sprintf("barrier LSN=%d waiting (received=%d)", e.Payload.TargetLSN, node.Storage.ReceivedLSN)) |
|||
} |
|||
} |
|||
} |
|||
|
|||
case EvBarrierFsync: |
|||
if node != nil && node.Running { |
|||
node.Storage.AdvanceFlush(e.Payload.TargetLSN) |
|||
// Send ack back to primary.
|
|||
deliverTime := s.Cluster.Now + s.jitter() |
|||
s.EnqueueAt(deliverTime, EvBarrierAck, e.Payload.FromNode, EventPayload{ |
|||
TargetLSN: e.Payload.TargetLSN, |
|||
FromNode: e.NodeID, |
|||
}) |
|||
s.record(e, fmt.Sprintf("fsync LSN=%d on %s, flushedLSN=%d", e.Payload.TargetLSN, e.NodeID, node.Storage.FlushedLSN)) |
|||
} |
|||
|
|||
case EvBarrierAck: |
|||
// Only process acks on running nodes in the current epoch.
|
|||
// After crash+promote, stale acks for the old primary must not advance commits.
|
|||
if node != nil && node.Running && node.Epoch == s.Cluster.Coordinator.Epoch { |
|||
if pending := s.Cluster.Pending[e.Payload.TargetLSN]; pending != nil { |
|||
pending.DurableOn[e.Payload.FromNode] = true |
|||
s.Cluster.refreshCommits() |
|||
s.record(e, fmt.Sprintf("ack LSN=%d from %s, durable=%d", e.Payload.TargetLSN, e.Payload.FromNode, s.Cluster.durableAckCount(pending))) |
|||
} |
|||
} else { |
|||
s.record(e, fmt.Sprintf("ack LSN=%d from %s DROPPED (node down or stale epoch)", e.Payload.TargetLSN, e.Payload.FromNode)) |
|||
} |
|||
|
|||
case EvNodeCrash: |
|||
if node != nil { |
|||
node.Running = false |
|||
// Drop all pending events for this node.
|
|||
s.dropEventsForNode(e.NodeID) |
|||
s.record(e, fmt.Sprintf("CRASH %s", e.NodeID)) |
|||
} |
|||
|
|||
case EvNodeRestart: |
|||
if node != nil { |
|||
node.Running = true |
|||
node.Epoch = s.Cluster.Coordinator.Epoch |
|||
s.record(e, fmt.Sprintf("RESTART %s epoch=%d", e.NodeID, node.Epoch)) |
|||
} |
|||
|
|||
case EvLinkDown: |
|||
s.Cluster.Disconnect(e.Payload.FromNode, e.Payload.ToNode) |
|||
s.Cluster.Disconnect(e.Payload.ToNode, e.Payload.FromNode) |
|||
s.record(e, fmt.Sprintf("LINK DOWN %s <-> %s", e.Payload.FromNode, e.Payload.ToNode)) |
|||
|
|||
case EvLinkUp: |
|||
s.Cluster.Connect(e.Payload.FromNode, e.Payload.ToNode) |
|||
s.Cluster.Connect(e.Payload.ToNode, e.Payload.FromNode) |
|||
s.record(e, fmt.Sprintf("LINK UP %s <-> %s", e.Payload.FromNode, e.Payload.ToNode)) |
|||
|
|||
case EvFlusherTick: |
|||
if node != nil && node.Running { |
|||
node.Storage.AdvanceCheckpoint(node.Storage.FlushedLSN) |
|||
s.record(e, fmt.Sprintf("flusher tick %s checkpoint=%d", e.NodeID, node.Storage.CheckpointLSN)) |
|||
} |
|||
|
|||
case EvPromote: |
|||
if err := s.Cluster.Promote(e.Payload.PromoteID); err != nil { |
|||
s.record(e, fmt.Sprintf("promote %s FAILED: %v", e.Payload.PromoteID, err)) |
|||
} else { |
|||
s.record(e, fmt.Sprintf("PROMOTE %s epoch=%d", e.Payload.PromoteID, s.Cluster.Coordinator.Epoch)) |
|||
} |
|||
|
|||
case EvLockAcquire: |
|||
s.executeLockAcquire(e) |
|||
|
|||
case EvLockRelease: |
|||
s.executeLockRelease(e) |
|||
} |
|||
} |
|||
|
|||
func (s *Simulator) executeWriteStart(e Event) { |
|||
c := s.Cluster |
|||
primary := c.Primary() |
|||
if primary == nil || !primary.Running || primary.Epoch != c.Coordinator.Epoch { |
|||
s.record(e, "write rejected: no valid primary") |
|||
return |
|||
} |
|||
c.nextLSN++ |
|||
w := Write{LSN: c.nextLSN, Block: e.Payload.Write.Block, Value: c.nextLSN} |
|||
primary.Storage.AppendWrite(w) |
|||
primary.Storage.AdvanceFlush(w.LSN) |
|||
c.Reference.Apply(w) |
|||
c.Pending[w.LSN] = &PendingCommit{ |
|||
Write: w, |
|||
DurableOn: map[string]bool{primary.ID: true}, |
|||
} |
|||
c.refreshCommits() |
|||
|
|||
// Ship to each replica with jitter.
|
|||
for _, rid := range c.replicaIDs() { |
|||
shipTime := s.Cluster.Now + s.jitter() |
|||
s.EnqueueAt(shipTime, EvShipEntry, primary.ID, EventPayload{ |
|||
Write: w, |
|||
ToNode: rid, |
|||
}) |
|||
} |
|||
// Barrier after ship.
|
|||
for _, rid := range c.replicaIDs() { |
|||
barrierTime := s.Cluster.Now + s.jitter() + 2 |
|||
s.EnqueueAt(barrierTime, EvBarrierSend, primary.ID, EventPayload{ |
|||
TargetLSN: w.LSN, |
|||
ToNode: rid, |
|||
}) |
|||
} |
|||
|
|||
s.record(e, fmt.Sprintf("write block=%d LSN=%d", w.Block, w.LSN)) |
|||
} |
|||
|
|||
func (s *Simulator) executeLockAcquire(e Event) { |
|||
name := e.Payload.LockName |
|||
ls, ok := s.locks[name] |
|||
if !ok { |
|||
ls = &lockState{} |
|||
s.locks[name] = ls |
|||
} |
|||
if !ls.held { |
|||
ls.held = true |
|||
ls.holder = e.Payload.ThreadID |
|||
s.record(e, fmt.Sprintf("lock %s acquired by %s", name, e.Payload.ThreadID)) |
|||
} else { |
|||
// Park — will be released when current holder releases.
|
|||
ls.waiting = append(ls.waiting, e) |
|||
s.record(e, fmt.Sprintf("lock %s BLOCKED %s (held by %s)", name, e.Payload.ThreadID, ls.holder)) |
|||
} |
|||
} |
|||
|
|||
func (s *Simulator) executeLockRelease(e Event) { |
|||
name := e.Payload.LockName |
|||
ls := s.locks[name] |
|||
if ls == nil || !ls.held { |
|||
return |
|||
} |
|||
// Validate: only the holder can release.
|
|||
if ls.holder != e.Payload.ThreadID { |
|||
s.record(e, fmt.Sprintf("lock %s release REJECTED: %s is not holder (held by %s)", name, e.Payload.ThreadID, ls.holder)) |
|||
return |
|||
} |
|||
s.record(e, fmt.Sprintf("lock %s released by %s", name, ls.holder)) |
|||
ls.held = false |
|||
ls.holder = "" |
|||
// Grant to next waiter (random pick among waiters for interleaving exploration).
|
|||
if len(ls.waiting) > 0 { |
|||
idx := s.rng.Intn(len(ls.waiting)) |
|||
next := ls.waiting[idx] |
|||
ls.waiting = append(ls.waiting[:idx], ls.waiting[idx+1:]...) |
|||
ls.held = true |
|||
ls.holder = next.Payload.ThreadID |
|||
s.record(next, fmt.Sprintf("lock %s granted to %s (was waiting)", name, next.Payload.ThreadID)) |
|||
} |
|||
} |
|||
|
|||
func (s *Simulator) dropEventsForNode(nodeID string) { |
|||
var kept eventHeap |
|||
for _, e := range s.queue { |
|||
if e.NodeID != nodeID { |
|||
kept = append(kept, e) |
|||
} |
|||
} |
|||
s.queue = kept |
|||
heap.Init(&s.queue) |
|||
} |
|||
|
|||
// --- Invariant checking ---
|
|||
|
|||
func (s *Simulator) checkInvariants(after Event) { |
|||
// 1. Commit safety: committed LSN must be durable on policy-required nodes.
|
|||
for lsn := uint64(1); lsn <= s.Cluster.Coordinator.CommittedLSN; lsn++ { |
|||
p := s.Cluster.Pending[lsn] |
|||
if p == nil { |
|||
continue |
|||
} |
|||
if !s.Cluster.commitSatisfied(p) { |
|||
s.addError(after, fmt.Sprintf("committed LSN %d not durable per policy", lsn)) |
|||
} |
|||
} |
|||
|
|||
// 2. No false commit on promoted node.
|
|||
primary := s.Cluster.Primary() |
|||
if primary != nil && primary.Running { |
|||
committedLSN := s.Cluster.Coordinator.CommittedLSN |
|||
for lsn := committedLSN + 1; lsn <= s.Cluster.nextLSN; lsn++ { |
|||
p := s.Cluster.Pending[lsn] |
|||
if p != nil && !p.Committed && p.DurableOn[primary.ID] { |
|||
// Uncommitted but durable on primary — only a problem if primary changed.
|
|||
// This is expected on the original primary. Only flag if this is a PROMOTED node.
|
|||
} |
|||
} |
|||
} |
|||
|
|||
// 3. Data correctness: primary state matches reference at the LSN it actually has.
|
|||
// After promotion, the new primary may not have all writes the old primary committed.
|
|||
// Verify correctness only up to what the current primary has durably received.
|
|||
if primary != nil && primary.Running { |
|||
checkLSN := primary.Storage.FlushedLSN |
|||
if checkLSN > s.Cluster.Coordinator.CommittedLSN { |
|||
checkLSN = s.Cluster.Coordinator.CommittedLSN |
|||
} |
|||
if checkLSN > 0 { |
|||
refState := s.Cluster.Reference.StateAt(checkLSN) |
|||
nodeState := primary.Storage.StateAt(checkLSN) |
|||
if !EqualState(refState, nodeState) { |
|||
s.addError(after, fmt.Sprintf("data divergence on primary %s at LSN=%d", |
|||
primary.ID, checkLSN)) |
|||
} |
|||
} |
|||
} |
|||
|
|||
// 4. Epoch fencing: no node has accepted a stale epoch.
|
|||
for id, node := range s.Cluster.Nodes { |
|||
if node.Running && node.Epoch > s.Cluster.Coordinator.Epoch { |
|||
s.addError(after, fmt.Sprintf("node %s has future epoch %d > coordinator %d", id, node.Epoch, s.Cluster.Coordinator.Epoch)) |
|||
} |
|||
} |
|||
|
|||
// 5. Lock safety: no two threads hold the same lock.
|
|||
for name, ls := range s.locks { |
|||
if ls.held && ls.holder == "" { |
|||
s.addError(after, fmt.Sprintf("lock %s held but no holder", name)) |
|||
} |
|||
} |
|||
} |
|||
|
|||
func (s *Simulator) addError(after Event, msg string) { |
|||
s.Errors = append(s.Errors, fmt.Sprintf("t=%d after %s on %s: %s", |
|||
after.Time, after.Kind, after.NodeID, msg)) |
|||
} |
|||
|
|||
func (s *Simulator) record(e Event, note string) { |
|||
s.trace = append(s.trace, TraceEntry{Time: e.Time, Event: e, Note: note}) |
|||
} |
|||
|
|||
// --- Random fault injection ---
|
|||
|
|||
// InjectRandomFault schedules a random fault (crash, partition, heal)
|
|||
// at a random future time within [Now+1, Now+spread).
|
|||
func (s *Simulator) InjectRandomFault() { |
|||
s.InjectRandomFaultWithin(30) |
|||
} |
|||
|
|||
// InjectRandomFaultWithin schedules a random fault at a random time
|
|||
// within [Now+1, Now+spread).
|
|||
func (s *Simulator) InjectRandomFaultWithin(spread uint64) { |
|||
if s.rng.Float64() > s.FaultRate { |
|||
return |
|||
} |
|||
members := s.Cluster.Coordinator.Members |
|||
if len(members) == 0 { |
|||
return |
|||
} |
|||
faultTime := s.Cluster.Now + 1 + uint64(s.rng.Int63n(int64(spread))) |
|||
|
|||
switch s.rng.Intn(3) { |
|||
case 0: // crash a random node
|
|||
id := members[s.rng.Intn(len(members))] |
|||
s.EnqueueAt(faultTime, EvNodeCrash, id, EventPayload{}) |
|||
case 1: // drop a link
|
|||
from := members[s.rng.Intn(len(members))] |
|||
to := members[s.rng.Intn(len(members))] |
|||
if from != to { |
|||
s.EnqueueAt(faultTime, EvLinkDown, from, EventPayload{FromNode: from, ToNode: to}) |
|||
} |
|||
case 2: // restore a link
|
|||
from := members[s.rng.Intn(len(members))] |
|||
to := members[s.rng.Intn(len(members))] |
|||
if from != to { |
|||
s.EnqueueAt(faultTime, EvLinkUp, from, EventPayload{FromNode: from, ToNode: to}) |
|||
} |
|||
} |
|||
} |
|||
|
|||
// --- Scenario helpers ---
|
|||
|
|||
// ScheduleWrites enqueues n writes at random times in [start, start+spread).
|
|||
func (s *Simulator) ScheduleWrites(n int, start, spread uint64) { |
|||
for i := 0; i < n; i++ { |
|||
t := start + uint64(s.rng.Int63n(int64(spread))) |
|||
block := uint64(s.rng.Intn(16)) |
|||
s.EnqueueAt(t, EvWriteStart, s.Cluster.Coordinator.PrimaryID, EventPayload{ |
|||
Write: Write{Block: block}, |
|||
}) |
|||
} |
|||
} |
|||
|
|||
// ScheduleCrashAndPromote enqueues a primary crash at crashTime and promotes promoteID at promoteTime.
|
|||
func (s *Simulator) ScheduleCrashAndPromote(crashTime uint64, promoteID string, promoteTime uint64) { |
|||
s.EnqueueAt(crashTime, EvNodeCrash, s.Cluster.Coordinator.PrimaryID, EventPayload{}) |
|||
s.EnqueueAt(promoteTime, EvPromote, "", EventPayload{PromoteID: promoteID}) |
|||
} |
|||
|
|||
// ScheduleFlusherTicks enqueues periodic flusher ticks for a node.
|
|||
func (s *Simulator) ScheduleFlusherTicks(nodeID string, start, interval uint64, count int) { |
|||
for i := 0; i < count; i++ { |
|||
s.EnqueueAt(start+uint64(i)*interval, EvFlusherTick, nodeID, EventPayload{}) |
|||
} |
|||
} |
|||
|
|||
// --- Output ---
|
|||
|
|||
// TraceString returns the full trace as a string.
|
|||
func (s *Simulator) TraceString() string { |
|||
var sb strings.Builder |
|||
for _, te := range s.trace { |
|||
fmt.Fprintf(&sb, "[t=%d] %s on %s: %s\n", te.Time, te.Event.Kind, te.Event.NodeID, te.Note) |
|||
} |
|||
return sb.String() |
|||
} |
|||
|
|||
// ErrorString returns all errors.
|
|||
func (s *Simulator) ErrorString() string { |
|||
return strings.Join(s.Errors, "\n") |
|||
} |
|||
|
|||
// AssertCommittedDataCorrect checks that the current primary's state matches the reference.
|
|||
func (s *Simulator) AssertCommittedDataCorrect() error { |
|||
primary := s.Cluster.Primary() |
|||
if primary == nil { |
|||
return fmt.Errorf("no primary") |
|||
} |
|||
committedLSN := s.Cluster.Coordinator.CommittedLSN |
|||
if committedLSN == 0 { |
|||
return nil |
|||
} |
|||
refState := s.Cluster.Reference.StateAt(committedLSN) |
|||
nodeState := primary.Storage.StateAt(committedLSN) |
|||
if !EqualState(refState, nodeState) { |
|||
return fmt.Errorf("data divergence on %s at LSN=%d: ref=%v node=%v", |
|||
primary.ID, committedLSN, refState, nodeState) |
|||
} |
|||
return nil |
|||
} |
|||
@ -0,0 +1,285 @@ |
|||
package distsim |
|||
|
|||
import ( |
|||
"fmt" |
|||
"strings" |
|||
"testing" |
|||
) |
|||
|
|||
// --- Fixed scenarios ---
|
|||
|
|||
func TestSim_BasicWriteAndCommit(t *testing.T) { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
sim := NewSimulator(c, 42) |
|||
|
|||
sim.ScheduleWrites(3, 1, 5) |
|||
sim.Run() |
|||
|
|||
if c.Coordinator.CommittedLSN < 1 { |
|||
t.Fatalf("expected at least 1 committed write, got %d", c.Coordinator.CommittedLSN) |
|||
} |
|||
if err := sim.AssertCommittedDataCorrect(); err != nil { |
|||
t.Fatal(err) |
|||
} |
|||
if len(sim.Errors) > 0 { |
|||
t.Fatalf("invariant violations:\n%s", sim.ErrorString()) |
|||
} |
|||
} |
|||
|
|||
func TestSim_CrashAfterCommit_DataSurvives(t *testing.T) { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
sim := NewSimulator(c, 99) |
|||
|
|||
// Write, let it commit, then crash primary, promote r1.
|
|||
sim.ScheduleWrites(5, 1, 3) |
|||
sim.ScheduleCrashAndPromote(20, "r1", 22) |
|||
|
|||
sim.Run() |
|||
|
|||
if len(sim.Errors) > 0 { |
|||
t.Fatalf("invariant violations:\n%s\nTrace:\n%s", sim.ErrorString(), sim.TraceString()) |
|||
} |
|||
if err := sim.AssertCommittedDataCorrect(); err != nil { |
|||
t.Fatal(err) |
|||
} |
|||
} |
|||
|
|||
func TestSim_PartitionThenHeal(t *testing.T) { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
sim := NewSimulator(c, 777) |
|||
|
|||
// Write some data.
|
|||
sim.ScheduleWrites(3, 1, 3) |
|||
// Partition r2 at time 5.
|
|||
sim.EnqueueAt(5, EvLinkDown, "p", EventPayload{FromNode: "p", ToNode: "r2"}) |
|||
// Write more during partition.
|
|||
sim.ScheduleWrites(3, 8, 3) |
|||
// Heal at time 15.
|
|||
sim.EnqueueAt(15, EvLinkUp, "p", EventPayload{FromNode: "p", ToNode: "r2"}) |
|||
// Write after heal.
|
|||
sim.ScheduleWrites(2, 18, 3) |
|||
|
|||
sim.Run() |
|||
|
|||
if len(sim.Errors) > 0 { |
|||
t.Fatalf("invariant violations:\n%s", sim.ErrorString()) |
|||
} |
|||
if err := sim.AssertCommittedDataCorrect(); err != nil { |
|||
t.Fatal(err) |
|||
} |
|||
} |
|||
|
|||
func TestSim_SyncAll_UncommittedNotVisible(t *testing.T) { |
|||
c := NewCluster(CommitSyncAll, "p", "r1") |
|||
sim := NewSimulator(c, 123) |
|||
|
|||
// Partition r1 so nothing can commit under sync_all.
|
|||
sim.EnqueueAt(0, EvLinkDown, "p", EventPayload{FromNode: "p", ToNode: "r1"}) |
|||
sim.ScheduleWrites(3, 1, 3) |
|||
|
|||
sim.Run() |
|||
|
|||
// Nothing should be committed.
|
|||
if c.Coordinator.CommittedLSN != 0 { |
|||
t.Fatalf("sync_all with partitioned replica should not commit, got %d", c.Coordinator.CommittedLSN) |
|||
} |
|||
if len(sim.Errors) > 0 { |
|||
t.Fatalf("invariant violations:\n%s", sim.ErrorString()) |
|||
} |
|||
} |
|||
|
|||
func TestSim_MessageReorderingDoesNotBreakSafety(t *testing.T) { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
sim := NewSimulator(c, 555) |
|||
sim.jitterMax = 8 // high jitter to force reordering
|
|||
|
|||
sim.ScheduleWrites(10, 1, 5) |
|||
sim.Run() |
|||
|
|||
if len(sim.Errors) > 0 { |
|||
t.Fatalf("invariant violations with high jitter:\n%s", sim.ErrorString()) |
|||
} |
|||
if err := sim.AssertCommittedDataCorrect(); err != nil { |
|||
t.Fatal(err) |
|||
} |
|||
} |
|||
|
|||
// --- Randomized property-based testing ---
|
|||
|
|||
func TestSim_Randomized_CommitSafety(t *testing.T) { |
|||
const numSeeds = 500 |
|||
const numWrites = 20 |
|||
failures := 0 |
|||
|
|||
for seed := int64(0); seed < numSeeds; seed++ { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
sim := NewSimulator(c, seed) |
|||
sim.MaxEvents = 2000 |
|||
|
|||
// Random writes.
|
|||
sim.ScheduleWrites(numWrites, 1, 30) |
|||
|
|||
// Random crash + promote somewhere in the middle.
|
|||
crashTime := uint64(sim.rng.Intn(25) + 5) |
|||
sim.ScheduleCrashAndPromote(crashTime, "r1", crashTime+3) |
|||
|
|||
sim.Run() |
|||
|
|||
if len(sim.Errors) > 0 { |
|||
t.Errorf("seed %d: invariant violation:\n%s\nTrace (last 20):\n%s", |
|||
seed, sim.ErrorString(), lastN(sim.trace, 20)) |
|||
failures++ |
|||
if failures >= 3 { |
|||
t.Fatal("too many failures, stopping") |
|||
} |
|||
} |
|||
} |
|||
t.Logf("randomized: %d/%d seeds passed", numSeeds-failures, numSeeds) |
|||
} |
|||
|
|||
func TestSim_Randomized_WithFaults(t *testing.T) { |
|||
const numSeeds = 300 |
|||
failures := 0 |
|||
|
|||
for seed := int64(0); seed < numSeeds; seed++ { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
sim := NewSimulator(c, seed) |
|||
sim.FaultRate = 0.08 |
|||
sim.MaxEvents = 1500 |
|||
sim.jitterMax = 5 |
|||
|
|||
// Interleave writes and random faults.
|
|||
for i := 0; i < 15; i++ { |
|||
t := uint64(i*3 + 1) |
|||
sim.EnqueueAt(t, EvWriteStart, "p", EventPayload{ |
|||
Write: Write{Block: uint64(sim.rng.Intn(8))}, |
|||
}) |
|||
sim.InjectRandomFault() |
|||
} |
|||
|
|||
sim.Run() |
|||
|
|||
if len(sim.Errors) > 0 { |
|||
t.Errorf("seed %d: invariant violation:\n%s", seed, sim.ErrorString()) |
|||
failures++ |
|||
if failures >= 3 { |
|||
t.Fatal("too many failures, stopping") |
|||
} |
|||
} |
|||
} |
|||
t.Logf("randomized+faults: %d/%d seeds passed", numSeeds-failures, numSeeds) |
|||
} |
|||
|
|||
func TestSim_Randomized_SyncAll(t *testing.T) { |
|||
const numSeeds = 200 |
|||
failures := 0 |
|||
|
|||
for seed := int64(0); seed < numSeeds; seed++ { |
|||
c := NewCluster(CommitSyncAll, "p", "r1") |
|||
sim := NewSimulator(c, seed) |
|||
sim.MaxEvents = 1000 |
|||
|
|||
sim.ScheduleWrites(10, 1, 20) |
|||
|
|||
// Random partition/heal.
|
|||
if sim.rng.Float64() < 0.5 { |
|||
pTime := uint64(sim.rng.Intn(15) + 1) |
|||
sim.EnqueueAt(pTime, EvLinkDown, "p", EventPayload{FromNode: "p", ToNode: "r1"}) |
|||
sim.EnqueueAt(pTime+uint64(sim.rng.Intn(10)+3), EvLinkUp, "p", EventPayload{FromNode: "p", ToNode: "r1"}) |
|||
} |
|||
|
|||
sim.Run() |
|||
|
|||
if len(sim.Errors) > 0 { |
|||
t.Errorf("seed %d: invariant violation:\n%s", seed, sim.ErrorString()) |
|||
failures++ |
|||
if failures >= 3 { |
|||
t.Fatal("too many failures, stopping") |
|||
} |
|||
} |
|||
} |
|||
t.Logf("sync_all randomized: %d/%d seeds passed", numSeeds-failures, numSeeds) |
|||
} |
|||
|
|||
// --- Lock contention tests ---
|
|||
|
|||
func TestSim_LockContention_NoDoubleHold(t *testing.T) { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
sim := NewSimulator(c, 42) |
|||
|
|||
// Two threads try to acquire the same lock at the same time.
|
|||
sim.EnqueueAt(5, EvLockAcquire, "p", EventPayload{LockName: "shipMu", ThreadID: "writer-1"}) |
|||
sim.EnqueueAt(5, EvLockAcquire, "p", EventPayload{LockName: "shipMu", ThreadID: "writer-2"}) |
|||
|
|||
// First release.
|
|||
sim.EnqueueAt(8, EvLockRelease, "p", EventPayload{LockName: "shipMu", ThreadID: "writer-1"}) |
|||
// Second release (whoever got granted after writer-1 releases).
|
|||
sim.EnqueueAt(11, EvLockRelease, "p", EventPayload{LockName: "shipMu", ThreadID: "writer-2"}) |
|||
|
|||
sim.Run() |
|||
|
|||
if len(sim.Errors) > 0 { |
|||
t.Fatalf("lock invariant violated:\n%s\nTrace:\n%s", sim.ErrorString(), sim.TraceString()) |
|||
} |
|||
|
|||
// Verify the trace shows one blocked, one granted.
|
|||
trace := sim.TraceString() |
|||
if !containsStr(trace, "BLOCKED") { |
|||
t.Fatal("expected one thread to be BLOCKED on lock contention") |
|||
} |
|||
if !containsStr(trace, "granted to") { |
|||
t.Fatal("expected blocked thread to be granted after release") |
|||
} |
|||
} |
|||
|
|||
func TestSim_LockContention_Randomized(t *testing.T) { |
|||
// Run many seeds with concurrent lock acquires at the same time.
|
|||
// The simulator should pick a random winner each time (seed-dependent).
|
|||
winners := map[string]int{} |
|||
for seed := int64(0); seed < 100; seed++ { |
|||
c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") |
|||
sim := NewSimulator(c, seed) |
|||
|
|||
sim.EnqueueAt(1, EvLockAcquire, "p", EventPayload{LockName: "mu", ThreadID: "A"}) |
|||
sim.EnqueueAt(1, EvLockAcquire, "p", EventPayload{LockName: "mu", ThreadID: "B"}) |
|||
sim.EnqueueAt(3, EvLockRelease, "p", EventPayload{LockName: "mu", ThreadID: "A"}) |
|||
sim.EnqueueAt(3, EvLockRelease, "p", EventPayload{LockName: "mu", ThreadID: "B"}) |
|||
|
|||
sim.Run() |
|||
|
|||
if len(sim.Errors) > 0 { |
|||
t.Fatalf("seed %d: %s", seed, sim.ErrorString()) |
|||
} |
|||
|
|||
// Check who got the lock first by looking at the trace.
|
|||
for _, te := range sim.trace { |
|||
if te.Event.Kind == EvLockAcquire && containsStr(te.Note, "acquired") { |
|||
winners[te.Event.Payload.ThreadID]++ |
|||
break |
|||
} |
|||
} |
|||
} |
|||
// Both threads should win at least some seeds (randomization works).
|
|||
if winners["A"] == 0 || winners["B"] == 0 { |
|||
t.Fatalf("lock winner not randomized: A=%d B=%d", winners["A"], winners["B"]) |
|||
} |
|||
t.Logf("lock winner distribution: A=%d B=%d", winners["A"], winners["B"]) |
|||
} |
|||
|
|||
func containsStr(s, substr string) bool { |
|||
return len(s) > 0 && len(substr) > 0 && strings.Contains(s, substr) |
|||
} |
|||
|
|||
// --- Helpers ---
|
|||
|
|||
func lastN(trace []TraceEntry, n int) string { |
|||
start := len(trace) - n |
|||
if start < 0 { |
|||
start = 0 |
|||
} |
|||
s := "" |
|||
for _, te := range trace[start:] { |
|||
s += fmt.Sprintf("[t=%d] %s on %s: %s\n", te.Time, te.Event.Kind, te.Event.NodeID, te.Note) |
|||
} |
|||
return s |
|||
} |
|||
@ -0,0 +1,129 @@ |
|||
package distsim |
|||
|
|||
import "sort" |
|||
|
|||
type SnapshotState struct { |
|||
ID string |
|||
LSN uint64 |
|||
State map[uint64]uint64 |
|||
} |
|||
|
|||
type Storage struct { |
|||
WAL []Write |
|||
Extent map[uint64]uint64 |
|||
ReceivedLSN uint64 |
|||
FlushedLSN uint64 |
|||
CheckpointLSN uint64 |
|||
Snapshots map[string]SnapshotState |
|||
BaseSnapshot *SnapshotState |
|||
} |
|||
|
|||
func NewStorage() *Storage { |
|||
return &Storage{ |
|||
Extent: map[uint64]uint64{}, |
|||
Snapshots: map[string]SnapshotState{}, |
|||
} |
|||
} |
|||
|
|||
func (s *Storage) AppendWrite(w Write) { |
|||
// Insert in LSN order (handles out-of-order delivery from jitter).
|
|||
inserted := false |
|||
for i, existing := range s.WAL { |
|||
if w.LSN == existing.LSN { |
|||
return // duplicate, skip
|
|||
} |
|||
if w.LSN < existing.LSN { |
|||
s.WAL = append(s.WAL[:i], append([]Write{w}, s.WAL[i:]...)...) |
|||
inserted = true |
|||
break |
|||
} |
|||
} |
|||
if !inserted { |
|||
s.WAL = append(s.WAL, w) |
|||
} |
|||
s.Extent[w.Block] = w.Value |
|||
if w.LSN > s.ReceivedLSN { |
|||
s.ReceivedLSN = w.LSN |
|||
} |
|||
} |
|||
|
|||
func (s *Storage) AdvanceFlush(lsn uint64) { |
|||
if lsn > s.ReceivedLSN { |
|||
lsn = s.ReceivedLSN |
|||
} |
|||
if lsn > s.FlushedLSN { |
|||
s.FlushedLSN = lsn |
|||
} |
|||
} |
|||
|
|||
func (s *Storage) AdvanceCheckpoint(lsn uint64) { |
|||
if lsn > s.FlushedLSN { |
|||
lsn = s.FlushedLSN |
|||
} |
|||
if lsn > s.CheckpointLSN { |
|||
s.CheckpointLSN = lsn |
|||
} |
|||
} |
|||
|
|||
func (s *Storage) StateAt(lsn uint64) map[uint64]uint64 { |
|||
state := map[uint64]uint64{} |
|||
if s.BaseSnapshot != nil { |
|||
if s.BaseSnapshot.LSN > lsn { |
|||
return cloneMap(s.BaseSnapshot.State) |
|||
} |
|||
state = cloneMap(s.BaseSnapshot.State) |
|||
} |
|||
for _, w := range s.WAL { |
|||
if w.LSN > lsn { |
|||
break |
|||
} |
|||
if s.BaseSnapshot != nil && w.LSN <= s.BaseSnapshot.LSN { |
|||
continue |
|||
} |
|||
state[w.Block] = w.Value |
|||
} |
|||
return state |
|||
} |
|||
|
|||
func (s *Storage) TakeSnapshot(id string, lsn uint64) SnapshotState { |
|||
snap := SnapshotState{ |
|||
ID: id, |
|||
LSN: lsn, |
|||
State: cloneMap(s.StateAt(lsn)), |
|||
} |
|||
s.Snapshots[id] = snap |
|||
return snap |
|||
} |
|||
|
|||
func (s *Storage) LoadSnapshot(snap SnapshotState) { |
|||
s.Extent = cloneMap(snap.State) |
|||
s.FlushedLSN = snap.LSN |
|||
s.ReceivedLSN = snap.LSN |
|||
s.CheckpointLSN = snap.LSN |
|||
s.BaseSnapshot = &SnapshotState{ |
|||
ID: snap.ID, |
|||
LSN: snap.LSN, |
|||
State: cloneMap(snap.State), |
|||
} |
|||
s.WAL = nil |
|||
} |
|||
|
|||
func (s *Storage) ReplaceWAL(writes []Write) { |
|||
s.WAL = append([]Write(nil), writes...) |
|||
sort.Slice(s.WAL, func(i, j int) bool { return s.WAL[i].LSN < s.WAL[j].LSN }) |
|||
s.Extent = s.StateAt(s.ReceivedLSN) |
|||
} |
|||
|
|||
func writesInRange(writes []Write, startExclusive, endInclusive uint64) []Write { |
|||
out := make([]Write, 0) |
|||
for _, w := range writes { |
|||
if w.LSN <= startExclusive { |
|||
continue |
|||
} |
|||
if w.LSN > endInclusive { |
|||
break |
|||
} |
|||
out = append(out, w) |
|||
} |
|||
return out |
|||
} |
|||
@ -0,0 +1,64 @@ |
|||
package enginev2 |
|||
|
|||
// AssignmentIntent represents a coordinator-driven assignment update.
|
|||
// It specifies the desired replica set and which replicas need recovery.
|
|||
type AssignmentIntent struct { |
|||
Endpoints map[string]Endpoint // desired replica set
|
|||
Epoch uint64 // current epoch
|
|||
RecoveryTargets map[string]SessionKind // replicas that need recovery (nil = no recovery)
|
|||
} |
|||
|
|||
// AssignmentResult records what the SenderGroup did in response to an assignment.
|
|||
type AssignmentResult struct { |
|||
Added []string // new senders created
|
|||
Removed []string // old senders stopped
|
|||
SessionsCreated []string // fresh recovery sessions attached
|
|||
SessionsSuperseded []string // existing sessions superseded by new ones
|
|||
SessionsFailed []string // recovery sessions that couldn't be created
|
|||
} |
|||
|
|||
// ApplyAssignment processes a coordinator assignment intent:
|
|||
// 1. Reconcile endpoints — add/remove/update senders
|
|||
// 2. For each recovery target, create a recovery session on the sender
|
|||
//
|
|||
// Epoch fencing: if intent.Epoch < sender.Epoch for any target, that target
|
|||
// is rejected. Stale assignment intent cannot create live sessions.
|
|||
func (sg *SenderGroup) ApplyAssignment(intent AssignmentIntent) AssignmentResult { |
|||
var result AssignmentResult |
|||
|
|||
// Step 1: reconcile topology.
|
|||
result.Added, result.Removed = sg.Reconcile(intent.Endpoints, intent.Epoch) |
|||
|
|||
// Step 2: create recovery sessions for designated targets.
|
|||
if intent.RecoveryTargets == nil { |
|||
return result |
|||
} |
|||
|
|||
sg.mu.RLock() |
|||
defer sg.mu.RUnlock() |
|||
for replicaID, kind := range intent.RecoveryTargets { |
|||
sender, ok := sg.senders[replicaID] |
|||
if !ok { |
|||
result.SessionsFailed = append(result.SessionsFailed, replicaID) |
|||
continue |
|||
} |
|||
// Reject stale assignment: intent epoch must match sender epoch.
|
|||
if intent.Epoch < sender.Epoch { |
|||
result.SessionsFailed = append(result.SessionsFailed, replicaID) |
|||
continue |
|||
} |
|||
_, err := sender.AttachSession(intent.Epoch, kind) |
|||
if err != nil { |
|||
// Session already active at current epoch — supersede it.
|
|||
sess := sender.SupersedeSession(kind, "assignment_intent") |
|||
if sess != nil { |
|||
result.SessionsSuperseded = append(result.SessionsSuperseded, replicaID) |
|||
} else { |
|||
result.SessionsFailed = append(result.SessionsFailed, replicaID) |
|||
} |
|||
continue |
|||
} |
|||
result.SessionsCreated = append(result.SessionsCreated, replicaID) |
|||
} |
|||
return result |
|||
} |
|||
@ -0,0 +1,420 @@ |
|||
package enginev2 |
|||
|
|||
import "testing" |
|||
|
|||
// ============================================================
|
|||
// Phase 04 P1: Session execution and sender-group orchestration
|
|||
// ============================================================
|
|||
|
|||
// --- Execution API: full lifecycle ---
|
|||
|
|||
func TestExec_FullRecoveryLifecycle(t *testing.T) { |
|||
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) |
|||
sess, _ := s.AttachSession(1, SessionCatchUp) |
|||
id := sess.ID |
|||
|
|||
// init → connecting
|
|||
if err := s.BeginConnect(id); err != nil { |
|||
t.Fatalf("BeginConnect: %v", err) |
|||
} |
|||
if s.State != StateConnecting { |
|||
t.Fatalf("state=%s, want connecting", s.State) |
|||
} |
|||
|
|||
// connecting → handshake
|
|||
if err := s.RecordHandshake(id, 5, 20); err != nil { |
|||
t.Fatalf("RecordHandshake: %v", err) |
|||
} |
|||
if sess.StartLSN != 5 || sess.TargetLSN != 20 { |
|||
t.Fatalf("range: start=%d target=%d", sess.StartLSN, sess.TargetLSN) |
|||
} |
|||
|
|||
// handshake → catchup
|
|||
if err := s.BeginCatchUp(id); err != nil { |
|||
t.Fatalf("BeginCatchUp: %v", err) |
|||
} |
|||
if s.State != StateCatchingUp { |
|||
t.Fatalf("state=%s, want catching_up", s.State) |
|||
} |
|||
|
|||
// progress
|
|||
if err := s.RecordCatchUpProgress(id, 15); err != nil { |
|||
t.Fatalf("progress to 15: %v", err) |
|||
} |
|||
if err := s.RecordCatchUpProgress(id, 20); err != nil { |
|||
t.Fatalf("progress to 20: %v", err) |
|||
} |
|||
if !sess.Converged() { |
|||
t.Fatal("should be converged at 20/20") |
|||
} |
|||
|
|||
// complete
|
|||
if !s.CompleteSessionByID(id) { |
|||
t.Fatal("completion should succeed") |
|||
} |
|||
if s.State != StateInSync { |
|||
t.Fatalf("state=%s, want in_sync", s.State) |
|||
} |
|||
} |
|||
|
|||
// --- Stale sessionID rejection across all execution APIs ---
|
|||
|
|||
func TestExec_StaleID_AllAPIsReject(t *testing.T) { |
|||
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) |
|||
sess1, _ := s.AttachSession(1, SessionCatchUp) |
|||
oldID := sess1.ID |
|||
|
|||
// Supersede with new session.
|
|||
s.UpdateEpoch(2) |
|||
sess2, _ := s.AttachSession(2, SessionCatchUp) |
|||
_ = sess2 |
|||
|
|||
// All APIs must reject oldID.
|
|||
if err := s.BeginConnect(oldID); err == nil { |
|||
t.Fatal("BeginConnect should reject stale ID") |
|||
} |
|||
if err := s.RecordHandshake(oldID, 0, 10); err == nil { |
|||
t.Fatal("RecordHandshake should reject stale ID") |
|||
} |
|||
if err := s.BeginCatchUp(oldID); err == nil { |
|||
t.Fatal("BeginCatchUp should reject stale ID") |
|||
} |
|||
if err := s.RecordCatchUpProgress(oldID, 5); err == nil { |
|||
t.Fatal("RecordCatchUpProgress should reject stale ID") |
|||
} |
|||
if s.CompleteSessionByID(oldID) { |
|||
t.Fatal("CompleteSessionByID should reject stale ID") |
|||
} |
|||
} |
|||
|
|||
// --- Phase ordering enforcement ---
|
|||
|
|||
func TestExec_WrongPhaseOrder_Rejected(t *testing.T) { |
|||
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) |
|||
sess, _ := s.AttachSession(1, SessionCatchUp) |
|||
id := sess.ID |
|||
|
|||
// Skip connecting → go directly to handshake: rejected.
|
|||
if err := s.RecordHandshake(id, 0, 10); err == nil { |
|||
t.Fatal("handshake from init should be rejected") |
|||
} |
|||
|
|||
// Skip to catch-up from init: rejected.
|
|||
if err := s.BeginCatchUp(id); err == nil { |
|||
t.Fatal("catch-up from init should be rejected") |
|||
} |
|||
|
|||
// Progress from init: rejected (not in catch-up phase).
|
|||
if err := s.RecordCatchUpProgress(id, 5); err == nil { |
|||
t.Fatal("progress from init should be rejected") |
|||
} |
|||
|
|||
// Correct path: init → connecting.
|
|||
s.BeginConnect(id) |
|||
// Now try catch-up from connecting: rejected (must handshake first).
|
|||
if err := s.BeginCatchUp(id); err == nil { |
|||
t.Fatal("catch-up from connecting should be rejected") |
|||
} |
|||
} |
|||
|
|||
// --- Progress regression rejection ---
|
|||
|
|||
func TestExec_ProgressRegression_Rejected(t *testing.T) { |
|||
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) |
|||
sess, _ := s.AttachSession(1, SessionCatchUp) |
|||
id := sess.ID |
|||
|
|||
s.BeginConnect(id) |
|||
s.RecordHandshake(id, 0, 100) |
|||
s.BeginCatchUp(id) |
|||
|
|||
s.RecordCatchUpProgress(id, 50) |
|||
|
|||
// Regression: 30 < 50.
|
|||
if err := s.RecordCatchUpProgress(id, 30); err == nil { |
|||
t.Fatal("progress regression should be rejected") |
|||
} |
|||
|
|||
// Same value: 50 = 50.
|
|||
if err := s.RecordCatchUpProgress(id, 50); err == nil { |
|||
t.Fatal("non-advancing progress should be rejected") |
|||
} |
|||
|
|||
// Advance: 60 > 50.
|
|||
if err := s.RecordCatchUpProgress(id, 60); err != nil { |
|||
t.Fatalf("valid progress should succeed: %v", err) |
|||
} |
|||
} |
|||
|
|||
// --- Epoch bump during execution ---
|
|||
|
|||
func TestExec_EpochBumpDuringExecution_InvalidatesAuthority(t *testing.T) { |
|||
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) |
|||
sess, _ := s.AttachSession(1, SessionCatchUp) |
|||
id := sess.ID |
|||
|
|||
s.BeginConnect(id) |
|||
s.RecordHandshake(id, 0, 100) |
|||
s.BeginCatchUp(id) |
|||
s.RecordCatchUpProgress(id, 50) |
|||
|
|||
// Epoch bumps mid-execution.
|
|||
s.UpdateEpoch(2) |
|||
|
|||
// All further execution on old session rejected.
|
|||
if err := s.RecordCatchUpProgress(id, 60); err == nil { |
|||
t.Fatal("progress after epoch bump should be rejected") |
|||
} |
|||
if s.CompleteSessionByID(id) { |
|||
t.Fatal("completion after epoch bump should be rejected") |
|||
} |
|||
|
|||
// Sender is disconnected, ready for new session.
|
|||
if s.State != StateDisconnected { |
|||
t.Fatalf("state=%s, want disconnected", s.State) |
|||
} |
|||
} |
|||
|
|||
// --- Endpoint change during execution ---
|
|||
|
|||
func TestExec_EndpointChangeDuringExecution_InvalidatesAuthority(t *testing.T) { |
|||
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", CtrlAddr: "r1:9334", Version: 1}, 1) |
|||
sess, _ := s.AttachSession(1, SessionCatchUp) |
|||
id := sess.ID |
|||
|
|||
s.BeginConnect(id) |
|||
s.RecordHandshake(id, 0, 50) |
|||
s.BeginCatchUp(id) |
|||
|
|||
// Endpoint changes mid-execution.
|
|||
s.UpdateEndpoint(Endpoint{DataAddr: "r1:9444", CtrlAddr: "r1:9445", Version: 2}) |
|||
|
|||
// All further execution rejected.
|
|||
if err := s.RecordCatchUpProgress(id, 10); err == nil { |
|||
t.Fatal("progress after endpoint change should be rejected") |
|||
} |
|||
if s.CompleteSessionByID(id) { |
|||
t.Fatal("completion after endpoint change should be rejected") |
|||
} |
|||
} |
|||
|
|||
// --- Completion authority enforcement ---
|
|||
|
|||
func TestExec_CompletionRejected_FromInit(t *testing.T) { |
|||
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) |
|||
sess, _ := s.AttachSession(1, SessionCatchUp) |
|||
|
|||
if s.CompleteSessionByID(sess.ID) { |
|||
t.Fatal("completion from PhaseInit should be rejected") |
|||
} |
|||
} |
|||
|
|||
func TestExec_CompletionRejected_FromConnecting(t *testing.T) { |
|||
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) |
|||
sess, _ := s.AttachSession(1, SessionCatchUp) |
|||
s.BeginConnect(sess.ID) |
|||
|
|||
if s.CompleteSessionByID(sess.ID) { |
|||
t.Fatal("completion from PhaseConnecting should be rejected") |
|||
} |
|||
} |
|||
|
|||
func TestExec_CompletionRejected_FromHandshakeWithGap(t *testing.T) { |
|||
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) |
|||
sess, _ := s.AttachSession(1, SessionCatchUp) |
|||
s.BeginConnect(sess.ID) |
|||
s.RecordHandshake(sess.ID, 5, 20) // gap exists: 5 → 20
|
|||
|
|||
if s.CompleteSessionByID(sess.ID) { |
|||
t.Fatal("completion from PhaseHandshake with gap should be rejected") |
|||
} |
|||
} |
|||
|
|||
func TestExec_CompletionAllowed_FromHandshakeZeroGap(t *testing.T) { |
|||
// Fast path: handshake shows replica already at target (zero gap).
|
|||
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) |
|||
sess, _ := s.AttachSession(1, SessionCatchUp) |
|||
s.BeginConnect(sess.ID) |
|||
s.RecordHandshake(sess.ID, 10, 10) // zero gap: start == target
|
|||
|
|||
if !s.CompleteSessionByID(sess.ID) { |
|||
t.Fatal("completion from handshake with zero gap should be allowed") |
|||
} |
|||
if s.State != StateInSync { |
|||
t.Fatalf("state=%s, want in_sync", s.State) |
|||
} |
|||
} |
|||
|
|||
func TestExec_CompletionRejected_FromCatchUpNotConverged(t *testing.T) { |
|||
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) |
|||
sess, _ := s.AttachSession(1, SessionCatchUp) |
|||
s.BeginConnect(sess.ID) |
|||
s.RecordHandshake(sess.ID, 0, 100) |
|||
s.BeginCatchUp(sess.ID) |
|||
s.RecordCatchUpProgress(sess.ID, 50) // not converged (50 < 100)
|
|||
|
|||
if s.CompleteSessionByID(sess.ID) { |
|||
t.Fatal("completion before convergence should be rejected") |
|||
} |
|||
} |
|||
|
|||
func TestExec_HandshakeInvalidRange_Rejected(t *testing.T) { |
|||
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) |
|||
sess, _ := s.AttachSession(1, SessionCatchUp) |
|||
s.BeginConnect(sess.ID) |
|||
|
|||
if err := s.RecordHandshake(sess.ID, 20, 5); err == nil { |
|||
t.Fatal("handshake with target < start should be rejected") |
|||
} |
|||
} |
|||
|
|||
// --- SenderGroup orchestration ---
|
|||
|
|||
func TestOrch_RepeatedReconnectCycles_PreserveSenderIdentity(t *testing.T) { |
|||
sg := NewSenderGroup() |
|||
sg.Reconcile(map[string]Endpoint{ |
|||
"r1:9333": {DataAddr: "r1:9333", Version: 1}, |
|||
}, 1) |
|||
|
|||
s := sg.Sender("r1:9333") |
|||
original := s // save pointer
|
|||
|
|||
// 5 reconnect cycles — sender identity preserved.
|
|||
for cycle := 0; cycle < 5; cycle++ { |
|||
sess, err := s.AttachSession(1, SessionCatchUp) |
|||
if err != nil { |
|||
t.Fatalf("cycle %d attach: %v", cycle, err) |
|||
} |
|||
s.BeginConnect(sess.ID) |
|||
s.RecordHandshake(sess.ID, 0, 10) |
|||
s.BeginCatchUp(sess.ID) |
|||
s.RecordCatchUpProgress(sess.ID, 10) |
|||
s.CompleteSessionByID(sess.ID) |
|||
|
|||
if s.State != StateInSync { |
|||
t.Fatalf("cycle %d: state=%s, want in_sync", cycle, s.State) |
|||
} |
|||
} |
|||
|
|||
// Same pointer — identity preserved.
|
|||
if sg.Sender("r1:9333") != original { |
|||
t.Fatal("sender identity should be preserved across cycles") |
|||
} |
|||
} |
|||
|
|||
func TestOrch_EndpointUpdateSupersedesActiveSession(t *testing.T) { |
|||
sg := NewSenderGroup() |
|||
sg.Reconcile(map[string]Endpoint{ |
|||
"r1:9333": {DataAddr: "r1:9333", CtrlAddr: "r1:9334", Version: 1}, |
|||
}, 1) |
|||
|
|||
s := sg.Sender("r1:9333") |
|||
sess, _ := s.AttachSession(1, SessionCatchUp) |
|||
s.BeginConnect(sess.ID) |
|||
|
|||
// Endpoint update via reconcile — session invalidated.
|
|||
sg.Reconcile(map[string]Endpoint{ |
|||
"r1:9333": {DataAddr: "r1:9333", CtrlAddr: "r1:9334", Version: 2}, |
|||
}, 1) |
|||
|
|||
if sess.Active() { |
|||
t.Fatal("session should be invalidated by endpoint update") |
|||
} |
|||
// Sender preserved, session gone.
|
|||
if sg.Sender("r1:9333") != s { |
|||
t.Fatal("sender identity should be preserved") |
|||
} |
|||
if s.Session() != nil { |
|||
t.Fatal("session should be nil after endpoint invalidation") |
|||
} |
|||
} |
|||
|
|||
func TestOrch_ReconcileMixedAddRemoveUpdate(t *testing.T) { |
|||
sg := NewSenderGroup() |
|||
sg.Reconcile(map[string]Endpoint{ |
|||
"r1:9333": {DataAddr: "r1:9333", Version: 1}, |
|||
"r2:9333": {DataAddr: "r2:9333", Version: 1}, |
|||
"r3:9333": {DataAddr: "r3:9333", Version: 1}, |
|||
}, 1) |
|||
|
|||
r1 := sg.Sender("r1:9333") |
|||
r2 := sg.Sender("r2:9333") |
|||
|
|||
// Attach sessions to r1 and r2.
|
|||
r1Sess, _ := r1.AttachSession(1, SessionCatchUp) |
|||
r2Sess, _ := r2.AttachSession(1, SessionCatchUp) |
|||
|
|||
// Reconcile: keep r1, remove r2, update r3, add r4.
|
|||
added, removed := sg.Reconcile(map[string]Endpoint{ |
|||
"r1:9333": {DataAddr: "r1:9333", Version: 1}, // kept
|
|||
"r3:9333": {DataAddr: "r3:9333", Version: 2}, // updated
|
|||
"r4:9333": {DataAddr: "r4:9333", Version: 1}, // added
|
|||
}, 1) |
|||
|
|||
if len(added) != 1 || added[0] != "r4:9333" { |
|||
t.Fatalf("added=%v", added) |
|||
} |
|||
if len(removed) != 1 || removed[0] != "r2:9333" { |
|||
t.Fatalf("removed=%v", removed) |
|||
} |
|||
|
|||
// r1: preserved with active session.
|
|||
if sg.Sender("r1:9333") != r1 { |
|||
t.Fatal("r1 should be preserved") |
|||
} |
|||
if !r1Sess.Active() { |
|||
t.Fatal("r1 session should still be active") |
|||
} |
|||
|
|||
// r2: stopped and removed.
|
|||
if sg.Sender("r2:9333") != nil { |
|||
t.Fatal("r2 should be removed") |
|||
} |
|||
if r2.Stopped() != true { |
|||
t.Fatal("r2 should be stopped") |
|||
} |
|||
if r2Sess.Active() { |
|||
t.Fatal("r2 session should be invalidated (sender stopped)") |
|||
} |
|||
|
|||
// r4: new sender, no session.
|
|||
if sg.Sender("r4:9333") == nil { |
|||
t.Fatal("r4 should exist") |
|||
} |
|||
} |
|||
|
|||
func TestOrch_EpochBumpInvalidatesExecutingSessions(t *testing.T) { |
|||
sg := NewSenderGroup() |
|||
sg.Reconcile(map[string]Endpoint{ |
|||
"r1:9333": {DataAddr: "r1:9333", Version: 1}, |
|||
"r2:9333": {DataAddr: "r2:9333", Version: 1}, |
|||
}, 1) |
|||
|
|||
r1 := sg.Sender("r1:9333") |
|||
r2 := sg.Sender("r2:9333") |
|||
|
|||
sess1, _ := r1.AttachSession(1, SessionCatchUp) |
|||
r1.BeginConnect(sess1.ID) |
|||
r1.RecordHandshake(sess1.ID, 0, 50) |
|||
r1.BeginCatchUp(sess1.ID) |
|||
r1.RecordCatchUpProgress(sess1.ID, 25) // mid-execution
|
|||
|
|||
sess2, _ := r2.AttachSession(1, SessionCatchUp) |
|||
r2.BeginConnect(sess2.ID) |
|||
|
|||
// Epoch bump.
|
|||
count := sg.InvalidateEpoch(2) |
|||
if count != 2 { |
|||
t.Fatalf("should invalidate 2 sessions, got %d", count) |
|||
} |
|||
|
|||
// Both sessions dead.
|
|||
if sess1.Active() || sess2.Active() { |
|||
t.Fatal("both sessions should be invalidated") |
|||
} |
|||
|
|||
// r1's mid-execution progress cannot continue.
|
|||
if err := r1.RecordCatchUpProgress(sess1.ID, 30); err == nil { |
|||
t.Fatal("progress on invalidated session should be rejected") |
|||
} |
|||
} |
|||
@ -0,0 +1,3 @@ |
|||
module github.com/seaweedfs/seaweedfs/sw-block/prototype/enginev2 |
|||
|
|||
go 1.23.0 |
|||
@ -0,0 +1,39 @@ |
|||
package enginev2 |
|||
|
|||
// HandshakeResult captures what the reconnect handshake reveals about a
|
|||
// replica's state relative to the primary's lineage-safe boundary.
|
|||
type HandshakeResult struct { |
|||
ReplicaFlushedLSN uint64 // highest LSN durably persisted on replica
|
|||
CommittedLSN uint64 // lineage-safe recovery target (committed prefix)
|
|||
RetentionStartLSN uint64 // oldest LSN still available in primary WAL
|
|||
} |
|||
|
|||
// RecoveryOutcome classifies the gap between replica and primary.
|
|||
type RecoveryOutcome string |
|||
|
|||
const ( |
|||
OutcomeZeroGap RecoveryOutcome = "zero_gap" // replica has full committed prefix
|
|||
OutcomeCatchUp RecoveryOutcome = "catchup" // gap within WAL retention
|
|||
OutcomeNeedsRebuild RecoveryOutcome = "needs_rebuild" // gap exceeds retention
|
|||
) |
|||
|
|||
// ClassifyRecoveryOutcome determines the recovery path from handshake data.
|
|||
//
|
|||
// Uses CommittedLSN (not WAL head) as the target boundary. This is the
|
|||
// lineage-safe recovery point — only acknowledged data counts. A replica
|
|||
// with FlushedLSN > CommittedLSN has divergent/uncommitted tail that must
|
|||
// NOT be treated as "already in sync."
|
|||
//
|
|||
// Decision matrix (matches CP13-5 gap analysis):
|
|||
// - ReplicaFlushedLSN >= CommittedLSN → zero gap, has full committed prefix
|
|||
// - ReplicaFlushedLSN+1 >= RetentionStartLSN → recoverable via WAL catch-up
|
|||
// - otherwise → gap too large, needs rebuild
|
|||
func ClassifyRecoveryOutcome(result HandshakeResult) RecoveryOutcome { |
|||
if result.ReplicaFlushedLSN >= result.CommittedLSN { |
|||
return OutcomeZeroGap |
|||
} |
|||
if result.RetentionStartLSN == 0 || result.ReplicaFlushedLSN+1 >= result.RetentionStartLSN { |
|||
return OutcomeCatchUp |
|||
} |
|||
return OutcomeNeedsRebuild |
|||
} |
|||
@ -0,0 +1,482 @@ |
|||
package enginev2 |
|||
|
|||
import "testing" |
|||
|
|||
// ============================================================
|
|||
// Phase 04 P2: Outcome branching, assignment intent, end-to-end
|
|||
// ============================================================
|
|||
|
|||
// --- Recovery outcome classification ---
|
|||
|
|||
func TestOutcome_ZeroGap(t *testing.T) { |
|||
o := ClassifyRecoveryOutcome(HandshakeResult{ |
|||
ReplicaFlushedLSN: 100, |
|||
CommittedLSN: 100, |
|||
RetentionStartLSN: 50, |
|||
}) |
|||
if o != OutcomeZeroGap { |
|||
t.Fatalf("got %s, want zero_gap", o) |
|||
} |
|||
} |
|||
|
|||
func TestOutcome_ZeroGap_ReplicaAtCommitted(t *testing.T) { |
|||
// Replica has exactly the committed prefix — zero gap.
|
|||
// Note: replica may have uncommitted tail beyond CommittedLSN;
|
|||
// that is handled by truncation, not by recovery classification.
|
|||
o := ClassifyRecoveryOutcome(HandshakeResult{ |
|||
ReplicaFlushedLSN: 100, |
|||
CommittedLSN: 100, |
|||
RetentionStartLSN: 50, |
|||
}) |
|||
if o != OutcomeZeroGap { |
|||
t.Fatalf("got %s, want zero_gap", o) |
|||
} |
|||
} |
|||
|
|||
func TestOutcome_CatchUp(t *testing.T) { |
|||
o := ClassifyRecoveryOutcome(HandshakeResult{ |
|||
ReplicaFlushedLSN: 80, |
|||
CommittedLSN: 100, |
|||
RetentionStartLSN: 50, |
|||
}) |
|||
if o != OutcomeCatchUp { |
|||
t.Fatalf("got %s, want catchup", o) |
|||
} |
|||
} |
|||
|
|||
func TestOutcome_CatchUp_ExactBoundary(t *testing.T) { |
|||
// ReplicaFlushedLSN+1 == RetentionStartLSN → recoverable (just barely).
|
|||
o := ClassifyRecoveryOutcome(HandshakeResult{ |
|||
ReplicaFlushedLSN: 49, |
|||
CommittedLSN: 100, |
|||
RetentionStartLSN: 50, |
|||
}) |
|||
if o != OutcomeCatchUp { |
|||
t.Fatalf("got %s, want catchup (exact boundary)", o) |
|||
} |
|||
} |
|||
|
|||
func TestOutcome_NeedsRebuild(t *testing.T) { |
|||
o := ClassifyRecoveryOutcome(HandshakeResult{ |
|||
ReplicaFlushedLSN: 10, |
|||
CommittedLSN: 100, |
|||
RetentionStartLSN: 50, |
|||
}) |
|||
if o != OutcomeNeedsRebuild { |
|||
t.Fatalf("got %s, want needs_rebuild", o) |
|||
} |
|||
} |
|||
|
|||
func TestOutcome_NeedsRebuild_OffByOne(t *testing.T) { |
|||
// ReplicaFlushedLSN+1 < RetentionStartLSN → unrecoverable.
|
|||
o := ClassifyRecoveryOutcome(HandshakeResult{ |
|||
ReplicaFlushedLSN: 48, |
|||
CommittedLSN: 100, |
|||
RetentionStartLSN: 50, |
|||
}) |
|||
if o != OutcomeNeedsRebuild { |
|||
t.Fatalf("got %s, want needs_rebuild (off-by-one)", o) |
|||
} |
|||
} |
|||
|
|||
// --- RecordHandshakeWithOutcome execution ---
|
|||
|
|||
func TestExec_HandshakeOutcome_ZeroGap_FastComplete(t *testing.T) { |
|||
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) |
|||
sess, _ := s.AttachSession(1, SessionCatchUp) |
|||
s.BeginConnect(sess.ID) |
|||
|
|||
outcome, err := s.RecordHandshakeWithOutcome(sess.ID, HandshakeResult{ |
|||
ReplicaFlushedLSN: 100, |
|||
CommittedLSN: 100, |
|||
RetentionStartLSN: 50, |
|||
}) |
|||
if err != nil { |
|||
t.Fatal(err) |
|||
} |
|||
if outcome != OutcomeZeroGap { |
|||
t.Fatalf("outcome=%s, want zero_gap", outcome) |
|||
} |
|||
|
|||
// Zero-gap: can complete directly from handshake phase.
|
|||
if !s.CompleteSessionByID(sess.ID) { |
|||
t.Fatal("zero-gap fast completion should succeed") |
|||
} |
|||
if s.State != StateInSync { |
|||
t.Fatalf("state=%s, want in_sync", s.State) |
|||
} |
|||
} |
|||
|
|||
func TestExec_HandshakeOutcome_CatchUp_NormalPath(t *testing.T) { |
|||
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) |
|||
sess, _ := s.AttachSession(1, SessionCatchUp) |
|||
s.BeginConnect(sess.ID) |
|||
|
|||
outcome, err := s.RecordHandshakeWithOutcome(sess.ID, HandshakeResult{ |
|||
ReplicaFlushedLSN: 80, |
|||
CommittedLSN: 100, |
|||
RetentionStartLSN: 50, |
|||
}) |
|||
if err != nil { |
|||
t.Fatal(err) |
|||
} |
|||
if outcome != OutcomeCatchUp { |
|||
t.Fatalf("outcome=%s, want catchup", outcome) |
|||
} |
|||
|
|||
// Must catch up before completing.
|
|||
if s.CompleteSessionByID(sess.ID) { |
|||
t.Fatal("completion should be rejected before catch-up") |
|||
} |
|||
|
|||
s.BeginCatchUp(sess.ID) |
|||
s.RecordCatchUpProgress(sess.ID, 100) |
|||
if !s.CompleteSessionByID(sess.ID) { |
|||
t.Fatal("completion should succeed after convergence") |
|||
} |
|||
} |
|||
|
|||
func TestExec_HandshakeOutcome_NeedsRebuild_InvalidatesSession(t *testing.T) { |
|||
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) |
|||
sess, _ := s.AttachSession(1, SessionCatchUp) |
|||
s.BeginConnect(sess.ID) |
|||
|
|||
outcome, err := s.RecordHandshakeWithOutcome(sess.ID, HandshakeResult{ |
|||
ReplicaFlushedLSN: 10, |
|||
CommittedLSN: 100, |
|||
RetentionStartLSN: 50, |
|||
}) |
|||
if err != nil { |
|||
t.Fatal(err) |
|||
} |
|||
if outcome != OutcomeNeedsRebuild { |
|||
t.Fatalf("outcome=%s, want needs_rebuild", outcome) |
|||
} |
|||
|
|||
// Session invalidated, sender at NeedsRebuild.
|
|||
if sess.Active() { |
|||
t.Fatal("session should be invalidated") |
|||
} |
|||
if s.State != StateNeedsRebuild { |
|||
t.Fatalf("state=%s, want needs_rebuild", s.State) |
|||
} |
|||
if s.Session() != nil { |
|||
t.Fatal("session should be nil after NeedsRebuild") |
|||
} |
|||
} |
|||
|
|||
// --- Assignment-intent orchestration ---
|
|||
|
|||
func TestAssignment_CreatesSessionsForTargets(t *testing.T) { |
|||
sg := NewSenderGroup() |
|||
|
|||
result := sg.ApplyAssignment(AssignmentIntent{ |
|||
Endpoints: map[string]Endpoint{ |
|||
"r1:9333": {DataAddr: "r1:9333", Version: 1}, |
|||
"r2:9333": {DataAddr: "r2:9333", Version: 1}, |
|||
}, |
|||
Epoch: 1, |
|||
RecoveryTargets: map[string]SessionKind{ |
|||
"r1:9333": SessionCatchUp, |
|||
}, |
|||
}) |
|||
|
|||
if len(result.Added) != 2 { |
|||
t.Fatalf("added=%d, want 2", len(result.Added)) |
|||
} |
|||
if len(result.SessionsCreated) != 1 || result.SessionsCreated[0] != "r1:9333" { |
|||
t.Fatalf("sessions created=%v", result.SessionsCreated) |
|||
} |
|||
|
|||
// r1 has session, r2 does not.
|
|||
r1 := sg.Sender("r1:9333") |
|||
if r1.Session() == nil { |
|||
t.Fatal("r1 should have a session") |
|||
} |
|||
r2 := sg.Sender("r2:9333") |
|||
if r2.Session() != nil { |
|||
t.Fatal("r2 should not have a session") |
|||
} |
|||
} |
|||
|
|||
func TestAssignment_SupersedesExistingSession(t *testing.T) { |
|||
sg := NewSenderGroup() |
|||
|
|||
// First assignment with catch-up session.
|
|||
sg.ApplyAssignment(AssignmentIntent{ |
|||
Endpoints: map[string]Endpoint{ |
|||
"r1:9333": {DataAddr: "r1:9333", Version: 1}, |
|||
}, |
|||
Epoch: 1, |
|||
RecoveryTargets: map[string]SessionKind{"r1:9333": SessionCatchUp}, |
|||
}) |
|||
oldSess := sg.Sender("r1:9333").Session() |
|||
|
|||
// Second assignment with rebuild session — supersedes.
|
|||
result := sg.ApplyAssignment(AssignmentIntent{ |
|||
Endpoints: map[string]Endpoint{ |
|||
"r1:9333": {DataAddr: "r1:9333", Version: 1}, |
|||
}, |
|||
Epoch: 1, |
|||
RecoveryTargets: map[string]SessionKind{"r1:9333": SessionRebuild}, |
|||
}) |
|||
newSess := sg.Sender("r1:9333").Session() |
|||
|
|||
if oldSess.Active() { |
|||
t.Fatal("old session should be invalidated") |
|||
} |
|||
if !newSess.Active() { |
|||
t.Fatal("new session should be active") |
|||
} |
|||
if newSess.Kind != SessionRebuild { |
|||
t.Fatalf("new session kind=%s, want rebuild", newSess.Kind) |
|||
} |
|||
if len(result.SessionsSuperseded) != 1 || result.SessionsSuperseded[0] != "r1:9333" { |
|||
t.Fatalf("superseded=%v, want [r1:9333]", result.SessionsSuperseded) |
|||
} |
|||
} |
|||
|
|||
func TestAssignment_FailsForUnknownReplica(t *testing.T) { |
|||
sg := NewSenderGroup() |
|||
|
|||
result := sg.ApplyAssignment(AssignmentIntent{ |
|||
Endpoints: map[string]Endpoint{ |
|||
"r1:9333": {DataAddr: "r1:9333", Version: 1}, |
|||
}, |
|||
Epoch: 1, |
|||
RecoveryTargets: map[string]SessionKind{"r99:9333": SessionCatchUp}, |
|||
}) |
|||
|
|||
if len(result.SessionsFailed) != 1 || result.SessionsFailed[0] != "r99:9333" { |
|||
t.Fatalf("sessions failed=%v, want [r99:9333]", result.SessionsFailed) |
|||
} |
|||
} |
|||
|
|||
func TestAssignment_StaleEpoch_Rejected(t *testing.T) { |
|||
sg := NewSenderGroup() |
|||
|
|||
// Epoch 2 assignment.
|
|||
sg.ApplyAssignment(AssignmentIntent{ |
|||
Endpoints: map[string]Endpoint{ |
|||
"r1:9333": {DataAddr: "r1:9333", Version: 1}, |
|||
}, |
|||
Epoch: 2, |
|||
}) |
|||
|
|||
// Stale epoch 1 assignment with recovery — must be rejected.
|
|||
result := sg.ApplyAssignment(AssignmentIntent{ |
|||
Endpoints: map[string]Endpoint{ |
|||
"r1:9333": {DataAddr: "r1:9333", Version: 1}, |
|||
}, |
|||
Epoch: 1, |
|||
RecoveryTargets: map[string]SessionKind{"r1:9333": SessionCatchUp}, |
|||
}) |
|||
|
|||
if len(result.SessionsFailed) != 1 || result.SessionsFailed[0] != "r1:9333" { |
|||
t.Fatalf("stale epoch should fail: failed=%v created=%v", result.SessionsFailed, result.SessionsCreated) |
|||
} |
|||
if sg.Sender("r1:9333").Session() != nil { |
|||
t.Fatal("stale intent must not create a session") |
|||
} |
|||
} |
|||
|
|||
// --- End-to-end prototype recovery flows ---
|
|||
|
|||
func TestE2E_CatchUpRecovery_FullFlow(t *testing.T) { |
|||
sg := NewSenderGroup() |
|||
|
|||
// Step 1: Assignment creates replicas + recovery intent.
|
|||
sg.ApplyAssignment(AssignmentIntent{ |
|||
Endpoints: map[string]Endpoint{ |
|||
"r1:9333": {DataAddr: "r1:9333", Version: 1}, |
|||
"r2:9333": {DataAddr: "r2:9333", Version: 1}, |
|||
}, |
|||
Epoch: 1, |
|||
RecoveryTargets: map[string]SessionKind{"r1:9333": SessionCatchUp}, |
|||
}) |
|||
|
|||
r1 := sg.Sender("r1:9333") |
|||
sess := r1.Session() |
|||
|
|||
// Step 2: Execute recovery.
|
|||
r1.BeginConnect(sess.ID) |
|||
|
|||
outcome, _ := r1.RecordHandshakeWithOutcome(sess.ID, HandshakeResult{ |
|||
ReplicaFlushedLSN: 80, |
|||
CommittedLSN: 100, |
|||
RetentionStartLSN: 50, |
|||
}) |
|||
if outcome != OutcomeCatchUp { |
|||
t.Fatalf("outcome=%s", outcome) |
|||
} |
|||
|
|||
r1.BeginCatchUp(sess.ID) |
|||
r1.RecordCatchUpProgress(sess.ID, 90) |
|||
r1.RecordCatchUpProgress(sess.ID, 100) // converged
|
|||
|
|||
// Step 3: Complete.
|
|||
if !r1.CompleteSessionByID(sess.ID) { |
|||
t.Fatal("completion should succeed") |
|||
} |
|||
|
|||
// Step 4: Verify final state.
|
|||
if r1.State != StateInSync { |
|||
t.Fatalf("r1 state=%s, want in_sync", r1.State) |
|||
} |
|||
if r1.Session() != nil { |
|||
t.Fatal("session should be nil after completion") |
|||
} |
|||
|
|||
t.Logf("e2e catch-up: assignment → connect → handshake(catchup) → progress → complete → InSync") |
|||
} |
|||
|
|||
func TestE2E_NeedsRebuild_Escalation(t *testing.T) { |
|||
sg := NewSenderGroup() |
|||
|
|||
// Step 1: Assignment with catch-up intent.
|
|||
sg.ApplyAssignment(AssignmentIntent{ |
|||
Endpoints: map[string]Endpoint{ |
|||
"r1:9333": {DataAddr: "r1:9333", Version: 1}, |
|||
}, |
|||
Epoch: 1, |
|||
RecoveryTargets: map[string]SessionKind{"r1:9333": SessionCatchUp}, |
|||
}) |
|||
|
|||
r1 := sg.Sender("r1:9333") |
|||
sess := r1.Session() |
|||
|
|||
// Step 2: Connect + handshake → unrecoverable gap.
|
|||
r1.BeginConnect(sess.ID) |
|||
outcome, _ := r1.RecordHandshakeWithOutcome(sess.ID, HandshakeResult{ |
|||
ReplicaFlushedLSN: 10, |
|||
CommittedLSN: 100, |
|||
RetentionStartLSN: 50, |
|||
}) |
|||
if outcome != OutcomeNeedsRebuild { |
|||
t.Fatalf("outcome=%s", outcome) |
|||
} |
|||
|
|||
// Step 3: Sender is at NeedsRebuild, session dead.
|
|||
if r1.State != StateNeedsRebuild { |
|||
t.Fatalf("state=%s", r1.State) |
|||
} |
|||
|
|||
// Step 4: New assignment with rebuild intent.
|
|||
sg.ApplyAssignment(AssignmentIntent{ |
|||
Endpoints: map[string]Endpoint{ |
|||
"r1:9333": {DataAddr: "r1:9333", Version: 1}, |
|||
}, |
|||
Epoch: 1, |
|||
RecoveryTargets: map[string]SessionKind{"r1:9333": SessionRebuild}, |
|||
}) |
|||
|
|||
rebuildSess := r1.Session() |
|||
if rebuildSess == nil || rebuildSess.Kind != SessionRebuild { |
|||
t.Fatal("should have rebuild session") |
|||
} |
|||
|
|||
// Step 5: Execute rebuild recovery (simulated).
|
|||
r1.BeginConnect(rebuildSess.ID) |
|||
r1.RecordHandshake(rebuildSess.ID, 0, 100) // full rebuild range
|
|||
r1.BeginCatchUp(rebuildSess.ID) |
|||
r1.RecordCatchUpProgress(rebuildSess.ID, 100) |
|||
|
|||
if !r1.CompleteSessionByID(rebuildSess.ID) { |
|||
t.Fatal("rebuild completion should succeed") |
|||
} |
|||
if r1.State != StateInSync { |
|||
t.Fatalf("after rebuild: state=%s, want in_sync", r1.State) |
|||
} |
|||
|
|||
t.Logf("e2e rebuild: catch-up→NeedsRebuild→rebuild assignment→recover→InSync") |
|||
} |
|||
|
|||
func TestE2E_ZeroGap_FastPath(t *testing.T) { |
|||
sg := NewSenderGroup() |
|||
|
|||
sg.ApplyAssignment(AssignmentIntent{ |
|||
Endpoints: map[string]Endpoint{ |
|||
"r1:9333": {DataAddr: "r1:9333", Version: 1}, |
|||
}, |
|||
Epoch: 1, |
|||
RecoveryTargets: map[string]SessionKind{"r1:9333": SessionCatchUp}, |
|||
}) |
|||
|
|||
r1 := sg.Sender("r1:9333") |
|||
sess := r1.Session() |
|||
|
|||
r1.BeginConnect(sess.ID) |
|||
outcome, _ := r1.RecordHandshakeWithOutcome(sess.ID, HandshakeResult{ |
|||
ReplicaFlushedLSN: 100, |
|||
CommittedLSN: 100, |
|||
RetentionStartLSN: 50, |
|||
}) |
|||
if outcome != OutcomeZeroGap { |
|||
t.Fatalf("outcome=%s", outcome) |
|||
} |
|||
|
|||
// Fast path: complete directly from handshake.
|
|||
if !r1.CompleteSessionByID(sess.ID) { |
|||
t.Fatal("zero-gap fast completion should succeed") |
|||
} |
|||
if r1.State != StateInSync { |
|||
t.Fatalf("state=%s, want in_sync", r1.State) |
|||
} |
|||
|
|||
t.Logf("e2e zero-gap: assignment → connect → handshake(zero_gap) → complete → InSync") |
|||
} |
|||
|
|||
func TestE2E_EpochBump_MidRecovery_FullCycle(t *testing.T) { |
|||
sg := NewSenderGroup() |
|||
|
|||
// Epoch 1: start recovery.
|
|||
sg.ApplyAssignment(AssignmentIntent{ |
|||
Endpoints: map[string]Endpoint{ |
|||
"r1:9333": {DataAddr: "r1:9333", Version: 1}, |
|||
}, |
|||
Epoch: 1, |
|||
RecoveryTargets: map[string]SessionKind{"r1:9333": SessionCatchUp}, |
|||
}) |
|||
|
|||
r1 := sg.Sender("r1:9333") |
|||
sess1 := r1.Session() |
|||
r1.BeginConnect(sess1.ID) |
|||
|
|||
// Epoch bumps mid-recovery.
|
|||
sg.InvalidateEpoch(2) |
|||
// Must also update sender epoch for the new assignment.
|
|||
r1.UpdateEpoch(2) |
|||
|
|||
// Old session dead.
|
|||
if sess1.Active() { |
|||
t.Fatal("epoch-1 session should be invalidated") |
|||
} |
|||
|
|||
// Epoch 2: new assignment, new session.
|
|||
sg.ApplyAssignment(AssignmentIntent{ |
|||
Endpoints: map[string]Endpoint{ |
|||
"r1:9333": {DataAddr: "r1:9333", Version: 1}, |
|||
}, |
|||
Epoch: 2, |
|||
RecoveryTargets: map[string]SessionKind{"r1:9333": SessionCatchUp}, |
|||
}) |
|||
|
|||
sess2 := r1.Session() |
|||
if sess2 == nil || sess2.Epoch != 2 { |
|||
t.Fatal("should have new session at epoch 2") |
|||
} |
|||
|
|||
// Complete at epoch 2.
|
|||
r1.BeginConnect(sess2.ID) |
|||
r1.RecordHandshakeWithOutcome(sess2.ID, HandshakeResult{ |
|||
ReplicaFlushedLSN: 100, |
|||
CommittedLSN: 100, |
|||
RetentionStartLSN: 50, |
|||
}) |
|||
r1.CompleteSessionByID(sess2.ID) |
|||
|
|||
if r1.State != StateInSync { |
|||
t.Fatalf("state=%s", r1.State) |
|||
} |
|||
t.Logf("e2e epoch bump: epoch1 recovery → bump → epoch2 recovery → InSync") |
|||
} |
|||
@ -0,0 +1,347 @@ |
|||
// Package enginev2 implements V2 per-replica sender/session ownership.
|
|||
//
|
|||
// Each replica has exactly one Sender that owns its identity (canonical address)
|
|||
// and at most one active RecoverySession per epoch. The Sender survives topology
|
|||
// changes; the session does not survive epoch bumps.
|
|||
package enginev2 |
|||
|
|||
import ( |
|||
"fmt" |
|||
"sync" |
|||
) |
|||
|
|||
// ReplicaState tracks the per-replica replication state machine.
|
|||
type ReplicaState string |
|||
|
|||
const ( |
|||
StateDisconnected ReplicaState = "disconnected" |
|||
StateConnecting ReplicaState = "connecting" |
|||
StateCatchingUp ReplicaState = "catching_up" |
|||
StateInSync ReplicaState = "in_sync" |
|||
StateDegraded ReplicaState = "degraded" |
|||
StateNeedsRebuild ReplicaState = "needs_rebuild" |
|||
) |
|||
|
|||
// Endpoint represents a replica's network identity.
|
|||
type Endpoint struct { |
|||
DataAddr string |
|||
CtrlAddr string |
|||
Version uint64 // bumped on address change
|
|||
} |
|||
|
|||
// Sender owns the replication channel to one replica. It is identified
|
|||
// by ReplicaID (canonical data address at creation time) and survives
|
|||
// topology changes as long as the replica stays in the set.
|
|||
//
|
|||
// A Sender holds at most one active RecoverySession. Normal in-sync
|
|||
// operation does not require a session — Ship/Barrier work directly.
|
|||
type Sender struct { |
|||
mu sync.Mutex |
|||
|
|||
ReplicaID string // canonical identity — stable across reconnects
|
|||
Endpoint Endpoint // current network address (may change via UpdateEndpoint)
|
|||
Epoch uint64 // current epoch
|
|||
State ReplicaState |
|||
|
|||
session *RecoverySession // nil when in-sync or disconnected without recovery
|
|||
stopped bool |
|||
} |
|||
|
|||
// NewSender creates a sender for a replica at the given endpoint and epoch.
|
|||
func NewSender(replicaID string, endpoint Endpoint, epoch uint64) *Sender { |
|||
return &Sender{ |
|||
ReplicaID: replicaID, |
|||
Endpoint: endpoint, |
|||
Epoch: epoch, |
|||
State: StateDisconnected, |
|||
} |
|||
} |
|||
|
|||
// UpdateEpoch updates the sender's epoch. If a recovery session is active
|
|||
// at a stale epoch, it is invalidated.
|
|||
func (s *Sender) UpdateEpoch(epoch uint64) { |
|||
s.mu.Lock() |
|||
defer s.mu.Unlock() |
|||
if s.stopped || epoch <= s.Epoch { |
|||
return |
|||
} |
|||
oldEpoch := s.Epoch |
|||
s.Epoch = epoch |
|||
if s.session != nil && s.session.Epoch < epoch { |
|||
s.session.invalidate(fmt.Sprintf("epoch_advanced_%d_to_%d", oldEpoch, epoch)) |
|||
s.session = nil |
|||
s.State = StateDisconnected |
|||
} |
|||
} |
|||
|
|||
// UpdateEndpoint updates the sender's target address after a control-plane
|
|||
// assignment refresh. If a recovery session is active and the address changed,
|
|||
// the session is invalidated (the new address needs a fresh session).
|
|||
func (s *Sender) UpdateEndpoint(ep Endpoint) { |
|||
s.mu.Lock() |
|||
defer s.mu.Unlock() |
|||
if s.stopped { |
|||
return |
|||
} |
|||
addrChanged := s.Endpoint.DataAddr != ep.DataAddr || s.Endpoint.CtrlAddr != ep.CtrlAddr || s.Endpoint.Version != ep.Version |
|||
s.Endpoint = ep |
|||
if addrChanged && s.session != nil { |
|||
s.session.invalidate("endpoint_changed") |
|||
s.session = nil |
|||
s.State = StateDisconnected |
|||
} |
|||
} |
|||
|
|||
// AttachSession creates and attaches a new recovery session for this sender.
|
|||
// The session epoch must match the sender's current epoch — stale or future
|
|||
// epoch sessions are rejected. Returns an error if a session is already active,
|
|||
// the sender is stopped, or the epoch doesn't match.
|
|||
func (s *Sender) AttachSession(epoch uint64, kind SessionKind) (*RecoverySession, error) { |
|||
s.mu.Lock() |
|||
defer s.mu.Unlock() |
|||
if s.stopped { |
|||
return nil, fmt.Errorf("sender stopped") |
|||
} |
|||
if epoch != s.Epoch { |
|||
return nil, fmt.Errorf("epoch mismatch: sender=%d session=%d", s.Epoch, epoch) |
|||
} |
|||
if s.session != nil && s.session.Active() { |
|||
return nil, fmt.Errorf("session already active (epoch=%d kind=%s)", s.session.Epoch, s.session.Kind) |
|||
} |
|||
sess := newRecoverySession(s.ReplicaID, epoch, kind) |
|||
s.session = sess |
|||
// Ownership established but execution not started.
|
|||
// BeginConnect() is the first execution-state transition.
|
|||
return sess, nil |
|||
} |
|||
|
|||
// SupersedeSession invalidates the current session (if any) and attaches
|
|||
// a new one at the sender's current epoch. Used when an assignment change
|
|||
// requires a fresh recovery path. The old session is invalidated with the
|
|||
// given reason. Always uses s.Epoch — does not accept an epoch parameter
|
|||
// to prevent epoch coherence drift.
|
|||
//
|
|||
// Establishes ownership only — does not mutate sender state.
|
|||
// BeginConnect() starts execution.
|
|||
func (s *Sender) SupersedeSession(kind SessionKind, reason string) *RecoverySession { |
|||
s.mu.Lock() |
|||
defer s.mu.Unlock() |
|||
if s.stopped { |
|||
return nil |
|||
} |
|||
if s.session != nil { |
|||
s.session.invalidate(reason) |
|||
} |
|||
sess := newRecoverySession(s.ReplicaID, s.Epoch, kind) |
|||
s.session = sess |
|||
return sess |
|||
} |
|||
|
|||
// Session returns the current recovery session, or nil if none.
|
|||
func (s *Sender) Session() *RecoverySession { |
|||
s.mu.Lock() |
|||
defer s.mu.Unlock() |
|||
return s.session |
|||
} |
|||
|
|||
// CompleteSessionByID marks the session as completed and transitions the
|
|||
// sender to InSync. Requires:
|
|||
// - sessionID matches the current active session
|
|||
// - session is in PhaseCatchUp and has Converged (normal path)
|
|||
// - OR session is in PhaseHandshake and gap is zero (fast path: already in sync)
|
|||
//
|
|||
// Returns false if any check fails (stale ID, wrong phase, not converged).
|
|||
func (s *Sender) CompleteSessionByID(sessionID uint64) bool { |
|||
s.mu.Lock() |
|||
defer s.mu.Unlock() |
|||
if err := s.checkSessionAuthority(sessionID); err != nil { |
|||
return false |
|||
} |
|||
sess := s.session |
|||
switch sess.Phase { |
|||
case PhaseCatchUp: |
|||
if !sess.Converged() { |
|||
return false // not converged yet
|
|||
} |
|||
case PhaseHandshake: |
|||
if sess.TargetLSN != sess.StartLSN { |
|||
return false // has a gap — must catch up first
|
|||
} |
|||
// Zero-gap fast path: handshake showed replica already at target.
|
|||
default: |
|||
return false // not at a completion-ready phase
|
|||
} |
|||
sess.complete() |
|||
s.session = nil |
|||
s.State = StateInSync |
|||
return true |
|||
} |
|||
|
|||
// === Execution APIs — sender-owned authority gate ===
|
|||
//
|
|||
// All execution APIs validate the sessionID against the current active session.
|
|||
// This prevents stale results from old/superseded sessions from mutating state.
|
|||
// The sender is the authority boundary, not the session object.
|
|||
|
|||
// BeginConnect transitions the session from init to connecting.
|
|||
// Mutates: session.Phase → PhaseConnecting. Sender.State → StateConnecting.
|
|||
// Rejects: wrong sessionID, stopped sender, session not in PhaseInit.
|
|||
func (s *Sender) BeginConnect(sessionID uint64) error { |
|||
s.mu.Lock() |
|||
defer s.mu.Unlock() |
|||
if err := s.checkSessionAuthority(sessionID); err != nil { |
|||
return err |
|||
} |
|||
if !s.session.Advance(PhaseConnecting) { |
|||
return fmt.Errorf("cannot begin connect: session phase=%s", s.session.Phase) |
|||
} |
|||
s.State = StateConnecting |
|||
return nil |
|||
} |
|||
|
|||
// RecordHandshake records a successful handshake result and sets the catch-up range.
|
|||
// Mutates: session.Phase → PhaseHandshake, session.StartLSN/TargetLSN.
|
|||
// Rejects: wrong sessionID, wrong phase, invalid range.
|
|||
func (s *Sender) RecordHandshake(sessionID uint64, startLSN, targetLSN uint64) error { |
|||
s.mu.Lock() |
|||
defer s.mu.Unlock() |
|||
if err := s.checkSessionAuthority(sessionID); err != nil { |
|||
return err |
|||
} |
|||
if targetLSN < startLSN { |
|||
return fmt.Errorf("invalid handshake range: target=%d < start=%d", targetLSN, startLSN) |
|||
} |
|||
if !s.session.Advance(PhaseHandshake) { |
|||
return fmt.Errorf("cannot record handshake: session phase=%s", s.session.Phase) |
|||
} |
|||
s.session.SetRange(startLSN, targetLSN) |
|||
return nil |
|||
} |
|||
|
|||
// RecordHandshakeWithOutcome records the handshake AND classifies the recovery
|
|||
// outcome. This is the preferred handshake API — it determines the recovery
|
|||
// path in one step:
|
|||
// - OutcomeZeroGap: sets zero range, ready for fast completion
|
|||
// - OutcomeCatchUp: sets catch-up range, ready for BeginCatchUp
|
|||
// - OutcomeNeedsRebuild: invalidates session, transitions sender to NeedsRebuild
|
|||
//
|
|||
// Returns the outcome. On NeedsRebuild, the session is dead and the caller
|
|||
// should not attempt further execution.
|
|||
func (s *Sender) RecordHandshakeWithOutcome(sessionID uint64, result HandshakeResult) (RecoveryOutcome, error) { |
|||
outcome := ClassifyRecoveryOutcome(result) |
|||
|
|||
s.mu.Lock() |
|||
defer s.mu.Unlock() |
|||
if err := s.checkSessionAuthority(sessionID); err != nil { |
|||
return outcome, err |
|||
} |
|||
// Must be in PhaseConnecting — require valid execution entry point.
|
|||
if s.session.Phase != PhaseConnecting { |
|||
return outcome, fmt.Errorf("handshake requires PhaseConnecting, got %s", s.session.Phase) |
|||
} |
|||
|
|||
if outcome == OutcomeNeedsRebuild { |
|||
s.session.invalidate("gap_exceeds_retention") |
|||
s.session = nil |
|||
s.State = StateNeedsRebuild |
|||
return outcome, nil |
|||
} |
|||
|
|||
if !s.session.Advance(PhaseHandshake) { |
|||
return outcome, fmt.Errorf("cannot record handshake: session phase=%s", s.session.Phase) |
|||
} |
|||
|
|||
switch outcome { |
|||
case OutcomeZeroGap: |
|||
s.session.SetRange(result.ReplicaFlushedLSN, result.ReplicaFlushedLSN) |
|||
case OutcomeCatchUp: |
|||
s.session.SetRange(result.ReplicaFlushedLSN, result.CommittedLSN) |
|||
} |
|||
return outcome, nil |
|||
} |
|||
|
|||
// BeginCatchUp transitions the session from handshake to catch-up phase.
|
|||
// Mutates: session.Phase → PhaseCatchUp. Sender.State → StateCatchingUp.
|
|||
// Rejects: wrong sessionID, wrong phase.
|
|||
func (s *Sender) BeginCatchUp(sessionID uint64) error { |
|||
s.mu.Lock() |
|||
defer s.mu.Unlock() |
|||
if err := s.checkSessionAuthority(sessionID); err != nil { |
|||
return err |
|||
} |
|||
if !s.session.Advance(PhaseCatchUp) { |
|||
return fmt.Errorf("cannot begin catch-up: session phase=%s", s.session.Phase) |
|||
} |
|||
s.State = StateCatchingUp |
|||
return nil |
|||
} |
|||
|
|||
// RecordCatchUpProgress records catch-up progress (highest LSN recovered).
|
|||
// Mutates: session.RecoveredTo (monotonic only).
|
|||
// Rejects: wrong sessionID, wrong phase, progress regression, invalidated session.
|
|||
func (s *Sender) RecordCatchUpProgress(sessionID uint64, recoveredTo uint64) error { |
|||
s.mu.Lock() |
|||
defer s.mu.Unlock() |
|||
if err := s.checkSessionAuthority(sessionID); err != nil { |
|||
return err |
|||
} |
|||
if s.session.Phase != PhaseCatchUp { |
|||
return fmt.Errorf("cannot record progress: session phase=%s, want catchup", s.session.Phase) |
|||
} |
|||
if recoveredTo <= s.session.RecoveredTo { |
|||
return fmt.Errorf("progress regression: current=%d proposed=%d", s.session.RecoveredTo, recoveredTo) |
|||
} |
|||
s.session.UpdateProgress(recoveredTo) |
|||
return nil |
|||
} |
|||
|
|||
// checkSessionAuthority validates that the sender has an active session
|
|||
// matching the given ID. Must be called with s.mu held.
|
|||
func (s *Sender) checkSessionAuthority(sessionID uint64) error { |
|||
if s.stopped { |
|||
return fmt.Errorf("sender stopped") |
|||
} |
|||
if s.session == nil { |
|||
return fmt.Errorf("no active session") |
|||
} |
|||
if s.session.ID != sessionID { |
|||
return fmt.Errorf("session ID mismatch: active=%d requested=%d", s.session.ID, sessionID) |
|||
} |
|||
if !s.session.Active() { |
|||
return fmt.Errorf("session %d is no longer active (phase=%s)", sessionID, s.session.Phase) |
|||
} |
|||
return nil |
|||
} |
|||
|
|||
// InvalidateSession invalidates the current session with a reason.
|
|||
// Transitions the sender to the given target state.
|
|||
func (s *Sender) InvalidateSession(reason string, targetState ReplicaState) { |
|||
s.mu.Lock() |
|||
defer s.mu.Unlock() |
|||
if s.session != nil { |
|||
s.session.invalidate(reason) |
|||
s.session = nil |
|||
} |
|||
s.State = targetState |
|||
} |
|||
|
|||
// Stop shuts down the sender and any active session.
|
|||
func (s *Sender) Stop() { |
|||
s.mu.Lock() |
|||
defer s.mu.Unlock() |
|||
if s.stopped { |
|||
return |
|||
} |
|||
s.stopped = true |
|||
if s.session != nil { |
|||
s.session.invalidate("sender_stopped") |
|||
s.session = nil |
|||
} |
|||
} |
|||
|
|||
// Stopped returns true if the sender has been stopped.
|
|||
func (s *Sender) Stopped() bool { |
|||
s.mu.Lock() |
|||
defer s.mu.Unlock() |
|||
return s.stopped |
|||
} |
|||
@ -0,0 +1,119 @@ |
|||
package enginev2 |
|||
|
|||
import ( |
|||
"sort" |
|||
"sync" |
|||
) |
|||
|
|||
// SenderGroup manages per-replica Senders with identity-preserving reconciliation.
|
|||
// It is the V2 equivalent of ShipperGroup.
|
|||
type SenderGroup struct { |
|||
mu sync.RWMutex |
|||
senders map[string]*Sender // keyed by ReplicaID
|
|||
} |
|||
|
|||
// NewSenderGroup creates an empty SenderGroup.
|
|||
func NewSenderGroup() *SenderGroup { |
|||
return &SenderGroup{ |
|||
senders: map[string]*Sender{}, |
|||
} |
|||
} |
|||
|
|||
// Reconcile diffs the current sender set against newEndpoints.
|
|||
// Matching senders (same ReplicaID) are preserved with all state.
|
|||
// Removed senders are stopped. New senders are created at the given epoch.
|
|||
// Returns lists of added and removed ReplicaIDs.
|
|||
func (sg *SenderGroup) Reconcile(newEndpoints map[string]Endpoint, epoch uint64) (added, removed []string) { |
|||
sg.mu.Lock() |
|||
defer sg.mu.Unlock() |
|||
|
|||
// Stop and remove senders not in the new set.
|
|||
for id, s := range sg.senders { |
|||
if _, keep := newEndpoints[id]; !keep { |
|||
s.Stop() |
|||
delete(sg.senders, id) |
|||
removed = append(removed, id) |
|||
} |
|||
} |
|||
|
|||
// Add new senders; update endpoints and epoch for existing.
|
|||
for id, ep := range newEndpoints { |
|||
if existing, ok := sg.senders[id]; ok { |
|||
existing.UpdateEndpoint(ep) |
|||
existing.UpdateEpoch(epoch) |
|||
} else { |
|||
sg.senders[id] = NewSender(id, ep, epoch) |
|||
added = append(added, id) |
|||
} |
|||
} |
|||
|
|||
sort.Strings(added) |
|||
sort.Strings(removed) |
|||
return added, removed |
|||
} |
|||
|
|||
// Sender returns the sender for a ReplicaID, or nil.
|
|||
func (sg *SenderGroup) Sender(replicaID string) *Sender { |
|||
sg.mu.RLock() |
|||
defer sg.mu.RUnlock() |
|||
return sg.senders[replicaID] |
|||
} |
|||
|
|||
// All returns all senders in deterministic order (sorted by ReplicaID).
|
|||
func (sg *SenderGroup) All() []*Sender { |
|||
sg.mu.RLock() |
|||
defer sg.mu.RUnlock() |
|||
out := make([]*Sender, 0, len(sg.senders)) |
|||
for _, s := range sg.senders { |
|||
out = append(out, s) |
|||
} |
|||
sort.Slice(out, func(i, j int) bool { |
|||
return out[i].ReplicaID < out[j].ReplicaID |
|||
}) |
|||
return out |
|||
} |
|||
|
|||
// Len returns the number of senders.
|
|||
func (sg *SenderGroup) Len() int { |
|||
sg.mu.RLock() |
|||
defer sg.mu.RUnlock() |
|||
return len(sg.senders) |
|||
} |
|||
|
|||
// StopAll stops all senders.
|
|||
func (sg *SenderGroup) StopAll() { |
|||
sg.mu.Lock() |
|||
defer sg.mu.Unlock() |
|||
for _, s := range sg.senders { |
|||
s.Stop() |
|||
} |
|||
} |
|||
|
|||
// InSyncCount returns the number of senders in StateInSync.
|
|||
func (sg *SenderGroup) InSyncCount() int { |
|||
sg.mu.RLock() |
|||
defer sg.mu.RUnlock() |
|||
count := 0 |
|||
for _, s := range sg.senders { |
|||
if s.State == StateInSync { |
|||
count++ |
|||
} |
|||
} |
|||
return count |
|||
} |
|||
|
|||
// InvalidateEpoch invalidates all active sessions that are bound to
|
|||
// a stale epoch. Called after promotion/epoch bump.
|
|||
func (sg *SenderGroup) InvalidateEpoch(currentEpoch uint64) int { |
|||
sg.mu.RLock() |
|||
defer sg.mu.RUnlock() |
|||
count := 0 |
|||
for _, s := range sg.senders { |
|||
sess := s.Session() |
|||
if sess != nil && sess.Epoch < currentEpoch && sess.Active() { |
|||
s.InvalidateSession("epoch_bump", StateDisconnected) |
|||
count++ |
|||
} |
|||
} |
|||
return count |
|||
} |
|||
@ -0,0 +1,203 @@ |
|||
package enginev2 |
|||
|
|||
import "testing" |
|||
|
|||
// === SenderGroup reconciliation ===
|
|||
|
|||
func TestSenderGroup_Reconcile_AddNew(t *testing.T) { |
|||
sg := NewSenderGroup() |
|||
|
|||
eps := map[string]Endpoint{ |
|||
"r1:9333": {DataAddr: "r1:9333", Version: 1}, |
|||
"r2:9333": {DataAddr: "r2:9333", Version: 1}, |
|||
} |
|||
added, removed := sg.Reconcile(eps, 1) |
|||
|
|||
if len(added) != 2 || len(removed) != 0 { |
|||
t.Fatalf("added=%v removed=%v", added, removed) |
|||
} |
|||
if sg.Len() != 2 { |
|||
t.Fatalf("len: got %d, want 2", sg.Len()) |
|||
} |
|||
} |
|||
|
|||
func TestSenderGroup_Reconcile_RemoveStale(t *testing.T) { |
|||
sg := NewSenderGroup() |
|||
sg.Reconcile(map[string]Endpoint{ |
|||
"r1:9333": {DataAddr: "r1:9333", Version: 1}, |
|||
"r2:9333": {DataAddr: "r2:9333", Version: 1}, |
|||
}, 1) |
|||
|
|||
// Remove r2, keep r1.
|
|||
_, removed := sg.Reconcile(map[string]Endpoint{ |
|||
"r1:9333": {DataAddr: "r1:9333", Version: 1}, |
|||
}, 1) |
|||
|
|||
if len(removed) != 1 || removed[0] != "r2:9333" { |
|||
t.Fatalf("removed=%v, want [r2:9333]", removed) |
|||
} |
|||
if sg.Sender("r2:9333") != nil { |
|||
t.Fatal("r2 should be removed") |
|||
} |
|||
if sg.Sender("r1:9333") == nil { |
|||
t.Fatal("r1 should be preserved") |
|||
} |
|||
} |
|||
|
|||
func TestSenderGroup_Reconcile_PreservesState(t *testing.T) { |
|||
sg := NewSenderGroup() |
|||
sg.Reconcile(map[string]Endpoint{ |
|||
"r1:9333": {DataAddr: "r1:9333", Version: 1}, |
|||
}, 1) |
|||
|
|||
// Attach session and advance.
|
|||
s := sg.Sender("r1:9333") |
|||
sess, _ := s.AttachSession(1, SessionCatchUp) |
|||
sess.SetRange(0, 100) |
|||
sess.UpdateProgress(50) |
|||
|
|||
// Reconcile with same address — sender preserved.
|
|||
sg.Reconcile(map[string]Endpoint{ |
|||
"r1:9333": {DataAddr: "r1:9333", Version: 1}, |
|||
}, 1) |
|||
|
|||
s2 := sg.Sender("r1:9333") |
|||
if s2 != s { |
|||
t.Fatal("reconcile should preserve the same sender object") |
|||
} |
|||
if s2.Session() != sess { |
|||
t.Fatal("reconcile should preserve the session") |
|||
} |
|||
if !sess.Active() { |
|||
t.Fatal("session should still be active after same-address reconcile") |
|||
} |
|||
} |
|||
|
|||
func TestSenderGroup_Reconcile_MixedUpdate(t *testing.T) { |
|||
sg := NewSenderGroup() |
|||
sg.Reconcile(map[string]Endpoint{ |
|||
"r1:9333": {DataAddr: "r1:9333", Version: 1}, |
|||
"r2:9333": {DataAddr: "r2:9333", Version: 1}, |
|||
}, 1) |
|||
|
|||
// Keep r1, remove r2, add r3.
|
|||
added, removed := sg.Reconcile(map[string]Endpoint{ |
|||
"r1:9333": {DataAddr: "r1:9333", Version: 1}, |
|||
"r3:9333": {DataAddr: "r3:9333", Version: 1}, |
|||
}, 1) |
|||
|
|||
if len(added) != 1 || added[0] != "r3:9333" { |
|||
t.Fatalf("added=%v, want [r3:9333]", added) |
|||
} |
|||
if len(removed) != 1 || removed[0] != "r2:9333" { |
|||
t.Fatalf("removed=%v, want [r2:9333]", removed) |
|||
} |
|||
if sg.Len() != 2 { |
|||
t.Fatalf("len=%d, want 2", sg.Len()) |
|||
} |
|||
} |
|||
|
|||
func TestSenderGroup_Reconcile_EndpointChange_InvalidatesSession(t *testing.T) { |
|||
sg := NewSenderGroup() |
|||
sg.Reconcile(map[string]Endpoint{ |
|||
"r1:9333": {DataAddr: "r1:9333", Version: 1}, |
|||
}, 1) |
|||
|
|||
s := sg.Sender("r1:9333") |
|||
sess, _ := s.AttachSession(1, SessionCatchUp) |
|||
|
|||
// Same ReplicaID but new endpoint version.
|
|||
sg.Reconcile(map[string]Endpoint{ |
|||
"r1:9333": {DataAddr: "r1:9333", Version: 2}, |
|||
}, 1) |
|||
|
|||
if sess.Active() { |
|||
t.Fatal("endpoint version change should invalidate session") |
|||
} |
|||
if s.Session() != nil { |
|||
t.Fatal("session should be nil after endpoint change") |
|||
} |
|||
} |
|||
|
|||
// === Epoch invalidation ===
|
|||
|
|||
func TestSenderGroup_InvalidateEpoch(t *testing.T) { |
|||
sg := NewSenderGroup() |
|||
sg.Reconcile(map[string]Endpoint{ |
|||
"r1:9333": {DataAddr: "r1:9333", Version: 1}, |
|||
"r2:9333": {DataAddr: "r2:9333", Version: 1}, |
|||
}, 1) |
|||
|
|||
// Both have sessions at epoch 1.
|
|||
s1 := sg.Sender("r1:9333") |
|||
s2 := sg.Sender("r2:9333") |
|||
sess1, _ := s1.AttachSession(1, SessionCatchUp) |
|||
sess2, _ := s2.AttachSession(1, SessionCatchUp) |
|||
|
|||
// Epoch bumps to 2. Both sessions stale.
|
|||
count := sg.InvalidateEpoch(2) |
|||
if count != 2 { |
|||
t.Fatalf("should invalidate 2 sessions, got %d", count) |
|||
} |
|||
if sess1.Active() || sess2.Active() { |
|||
t.Fatal("both sessions should be invalidated") |
|||
} |
|||
if s1.State != StateDisconnected || s2.State != StateDisconnected { |
|||
t.Fatal("senders should be disconnected after epoch invalidation") |
|||
} |
|||
} |
|||
|
|||
func TestSenderGroup_InvalidateEpoch_SkipsCurrentEpoch(t *testing.T) { |
|||
sg := NewSenderGroup() |
|||
sg.Reconcile(map[string]Endpoint{ |
|||
"r1:9333": {DataAddr: "r1:9333", Version: 1}, |
|||
}, 2) |
|||
|
|||
s := sg.Sender("r1:9333") |
|||
sess, _ := s.AttachSession(2, SessionCatchUp) // epoch 2 session
|
|||
|
|||
// Invalidate epoch 2 — session AT epoch 2 should NOT be invalidated.
|
|||
count := sg.InvalidateEpoch(2) |
|||
if count != 0 { |
|||
t.Fatalf("should not invalidate current-epoch session, got %d", count) |
|||
} |
|||
if !sess.Active() { |
|||
t.Fatal("current-epoch session should remain active") |
|||
} |
|||
} |
|||
|
|||
func TestSenderGroup_StopAll(t *testing.T) { |
|||
sg := NewSenderGroup() |
|||
sg.Reconcile(map[string]Endpoint{ |
|||
"r1:9333": {DataAddr: "r1:9333", Version: 1}, |
|||
"r2:9333": {DataAddr: "r2:9333", Version: 1}, |
|||
}, 1) |
|||
|
|||
sg.StopAll() |
|||
|
|||
for _, s := range sg.All() { |
|||
if !s.Stopped() { |
|||
t.Fatalf("%s should be stopped", s.ReplicaID) |
|||
} |
|||
} |
|||
} |
|||
|
|||
func TestSenderGroup_All_DeterministicOrder(t *testing.T) { |
|||
sg := NewSenderGroup() |
|||
sg.Reconcile(map[string]Endpoint{ |
|||
"r3:9333": {DataAddr: "r3:9333", Version: 1}, |
|||
"r1:9333": {DataAddr: "r1:9333", Version: 1}, |
|||
"r2:9333": {DataAddr: "r2:9333", Version: 1}, |
|||
}, 1) |
|||
|
|||
all := sg.All() |
|||
if len(all) != 3 { |
|||
t.Fatalf("len=%d, want 3", len(all)) |
|||
} |
|||
expected := []string{"r1:9333", "r2:9333", "r3:9333"} |
|||
for i, exp := range expected { |
|||
if all[i].ReplicaID != exp { |
|||
t.Fatalf("all[%d]=%s, want %s", i, all[i].ReplicaID, exp) |
|||
} |
|||
} |
|||
} |
|||
@ -0,0 +1,407 @@ |
|||
package enginev2 |
|||
|
|||
import "testing" |
|||
|
|||
// === Sender lifecycle ===
|
|||
|
|||
func TestSender_NewSender_Disconnected(t *testing.T) { |
|||
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", CtrlAddr: "r1:9334", Version: 1}, 1) |
|||
if s.State != StateDisconnected { |
|||
t.Fatalf("new sender should be Disconnected, got %s", s.State) |
|||
} |
|||
if s.Session() != nil { |
|||
t.Fatal("new sender should have no session") |
|||
} |
|||
} |
|||
|
|||
func TestSender_AttachSession_Success(t *testing.T) { |
|||
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) |
|||
|
|||
sess, err := s.AttachSession(1, SessionCatchUp) |
|||
if err != nil { |
|||
t.Fatal(err) |
|||
} |
|||
if sess.Kind != SessionCatchUp { |
|||
t.Fatalf("session kind: got %s, want catchup", sess.Kind) |
|||
} |
|||
if sess.Epoch != 1 { |
|||
t.Fatalf("session epoch: got %d, want 1", sess.Epoch) |
|||
} |
|||
if !sess.Active() { |
|||
t.Fatal("session should be active") |
|||
} |
|||
// AttachSession is ownership-only — sender stays Disconnected until BeginConnect.
|
|||
if s.State != StateDisconnected { |
|||
t.Fatalf("sender state after attach: got %s, want disconnected (ownership-only)", s.State) |
|||
} |
|||
} |
|||
|
|||
func TestSender_AttachSession_RejectsDouble(t *testing.T) { |
|||
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) |
|||
|
|||
_, err := s.AttachSession(1, SessionCatchUp) |
|||
if err != nil { |
|||
t.Fatal(err) |
|||
} |
|||
_, err = s.AttachSession(1, SessionBootstrap) |
|||
if err == nil { |
|||
t.Fatal("should reject second attach while session active") |
|||
} |
|||
} |
|||
|
|||
func TestSender_CompleteSession_TransitionsInSync(t *testing.T) { |
|||
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) |
|||
|
|||
sess, _ := s.AttachSession(1, SessionCatchUp) |
|||
// Must execute full lifecycle before completing.
|
|||
s.BeginConnect(sess.ID) |
|||
s.RecordHandshake(sess.ID, 5, 10) |
|||
s.BeginCatchUp(sess.ID) |
|||
s.RecordCatchUpProgress(sess.ID, 10) // converged
|
|||
|
|||
if !s.CompleteSessionByID(sess.ID) { |
|||
t.Fatal("completion should succeed when converged") |
|||
} |
|||
if s.State != StateInSync { |
|||
t.Fatalf("after complete: got %s, want in_sync", s.State) |
|||
} |
|||
if s.Session() != nil { |
|||
t.Fatal("session should be nil after complete") |
|||
} |
|||
if sess.Active() { |
|||
t.Fatal("completed session should not be active") |
|||
} |
|||
} |
|||
|
|||
func TestSender_SupersedeSession(t *testing.T) { |
|||
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) |
|||
|
|||
old, _ := s.AttachSession(1, SessionCatchUp) |
|||
s.UpdateEpoch(2) // epoch bumps — old session invalidated by UpdateEpoch
|
|||
new := s.SupersedeSession(SessionReassign, "explicit_supersede") |
|||
|
|||
if old.Active() { |
|||
t.Fatal("old session should be invalidated") |
|||
} |
|||
// Invalidated by UpdateEpoch, not by SupersedeSession (already dead).
|
|||
if old.InvalidateReason == "" { |
|||
t.Fatal("old session should have invalidation reason") |
|||
} |
|||
if !new.Active() { |
|||
t.Fatal("new session should be active") |
|||
} |
|||
if new.Epoch != 2 { |
|||
t.Fatalf("new session epoch: got %d, want 2", new.Epoch) |
|||
} |
|||
} |
|||
|
|||
func TestSender_UpdateEndpoint_InvalidatesSession(t *testing.T) { |
|||
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) |
|||
|
|||
sess, _ := s.AttachSession(1, SessionCatchUp) |
|||
s.UpdateEndpoint(Endpoint{DataAddr: "r1:9444", Version: 2}) |
|||
|
|||
if sess.Active() { |
|||
t.Fatal("session should be invalidated after endpoint change") |
|||
} |
|||
if sess.InvalidateReason != "endpoint_changed" { |
|||
t.Fatalf("invalidation reason: got %q", sess.InvalidateReason) |
|||
} |
|||
if s.State != StateDisconnected { |
|||
t.Fatalf("sender should be disconnected after endpoint change, got %s", s.State) |
|||
} |
|||
if s.Session() != nil { |
|||
t.Fatal("session should be nil after endpoint change") |
|||
} |
|||
} |
|||
|
|||
func TestSender_UpdateEndpoint_SameAddr_PreservesSession(t *testing.T) { |
|||
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) |
|||
|
|||
sess, _ := s.AttachSession(1, SessionCatchUp) |
|||
s.UpdateEndpoint(Endpoint{DataAddr: "r1:9333", Version: 1}) |
|||
|
|||
if !sess.Active() { |
|||
t.Fatal("same-address update should preserve session") |
|||
} |
|||
} |
|||
|
|||
func TestSender_UpdateEndpoint_CtrlAddrOnly_InvalidatesSession(t *testing.T) { |
|||
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", CtrlAddr: "r1:9334", Version: 1}, 1) |
|||
|
|||
sess, _ := s.AttachSession(1, SessionCatchUp) |
|||
s.UpdateEndpoint(Endpoint{DataAddr: "r1:9333", CtrlAddr: "r1:9444", Version: 1}) |
|||
|
|||
if sess.Active() { |
|||
t.Fatal("CtrlAddr-only change should invalidate session") |
|||
} |
|||
if s.State != StateDisconnected { |
|||
t.Fatalf("sender should be disconnected, got %s", s.State) |
|||
} |
|||
} |
|||
|
|||
func TestSender_Stop_InvalidatesSession(t *testing.T) { |
|||
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) |
|||
|
|||
sess, _ := s.AttachSession(1, SessionCatchUp) |
|||
s.Stop() |
|||
|
|||
if sess.Active() { |
|||
t.Fatal("session should be invalidated after stop") |
|||
} |
|||
if !s.Stopped() { |
|||
t.Fatal("sender should be stopped") |
|||
} |
|||
|
|||
// Attach after stop fails.
|
|||
_, err := s.AttachSession(1, SessionBootstrap) |
|||
if err == nil { |
|||
t.Fatal("attach after stop should fail") |
|||
} |
|||
} |
|||
|
|||
func TestSender_InvalidateSession_TargetState(t *testing.T) { |
|||
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) |
|||
|
|||
sess, _ := s.AttachSession(1, SessionCatchUp) |
|||
s.InvalidateSession("timeout", StateNeedsRebuild) |
|||
|
|||
if sess.Active() { |
|||
t.Fatal("session should be invalidated") |
|||
} |
|||
if s.State != StateNeedsRebuild { |
|||
t.Fatalf("sender state: got %s, want needs_rebuild", s.State) |
|||
} |
|||
} |
|||
|
|||
// === Session lifecycle ===
|
|||
|
|||
func TestSession_Advance_ValidTransitions(t *testing.T) { |
|||
sess := newRecoverySession("r1", 1, SessionCatchUp) |
|||
|
|||
if !sess.Advance(PhaseConnecting) { |
|||
t.Fatal("init → connecting should succeed") |
|||
} |
|||
if !sess.Advance(PhaseHandshake) { |
|||
t.Fatal("connecting → handshake should succeed") |
|||
} |
|||
if !sess.Advance(PhaseCatchUp) { |
|||
t.Fatal("handshake → catchup should succeed") |
|||
} |
|||
if !sess.Advance(PhaseCompleted) { |
|||
t.Fatal("catchup → completed should succeed") |
|||
} |
|||
} |
|||
|
|||
func TestSession_Advance_RejectsInvalidJump(t *testing.T) { |
|||
sess := newRecoverySession("r1", 1, SessionCatchUp) |
|||
|
|||
// init → catchup is not valid (must go through connecting, handshake)
|
|||
if sess.Advance(PhaseCatchUp) { |
|||
t.Fatal("init → catchup should be rejected") |
|||
} |
|||
// init → completed is not valid
|
|||
if sess.Advance(PhaseCompleted) { |
|||
t.Fatal("init → completed should be rejected") |
|||
} |
|||
} |
|||
|
|||
func TestSession_Advance_StopsOnInvalidate(t *testing.T) { |
|||
sess := newRecoverySession("r1", 1, SessionCatchUp) |
|||
sess.Advance(PhaseConnecting) |
|||
sess.Advance(PhaseHandshake) |
|||
sess.invalidate("test") |
|||
|
|||
if sess.Advance(PhaseCatchUp) { |
|||
t.Fatal("advance after invalidate should fail") |
|||
} |
|||
} |
|||
|
|||
func TestSender_AttachSession_RejectsEpochMismatch(t *testing.T) { |
|||
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) |
|||
|
|||
_, err := s.AttachSession(2, SessionCatchUp) |
|||
if err == nil { |
|||
t.Fatal("should reject session at epoch 2 when sender is at epoch 1") |
|||
} |
|||
} |
|||
|
|||
func TestSender_UpdateEpoch_InvalidatesStaleSession(t *testing.T) { |
|||
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) |
|||
sess, _ := s.AttachSession(1, SessionCatchUp) |
|||
|
|||
s.UpdateEpoch(2) |
|||
|
|||
if sess.Active() { |
|||
t.Fatal("session at epoch 1 should be invalidated after UpdateEpoch(2)") |
|||
} |
|||
if s.Epoch != 2 { |
|||
t.Fatalf("sender epoch should be 2, got %d", s.Epoch) |
|||
} |
|||
if s.State != StateDisconnected { |
|||
t.Fatalf("sender should be disconnected after epoch bump, got %s", s.State) |
|||
} |
|||
|
|||
// Can now attach at epoch 2.
|
|||
sess2, err := s.AttachSession(2, SessionCatchUp) |
|||
if err != nil { |
|||
t.Fatalf("attach at new epoch should succeed: %v", err) |
|||
} |
|||
if sess2.Epoch != 2 { |
|||
t.Fatalf("new session epoch: got %d, want 2", sess2.Epoch) |
|||
} |
|||
} |
|||
|
|||
func TestSession_Progress_StopsOnComplete(t *testing.T) { |
|||
sess := newRecoverySession("r1", 1, SessionCatchUp) |
|||
sess.SetRange(0, 100) |
|||
|
|||
sess.UpdateProgress(50) |
|||
if sess.Converged() { |
|||
t.Fatal("should not converge at 50/100") |
|||
} |
|||
|
|||
sess.complete() |
|||
|
|||
if sess.UpdateProgress(100) { |
|||
t.Fatal("update after complete should return false") |
|||
} |
|||
} |
|||
|
|||
func TestSession_Converged(t *testing.T) { |
|||
sess := newRecoverySession("r1", 1, SessionCatchUp) |
|||
sess.SetRange(0, 10) |
|||
|
|||
sess.UpdateProgress(9) |
|||
if sess.Converged() { |
|||
t.Fatal("9 < 10: not converged") |
|||
} |
|||
|
|||
sess.UpdateProgress(10) |
|||
if !sess.Converged() { |
|||
t.Fatal("10 >= 10: should be converged") |
|||
} |
|||
} |
|||
|
|||
// === Bridge tests: ownership invariants matching distsim scenarios ===
|
|||
|
|||
func TestBridge_StaleCompletion_AfterSupersede_HasNoEffect(t *testing.T) { |
|||
// Matches distsim TestP04a_StaleCompletion_AfterSupersede_Rejected.
|
|||
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) |
|||
|
|||
// First session.
|
|||
sess1, _ := s.AttachSession(1, SessionCatchUp) |
|||
sess1.Advance(PhaseConnecting) |
|||
sess1.Advance(PhaseHandshake) |
|||
sess1.Advance(PhaseCatchUp) |
|||
|
|||
// Supersede with new session.
|
|||
s.UpdateEpoch(2) |
|||
sess2, _ := s.AttachSession(2, SessionCatchUp) |
|||
|
|||
// Old session: advance/complete has no effect (already invalidated).
|
|||
if sess1.Advance(PhaseCompleted) { |
|||
t.Fatal("stale session should not advance to completed") |
|||
} |
|||
if sess1.Active() { |
|||
t.Fatal("old session should be inactive") |
|||
} |
|||
|
|||
// New session: still active and owns the sender.
|
|||
if !sess2.Active() { |
|||
t.Fatal("new session should be active") |
|||
} |
|||
if s.Session() != sess2 { |
|||
t.Fatal("sender should own the new session") |
|||
} |
|||
|
|||
// Stale completion by OLD session ID — REJECTED by identity check.
|
|||
if s.CompleteSessionByID(sess1.ID) { |
|||
t.Fatal("stale completion with old session ID must be rejected") |
|||
} |
|||
// Sender must NOT have moved to InSync.
|
|||
if s.State == StateInSync { |
|||
t.Fatal("sender must not be InSync after stale completion") |
|||
} |
|||
// New session must still be active.
|
|||
if !sess2.Active() { |
|||
t.Fatal("new session must still be active after stale completion rejected") |
|||
} |
|||
|
|||
// Correct completion by NEW session ID — requires full execution path.
|
|||
s.BeginConnect(sess2.ID) |
|||
s.RecordHandshake(sess2.ID, 0, 10) |
|||
s.BeginCatchUp(sess2.ID) |
|||
s.RecordCatchUpProgress(sess2.ID, 10) |
|||
if !s.CompleteSessionByID(sess2.ID) { |
|||
t.Fatal("completion with correct session ID should succeed after convergence") |
|||
} |
|||
if s.State != StateInSync { |
|||
t.Fatalf("sender should be InSync after correct completion, got %s", s.State) |
|||
} |
|||
} |
|||
|
|||
func TestBridge_EpochBump_RejectedCompletion(t *testing.T) { |
|||
// Matches distsim TestP04a_EpochBumpDuringCatchup_InvalidatesSession.
|
|||
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) |
|||
|
|||
sess, _ := s.AttachSession(1, SessionCatchUp) |
|||
sess.Advance(PhaseConnecting) |
|||
|
|||
// Epoch bumps — session invalidated.
|
|||
s.UpdateEpoch(2) |
|||
|
|||
// Attempting to advance the old session fails.
|
|||
if sess.Advance(PhaseHandshake) { |
|||
t.Fatal("stale session should not advance after epoch bump") |
|||
} |
|||
|
|||
// Attempting to attach at old epoch fails.
|
|||
_, err := s.AttachSession(1, SessionCatchUp) |
|||
if err == nil { |
|||
t.Fatal("attach at stale epoch should fail") |
|||
} |
|||
|
|||
// Attach at new epoch succeeds.
|
|||
sess2, err := s.AttachSession(2, SessionCatchUp) |
|||
if err != nil { |
|||
t.Fatalf("attach at new epoch should succeed: %v", err) |
|||
} |
|||
if sess2.Epoch != 2 { |
|||
t.Fatalf("new session epoch=%d, want 2", sess2.Epoch) |
|||
} |
|||
} |
|||
|
|||
func TestBridge_EndpointChange_InvalidatesAndAllowsNewSession(t *testing.T) { |
|||
// Matches distsim TestP04a_EndpointChangeDuringCatchup_InvalidatesSession.
|
|||
s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) |
|||
|
|||
sess, _ := s.AttachSession(1, SessionCatchUp) |
|||
|
|||
// Endpoint changes.
|
|||
s.UpdateEndpoint(Endpoint{DataAddr: "r1:9444", Version: 2}) |
|||
|
|||
// Old session dead.
|
|||
if sess.Active() { |
|||
t.Fatal("session should be invalidated") |
|||
} |
|||
|
|||
// New session can be attached (same epoch, new endpoint).
|
|||
sess2, err := s.AttachSession(1, SessionCatchUp) |
|||
if err != nil { |
|||
t.Fatalf("new session after endpoint change: %v", err) |
|||
} |
|||
if !sess2.Active() { |
|||
t.Fatal("new session should be active") |
|||
} |
|||
} |
|||
|
|||
func TestSession_DoubleInvalidate_Safe(t *testing.T) { |
|||
sess := newRecoverySession("r1", 1, SessionCatchUp) |
|||
sess.invalidate("first") |
|||
sess.invalidate("second") // should not panic or change reason
|
|||
|
|||
if sess.InvalidateReason != "first" { |
|||
t.Fatalf("reason should be first, got %q", sess.InvalidateReason) |
|||
} |
|||
} |
|||
@ -0,0 +1,151 @@ |
|||
package enginev2 |
|||
|
|||
import ( |
|||
"sync" |
|||
"sync/atomic" |
|||
) |
|||
|
|||
// SessionKind identifies how the recovery session was created.
|
|||
type SessionKind string |
|||
|
|||
const ( |
|||
SessionBootstrap SessionKind = "bootstrap" // fresh replica, no prior state
|
|||
SessionCatchUp SessionKind = "catchup" // WAL gap recovery
|
|||
SessionRebuild SessionKind = "rebuild" // full extent + WAL rebuild
|
|||
SessionReassign SessionKind = "reassign" // address change recovery
|
|||
) |
|||
|
|||
// SessionPhase tracks progress within a recovery session.
|
|||
type SessionPhase string |
|||
|
|||
const ( |
|||
PhaseInit SessionPhase = "init" |
|||
PhaseConnecting SessionPhase = "connecting" |
|||
PhaseHandshake SessionPhase = "handshake" |
|||
PhaseCatchUp SessionPhase = "catchup" |
|||
PhaseCompleted SessionPhase = "completed" |
|||
PhaseInvalidated SessionPhase = "invalidated" |
|||
) |
|||
|
|||
// sessionIDCounter generates unique session IDs across all senders.
|
|||
var sessionIDCounter atomic.Uint64 |
|||
|
|||
// RecoverySession represents one recovery attempt for a specific replica
|
|||
// at a specific epoch. It is owned by a Sender and has exclusive authority
|
|||
// to transition the replica through connecting → handshake → catchup → complete.
|
|||
//
|
|||
// Each session has a unique ID. Stale completions are rejected by ID, not
|
|||
// by pointer comparison. This prevents old sessions from mutating state
|
|||
// even if they retain a reference to the sender.
|
|||
//
|
|||
// Lifecycle rules:
|
|||
// - At most one active session per Sender
|
|||
// - Session is bound to an epoch; epoch bump invalidates it
|
|||
// - Session is bound to an endpoint; address change invalidates it
|
|||
// - Completed sessions release ownership back to the Sender
|
|||
// - Invalidated sessions are dead and cannot be reused
|
|||
type RecoverySession struct { |
|||
mu sync.Mutex |
|||
|
|||
ID uint64 // unique, monotonic, never reused
|
|||
ReplicaID string |
|||
Epoch uint64 |
|||
Kind SessionKind |
|||
Phase SessionPhase |
|||
InvalidateReason string // non-empty when invalidated
|
|||
|
|||
// Progress tracking.
|
|||
StartLSN uint64 // gap start (exclusive)
|
|||
TargetLSN uint64 // gap end (inclusive)
|
|||
RecoveredTo uint64 // highest LSN recovered so far
|
|||
} |
|||
|
|||
func newRecoverySession(replicaID string, epoch uint64, kind SessionKind) *RecoverySession { |
|||
return &RecoverySession{ |
|||
ID: sessionIDCounter.Add(1), |
|||
ReplicaID: replicaID, |
|||
Epoch: epoch, |
|||
Kind: kind, |
|||
Phase: PhaseInit, |
|||
} |
|||
} |
|||
|
|||
// Active returns true if the session has not been completed or invalidated.
|
|||
func (rs *RecoverySession) Active() bool { |
|||
rs.mu.Lock() |
|||
defer rs.mu.Unlock() |
|||
return rs.Phase != PhaseCompleted && rs.Phase != PhaseInvalidated |
|||
} |
|||
|
|||
// validTransitions defines the allowed phase transitions.
|
|||
// Each phase maps to the set of phases it can transition to.
|
|||
var validTransitions = map[SessionPhase]map[SessionPhase]bool{ |
|||
PhaseInit: {PhaseConnecting: true, PhaseInvalidated: true}, |
|||
PhaseConnecting: {PhaseHandshake: true, PhaseInvalidated: true}, |
|||
PhaseHandshake: {PhaseCatchUp: true, PhaseCompleted: true, PhaseInvalidated: true}, |
|||
PhaseCatchUp: {PhaseCompleted: true, PhaseInvalidated: true}, |
|||
} |
|||
|
|||
// Advance moves the session to the next phase. Returns false if the
|
|||
// transition is not valid (wrong source phase, already terminal, or
|
|||
// illegal jump). Enforces the lifecycle:
|
|||
//
|
|||
// init → connecting → handshake → catchup → completed
|
|||
// ↘ invalidated (from any non-terminal)
|
|||
func (rs *RecoverySession) Advance(phase SessionPhase) bool { |
|||
rs.mu.Lock() |
|||
defer rs.mu.Unlock() |
|||
if rs.Phase == PhaseCompleted || rs.Phase == PhaseInvalidated { |
|||
return false |
|||
} |
|||
allowed := validTransitions[rs.Phase] |
|||
if !allowed[phase] { |
|||
return false |
|||
} |
|||
rs.Phase = phase |
|||
return true |
|||
} |
|||
|
|||
// UpdateProgress records catch-up progress. Returns false if stale.
|
|||
func (rs *RecoverySession) UpdateProgress(recoveredTo uint64) bool { |
|||
rs.mu.Lock() |
|||
defer rs.mu.Unlock() |
|||
if rs.Phase == PhaseCompleted || rs.Phase == PhaseInvalidated { |
|||
return false |
|||
} |
|||
if recoveredTo > rs.RecoveredTo { |
|||
rs.RecoveredTo = recoveredTo |
|||
} |
|||
return true |
|||
} |
|||
|
|||
// SetRange sets the recovery LSN range.
|
|||
func (rs *RecoverySession) SetRange(start, target uint64) { |
|||
rs.mu.Lock() |
|||
defer rs.mu.Unlock() |
|||
rs.StartLSN = start |
|||
rs.TargetLSN = target |
|||
} |
|||
|
|||
// Converged returns true if recovery has reached the target.
|
|||
func (rs *RecoverySession) Converged() bool { |
|||
rs.mu.Lock() |
|||
defer rs.mu.Unlock() |
|||
return rs.TargetLSN > 0 && rs.RecoveredTo >= rs.TargetLSN |
|||
} |
|||
|
|||
func (rs *RecoverySession) complete() { |
|||
rs.mu.Lock() |
|||
defer rs.mu.Unlock() |
|||
rs.Phase = PhaseCompleted |
|||
} |
|||
|
|||
func (rs *RecoverySession) invalidate(reason string) { |
|||
rs.mu.Lock() |
|||
defer rs.mu.Unlock() |
|||
if rs.Phase == PhaseCompleted || rs.Phase == PhaseInvalidated { |
|||
return |
|||
} |
|||
rs.Phase = PhaseInvalidated |
|||
rs.InvalidateReason = reason |
|||
} |
|||
@ -0,0 +1,162 @@ |
|||
package fsmv2 |
|||
|
|||
func (f *FSM) Apply(evt Event) ([]Action, error) { |
|||
switch evt.Kind { |
|||
case EventEpochChanged: |
|||
if evt.Epoch <= f.Epoch { |
|||
return nil, nil |
|||
} |
|||
f.Epoch = evt.Epoch |
|||
switch f.State { |
|||
case StateInSync: |
|||
f.State = StateLagging |
|||
f.clearCatchup() |
|||
f.clearRebuild() |
|||
return []Action{ActionRevokeSyncEligibility}, nil |
|||
case StateCatchingUp, StatePromotionHold, StateRebuilding, StateCatchUpAfterBuild: |
|||
f.State = StateLagging |
|||
f.clearCatchup() |
|||
f.clearRebuild() |
|||
return []Action{ActionAbortRecovery, ActionRevokeSyncEligibility}, nil |
|||
default: |
|||
f.clearCatchup() |
|||
f.clearRebuild() |
|||
return nil, nil |
|||
} |
|||
case EventFatal: |
|||
f.State = StateFailed |
|||
f.clearCatchup() |
|||
f.clearRebuild() |
|||
return []Action{ActionFailReplica, ActionRevokeSyncEligibility}, nil |
|||
} |
|||
|
|||
switch f.State { |
|||
case StateBootstrapping: |
|||
switch evt.Kind { |
|||
case EventBootstrapComplete: |
|||
f.ReplicaFlushedLSN = evt.ReplicaFlushedLSN |
|||
f.State = StateInSync |
|||
return []Action{ActionGrantSyncEligibility}, nil |
|||
case EventDisconnect: |
|||
f.State = StateLagging |
|||
return nil, nil |
|||
} |
|||
case StateInSync: |
|||
switch evt.Kind { |
|||
case EventDurableProgress: |
|||
if evt.ReplicaFlushedLSN < f.ReplicaFlushedLSN { |
|||
return nil, invalid(f.State, evt.Kind) |
|||
} |
|||
f.ReplicaFlushedLSN = evt.ReplicaFlushedLSN |
|||
return nil, nil |
|||
case EventDisconnect: |
|||
f.State = StateLagging |
|||
return []Action{ActionRevokeSyncEligibility}, nil |
|||
} |
|||
case StateLagging: |
|||
switch evt.Kind { |
|||
case EventReconnectCatchup: |
|||
f.ReplicaFlushedLSN = evt.ReplicaFlushedLSN |
|||
f.CatchupStartLSN = evt.ReplicaFlushedLSN |
|||
f.CatchupTargetLSN = evt.TargetLSN |
|||
f.PromotionBarrierLSN = evt.TargetLSN |
|||
f.RecoveryReservationID = evt.ReservationID |
|||
f.ReservationExpiry = evt.ReservationTTL |
|||
f.State = StateCatchingUp |
|||
return []Action{ActionStartCatchup}, nil |
|||
case EventReconnectRebuild: |
|||
f.State = StateNeedsRebuild |
|||
return []Action{ActionRevokeSyncEligibility}, nil |
|||
} |
|||
case StateCatchingUp: |
|||
switch evt.Kind { |
|||
case EventCatchupProgress: |
|||
if evt.ReplicaFlushedLSN < f.ReplicaFlushedLSN { |
|||
return nil, invalid(f.State, evt.Kind) |
|||
} |
|||
f.ReplicaFlushedLSN = evt.ReplicaFlushedLSN |
|||
if evt.ReplicaFlushedLSN >= f.PromotionBarrierLSN { |
|||
f.State = StatePromotionHold |
|||
f.PromotionHoldUntil = evt.PromotionHoldTill |
|||
return []Action{ActionEnterPromotionHold}, nil |
|||
} |
|||
return nil, nil |
|||
case EventRetentionLost, EventCatchupTimeout: |
|||
f.State = StateNeedsRebuild |
|||
f.clearCatchup() |
|||
return []Action{ActionAbortRecovery}, nil |
|||
case EventDisconnect: |
|||
f.State = StateLagging |
|||
f.clearCatchup() |
|||
return []Action{ActionAbortRecovery}, nil |
|||
} |
|||
case StatePromotionHold: |
|||
switch evt.Kind { |
|||
case EventDurableProgress: |
|||
if evt.ReplicaFlushedLSN < f.ReplicaFlushedLSN { |
|||
return nil, invalid(f.State, evt.Kind) |
|||
} |
|||
f.ReplicaFlushedLSN = evt.ReplicaFlushedLSN |
|||
return nil, nil |
|||
case EventPromotionHealthy: |
|||
if evt.Now < f.PromotionHoldUntil { |
|||
return nil, nil |
|||
} |
|||
f.State = StateInSync |
|||
f.clearCatchup() |
|||
return []Action{ActionGrantSyncEligibility}, nil |
|||
case EventDisconnect: |
|||
f.State = StateLagging |
|||
f.clearCatchup() |
|||
return []Action{ActionRevokeSyncEligibility}, nil |
|||
} |
|||
case StateNeedsRebuild: |
|||
switch evt.Kind { |
|||
case EventStartRebuild: |
|||
f.State = StateRebuilding |
|||
f.SnapshotID = evt.SnapshotID |
|||
f.SnapshotCpLSN = evt.SnapshotCpLSN |
|||
f.RecoveryReservationID = evt.ReservationID |
|||
f.ReservationExpiry = evt.ReservationTTL |
|||
return []Action{ActionStartRebuild}, nil |
|||
} |
|||
case StateRebuilding: |
|||
switch evt.Kind { |
|||
case EventRebuildBaseApplied: |
|||
f.State = StateCatchUpAfterBuild |
|||
f.ReplicaFlushedLSN = f.SnapshotCpLSN |
|||
f.CatchupStartLSN = f.SnapshotCpLSN |
|||
f.CatchupTargetLSN = evt.TargetLSN |
|||
f.PromotionBarrierLSN = evt.TargetLSN |
|||
return []Action{ActionStartCatchup}, nil |
|||
case EventRetentionLost, EventRebuildTooSlow, EventDisconnect: |
|||
f.State = StateNeedsRebuild |
|||
f.clearCatchup() |
|||
f.clearRebuild() |
|||
return []Action{ActionAbortRecovery}, nil |
|||
} |
|||
case StateCatchUpAfterBuild: |
|||
switch evt.Kind { |
|||
case EventCatchupProgress: |
|||
if evt.ReplicaFlushedLSN < f.ReplicaFlushedLSN { |
|||
return nil, invalid(f.State, evt.Kind) |
|||
} |
|||
f.ReplicaFlushedLSN = evt.ReplicaFlushedLSN |
|||
if evt.ReplicaFlushedLSN >= f.PromotionBarrierLSN { |
|||
f.State = StatePromotionHold |
|||
f.PromotionHoldUntil = evt.PromotionHoldTill |
|||
return []Action{ActionEnterPromotionHold}, nil |
|||
} |
|||
return nil, nil |
|||
case EventRetentionLost, EventCatchupTimeout, EventDisconnect: |
|||
f.State = StateNeedsRebuild |
|||
f.clearCatchup() |
|||
f.clearRebuild() |
|||
return []Action{ActionAbortRecovery}, nil |
|||
} |
|||
case StateFailed: |
|||
return nil, nil |
|||
} |
|||
|
|||
return nil, invalid(f.State, evt.Kind) |
|||
} |
|||
@ -0,0 +1,37 @@ |
|||
package fsmv2 |
|||
|
|||
type EventKind string |
|||
|
|||
const ( |
|||
EventBootstrapComplete EventKind = "BootstrapComplete" |
|||
EventDisconnect EventKind = "Disconnect" |
|||
EventReconnectCatchup EventKind = "ReconnectCatchup" |
|||
EventReconnectRebuild EventKind = "ReconnectRebuild" |
|||
EventDurableProgress EventKind = "DurableProgress" |
|||
EventCatchupProgress EventKind = "CatchupProgress" |
|||
EventPromotionHealthy EventKind = "PromotionHealthy" |
|||
EventStartRebuild EventKind = "StartRebuild" |
|||
EventRebuildBaseApplied EventKind = "RebuildBaseApplied" |
|||
EventRetentionLost EventKind = "RetentionLost" |
|||
EventCatchupTimeout EventKind = "CatchupTimeout" |
|||
EventRebuildTooSlow EventKind = "RebuildTooSlow" |
|||
EventEpochChanged EventKind = "EpochChanged" |
|||
EventFatal EventKind = "Fatal" |
|||
) |
|||
|
|||
type Event struct { |
|||
Kind EventKind |
|||
|
|||
Epoch uint64 |
|||
Now uint64 |
|||
|
|||
ReplicaFlushedLSN uint64 |
|||
TargetLSN uint64 |
|||
PromotionHoldTill uint64 |
|||
|
|||
SnapshotID string |
|||
SnapshotCpLSN uint64 |
|||
|
|||
ReservationID string |
|||
ReservationTTL uint64 |
|||
} |
|||
@ -0,0 +1,73 @@ |
|||
package fsmv2 |
|||
|
|||
import "fmt" |
|||
|
|||
type State string |
|||
|
|||
const ( |
|||
StateBootstrapping State = "Bootstrapping" |
|||
StateInSync State = "InSync" |
|||
StateLagging State = "Lagging" |
|||
StateCatchingUp State = "CatchingUp" |
|||
StatePromotionHold State = "PromotionHold" |
|||
StateNeedsRebuild State = "NeedsRebuild" |
|||
StateRebuilding State = "Rebuilding" |
|||
StateCatchUpAfterBuild State = "CatchUpAfterRebuild" |
|||
StateFailed State = "Failed" |
|||
) |
|||
|
|||
type Action string |
|||
|
|||
const ( |
|||
ActionNone Action = "None" |
|||
ActionGrantSyncEligibility Action = "GrantSyncEligibility" |
|||
ActionRevokeSyncEligibility Action = "RevokeSyncEligibility" |
|||
ActionStartCatchup Action = "StartCatchup" |
|||
ActionEnterPromotionHold Action = "EnterPromotionHold" |
|||
ActionStartRebuild Action = "StartRebuild" |
|||
ActionAbortRecovery Action = "AbortRecovery" |
|||
ActionFailReplica Action = "FailReplica" |
|||
) |
|||
|
|||
type FSM struct { |
|||
State State |
|||
Epoch uint64 |
|||
|
|||
ReplicaFlushedLSN uint64 |
|||
CatchupStartLSN uint64 |
|||
CatchupTargetLSN uint64 |
|||
PromotionBarrierLSN uint64 |
|||
PromotionHoldUntil uint64 |
|||
|
|||
SnapshotID string |
|||
SnapshotCpLSN uint64 |
|||
|
|||
RecoveryReservationID string |
|||
ReservationExpiry uint64 |
|||
} |
|||
|
|||
func New(epoch uint64) *FSM { |
|||
return &FSM{State: StateBootstrapping, Epoch: epoch} |
|||
} |
|||
|
|||
func (f *FSM) IsSyncEligible() bool { |
|||
return f.State == StateInSync |
|||
} |
|||
|
|||
func (f *FSM) clearCatchup() { |
|||
f.CatchupStartLSN = 0 |
|||
f.CatchupTargetLSN = 0 |
|||
f.PromotionBarrierLSN = 0 |
|||
f.PromotionHoldUntil = 0 |
|||
f.RecoveryReservationID = "" |
|||
f.ReservationExpiry = 0 |
|||
} |
|||
|
|||
func (f *FSM) clearRebuild() { |
|||
f.SnapshotID = "" |
|||
f.SnapshotCpLSN = 0 |
|||
} |
|||
|
|||
func invalid(state State, kind EventKind) error { |
|||
return fmt.Errorf("fsmv2: invalid event %s in state %s", kind, state) |
|||
} |
|||
@ -0,0 +1,95 @@ |
|||
package fsmv2 |
|||
|
|||
import "testing" |
|||
|
|||
func mustApply(t *testing.T, f *FSM, evt Event) []Action { |
|||
t.Helper() |
|||
actions, err := f.Apply(evt) |
|||
if err != nil { |
|||
t.Fatalf("apply %s: %v", evt.Kind, err) |
|||
} |
|||
return actions |
|||
} |
|||
|
|||
func TestFSMBootstrapToInSync(t *testing.T) { |
|||
f := New(7) |
|||
mustApply(t, f, Event{Kind: EventBootstrapComplete, ReplicaFlushedLSN: 10}) |
|||
if f.State != StateInSync || !f.IsSyncEligible() || f.ReplicaFlushedLSN != 10 { |
|||
t.Fatalf("unexpected bootstrap result: state=%s eligible=%v lsn=%d", f.State, f.IsSyncEligible(), f.ReplicaFlushedLSN) |
|||
} |
|||
} |
|||
|
|||
func TestFSMCatchupPromotionHoldFlow(t *testing.T) { |
|||
f := New(3) |
|||
mustApply(t, f, Event{Kind: EventBootstrapComplete, ReplicaFlushedLSN: 5}) |
|||
mustApply(t, f, Event{Kind: EventDisconnect}) |
|||
mustApply(t, f, Event{Kind: EventReconnectCatchup, ReplicaFlushedLSN: 5, TargetLSN: 20, ReservationID: "r1", ReservationTTL: 100}) |
|||
if f.State != StateCatchingUp { |
|||
t.Fatalf("expected catching up, got %s", f.State) |
|||
} |
|||
mustApply(t, f, Event{Kind: EventCatchupProgress, ReplicaFlushedLSN: 20, PromotionHoldTill: 30}) |
|||
if f.State != StatePromotionHold { |
|||
t.Fatalf("expected promotion hold, got %s", f.State) |
|||
} |
|||
mustApply(t, f, Event{Kind: EventPromotionHealthy, Now: 29}) |
|||
if f.State != StatePromotionHold { |
|||
t.Fatalf("hold exited too early: %s", f.State) |
|||
} |
|||
mustApply(t, f, Event{Kind: EventPromotionHealthy, Now: 30}) |
|||
if f.State != StateInSync || !f.IsSyncEligible() { |
|||
t.Fatalf("expected insync after hold, got %s eligible=%v", f.State, f.IsSyncEligible()) |
|||
} |
|||
} |
|||
|
|||
func TestFSMRebuildFlow(t *testing.T) { |
|||
f := New(11) |
|||
mustApply(t, f, Event{Kind: EventDisconnect}) |
|||
mustApply(t, f, Event{Kind: EventReconnectRebuild}) |
|||
if f.State != StateNeedsRebuild { |
|||
t.Fatalf("expected needs rebuild, got %s", f.State) |
|||
} |
|||
mustApply(t, f, Event{Kind: EventStartRebuild, SnapshotID: "snap-1", SnapshotCpLSN: 100, ReservationID: "rr", ReservationTTL: 200}) |
|||
if f.State != StateRebuilding { |
|||
t.Fatalf("expected rebuilding, got %s", f.State) |
|||
} |
|||
mustApply(t, f, Event{Kind: EventRebuildBaseApplied, TargetLSN: 140}) |
|||
if f.State != StateCatchUpAfterBuild || f.ReplicaFlushedLSN != 100 { |
|||
t.Fatalf("unexpected rebuild-base state=%s lsn=%d", f.State, f.ReplicaFlushedLSN) |
|||
} |
|||
mustApply(t, f, Event{Kind: EventCatchupProgress, ReplicaFlushedLSN: 140, PromotionHoldTill: 150}) |
|||
mustApply(t, f, Event{Kind: EventPromotionHealthy, Now: 150}) |
|||
if f.State != StateInSync || f.SnapshotID != "snap-1" { |
|||
t.Fatalf("expected insync after rebuild, got state=%s snapshot=%q", f.State, f.SnapshotID) |
|||
} |
|||
} |
|||
|
|||
func TestFSMEpochChangeAbortsRecovery(t *testing.T) { |
|||
f := New(1) |
|||
mustApply(t, f, Event{Kind: EventBootstrapComplete, ReplicaFlushedLSN: 1}) |
|||
mustApply(t, f, Event{Kind: EventDisconnect}) |
|||
mustApply(t, f, Event{Kind: EventReconnectCatchup, ReplicaFlushedLSN: 1, TargetLSN: 5, ReservationID: "r1", ReservationTTL: 99}) |
|||
mustApply(t, f, Event{Kind: EventEpochChanged, Epoch: 2}) |
|||
if f.State != StateLagging || f.RecoveryReservationID != "" || f.IsSyncEligible() { |
|||
t.Fatalf("unexpected state after epoch change: state=%s reservation=%q eligible=%v", f.State, f.RecoveryReservationID, f.IsSyncEligible()) |
|||
} |
|||
} |
|||
|
|||
func TestFSMReservationLostNeedsRebuild(t *testing.T) { |
|||
f := New(5) |
|||
mustApply(t, f, Event{Kind: EventBootstrapComplete, ReplicaFlushedLSN: 9}) |
|||
mustApply(t, f, Event{Kind: EventDisconnect}) |
|||
mustApply(t, f, Event{Kind: EventReconnectCatchup, ReplicaFlushedLSN: 9, TargetLSN: 15, ReservationID: "r2", ReservationTTL: 80}) |
|||
mustApply(t, f, Event{Kind: EventRetentionLost}) |
|||
if f.State != StateNeedsRebuild { |
|||
t.Fatalf("expected needs rebuild after reservation lost, got %s", f.State) |
|||
} |
|||
} |
|||
|
|||
func TestFSMDurableProgressWhileInSync(t *testing.T) { |
|||
f := New(2) |
|||
mustApply(t, f, Event{Kind: EventBootstrapComplete, ReplicaFlushedLSN: 4}) |
|||
mustApply(t, f, Event{Kind: EventDurableProgress, ReplicaFlushedLSN: 8}) |
|||
if f.ReplicaFlushedLSN != 8 || f.State != StateInSync { |
|||
t.Fatalf("unexpected in-sync durable progress: state=%s lsn=%d", f.State, f.ReplicaFlushedLSN) |
|||
} |
|||
} |
|||
@ -0,0 +1,37 @@ |
|||
param( |
|||
[string[]]$Packages = @( |
|||
'./sw-block/prototype/fsmv2', |
|||
'./sw-block/prototype/volumefsm', |
|||
'./sw-block/prototype/distsim' |
|||
) |
|||
) |
|||
|
|||
$ErrorActionPreference = 'Stop' |
|||
$root = Split-Path -Parent (Split-Path -Parent $PSScriptRoot) |
|||
Set-Location $root |
|||
|
|||
$cacheDir = Join-Path $root '.gocache_v2' |
|||
$tmpDir = Join-Path $root '.gotmp_v2' |
|||
New-Item -ItemType Directory -Force -Path $cacheDir,$tmpDir | Out-Null |
|||
$env:GOCACHE = $cacheDir |
|||
$env:GOTMPDIR = $tmpDir |
|||
|
|||
foreach ($pkg in $Packages) { |
|||
$name = Split-Path $pkg -Leaf |
|||
$out = Join-Path $root ("sw-block\\prototype\\{0}\\{0}.test.exe" -f $name) |
|||
Write-Host "==> building $pkg" |
|||
go test -c -o $out $pkg |
|||
if (!(Test-Path $out)) { |
|||
throw "go test -c build failed for $pkg" |
|||
} |
|||
if ($LASTEXITCODE -ne 0) { |
|||
Write-Warning "go test -c reported a non-zero exit code for $pkg, but the test binary was produced. Continuing." |
|||
} |
|||
Write-Host "==> running $out" |
|||
cmd /c "cd /d $root && $out -test.v -test.count=1" |
|||
if ($LASTEXITCODE -ne 0) { |
|||
throw "test binary failed for $pkg" |
|||
} |
|||
} |
|||
|
|||
Write-Host "Done." |
|||
@ -0,0 +1,148 @@ |
|||
package volumefsm |
|||
|
|||
import fsmv2 "github.com/seaweedfs/seaweedfs/sw-block/prototype/fsmv2" |
|||
|
|||
type EventKind string |
|||
|
|||
const ( |
|||
EventWriteCommitted EventKind = "WriteCommitted" |
|||
EventCheckpointAdvanced EventKind = "CheckpointAdvanced" |
|||
EventBarrierCompleted EventKind = "BarrierCompleted" |
|||
EventBootstrapReplica EventKind = "BootstrapReplica" |
|||
EventReplicaDisconnect EventKind = "ReplicaDisconnect" |
|||
EventReplicaReconnect EventKind = "ReplicaReconnect" |
|||
EventReplicaNeedsRebuild EventKind = "ReplicaNeedsRebuild" |
|||
EventReplicaCatchupProgress EventKind = "ReplicaCatchupProgress" |
|||
EventReplicaPromotionHealthy EventKind = "ReplicaPromotionHealthy" |
|||
EventReplicaStartRebuild EventKind = "ReplicaStartRebuild" |
|||
EventReplicaRebuildBaseApplied EventKind = "ReplicaRebuildBaseApplied" |
|||
EventReplicaReservationLost EventKind = "ReplicaReservationLost" |
|||
EventReplicaCatchupTimeout EventKind = "ReplicaCatchupTimeout" |
|||
EventReplicaRebuildTooSlow EventKind = "ReplicaRebuildTooSlow" |
|||
EventPrimaryLeaseLost EventKind = "PrimaryLeaseLost" |
|||
EventPromoteReplica EventKind = "PromoteReplica" |
|||
) |
|||
|
|||
type Event struct { |
|||
Kind EventKind |
|||
ReplicaID string |
|||
|
|||
LSN uint64 |
|||
CheckpointLSN uint64 |
|||
ReplicaFlushedLSN uint64 |
|||
TargetLSN uint64 |
|||
Now uint64 |
|||
HoldUntil uint64 |
|||
SnapshotID string |
|||
SnapshotCpLSN uint64 |
|||
ReservationID string |
|||
ReservationTTL uint64 |
|||
} |
|||
|
|||
func (m *Model) Apply(evt Event) error { |
|||
switch evt.Kind { |
|||
case EventWriteCommitted: |
|||
if evt.LSN > m.HeadLSN { |
|||
m.HeadLSN = evt.LSN |
|||
} else { |
|||
m.HeadLSN++ |
|||
} |
|||
return nil |
|||
case EventCheckpointAdvanced: |
|||
if evt.CheckpointLSN > m.CheckpointLSN { |
|||
m.CheckpointLSN = evt.CheckpointLSN |
|||
} |
|||
return nil |
|||
case EventPrimaryLeaseLost: |
|||
m.PrimaryState = PrimaryLost |
|||
m.Epoch++ |
|||
for _, r := range m.Replicas { |
|||
_, err := r.FSM.Apply(fsmv2.Event{Kind: fsmv2.EventEpochChanged, Epoch: m.Epoch}) |
|||
if err != nil { |
|||
return err |
|||
} |
|||
} |
|||
return nil |
|||
case EventPromoteReplica: |
|||
m.PrimaryID = evt.ReplicaID |
|||
m.PrimaryState = PrimaryServing |
|||
m.Epoch++ |
|||
for _, r := range m.Replicas { |
|||
_, err := r.FSM.Apply(fsmv2.Event{Kind: fsmv2.EventEpochChanged, Epoch: m.Epoch}) |
|||
if err != nil { |
|||
return err |
|||
} |
|||
} |
|||
return nil |
|||
} |
|||
|
|||
r := m.Replica(evt.ReplicaID) |
|||
if r == nil { |
|||
return nil |
|||
} |
|||
|
|||
var fEvt fsmv2.Event |
|||
switch evt.Kind { |
|||
case EventBarrierCompleted: |
|||
fEvt = fsmv2.Event{Kind: fsmv2.EventDurableProgress, ReplicaFlushedLSN: evt.ReplicaFlushedLSN} |
|||
case EventBootstrapReplica: |
|||
fEvt = fsmv2.Event{Kind: fsmv2.EventBootstrapComplete, ReplicaFlushedLSN: evt.ReplicaFlushedLSN} |
|||
case EventReplicaDisconnect: |
|||
fEvt = fsmv2.Event{Kind: fsmv2.EventDisconnect} |
|||
case EventReplicaReconnect: |
|||
if evt.ReservationID != "" { |
|||
fEvt = fsmv2.Event{Kind: fsmv2.EventReconnectCatchup, ReplicaFlushedLSN: evt.ReplicaFlushedLSN, TargetLSN: evt.TargetLSN, ReservationID: evt.ReservationID, ReservationTTL: evt.ReservationTTL} |
|||
} else { |
|||
fEvt = fsmv2.Event{Kind: fsmv2.EventReconnectRebuild} |
|||
} |
|||
case EventReplicaNeedsRebuild: |
|||
fEvt = fsmv2.Event{Kind: fsmv2.EventReconnectRebuild} |
|||
case EventReplicaCatchupProgress: |
|||
fEvt = fsmv2.Event{Kind: fsmv2.EventCatchupProgress, ReplicaFlushedLSN: evt.ReplicaFlushedLSN, PromotionHoldTill: evt.HoldUntil} |
|||
case EventReplicaPromotionHealthy: |
|||
fEvt = fsmv2.Event{Kind: fsmv2.EventPromotionHealthy, Now: evt.Now} |
|||
case EventReplicaStartRebuild: |
|||
fEvt = fsmv2.Event{Kind: fsmv2.EventStartRebuild, SnapshotID: evt.SnapshotID, SnapshotCpLSN: evt.SnapshotCpLSN, ReservationID: evt.ReservationID, ReservationTTL: evt.ReservationTTL} |
|||
case EventReplicaRebuildBaseApplied: |
|||
fEvt = fsmv2.Event{Kind: fsmv2.EventRebuildBaseApplied, TargetLSN: evt.TargetLSN} |
|||
case EventReplicaReservationLost: |
|||
fEvt = fsmv2.Event{Kind: fsmv2.EventRetentionLost} |
|||
case EventReplicaCatchupTimeout: |
|||
fEvt = fsmv2.Event{Kind: fsmv2.EventCatchupTimeout} |
|||
case EventReplicaRebuildTooSlow: |
|||
fEvt = fsmv2.Event{Kind: fsmv2.EventRebuildTooSlow} |
|||
default: |
|||
return nil |
|||
} |
|||
_, err := r.FSM.Apply(fEvt) |
|||
return err |
|||
} |
|||
|
|||
func (m *Model) EvaluateReconnect(replicaID string, flushedLSN, targetLSN uint64) (RecoveryDecision, error) { |
|||
decision := m.Planner.PlanReconnect(replicaID, flushedLSN, targetLSN) |
|||
r := m.Replica(replicaID) |
|||
if r == nil { |
|||
return decision, nil |
|||
} |
|||
switch decision.Disposition { |
|||
case RecoveryCatchup: |
|||
err := m.Apply(Event{ |
|||
Kind: EventReplicaReconnect, |
|||
ReplicaID: replicaID, |
|||
ReplicaFlushedLSN: flushedLSN, |
|||
TargetLSN: targetLSN, |
|||
ReservationID: decision.ReservationID, |
|||
ReservationTTL: decision.ReservationTTL, |
|||
}) |
|||
return decision, err |
|||
default: |
|||
if r.FSM.State == fsmv2.StateNeedsRebuild { |
|||
return decision, nil |
|||
} |
|||
err := m.Apply(Event{ |
|||
Kind: EventReplicaNeedsRebuild, |
|||
ReplicaID: replicaID, |
|||
}) |
|||
return decision, err |
|||
} |
|||
} |
|||
@ -0,0 +1,38 @@ |
|||
package volumefsm |
|||
|
|||
import ( |
|||
"fmt" |
|||
"sort" |
|||
"strings" |
|||
) |
|||
|
|||
func FormatSnapshot(s Snapshot) string { |
|||
ids := make([]string, 0, len(s.Replicas)) |
|||
for id := range s.Replicas { |
|||
ids = append(ids, id) |
|||
} |
|||
sort.Strings(ids) |
|||
|
|||
parts := []string{ |
|||
fmt.Sprintf("step=%s", s.Step), |
|||
fmt.Sprintf("epoch=%d", s.Epoch), |
|||
fmt.Sprintf("primary=%s/%s", s.PrimaryID, s.PrimaryState), |
|||
fmt.Sprintf("head=%d", s.HeadLSN), |
|||
fmt.Sprintf("write=%t:%s", s.WriteGate.Allowed, s.WriteGate.Reason), |
|||
fmt.Sprintf("ack=%t:%s", s.AckGate.Allowed, s.AckGate.Reason), |
|||
} |
|||
for _, id := range ids { |
|||
r := s.Replicas[id] |
|||
parts = append(parts, fmt.Sprintf("%s=%s@%d", id, r.State, r.FlushedLSN)) |
|||
} |
|||
return strings.Join(parts, " ") |
|||
} |
|||
|
|||
func FormatTrace(trace []Snapshot) string { |
|||
lines := make([]string, 0, len(trace)) |
|||
for _, s := range trace { |
|||
lines = append(lines, FormatSnapshot(s)) |
|||
} |
|||
return strings.Join(lines, "\n") |
|||
} |
|||
|
|||
@ -0,0 +1,142 @@ |
|||
package volumefsm |
|||
|
|||
import fsmv2 "github.com/seaweedfs/seaweedfs/sw-block/prototype/fsmv2" |
|||
|
|||
type Mode string |
|||
|
|||
const ( |
|||
ModeBestEffort Mode = "best_effort" |
|||
ModeSyncAll Mode = "sync_all" |
|||
ModeSyncQuorum Mode = "sync_quorum" |
|||
) |
|||
|
|||
type PrimaryState string |
|||
|
|||
const ( |
|||
PrimaryServing PrimaryState = "serving" |
|||
PrimaryDraining PrimaryState = "draining" |
|||
PrimaryLost PrimaryState = "lost" |
|||
) |
|||
|
|||
type Replica struct { |
|||
ID string |
|||
FSM *fsmv2.FSM |
|||
} |
|||
|
|||
type Model struct { |
|||
Epoch uint64 |
|||
PrimaryID string |
|||
PrimaryState PrimaryState |
|||
Mode Mode |
|||
|
|||
HeadLSN uint64 |
|||
CheckpointLSN uint64 |
|||
|
|||
RequiredReplicaIDs []string |
|||
Replicas map[string]*Replica |
|||
Planner RecoveryPlanner |
|||
} |
|||
|
|||
func New(primaryID string, mode Mode, epoch uint64, replicaIDs ...string) *Model { |
|||
m := &Model{ |
|||
Epoch: epoch, |
|||
PrimaryID: primaryID, |
|||
PrimaryState: PrimaryServing, |
|||
Mode: mode, |
|||
Replicas: make(map[string]*Replica, len(replicaIDs)), |
|||
Planner: StaticRecoveryPlanner{}, |
|||
} |
|||
for _, id := range replicaIDs { |
|||
m.Replicas[id] = &Replica{ID: id, FSM: fsmv2.New(epoch)} |
|||
m.RequiredReplicaIDs = append(m.RequiredReplicaIDs, id) |
|||
} |
|||
return m |
|||
} |
|||
|
|||
func (m *Model) Replica(id string) *Replica { |
|||
return m.Replicas[id] |
|||
} |
|||
|
|||
func (m *Model) SyncEligibleCount() int { |
|||
count := 0 |
|||
for _, id := range m.RequiredReplicaIDs { |
|||
r := m.Replicas[id] |
|||
if r != nil && r.FSM.IsSyncEligible() { |
|||
count++ |
|||
} |
|||
} |
|||
return count |
|||
} |
|||
|
|||
func (m *Model) DurableReplicaCount(targetLSN uint64) int { |
|||
count := 0 |
|||
for _, id := range m.RequiredReplicaIDs { |
|||
r := m.Replicas[id] |
|||
if r != nil && r.FSM.IsSyncEligible() && r.FSM.ReplicaFlushedLSN >= targetLSN { |
|||
count++ |
|||
} |
|||
} |
|||
return count |
|||
} |
|||
|
|||
func (m *Model) Quorum() int { |
|||
rf := len(m.RequiredReplicaIDs) + 1 |
|||
return rf/2 + 1 |
|||
} |
|||
|
|||
func (m *Model) CanServeWrite() bool { |
|||
return m.WriteAdmission().Allowed |
|||
} |
|||
|
|||
type AdmissionDecision struct { |
|||
Allowed bool |
|||
Reason string |
|||
} |
|||
|
|||
func (m *Model) WriteAdmission() AdmissionDecision { |
|||
if m.PrimaryState != PrimaryServing { |
|||
return AdmissionDecision{Allowed: false, Reason: "primary_not_serving"} |
|||
} |
|||
switch m.Mode { |
|||
case ModeBestEffort: |
|||
return AdmissionDecision{Allowed: true, Reason: "best_effort_local_durable"} |
|||
case ModeSyncAll: |
|||
if m.SyncEligibleCount() == len(m.RequiredReplicaIDs) { |
|||
return AdmissionDecision{Allowed: true, Reason: "all_replicas_sync_eligible"} |
|||
} |
|||
return AdmissionDecision{Allowed: false, Reason: "required_replica_not_in_sync"} |
|||
case ModeSyncQuorum: |
|||
if 1+m.SyncEligibleCount() >= m.Quorum() { |
|||
return AdmissionDecision{Allowed: true, Reason: "quorum_sync_eligible"} |
|||
} |
|||
return AdmissionDecision{Allowed: false, Reason: "quorum_not_available"} |
|||
default: |
|||
return AdmissionDecision{Allowed: false, Reason: "unknown_mode"} |
|||
} |
|||
} |
|||
|
|||
func (m *Model) CanAcknowledgeLSN(targetLSN uint64) bool { |
|||
return m.AckAdmission(targetLSN).Allowed |
|||
} |
|||
|
|||
func (m *Model) AckAdmission(targetLSN uint64) AdmissionDecision { |
|||
if m.PrimaryState != PrimaryServing { |
|||
return AdmissionDecision{Allowed: false, Reason: "primary_not_serving"} |
|||
} |
|||
switch m.Mode { |
|||
case ModeBestEffort: |
|||
return AdmissionDecision{Allowed: true, Reason: "best_effort_local_durable"} |
|||
case ModeSyncAll: |
|||
if m.DurableReplicaCount(targetLSN) == len(m.RequiredReplicaIDs) { |
|||
return AdmissionDecision{Allowed: true, Reason: "all_replicas_durable"} |
|||
} |
|||
return AdmissionDecision{Allowed: false, Reason: "required_replica_not_durable"} |
|||
case ModeSyncQuorum: |
|||
if 1+m.DurableReplicaCount(targetLSN) >= m.Quorum() { |
|||
return AdmissionDecision{Allowed: true, Reason: "quorum_durable"} |
|||
} |
|||
return AdmissionDecision{Allowed: false, Reason: "durable_quorum_not_available"} |
|||
default: |
|||
return AdmissionDecision{Allowed: false, Reason: "unknown_mode"} |
|||
} |
|||
} |
|||
@ -0,0 +1,421 @@ |
|||
package volumefsm |
|||
|
|||
import ( |
|||
"strings" |
|||
"testing" |
|||
|
|||
fsmv2 "github.com/seaweedfs/seaweedfs/sw-block/prototype/fsmv2" |
|||
) |
|||
|
|||
type scriptedPlanner struct { |
|||
decision RecoveryDecision |
|||
} |
|||
|
|||
func (s scriptedPlanner) PlanReconnect(replicaID string, flushedLSN, targetLSN uint64) RecoveryDecision { |
|||
return s.decision |
|||
} |
|||
|
|||
func mustApply(t *testing.T, m *Model, evt Event) { |
|||
t.Helper() |
|||
if err := m.Apply(evt); err != nil { |
|||
t.Fatalf("apply %s: %v", evt.Kind, err) |
|||
} |
|||
} |
|||
|
|||
func TestModelSyncAllBlocksOnLaggingReplica(t *testing.T) { |
|||
m := New("p1", ModeSyncAll, 1, "r1", "r2") |
|||
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 1}) |
|||
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r2", ReplicaFlushedLSN: 1}) |
|||
if !m.CanServeWrite() { |
|||
t.Fatal("sync_all should serve when all replicas are in sync") |
|||
} |
|||
mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r2"}) |
|||
if m.CanServeWrite() { |
|||
t.Fatal("sync_all should block when one required replica lags") |
|||
} |
|||
} |
|||
|
|||
func TestModelSyncQuorumSurvivesOneLaggingReplica(t *testing.T) { |
|||
m := New("p1", ModeSyncQuorum, 1, "r1", "r2") |
|||
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 1}) |
|||
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r2", ReplicaFlushedLSN: 1}) |
|||
mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r2"}) |
|||
if !m.CanServeWrite() { |
|||
t.Fatal("sync_quorum should still serve with primary + one in-sync replica") |
|||
} |
|||
} |
|||
|
|||
func TestModelCatchupFlowRestoresEligibility(t *testing.T) { |
|||
m := New("p1", ModeSyncAll, 1, "r1") |
|||
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 1}) |
|||
mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"}) |
|||
mustApply(t, m, Event{Kind: EventWriteCommitted, LSN: 10}) |
|||
mustApply(t, m, Event{Kind: EventReplicaReconnect, ReplicaID: "r1", ReplicaFlushedLSN: 1, TargetLSN: 10, ReservationID: "res-1", ReservationTTL: 100}) |
|||
if got := m.Replica("r1").FSM.State; got != fsmv2.StateCatchingUp { |
|||
t.Fatalf("expected catching up, got %s", got) |
|||
} |
|||
mustApply(t, m, Event{Kind: EventReplicaCatchupProgress, ReplicaID: "r1", ReplicaFlushedLSN: 10, HoldUntil: 20}) |
|||
mustApply(t, m, Event{Kind: EventReplicaPromotionHealthy, ReplicaID: "r1", Now: 20}) |
|||
if got := m.Replica("r1").FSM.State; got != fsmv2.StateInSync { |
|||
t.Fatalf("expected in sync, got %s", got) |
|||
} |
|||
if !m.CanServeWrite() { |
|||
t.Fatal("sync_all should serve after replica returns to in-sync") |
|||
} |
|||
} |
|||
|
|||
func TestModelLongGapRebuildFlow(t *testing.T) { |
|||
m := New("p1", ModeBestEffort, 1, "r1") |
|||
mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"}) |
|||
mustApply(t, m, Event{Kind: EventReplicaNeedsRebuild, ReplicaID: "r1"}) |
|||
mustApply(t, m, Event{Kind: EventReplicaStartRebuild, ReplicaID: "r1", SnapshotID: "snap1", SnapshotCpLSN: 100, ReservationID: "rebuild-1", ReservationTTL: 200}) |
|||
mustApply(t, m, Event{Kind: EventReplicaRebuildBaseApplied, ReplicaID: "r1", TargetLSN: 130}) |
|||
mustApply(t, m, Event{Kind: EventReplicaCatchupProgress, ReplicaID: "r1", ReplicaFlushedLSN: 130, HoldUntil: 150}) |
|||
mustApply(t, m, Event{Kind: EventReplicaPromotionHealthy, ReplicaID: "r1", Now: 150}) |
|||
if got := m.Replica("r1").FSM.State; got != fsmv2.StateInSync { |
|||
t.Fatalf("expected in sync after rebuild, got %s", got) |
|||
} |
|||
} |
|||
|
|||
func TestModelPrimaryLeaseLostFencesRecovery(t *testing.T) { |
|||
m := New("p1", ModeSyncAll, 1, "r1") |
|||
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 1}) |
|||
mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"}) |
|||
mustApply(t, m, Event{Kind: EventReplicaReconnect, ReplicaID: "r1", ReplicaFlushedLSN: 1, TargetLSN: 5, ReservationID: "res-2", ReservationTTL: 100}) |
|||
mustApply(t, m, Event{Kind: EventPrimaryLeaseLost}) |
|||
if m.PrimaryState != PrimaryLost { |
|||
t.Fatalf("expected lost primary, got %s", m.PrimaryState) |
|||
} |
|||
if got := m.Replica("r1").FSM.State; got != fsmv2.StateLagging { |
|||
t.Fatalf("expected lagging after fencing, got %s", got) |
|||
} |
|||
} |
|||
|
|||
func TestModelPromoteReplicaChangesEpoch(t *testing.T) { |
|||
m := New("p1", ModeSyncQuorum, 1, "r1", "r2") |
|||
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 10}) |
|||
oldEpoch := m.Epoch |
|||
mustApply(t, m, Event{Kind: EventPromoteReplica, ReplicaID: "r1"}) |
|||
if m.PrimaryID != "r1" { |
|||
t.Fatalf("expected promoted primary r1, got %s", m.PrimaryID) |
|||
} |
|||
if m.Epoch != oldEpoch+1 { |
|||
t.Fatalf("expected epoch increment, got %d want %d", m.Epoch, oldEpoch+1) |
|||
} |
|||
} |
|||
|
|||
func TestModelSyncQuorumWithThreeReplicasMixedStates(t *testing.T) { |
|||
m := New("p1", ModeSyncQuorum, 1, "r1", "r2", "r3") |
|||
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 1}) |
|||
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r2", ReplicaFlushedLSN: 1}) |
|||
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r3", ReplicaFlushedLSN: 1}) |
|||
|
|||
mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r2"}) |
|||
|
|||
if !m.CanServeWrite() { |
|||
t.Fatal("sync_quorum should serve with primary + two in-sync replicas out of RF=4") |
|||
} |
|||
} |
|||
|
|||
func TestModelFailoverFencesMixedReplicaStates(t *testing.T) { |
|||
m := New("p1", ModeSyncQuorum, 10, "r1", "r2", "r3") |
|||
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 8}) |
|||
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r2", ReplicaFlushedLSN: 8}) |
|||
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r3", ReplicaFlushedLSN: 6}) |
|||
mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r2"}) |
|||
mustApply(t, m, Event{Kind: EventReplicaReconnect, ReplicaID: "r2", ReplicaFlushedLSN: 8, TargetLSN: 12, ReservationID: "catch-r2", ReservationTTL: 100}) |
|||
mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r3"}) |
|||
mustApply(t, m, Event{Kind: EventReplicaNeedsRebuild, ReplicaID: "r3"}) |
|||
mustApply(t, m, Event{Kind: EventReplicaStartRebuild, ReplicaID: "r3", SnapshotID: "snap-x", SnapshotCpLSN: 6, ReservationID: "rebuild-r3", ReservationTTL: 200}) |
|||
|
|||
mustApply(t, m, Event{Kind: EventPrimaryLeaseLost}) |
|||
mustApply(t, m, Event{Kind: EventPromoteReplica, ReplicaID: "r1"}) |
|||
|
|||
if m.PrimaryID != "r1" { |
|||
t.Fatalf("expected r1 promoted, got %s", m.PrimaryID) |
|||
} |
|||
if got := m.Replica("r2").FSM.State; got != fsmv2.StateLagging { |
|||
t.Fatalf("expected r2 fenced back to lagging, got %s", got) |
|||
} |
|||
if got := m.Replica("r3").FSM.State; got != fsmv2.StateLagging { |
|||
t.Fatalf("expected r3 fenced back to lagging, got %s", got) |
|||
} |
|||
} |
|||
|
|||
func TestModelRebuildInterruptedByEpochChange(t *testing.T) { |
|||
m := New("p1", ModeBestEffort, 1, "r1") |
|||
mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"}) |
|||
mustApply(t, m, Event{Kind: EventReplicaNeedsRebuild, ReplicaID: "r1"}) |
|||
mustApply(t, m, Event{Kind: EventReplicaStartRebuild, ReplicaID: "r1", SnapshotID: "snap-2", SnapshotCpLSN: 100, ReservationID: "rebuild-2", ReservationTTL: 200}) |
|||
if got := m.Replica("r1").FSM.State; got != fsmv2.StateRebuilding { |
|||
t.Fatalf("expected rebuilding, got %s", got) |
|||
} |
|||
|
|||
mustApply(t, m, Event{Kind: EventPromoteReplica, ReplicaID: "r1"}) |
|||
if got := m.Replica("r1").FSM.State; got != fsmv2.StateLagging { |
|||
t.Fatalf("expected lagging after epoch change fencing, got %s", got) |
|||
} |
|||
} |
|||
|
|||
func TestModelReservationLostDuringCatchupAfterRebuild(t *testing.T) { |
|||
m := New("p1", ModeBestEffort, 1, "r1") |
|||
mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"}) |
|||
mustApply(t, m, Event{Kind: EventReplicaNeedsRebuild, ReplicaID: "r1"}) |
|||
mustApply(t, m, Event{Kind: EventReplicaStartRebuild, ReplicaID: "r1", SnapshotID: "snap-3", SnapshotCpLSN: 50, ReservationID: "rebuild-3", ReservationTTL: 200}) |
|||
mustApply(t, m, Event{Kind: EventReplicaRebuildBaseApplied, ReplicaID: "r1", TargetLSN: 80}) |
|||
if got := m.Replica("r1").FSM.State; got != fsmv2.StateCatchUpAfterBuild { |
|||
t.Fatalf("expected catch-up-after-rebuild, got %s", got) |
|||
} |
|||
|
|||
mustApply(t, m, Event{Kind: EventReplicaReservationLost, ReplicaID: "r1"}) |
|||
if got := m.Replica("r1").FSM.State; got != fsmv2.StateNeedsRebuild { |
|||
t.Fatalf("expected needs rebuild after reservation loss, got %s", got) |
|||
} |
|||
} |
|||
|
|||
func TestModelSyncAllBarrierAcknowledgeTargetLSN(t *testing.T) { |
|||
m := New("p1", ModeSyncAll, 1, "r1", "r2") |
|||
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 5}) |
|||
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r2", ReplicaFlushedLSN: 5}) |
|||
mustApply(t, m, Event{Kind: EventWriteCommitted, LSN: 10}) |
|||
|
|||
if m.CanAcknowledgeLSN(10) { |
|||
t.Fatal("sync_all should not acknowledge target LSN before barriers advance replica durability") |
|||
} |
|||
|
|||
mustApply(t, m, Event{Kind: EventBarrierCompleted, ReplicaID: "r1", ReplicaFlushedLSN: 10}) |
|||
if m.CanAcknowledgeLSN(10) { |
|||
t.Fatal("sync_all should still wait for second replica durability") |
|||
} |
|||
|
|||
mustApply(t, m, Event{Kind: EventBarrierCompleted, ReplicaID: "r2", ReplicaFlushedLSN: 10}) |
|||
if !m.CanAcknowledgeLSN(10) { |
|||
t.Fatal("sync_all should acknowledge once all required replicas are durable at target LSN") |
|||
} |
|||
} |
|||
|
|||
func TestModelSyncQuorumBarrierAcknowledgeTargetLSN(t *testing.T) { |
|||
m := New("p1", ModeSyncQuorum, 1, "r1", "r2", "r3") |
|||
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 5}) |
|||
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r2", ReplicaFlushedLSN: 5}) |
|||
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r3", ReplicaFlushedLSN: 5}) |
|||
mustApply(t, m, Event{Kind: EventWriteCommitted, LSN: 9}) |
|||
|
|||
if m.CanAcknowledgeLSN(9) { |
|||
t.Fatal("sync_quorum should not acknowledge before any replica reaches target durability") |
|||
} |
|||
|
|||
mustApply(t, m, Event{Kind: EventBarrierCompleted, ReplicaID: "r1", ReplicaFlushedLSN: 9}) |
|||
if m.CanAcknowledgeLSN(9) { |
|||
t.Fatal("sync_quorum should still wait because RF=4 quorum needs primary + two durable replicas") |
|||
} |
|||
mustApply(t, m, Event{Kind: EventBarrierCompleted, ReplicaID: "r2", ReplicaFlushedLSN: 9}) |
|||
if !m.CanAcknowledgeLSN(9) { |
|||
t.Fatal("sync_quorum should acknowledge with primary + two durable replicas in RF=4") |
|||
} |
|||
} |
|||
|
|||
func TestModelWriteAdmissionReasons(t *testing.T) { |
|||
m := New("p1", ModeSyncAll, 1, "r1") |
|||
dec := m.WriteAdmission() |
|||
if dec.Allowed || dec.Reason != "required_replica_not_in_sync" { |
|||
t.Fatalf("unexpected admission before bootstrap: %+v", dec) |
|||
} |
|||
|
|||
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 1}) |
|||
dec = m.WriteAdmission() |
|||
if !dec.Allowed || dec.Reason != "all_replicas_sync_eligible" { |
|||
t.Fatalf("unexpected admission after bootstrap: %+v", dec) |
|||
} |
|||
} |
|||
|
|||
func TestModelEvaluateReconnectUsesPlanner(t *testing.T) { |
|||
m := New("p1", ModeBestEffort, 1, "r1") |
|||
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 2}) |
|||
mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"}) |
|||
|
|||
decision, err := m.EvaluateReconnect("r1", 2, 8) |
|||
if err != nil { |
|||
t.Fatalf("evaluate reconnect: %v", err) |
|||
} |
|||
if decision.Disposition != RecoveryCatchup { |
|||
t.Fatalf("expected catchup decision, got %+v", decision) |
|||
} |
|||
if got := m.Replica("r1").FSM.State; got != fsmv2.StateCatchingUp { |
|||
t.Fatalf("expected catching up, got %s", got) |
|||
} |
|||
} |
|||
|
|||
func TestModelEvaluateReconnectNeedsRebuildFromPlanner(t *testing.T) { |
|||
m := New("p1", ModeBestEffort, 1, "r1") |
|||
m.Planner = scriptedPlanner{decision: RecoveryDecision{ |
|||
Disposition: RecoveryNeedsRebuild, |
|||
Reason: "payload_not_resolvable", |
|||
}} |
|||
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 2}) |
|||
mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"}) |
|||
|
|||
decision, err := m.EvaluateReconnect("r1", 2, 8) |
|||
if err != nil { |
|||
t.Fatalf("evaluate reconnect: %v", err) |
|||
} |
|||
if decision.Disposition != RecoveryNeedsRebuild || decision.Reason != "payload_not_resolvable" { |
|||
t.Fatalf("unexpected decision: %+v", decision) |
|||
} |
|||
if got := m.Replica("r1").FSM.State; got != fsmv2.StateNeedsRebuild { |
|||
t.Fatalf("expected needs rebuild, got %s", got) |
|||
} |
|||
} |
|||
|
|||
func TestModelEvaluateReconnectCarriesRecoveryClasses(t *testing.T) { |
|||
m := New("p1", ModeBestEffort, 1, "r1") |
|||
m.Planner = scriptedPlanner{decision: RecoveryDecision{ |
|||
Disposition: RecoveryCatchup, |
|||
ReservationID: "extent-resv", |
|||
ReservationTTL: 42, |
|||
Reason: "extent_payload_resolvable", |
|||
Classes: []RecoveryClass{RecoveryClassExtentReferenced}, |
|||
}} |
|||
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 3}) |
|||
mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"}) |
|||
|
|||
decision, err := m.EvaluateReconnect("r1", 3, 9) |
|||
if err != nil { |
|||
t.Fatalf("evaluate reconnect: %v", err) |
|||
} |
|||
if len(decision.Classes) != 1 || decision.Classes[0] != RecoveryClassExtentReferenced { |
|||
t.Fatalf("unexpected recovery classes: %+v", decision.Classes) |
|||
} |
|||
if got := m.Replica("r1").FSM.State; got != fsmv2.StateCatchingUp { |
|||
t.Fatalf("expected catching up, got %s", got) |
|||
} |
|||
if got := m.Replica("r1").FSM.RecoveryReservationID; got != "extent-resv" { |
|||
t.Fatalf("expected reservation extent-resv, got %q", got) |
|||
} |
|||
} |
|||
|
|||
func TestModelEvaluateReconnectCanChangeOverTime(t *testing.T) { |
|||
m := New("p1", ModeBestEffort, 1, "r1") |
|||
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 4}) |
|||
mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"}) |
|||
|
|||
m.Planner = scriptedPlanner{decision: RecoveryDecision{ |
|||
Disposition: RecoveryCatchup, |
|||
ReservationID: "resv-1", |
|||
ReservationTTL: 10, |
|||
Reason: "temporarily_recoverable", |
|||
Classes: []RecoveryClass{RecoveryClassWALInline}, |
|||
}} |
|||
decision, err := m.EvaluateReconnect("r1", 4, 12) |
|||
if err != nil { |
|||
t.Fatalf("first evaluate reconnect: %v", err) |
|||
} |
|||
if decision.Disposition != RecoveryCatchup { |
|||
t.Fatalf("expected catchup on first evaluation, got %+v", decision) |
|||
} |
|||
|
|||
mustApply(t, m, Event{Kind: EventReplicaReservationLost, ReplicaID: "r1"}) |
|||
if got := m.Replica("r1").FSM.State; got != fsmv2.StateNeedsRebuild { |
|||
t.Fatalf("expected needs rebuild after reservation loss, got %s", got) |
|||
} |
|||
|
|||
m.Planner = scriptedPlanner{decision: RecoveryDecision{ |
|||
Disposition: RecoveryNeedsRebuild, |
|||
Reason: "recoverability_expired", |
|||
}} |
|||
decision, err = m.EvaluateReconnect("r1", 4, 12) |
|||
if err != nil { |
|||
t.Fatalf("second evaluate reconnect: %v", err) |
|||
} |
|||
if decision.Reason != "recoverability_expired" { |
|||
t.Fatalf("unexpected second decision: %+v", decision) |
|||
} |
|||
} |
|||
|
|||
func TestRunScenarioProducesStateTrace(t *testing.T) { |
|||
m := New("p1", ModeSyncAll, 1, "r1") |
|||
trace, err := RunScenario(m, []ScenarioStep{ |
|||
{Name: "bootstrap", Event: Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 1}}, |
|||
{Name: "write10", Event: Event{Kind: EventWriteCommitted, LSN: 10}}, |
|||
{Name: "barrier10", Event: Event{Kind: EventBarrierCompleted, ReplicaID: "r1", ReplicaFlushedLSN: 10}}, |
|||
}) |
|||
if err != nil { |
|||
t.Fatalf("run scenario: %v", err) |
|||
} |
|||
if len(trace) != 4 { |
|||
t.Fatalf("expected 4 snapshots, got %d", len(trace)) |
|||
} |
|||
last := trace[len(trace)-1] |
|||
if last.HeadLSN != 10 { |
|||
t.Fatalf("expected head 10, got %d", last.HeadLSN) |
|||
} |
|||
if got := last.Replicas["r1"].FlushedLSN; got != 10 { |
|||
t.Fatalf("expected replica flushed 10, got %d", got) |
|||
} |
|||
if !last.AckGate.Allowed { |
|||
t.Fatalf("expected ack gate allowed at final step, got %+v", last.AckGate) |
|||
} |
|||
} |
|||
|
|||
func TestScriptedRecoveryPlannerChangesDecisionOverTime(t *testing.T) { |
|||
m := New("p1", ModeBestEffort, 1, "r1") |
|||
m.Planner = &ScriptedRecoveryPlanner{ |
|||
Decisions: []RecoveryDecision{ |
|||
{ |
|||
Disposition: RecoveryCatchup, |
|||
ReservationID: "resv-a", |
|||
ReservationTTL: 10, |
|||
Reason: "recoverable_now", |
|||
Classes: []RecoveryClass{RecoveryClassWALInline}, |
|||
}, |
|||
{ |
|||
Disposition: RecoveryNeedsRebuild, |
|||
Reason: "recoverability_expired", |
|||
}, |
|||
}, |
|||
} |
|||
mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 4}) |
|||
mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"}) |
|||
|
|||
first, err := m.EvaluateReconnect("r1", 4, 12) |
|||
if err != nil { |
|||
t.Fatalf("first reconnect: %v", err) |
|||
} |
|||
if first.Disposition != RecoveryCatchup { |
|||
t.Fatalf("unexpected first decision: %+v", first) |
|||
} |
|||
mustApply(t, m, Event{Kind: EventReplicaReservationLost, ReplicaID: "r1"}) |
|||
|
|||
second, err := m.EvaluateReconnect("r1", 4, 12) |
|||
if err != nil { |
|||
t.Fatalf("second reconnect: %v", err) |
|||
} |
|||
if second.Disposition != RecoveryNeedsRebuild || second.Reason != "recoverability_expired" { |
|||
t.Fatalf("unexpected second decision: %+v", second) |
|||
} |
|||
} |
|||
|
|||
func TestFormatTraceIncludesReplicaStatesAndGates(t *testing.T) { |
|||
m := New("p1", ModeSyncAll, 1, "r1") |
|||
trace, err := RunScenario(m, []ScenarioStep{ |
|||
{Name: "bootstrap", Event: Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 1}}, |
|||
{Name: "write10", Event: Event{Kind: EventWriteCommitted, LSN: 10}}, |
|||
{Name: "barrier10", Event: Event{Kind: EventBarrierCompleted, ReplicaID: "r1", ReplicaFlushedLSN: 10}}, |
|||
}) |
|||
if err != nil { |
|||
t.Fatalf("run scenario: %v", err) |
|||
} |
|||
got := FormatTrace(trace) |
|||
wantParts := []string{ |
|||
"step=bootstrap", |
|||
"write=true:all_replicas_sync_eligible", |
|||
"step=barrier10", |
|||
"ack=true:all_replicas_durable", |
|||
"r1=InSync@10", |
|||
} |
|||
for _, part := range wantParts { |
|||
if !strings.Contains(got, part) { |
|||
t.Fatalf("trace missing %q:\n%s", part, got) |
|||
} |
|||
} |
|||
} |
|||
@ -0,0 +1,70 @@ |
|||
package volumefsm |
|||
|
|||
type RecoveryClass string |
|||
|
|||
const ( |
|||
RecoveryClassWALInline RecoveryClass = "wal_inline" |
|||
RecoveryClassExtentReferenced RecoveryClass = "extent_referenced" |
|||
) |
|||
|
|||
type RecoveryDisposition string |
|||
|
|||
const ( |
|||
RecoveryCatchup RecoveryDisposition = "catchup" |
|||
RecoveryNeedsRebuild RecoveryDisposition = "needs_rebuild" |
|||
) |
|||
|
|||
type RecoveryDecision struct { |
|||
Disposition RecoveryDisposition |
|||
ReservationID string |
|||
ReservationTTL uint64 |
|||
Reason string |
|||
Classes []RecoveryClass |
|||
} |
|||
|
|||
type RecoveryPlanner interface { |
|||
PlanReconnect(replicaID string, flushedLSN, targetLSN uint64) RecoveryDecision |
|||
} |
|||
|
|||
// StaticRecoveryPlanner is the minimal default planner for the prototype.
|
|||
// If targetLSN >= flushedLSN and the caller provided a real target, reconnect
|
|||
// is treated as catch-up; otherwise rebuild is required.
|
|||
type StaticRecoveryPlanner struct{} |
|||
|
|||
func (StaticRecoveryPlanner) PlanReconnect(replicaID string, flushedLSN, targetLSN uint64) RecoveryDecision { |
|||
if targetLSN > flushedLSN { |
|||
return RecoveryDecision{ |
|||
Disposition: RecoveryCatchup, |
|||
ReservationID: replicaID + "-resv", |
|||
ReservationTTL: 100, |
|||
Reason: "static_recoverable_window", |
|||
Classes: []RecoveryClass{RecoveryClassWALInline}, |
|||
} |
|||
} |
|||
return RecoveryDecision{ |
|||
Disposition: RecoveryNeedsRebuild, |
|||
Reason: "static_no_recoverable_window", |
|||
} |
|||
} |
|||
|
|||
// ScriptedRecoveryPlanner returns pre-seeded reconnect decisions in order.
|
|||
// Once the scripted list is exhausted, the last decision is reused.
|
|||
type ScriptedRecoveryPlanner struct { |
|||
Decisions []RecoveryDecision |
|||
index int |
|||
} |
|||
|
|||
func (s *ScriptedRecoveryPlanner) PlanReconnect(replicaID string, flushedLSN, targetLSN uint64) RecoveryDecision { |
|||
if len(s.Decisions) == 0 { |
|||
return RecoveryDecision{ |
|||
Disposition: RecoveryNeedsRebuild, |
|||
Reason: "scripted_no_decision", |
|||
} |
|||
} |
|||
if s.index >= len(s.Decisions) { |
|||
return s.Decisions[len(s.Decisions)-1] |
|||
} |
|||
d := s.Decisions[s.index] |
|||
s.index++ |
|||
return d |
|||
} |
|||
@ -0,0 +1,61 @@ |
|||
package volumefsm |
|||
|
|||
import ( |
|||
"fmt" |
|||
|
|||
fsmv2 "github.com/seaweedfs/seaweedfs/sw-block/prototype/fsmv2" |
|||
) |
|||
|
|||
type ScenarioStep struct { |
|||
Name string |
|||
Event Event |
|||
} |
|||
|
|||
type ReplicaSnapshot struct { |
|||
State fsmv2.State |
|||
FlushedLSN uint64 |
|||
} |
|||
|
|||
type Snapshot struct { |
|||
Step string |
|||
Epoch uint64 |
|||
PrimaryID string |
|||
PrimaryState PrimaryState |
|||
HeadLSN uint64 |
|||
WriteGate AdmissionDecision |
|||
AckGate AdmissionDecision |
|||
Replicas map[string]ReplicaSnapshot |
|||
} |
|||
|
|||
func (m *Model) Snapshot(step string) Snapshot { |
|||
replicas := make(map[string]ReplicaSnapshot, len(m.Replicas)) |
|||
for id, r := range m.Replicas { |
|||
replicas[id] = ReplicaSnapshot{ |
|||
State: r.FSM.State, |
|||
FlushedLSN: r.FSM.ReplicaFlushedLSN, |
|||
} |
|||
} |
|||
return Snapshot{ |
|||
Step: step, |
|||
Epoch: m.Epoch, |
|||
PrimaryID: m.PrimaryID, |
|||
PrimaryState: m.PrimaryState, |
|||
HeadLSN: m.HeadLSN, |
|||
WriteGate: m.WriteAdmission(), |
|||
AckGate: m.AckAdmission(m.HeadLSN), |
|||
Replicas: replicas, |
|||
} |
|||
} |
|||
|
|||
func RunScenario(m *Model, steps []ScenarioStep) ([]Snapshot, error) { |
|||
trace := make([]Snapshot, 0, len(steps)+1) |
|||
trace = append(trace, m.Snapshot("initial")) |
|||
for _, step := range steps { |
|||
if err := m.Apply(step.Event); err != nil { |
|||
return trace, fmt.Errorf("scenario step %q: %w", step.Name, err) |
|||
} |
|||
trace = append(trace, m.Snapshot(step.Name)) |
|||
} |
|||
return trace, nil |
|||
} |
|||
|
|||
@ -0,0 +1,17 @@ |
|||
# V2 Test Reference |
|||
|
|||
This directory holds V2-facing test reference material copied from the project test database. |
|||
|
|||
Files: |
|||
|
|||
- `test_db.md` |
|||
- copied from `learn/projects/sw-block/test/test_db.md` |
|||
- full block-service test inventory |
|||
- `v2_selected.md` |
|||
- V2-focused working subset |
|||
- includes the currently selected simulator-relevant cases and the 4 Phase 13 V2-boundary tests |
|||
|
|||
Use: |
|||
|
|||
- `learn/projects/sw-block/test/test_db.md` as the project-wide source inventory |
|||
- `sw-block/test/v2_selected.md` as the active V2 reference/worklist |
|||
1675
sw-block/test/test_db.md
File diff suppressed because it is too large
View File
File diff suppressed because it is too large
View File
@ -0,0 +1,105 @@ |
|||
# V2 Test Database |
|||
|
|||
Date: 2026-03-27 |
|||
Status: working subset |
|||
|
|||
## Purpose |
|||
|
|||
This is the V2-focused review subset derived from: |
|||
|
|||
- `sw-block/test/test_db.md` |
|||
- `learn/projects/sw-block/phases/phase13_test.md` |
|||
- `learn/projects/sw-block/phases/phase-13-v2-boundary-tests.md` |
|||
|
|||
Use this file to review and track the tests that most directly help: |
|||
|
|||
- V2 protocol design |
|||
- simulator coverage |
|||
- V1 / V1.5 / V2 comparison |
|||
- V2 acceptance boundaries |
|||
|
|||
This is intentionally much smaller than the full `test_db.md`. |
|||
|
|||
## Review Codes |
|||
|
|||
### Status |
|||
|
|||
- `picked` |
|||
- `reviewed` |
|||
- `mapped` |
|||
|
|||
### Sim |
|||
|
|||
- `sim_core` |
|||
- `sim_reduced` |
|||
- `real_only` |
|||
- `v2_boundary` |
|||
- `sim_not_needed_yet` |
|||
|
|||
## V2 Boundary Tests |
|||
|
|||
| # | Test Name | File | Line | Level | Status | Sim | Notes | |
|||
|---|---|---|---|---|---|---|---| |
|||
| 1 | `TestAdversarial_ReconnectUsesHandshakeNotBootstrap` | `sync_all_adversarial_test.go` | | `unit` | `picked` | `v2_boundary` | V1/V1.5 sender identity loss; should become V2 acceptance case | |
|||
| 2 | `TestAdversarial_NeedsRebuildBlocksAllPaths` | `sync_all_adversarial_test.go` | | `unit` | `picked` | `v2_boundary` | `NeedsRebuild` must remain sticky under stable per-replica sender identity | |
|||
| 3 | `TestAdversarial_CatchupDoesNotOverwriteNewerData` | `sync_all_adversarial_test.go` | | `unit` | `picked` | `v2_boundary` | Catch-up correctness depends on identity continuity and proper recovery ownership | |
|||
| 4 | `TestAdversarial_CatchupMultipleDisconnects` | `sync_all_adversarial_test.go` | | `unit` | `picked` | `v2_boundary` | Multiple reconnect cycles are a V2 sender-loop / recovery-session acceptance target | |
|||
|
|||
## Core Protocol Tests |
|||
|
|||
| # | Test Name | File | Line | Level | Status | Sim | Notes | |
|||
|---|---|---|---|---|---|---|---| |
|||
| 1 | `TestRecovery` | `recovery_test.go` | | `unit` | `picked` | `sim_core` | Crash recovery correctness is fundamental to block protocol reasoning | |
|||
| 2 | `TestReplicaProgress_BarrierUsesFlushedLSN` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Durable-progress truth; barrier must count flushed progress, not send progress | |
|||
| 3 | `TestReplicaProgress_FlushedLSNMonotonicWithinEpoch` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Progress monotonicity invariant | |
|||
| 4 | `TestBarrier_RejectsReplicaNotInSync` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Only eligible replica states count for strict durability | |
|||
| 5 | `TestBarrier_EpochMismatchRejected` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Epoch fencing on barrier path | |
|||
| 6 | `TestBug1_SyncAll_WriteDuringDegraded_SyncCacheMustFail` | `sync_all_bug_test.go` | | `unit` | `picked` | `sim_core` | `sync_all` strictness during degraded state | |
|||
| 7 | `TestSyncAll_FullRoundTrip_WriteAndFlush` | `sync_all_bug_test.go` | | `unit` | `picked` | `sim_core` | End-to-end strict replication contract | |
|||
| 8 | `TestBestEffort_FlushSucceeds_ReplicaDown` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Contrasts best_effort vs strict modes | |
|||
| 9 | `TestShip_DegradedDoesNotSilentlyCountAsHealthy` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | No false durability from degraded shipper | |
|||
| 10 | `TestDistSync_SyncAll_AllDegraded_Fails` | `dist_group_commit_test.go` | | `unit` | `picked` | `sim_core` | Availability semantics under strict mode | |
|||
| 11 | `TestBug2_SyncAll_SyncCache_AfterDegradedShipperRecovers` | `sync_all_bug_test.go` | | `unit` | `picked` | `sim_core` | Recoverability after degraded shipper | |
|||
| 12 | `TestReconnect_CatchupFromRetainedWal` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Short-gap catch-up | |
|||
| 13 | `TestReconnect_GapBeyondRetainedWal_NeedsRebuild` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Catch-up vs rebuild boundary | |
|||
| 14 | `TestReconnect_EpochChangeDuringCatchup_Aborts` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Recovery fencing during catch-up | |
|||
| 15 | `TestCatchupReplay_DataIntegrity_AllBlocksMatch` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Recovery data correctness | |
|||
| 16 | `TestCatchupReplay_DuplicateEntry_Idempotent` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Replay idempotence | |
|||
| 17 | `TestBarrier_DuringCatchup_Rejected` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | State-machine correctness during recovery | |
|||
| 18 | `TestReplicaState_RebuildComplete_ReentersInSync` | `rebuild_v1_test.go` | | `unit` | `picked` | `sim_core` | Rebuild lifecycle closure | |
|||
| 19 | `TestRebuild_AbortOnEpochChange` | `rebuild_v1_test.go` | | `unit` | `picked` | `sim_core` | Rebuild fencing | |
|||
| 20 | `TestRebuild_MissingTailRestartsOrFailsCleanly` | `rebuild_v1_test.go` | | `unit` | `picked` | `sim_core` | Safe rebuild failure behavior | |
|||
| 21 | `TestWalRetention_RequiredReplicaBlocksReclaim` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Retention rule under lag | |
|||
| 22 | `TestWalRetention_TimeoutTriggersNeedsRebuild` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Retention timeout boundary | |
|||
| 23 | `TestWalRetention_MaxBytesTriggersNeedsRebuild` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Retention budget boundary | |
|||
| 24 | `TestComponent_FailoverPromote` | `component_test.go` | | `component` | `picked` | `sim_core` | Core failover baseline | |
|||
| 25 | `TestCP13_SyncAll_FailoverPromotesReplica` | `cp13_protocol_test.go` | | `component` | `picked` | `sim_core` | Strict-mode failover | |
|||
| 26 | `TestCP13_SyncAll_ReplicaRestart_Rejoin` | `cp13_protocol_test.go` | | `component` | `picked` | `sim_core` | Restart/rejoin lifecycle | |
|||
| 27 | `TestQA_LSNLag_StaleReplicaSkipped` | `qa_block_edge_cases_test.go` | | `unit` | `picked` | `sim_core` | Promotion safety and stale candidate rejection | |
|||
| 28 | `TestQA_CascadeFailover_RF3_EpochChain` | `qa_block_edge_cases_test.go` | | `unit` | `picked` | `sim_core` | Multi-promotion lineage | |
|||
| 29 | `TestDurabilityMode_Validate_SyncQuorum_RF2_Rejected` | `durability_mode_test.go` | | `unit` | `picked` | `sim_core` | Mode normalization | |
|||
| 30 | `TestCP13_BestEffort_SurvivesReplicaDeath` | `cp13_protocol_test.go` | | `component` | `picked` | `sim_core` | Best-effort contract | |
|||
| 31 | `CP13-8 T4a: sync_all blocks during outage` | `manual` | | `integration` | `picked` | `sim_core` | Strict outage semantics | |
|||
|
|||
## Reduced / Supporting Tests |
|||
|
|||
| # | Test Name | File | Line | Level | Status | Sim | Notes | |
|||
|---|---|---|---|---|---|---|---| |
|||
| 1 | `testRecoverExtendedScanPastStaleHead` | `recovery_test.go` | | `unit` | `picked` | `sim_reduced` | Advisory WAL-head recovery shape | |
|||
| 2 | `testRecoverNoSuperblockPersist` | `recovery_test.go` | | `unit` | `picked` | `sim_reduced` | Recovery despite optimized persist behavior | |
|||
| 3 | `TestQAGroupCommitter` | `blockvol_qa_test.go` | | `unit` | `picked` | `sim_reduced` | Commit batching semantics | |
|||
| 4 | `TestQA_Admission_WriteLBAIntegration` | `qa_wal_admission_test.go` | | `unit` | `picked` | `sim_reduced` | Backpressure behavior | |
|||
| 5 | `TestSyncAll_MultipleFlush_NoWritesBetween` | `sync_all_bug_test.go` | | `unit` | `picked` | `sim_reduced` | Idempotent flush shape | |
|||
| 6 | `TestRebuild_PostRebuild_FlushedLSN_IsCheckpoint` | `rebuild_v1_test.go` | | `unit` | `picked` | `sim_reduced` | Progress initialization after rebuild | |
|||
| 7 | `TestComponent_ManualPromote` | `component_test.go` | | `component` | `picked` | `sim_reduced` | Manual control-path shape | |
|||
| 8 | `TestHeartbeat_ReportsPerReplicaState` | `rebuild_v1_test.go` | | `unit` | `picked` | `sim_reduced` | Heartbeat observability | |
|||
| 9 | `TestHeartbeat_ReportsNeedsRebuild` | `rebuild_v1_test.go` | | `unit` | `picked` | `sim_reduced` | Control-plane visibility | |
|||
| 10 | `TestComponent_ExpandThenFailover` | `component_test.go` | | `component` | `picked` | `sim_reduced` | State continuity across operations | |
|||
| 11 | `TestCP13_DurabilityModeDefault` | `cp13_protocol_test.go` | | `component` | `picked` | `sim_reduced` | Default mode behavior | |
|||
| 12 | `CP13-8 T4b: recovery after restart` | `manual` | | `integration` | `picked` | `sim_reduced` | Recovery-time shape and control-plane/local-reconnect interaction | |
|||
|
|||
## Notes |
|||
|
|||
- This file is the actionable V2 subset, not the master inventory. |
|||
- If `tester` later finalizes a broader 70-case picked set, expand this file from that selection. |
|||
- The 4 V2-boundary tests must remain present even if they fail on V1/V1.5. |
|||
@ -0,0 +1,115 @@ |
|||
# V2-Selected Test Worklist |
|||
|
|||
Date: 2026-03-27 |
|||
Status: working |
|||
|
|||
## Purpose |
|||
|
|||
This is the V2-facing subset of the larger block-service test database. |
|||
|
|||
Sources: |
|||
|
|||
- `sw-block/test/test_db.md` |
|||
- `learn/projects/sw-block/phases/phase13_test.md` |
|||
- `learn/projects/sw-block/phases/phase-13-v2-boundary-tests.md` |
|||
|
|||
This file is for: |
|||
|
|||
- tests that should help V2 design and simulator work |
|||
- explicit inclusion of the 4 Phase 13 V2-boundary failures |
|||
- a working set that `tester`, `sw`, and design can refine further |
|||
|
|||
## Current Inclusion Rule |
|||
|
|||
Include tests that are: |
|||
|
|||
- `sim_core` |
|||
- `sim_reduced` |
|||
- `v2_boundary` |
|||
|
|||
Prefer tests that directly inform: |
|||
|
|||
- barriers and durability truth |
|||
- catch-up vs rebuild |
|||
- failover / promotion |
|||
- WAL retention / tail-chasing |
|||
- mode semantics |
|||
- endpoint / identity / reassignment behavior |
|||
|
|||
## Phase 13 V2-Boundary Tests |
|||
|
|||
These must stay visible in the V2 worklist: |
|||
|
|||
| Test | File | Why It Matters To V2 | |
|||
|---|---|---| |
|||
| `TestAdversarial_ReconnectUsesHandshakeNotBootstrap` | `sync_all_adversarial_test.go` | Sender identity and reconnect ownership | |
|||
| `TestAdversarial_NeedsRebuildBlocksAllPaths` | `sync_all_adversarial_test.go` | `NeedsRebuild` must remain sticky and identity-safe | |
|||
| `TestAdversarial_CatchupDoesNotOverwriteNewerData` | `sync_all_adversarial_test.go` | Catch-up must preserve data correctness under identity continuity | |
|||
| `TestAdversarial_CatchupMultipleDisconnects` | `sync_all_adversarial_test.go` | Multiple reconnect cycles require stable per-replica sender ownership | |
|||
|
|||
## High-Value V2 Working Set |
|||
|
|||
This is the current distilled working set from `phase13_test.md`. |
|||
|
|||
| Test | File | Current Result | Mapping | Why It Helps V2 | |
|||
|---|---|---|---|---| |
|||
| `TestRecovery` | `recovery_test.go` | PASS | `sim_core` | Crash recovery correctness | |
|||
| `TestReplicaProgress_BarrierUsesFlushedLSN` | `sync_all_protocol_test.go` | PASS | `sim_core` | Barrier truth / durable progress | |
|||
| `TestReplicaProgress_FlushedLSNMonotonicWithinEpoch` | `sync_all_protocol_test.go` | PASS | `sim_core` | Monotonic progress invariant | |
|||
| `TestBarrier_RejectsReplicaNotInSync` | `sync_all_protocol_test.go` | PASS | `sim_core` | State-gated strict durability | |
|||
| `TestBarrier_EpochMismatchRejected` | `sync_all_protocol_test.go` | PASS | `sim_core` | Epoch fencing | |
|||
| `TestBug1_SyncAll_WriteDuringDegraded_SyncCacheMustFail` | `sync_all_bug_test.go` | PASS | `sim_core` | `sync_all` strictness during outage | |
|||
| `TestSyncAll_FullRoundTrip_WriteAndFlush` | `sync_all_bug_test.go` | PASS | `sim_core` | End-to-end strict replication | |
|||
| `TestBestEffort_FlushSucceeds_ReplicaDown` | `sync_all_protocol_test.go` | PASS | `sim_core` | Mode difference vs strict sync | |
|||
| `TestShip_DegradedDoesNotSilentlyCountAsHealthy` | `sync_all_protocol_test.go` | PASS | `sim_core` | No false durability | |
|||
| `TestDistSync_SyncAll_AllDegraded_Fails` | `dist_group_commit_test.go` | PASS | `sim_core` | Availability semantics | |
|||
| `TestBug2_SyncAll_SyncCache_AfterDegradedShipperRecovers` | `sync_all_bug_test.go` | PASS | `sim_core` | Recoverability after degraded shipper | |
|||
| `TestReconnect_CatchupFromRetainedWal` | `sync_all_protocol_test.go` | PASS | `sim_core` | Short-gap catch-up | |
|||
| `TestReconnect_GapBeyondRetainedWal_NeedsRebuild` | `sync_all_protocol_test.go` | PASS | `sim_core` | Catch-up vs rebuild boundary | |
|||
| `TestReconnect_EpochChangeDuringCatchup_Aborts` | `sync_all_protocol_test.go` | PASS | `sim_core` | Recovery fencing | |
|||
| `TestCatchupReplay_DataIntegrity_AllBlocksMatch` | `sync_all_protocol_test.go` | PASS | `sim_core` | Recovery data correctness | |
|||
| `TestCatchupReplay_DuplicateEntry_Idempotent` | `sync_all_protocol_test.go` | PASS | `sim_core` | Replay idempotence | |
|||
| `TestBarrier_DuringCatchup_Rejected` | `sync_all_protocol_test.go` | PASS | `sim_core` | State-machine correctness | |
|||
| `TestReplicaState_RebuildComplete_ReentersInSync` | `rebuild_v1_test.go` | PASS | `sim_core` | Rebuild lifecycle | |
|||
| `TestRebuild_AbortOnEpochChange` | `rebuild_v1_test.go` | PASS | `sim_core` | Rebuild fencing | |
|||
| `TestRebuild_MissingTailRestartsOrFailsCleanly` | `rebuild_v1_test.go` | PASS | `sim_core` | No partial/unsafe rebuild success | |
|||
| `TestWalRetention_RequiredReplicaBlocksReclaim` | `sync_all_protocol_test.go` | PASS | `sim_core` | Retention rule | |
|||
| `TestWalRetention_TimeoutTriggersNeedsRebuild` | `sync_all_protocol_test.go` | PASS | `sim_core` | Retention timeout boundary | |
|||
| `TestWalRetention_MaxBytesTriggersNeedsRebuild` | `sync_all_protocol_test.go` | PASS | `sim_core` | Retention budget boundary | |
|||
| `TestComponent_FailoverPromote` | `component_test.go` | PASS | `sim_core` | Failover baseline | |
|||
| `TestCP13_SyncAll_FailoverPromotesReplica` | `cp13_protocol_test.go` | PASS | `sim_core` | Strict-mode failover | |
|||
| `TestCP13_SyncAll_ReplicaRestart_Rejoin` | `cp13_protocol_test.go` | PASS | `sim_core` | Restart/rejoin lifecycle | |
|||
| `TestQA_LSNLag_StaleReplicaSkipped` | `qa_block_edge_cases_test.go` | PASS | `sim_core` | Promotion safety | |
|||
| `TestQA_CascadeFailover_RF3_EpochChain` | `qa_block_edge_cases_test.go` | PASS | `sim_core` | Multi-promotion lineage | |
|||
| `TestDurabilityMode_Validate_SyncQuorum_RF2_Rejected` | `durability_mode_test.go` | PASS | `sim_core` | Mode normalization | |
|||
| `TestCP13_BestEffort_SurvivesReplicaDeath` | `cp13_protocol_test.go` | PASS | `sim_core` | Best-effort contract | |
|||
| `CP13-8 T4a: sync_all blocks during outage` | `manual` | PASS | `sim_core` | Strict outage semantics | |
|||
| `CP13-8 T4b: recovery after restart` | `manual` | PASS | `sim_reduced` | Recovery-time shape | |
|||
|
|||
## Reduced / Supporting Cases To Keep In View |
|||
|
|||
| Test | File | Current Result | Mapping | Why It Helps V2 | |
|||
|---|---|---|---|---| |
|||
| `testRecoverExtendedScanPastStaleHead` | `recovery_test.go` | PASS | `sim_reduced` | Advisory WAL-head recovery shape | |
|||
| `testRecoverNoSuperblockPersist` | `recovery_test.go` | PASS | `sim_reduced` | Recoverability despite optimized persist behavior | |
|||
| `TestQAGroupCommitter` | `blockvol_qa_test.go` | PASS | `sim_reduced` | Commit batching semantics | |
|||
| `TestQA_Admission_WriteLBAIntegration` | `qa_wal_admission_test.go` | PASS | `sim_reduced` | Backpressure behavior | |
|||
| `TestSyncAll_MultipleFlush_NoWritesBetween` | `sync_all_bug_test.go` | PASS | `sim_reduced` | Idempotent flush shape | |
|||
| `TestRebuild_PostRebuild_FlushedLSN_IsCheckpoint` | `rebuild_v1_test.go` | PASS | `sim_reduced` | Progress initialization | |
|||
| `TestComponent_ManualPromote` | `component_test.go` | PASS | `sim_reduced` | Manual control-path shape | |
|||
| `TestHeartbeat_ReportsPerReplicaState` | `rebuild_v1_test.go` | PASS | `sim_reduced` | Heartbeat observability | |
|||
| `TestHeartbeat_ReportsNeedsRebuild` | `rebuild_v1_test.go` | PASS | `sim_reduced` | Control-plane visibility | |
|||
| `TestComponent_ExpandThenFailover` | `component_test.go` | PASS | `sim_reduced` | Cross-operation state continuity | |
|||
| `TestCP13_DurabilityModeDefault` | `cp13_protocol_test.go` | PASS | `sim_reduced` | Default mode behavior | |
|||
|
|||
## Working Note |
|||
|
|||
`phase13_test.md` currently contains the mapped subset from the real test inventory. |
|||
|
|||
This V2 copy is intentionally narrower: |
|||
|
|||
- preserve the core tests that define the protocol story |
|||
- preserve the 4 V2-boundary tests explicitly |
|||
- keep a smaller reduced set for supporting invariants |
|||
|
|||
If `tester` finalizes a broader 70-case working set, extend this file rather than editing the full copied database directly. |
|||
Write
Preview
Loading…
Cancel
Save
Reference in new issue