diff --git a/sw-block/.gocache_v2/README b/sw-block/.gocache_v2/README new file mode 100644 index 000000000..eeaef1c73 --- /dev/null +++ b/sw-block/.gocache_v2/README @@ -0,0 +1,4 @@ +This directory holds cached build artifacts from the Go build system. +Run "go clean -cache" if the directory is getting too large. +Run "go clean -fuzzcache" to delete the fuzz cache. +See go.dev to learn more about Go. diff --git a/sw-block/.gocache_v2/trim.txt b/sw-block/.gocache_v2/trim.txt new file mode 100644 index 000000000..0808f4667 --- /dev/null +++ b/sw-block/.gocache_v2/trim.txt @@ -0,0 +1 @@ +1774577367 \ No newline at end of file diff --git a/sw-block/.private/README.md b/sw-block/.private/README.md new file mode 100644 index 000000000..68b4db58b --- /dev/null +++ b/sw-block/.private/README.md @@ -0,0 +1,27 @@ +# .private + +Private working area for `sw-block`. + +Use this for: +- phase development notes +- roadmap/progress tracking +- draft handoff notes +- temporary design comparisons +- prototype scratch work not ready for `design/` or `prototype/` + +Recommended layout: +- `.private/phase/`: phase-by-phase development notes +- `.private/roadmap/`: short-term and medium-term execution notes +- `.private/handoff/`: notes for `sw`, `qa`, or future sessions + +Phase protocol: +- each phase should normally have: + - `phase-xx.md` + - `phase-xx-log.md` + - `phase-xx-decisions.md` +- details are defined in `.private/phase/README.md` + +Promotion rules: +- stable vision/design docs go to `../design/` +- real prototype code stays in `../prototype/` +- `.private/` is for working material, not source of truth diff --git a/sw-block/.private/phase/README.md b/sw-block/.private/phase/README.md new file mode 100644 index 000000000..d54b4beaa --- /dev/null +++ b/sw-block/.private/phase/README.md @@ -0,0 +1,36 @@ +# Phase Dev + +Use this directory for private phase development notes. + +## Phase Protocol + +Each phase should use this file set: + +- `phase-01.md` + - plan + - scope + - progress + - active tasks + - exit criteria +- `phase-01-log.md` + - dated development log + - experiments + - test runs + - failures and findings +- `phase-01-decisions.md` + - key algorithm decisions + - tradeoffs + - rejected alternatives + +Suggested naming pattern: +- `phase-01.md` +- `phase-01-log.md` +- `phase-01-decisions.md` +- `phase-02.md` +- `phase-02-log.md` +- `phase-02-decisions.md` + +Rule of use: +1. if it is what we are doing -> `phase-xx.md` +2. if it is what happened -> `phase-xx-log.md` +3. if it is why we chose something -> `phase-xx-decisions.md` diff --git a/sw-block/.private/phase/phase-01-decisions.md b/sw-block/.private/phase/phase-01-decisions.md new file mode 100644 index 000000000..7208f7b27 --- /dev/null +++ b/sw-block/.private/phase/phase-01-decisions.md @@ -0,0 +1,97 @@ +# Phase 01 Decisions + +Date: 2026-03-26 +Status: active + +## Purpose + +Capture the key design decisions made during Phase 01 simulator work. + +## Initial Decisions + +### 1. `design/` vs `.private/phase/` + +Decision: +- `sw-block/design/` holds shared design truth +- `sw-block/.private/phase/` holds execution planning and progress + +Reason: +- design backlog and execution checklist should not be mixed + +### 2. Scenario source of truth + +Decision: +- `sw-block/design/v2_scenarios.md` is the scenario backlog and coverage matrix + +Reason: +- all contributors need one visible scenario list + +### 3. Phase 01 priority + +Decision: +- first close: + - `S19` + - `S20` + +Reason: +- they are the biggest remaining distributed lineage/partition scenarios + +### 4. Current simulator scope + +Decision: +- use the simulator as a V2 design-validation tool, not a product/perf harness + +Reason: +- current goal is correctness and protocol coverage, not productization + +### 5. Phase execution format + +Decision: +- keep phase execution in three files: + - `phase-xx.md` + - `phase-xx-log.md` + - `phase-xx-decisions.md` + +Reason: +- separates plan, evidence, and reasoning +- reduces drift between roadmap and findings + +### 6. Design backlog vs execution plan + +Decision: +- `sw-block/design/v2_scenarios.md` remains the source of truth for scenario backlog and coverage +- `.private/phase/phase-01.md` is the execution layer for `sw` + +Reason: +- design truth should be stable and shareable +- execution tasks should be easier to edit without polluting design docs + +### 7. Immediate Phase 01 priorities + +Decision: +- prioritize: + - `S19` chain of custody across multiple promotions + - `S20` live partition with competing writes + +Reason: +- these are the biggest remaining distributed-lineage gaps after current simulator milestone + +### 8. Coverage status should be conservative + +Decision: +- mark scenarios as `partial` unless the test actually exercises the core protocol obligation, not just a simplified happy path + +Reason: +- avoids overstating simulator coverage +- keeps the backlog honest for follow-up strengthening + +### 9. Protocol-version comparison belongs in the simulator + +Decision: +- compare `V1`, `V1.5`, and `V2` using the same scenario set where possible + +Reason: +- this is the clearest way to show: + - where V1 breaks + - where V1.5 improves but still strains + - why V2 is architecturally cleaner diff --git a/sw-block/.private/phase/phase-01-log.md b/sw-block/.private/phase/phase-01-log.md new file mode 100644 index 000000000..08a61e13a --- /dev/null +++ b/sw-block/.private/phase/phase-01-log.md @@ -0,0 +1,67 @@ +# Phase 01 Log + +Date: 2026-03-26 +Status: active + +## Log Protocol + +Use dated entries like: + +## 2026-03-26 +- work completed +- tests run +- failures found +- seeds/traces worth keeping +- follow-up items + +## Initial State + +- Phase 01 created from the earlier `phase-01-v2-scenarios.md` working note +- scenario source of truth remains: + - `sw-block/design/v2_scenarios.md` +- current active asks for `sw`: + - `S19` + - `S20` + +## 2026-03-26 + +- created Phase 01 file set: + - `phase-01.md` + - `phase-01-log.md` + - `phase-01-decisions.md` +- promoted scenario execution checklist into `phase-01.md` +- kept `sw-block/design/v2_scenarios.md` as the shared backlog and coverage matrix +- current simulator milestone: + - `fsmv2` passing + - `volumefsm` passing + - `distsim` passing + - randomized `distsim` seeds passing + - event/interleaving simulator work present in `sw-block/prototype/distsim/simulator.go` +- current immediate development priority for `sw`: + - implement `S19` + - implement `S20` +- `sw` added Phase 01 P0/P1 scenario tests in `distsim`: + - `S19` + - `S20` + - `S5` + - `S6` + - `S18` + - stronger `S12` +- review result: + - `S19` looks solid + - stronger `S12` now looks solid + - `S20`, `S5`, `S6`, `S18` are better classified as `partial` than fully closed +- updated `v2_scenarios.md` coverage matrix to reflect actual status +- next development focus: + - P2 scenarios + - stronger versions of current partial scenarios +- added protocol-version comparison design: + - `sw-block/design/protocol-version-simulation.md` +- added minimal protocol policy prototype in `distsim`: + - `ProtocolV1` + - `ProtocolV15` + - `ProtocolV2` + - focused on: + - catch-up policy + - tail-chasing outcome policy + - restart/rejoin policy diff --git a/sw-block/.private/phase/phase-01-v2-scenarios.md b/sw-block/.private/phase/phase-01-v2-scenarios.md new file mode 100644 index 000000000..2e84175cf --- /dev/null +++ b/sw-block/.private/phase/phase-01-v2-scenarios.md @@ -0,0 +1,11 @@ +# Deprecated + +This file is deprecated. + +Use instead: +- `phase-01.md` +- `phase-01-log.md` +- `phase-01-decisions.md` + +The scenario source of truth remains: +- `sw-block/design/v2_scenarios.md` diff --git a/sw-block/.private/phase/phase-01.md b/sw-block/.private/phase/phase-01.md new file mode 100644 index 000000000..388f21f90 --- /dev/null +++ b/sw-block/.private/phase/phase-01.md @@ -0,0 +1,164 @@ +# Phase 01 + +Date: 2026-03-26 +Status: completed +Purpose: drive V2 simulator development by closing the scenario backlog in `sw-block/design/v2_scenarios.md` + +## Goal + +Make the V2 simulator cover the important protocol scenarios as explicitly as possible. + +This phase is about: +- simulator fidelity +- scenario coverage +- invariant quality + +This phase is not about: +- product integration +- SPDK +- raw allocator +- production transport + +## Source Of Truth + +Design/source-of-truth: +- `sw-block/design/v2_scenarios.md` + +Prototype code: +- `sw-block/prototype/fsmv2/` +- `sw-block/prototype/volumefsm/` +- `sw-block/prototype/distsim/` + +## Assigned Tasks For `sw` + +### P0 + +1. `S19` chain of custody across multiple promotions +- add fixed test(s) +- verify committed data from `A -> B -> C` +- update coverage matrix + +2. `S20` live partition with competing writes +- add fixed test(s) +- stale side must not advance committed lineage +- update coverage matrix + +### P1 + +3. `S5` flapping replica stays recoverable +- repeated disconnect/reconnect +- no unnecessary rebuild while recovery remains possible + +4. `S6` tail-chasing under load +- primary keeps writing while replica catches up +- explicit outcome: + - converge and promote + - or abort to rebuild + +5. `S18` primary restart without failover +- same-lineage restart behavior +- no stale session assumptions + +6. stronger `S12` +- more than one promotion candidate +- choose valid lineage, not merely highest apparent LSN + +### P2 + +7. protocol-version comparison support +- model: + - `V1` + - `V1.5` + - `V2` +- use the same scenario set to show: + - V1 breaks + - V1.5 improves but still strains + - V2 handles recovery more explicitly + +8. richer Smart WAL scenarios +- time-varying `ExtentReferenced` availability +- recoverable then unrecoverable transitions + +9. delayed/drop network scenarios beyond simple disconnect + +10. multi-node reservation expiry / rebuild timeout cases + +## Invariants To Preserve + +After every scenario or random run, preserve: + +1. committed data is durable per policy +2. uncommitted data is not revived as committed +3. stale epoch traffic does not mutate current lineage +4. recovered/promoted node matches reference state at target `LSN` +5. committed prefix remains contiguous + +## Required Updates Per Task + +For each completed scenario: + +1. add or update test(s) +2. update `sw-block/design/v2_scenarios.md` + - package + - test name + - status +3. note any missing simulator capability + +## Current Progress + +Already in place before this phase: +- `fsmv2` local FSM prototype +- `volumefsm` orchestrator prototype +- `distsim` distributed simulator +- randomized `distsim` runs +- first event/interleaving simulator work in `distsim/simulator.go` + +Open focus: +- `S19` covered in `distsim` +- `S20` partially covered in `distsim` +- `S5` partially covered in `distsim` +- `S6` partially covered in `distsim` +- `S18` partially covered in `distsim` +- stronger `S12` covered in `distsim` +- protocol-version comparison design added in: + - `sw-block/design/protocol-version-simulation.md` +- remaining focus is now P2 plus stronger versions of partial scenarios + +## Phase Status + +### P0 + +- `S19` chain of custody across multiple promotions: done +- `S20` live partition with competing writes: partial + +### P1 + +- `S5` flapping replica stays recoverable: partial +- `S6` tail-chasing under load: partial +- `S18` primary restart without failover: partial +- stronger `S12`: done + +### P2 + +- active next step: + - protocol-version comparison support + - stronger versions of current partial scenarios + +## Exit Criteria + +Phase 01 is done when: + +1. `S19` and `S20` are covered +2. `S5`, `S6`, `S18`, and stronger `S12` are at least partially covered +3. coverage matrix in `v2_scenarios.md` is current +4. random simulation still passes after added scenarios + +## Completion Note + +Phase 01 completed with: +- `S19` covered +- stronger `S12` covered +- `S20`, `S5`, `S6`, `S18` strengthened but correctly left as `partial` + +Next execution phase: +- `sw-block/.private/phase/phase-02.md` diff --git a/sw-block/.private/phase/phase-02-decisions.md b/sw-block/.private/phase/phase-02-decisions.md new file mode 100644 index 000000000..2d02206b5 --- /dev/null +++ b/sw-block/.private/phase/phase-02-decisions.md @@ -0,0 +1,51 @@ +# Phase 02 Decisions + +Date: 2026-03-26 +Status: active + +## Decision 1: Extend `distsim` Instead Of Forking A New Protocol Simulator + +Reason: +- current `distsim` already has: + - node/storage model + - coordinator/epoch model + - reference oracle + - randomized runs +- the missing layer is protocol-state fidelity, not a new simulation foundation + +Implication: +- add lightweight per-node replication state and protocol decisions to `distsim` +- do not build a separate fourth simulator yet + +## Decision 2: Keep Coverage Status Conservative + +Reason: +- `S20`, `S6`, and `S18` currently prove important safety properties +- but they do not yet fully assert message-level or explicit state-transition behavior + +Implication: +- leave them `partial` until the model can assert protocol behavior directly + +## Decision 3: Use Versioned Scenario Comparison To Justify V2 + +Reason: +- the simulator should not only say "V2 works" +- it should show: + - where `V1` fails + - where `V1.5` improves but still strains + - why `V2` is worth the complexity + +Implication: +- Phase 02 includes explicit `V1` / `V1.5` / `V2` scenario comparison work + +## Decision 4: V2 Must Not Be Described As "Always Catch-Up" + +Reason: +- that wording is too optimistic and hides the real V2 design rule +- V2 is better because it makes recoverability explicit, not because it retries forever + +Implication: +- describe V2 as: + - catch-up if explicitly recoverable + - otherwise explicit rebuild +- keep this wording consistent in tests and docs diff --git a/sw-block/.private/phase/phase-02-log.md b/sw-block/.private/phase/phase-02-log.md new file mode 100644 index 000000000..5fbba1e53 --- /dev/null +++ b/sw-block/.private/phase/phase-02-log.md @@ -0,0 +1,93 @@ +# Phase 02 Log + +Date: 2026-03-26 +Status: active + +## 2026-03-26 + +- Phase 02 created to move `distsim` from final-state safety validation toward explicit protocol-state simulation. +- Initial focus: + - close `S20`, `S6`, and `S18` at protocol level + - compare `V1`, `V1.5`, and `V2` on the same scenarios +- Known model gap at phase start: + - current `distsim` is strong at final-state safety invariants + - current `distsim` is weaker at mid-flow protocol assertions and message-level rejection reasons +- Phase 02 progress now in place: + - delivery accept/reject tracking + - protocol-level stale-epoch rejection assertions + - explicit non-convergent catch-up state transition assertions + - initial version-comparison tests for disconnect, tail-chasing, and restart/rejoin policy +- Next simulator target: + - reproduce real `V1.5` address-instability and control-plane-recovery failures as named scenarios +- Immediate coding asks for `sw`: + - changed-address restart failure in `V1.5` + - same-address transient outage comparison across `V1` / `V1.5` / `V2` + - slow control-plane reassignment scenario derived from `CP13-8 T4b` +- Local housekeeping done: + - corrected V2 wording from "always catch-up" to "catch-up if explicitly recoverable; otherwise rebuild" + - added explicit brief-disconnect and changed-address restart policy helpers + - verified `distsim` test suite still passes with the Windows-safe runner +- Scenario status update: + - `S20` now covered via protocol-level stale-traffic rejection + committed-prefix stability + - `S6` now covered via explicit `CatchingUp -> NeedsRebuild` assertions + - `S18` now covered via explicit stale `MsgBarrierAck` rejection + prefix stability +- Next asks for `sw` after this closure: + - changed-address restart scenario tied directly to `CP13-8 T4b` + - same-address transient outage comparison across `V1` / `V1.5` / `V2` + - slow control-plane reassignment scenario + - Smart WAL recoverable -> unrecoverable transition scenarios +- Additional closure completed: + - `S5` now covered with both: + - repeated recoverable flapping + - budget-exceeded escalation to `NeedsRebuild` + - Smart WAL transitions now exercised with: + - recoverable -> unrecoverable during active recovery + - mixed `WALInline` + `ExtentReferenced` success + - time-varying payload availability +- Updated next asks for `sw`: + - changed-address restart scenario tied directly to `CP13-8 T4b` + - same-address transient outage comparison across `V1` / `V1.5` / `V2` + - slow control-plane reassignment scenario + - delayed/drop network beyond simple disconnect + - multi-node reservation expiry / rebuild timeout cases +- Additional Phase 02 coverage delivered: + - delayed stale messages after promote/failover + - delayed stale barrier ack rejection + - selective write-drop with barrier delivery under `sync_all` + - multi-node mixed reservation expiry outcome + - multi-node `NeedsRebuild` / snapshot rebuild recovery + - partial rebuild timeout / retry completion +- Remaining asks are now narrower: + - changed-address restart scenario tied directly to `CP13-8 T4b` + - same-address transient outage comparison across `V1` / `V1.5` / `V2` + - slow control-plane reassignment scenario + - stronger coordinator candidate-selection scenarios +- Additional closure after review: + - safe default promotion selector now refuses `NeedsRebuild` candidates + - explicit desperate-promotion API separated from safe selection + - changed-address and slow-control-plane comparison tests now prove actual data divergence / healing, not only policy shape +- New next-step assignment: + - strengthen model depth around endpoint identity and control-plane reassignment + - replace abstract repair helpers with more explicit event flow where practical + - reduce direct recovery state injection in comparison tests + - extend candidate selection from ranking into validity rules + +## 2026-03-27 + +- Phase 02 core simulator hardening is effectively complete. +- Delivered since the previous checkpoint: + - endpoint identity / endpoint-version modeling + - stale-endpoint rejection in delivery path + - heartbeat -> coordinator detect -> assignment-update control-plane flow + - recovery-session trigger API for `V1.5` and `V2` + - explicit candidate eligibility checks: + - running + - epoch alignment + - state eligibility + - committed-prefix sufficiency + - safe default promotion now rejects candidates without the committed prefix +- Current `distsim` status at latest review: + - 73 tests passing +- Manager bookkeeping decision: + - keep Phase 02 active only for doc maintenance / wrap-up + - treat further simulator depth as likely Phase 03 work, not unbounded Phase 02 scope creep diff --git a/sw-block/.private/phase/phase-02.md b/sw-block/.private/phase/phase-02.md new file mode 100644 index 000000000..dd4f4cdf6 --- /dev/null +++ b/sw-block/.private/phase/phase-02.md @@ -0,0 +1,191 @@ +# Phase 02 + +Date: 2026-03-27 +Status: active +Purpose: extend the V2 simulator from final-state safety checking into protocol-state simulation that can reproduce `V1`, `V1.5`, and `V2` behavior on the same scenarios + +## Goal + +Make the simulator model enough node-local replication state and message-level behavior to: + +1. reproduce `V1` / `V1.5` failure modes +2. show why those failures are structural +3. close the current `partial` V2 scenarios with stronger protocol assertions + +This phase is about: +- protocol-version comparison +- per-node replication state +- message-level fencing / accept / reject behavior +- explicit catch-up abort / rebuild transitions + +This phase is not about: +- product integration +- production transport +- SPDK +- raw allocator + +## Source Of Truth + +Design/source-of-truth: +- `sw-block/design/v2_scenarios.md` +- `sw-block/design/protocol-version-simulation.md` +- `sw-block/design/v1-v15-v2-simulator-goals.md` + +Prototype code: +- `sw-block/prototype/distsim/` + +## Assigned Tasks For `sw` + +### P0 + +1. Add per-node replication state to `distsim` +- minimum states: + - `InSync` + - `Lagging` + - `CatchingUp` + - `NeedsRebuild` + - `Rebuilding` +- keep state lightweight; do not clone full `fsmv2` into `distsim` + +2. Add message-level protocol decisions +- stale-epoch write / ship / barrier traffic must be explicitly rejected +- record whether a message was: + - accepted + - rejected by epoch + - rejected by state + +3. Add explicit catch-up abort / rebuild entry +- non-convergent catch-up must move to explicit modeled failure: + - `NeedsRebuild` + - or equivalent abort outcome + +### P1 + +4. Re-close `S20` at protocol level +- stale-side writes must go through protocol delivery path +- prove stale-side traffic cannot advance committed lineage + +5. Re-close `S6` at protocol level +- assert explicit abort/escalation on non-convergence +- not only final-state safety + +6. Re-close `S18` at protocol level +- assert committed-prefix behavior around delayed old ack / restart races +- not only final-state oracle checks + +### P2 + +7. Expand protocol-version comparison +- run selected scenarios under: + - `V1` + - `V1.5` + - `V2` +- at minimum: + - brief disconnect + - restart with changed address + - tail-chasing + +8. Add V1.5-derived failure scenarios +- replica restart with changed receiver address +- same-address transient outage +- slow control-plane recovery vs fast local reconnect + +9. Prepare richer recovery modeling +- time-varying recoverability +- reservation loss during active catch-up +- rebuild timeout / retry in mixed-state cluster + +## Invariants To Preserve + +After every scenario or random run, preserve: + +1. committed data is durable per policy +2. uncommitted data is not revived as committed +3. stale epoch traffic does not mutate current lineage +4. recovered/promoted node matches reference state at target `LSN` +5. committed prefix remains contiguous +6. protocol-state transitions are explicit, not inferred from final data only + +## Required Updates Per Task + +For each completed task: + +1. add or update test(s) +2. update `sw-block/design/v2_scenarios.md` + - package + - test name + - status + - source if new scenario was derived from V1/V1.5 behavior +3. add a short note to: + - `sw-block/.private/phase/phase-02-log.md` +4. if a design choice changed, record it in: + - `sw-block/.private/phase/phase-02-decisions.md` + +## Current Progress + +Already in place before this phase: +- `distsim` final-state safety invariants +- randomized simulation +- event/interleaving simulator work +- initial `ProtocolVersion` / policy scaffold +- `S19` covered +- stronger `S12` covered + +Known partials to close in this phase: +- none in the current named backlog slice + +Delivered in this phase so far: +- delivery accept/reject tracking added +- protocol-level rejection assertions added +- explicit `CatchingUp -> NeedsRebuild` state transition tested +- selected protocol-version comparison tests added +- `S20`, `S6`, and `S18` moved from `partial` to `covered` +- Smart WAL transition scenarios added +- `S5` moved from `partial` to `covered` +- endpoint identity / endpoint-version modeling added +- explicit heartbeat -> detect -> assignment-update control-plane flow added for changed-address restart +- explicit recovery-session triggers added for `V1.5` and `V2` +- promotion selection now uses explicit eligibility, including committed-prefix gating +- safe and desperate promotion paths are separated +- full `distsim` suite at latest review: 73 tests passing + +Remaining focus for `sw`: +- Phase 02 core scope is now largely delivered +- remaining work should be treated as future-strengthening, not baseline closure +- if more simulator depth is needed next, it should likely start as Phase 03: + - timeout semantics + - timer races + - richer event/interleaving behavior + - stronger endpoint/control-plane realism beyond the current abstract model + +## Immediate Next Tasks For `sw` + +1. Add a documented compare artifact for new scenarios +- for each new `V1` / `V1.5` / `V2` comparison: + - record scenario name + - what fails in `V1` + - what improves in `V1.5` + - what is explicit in `V2` +- keep `sw-block/design/v1-v15-v2-comparison.md` updated + +2. Keep the coverage matrix honest +- do not mark a scenario `covered` unless the test asserts protocol behavior directly +- final-state oracle checks alone are not enough + +3. Prepare Phase 03 proposal instead of broadening ad hoc +- if more depth is needed, define it cleanly first: + - timers / timeout events + - event ordering races + - richer endpoint lifecycle + - recovery-session uniqueness across competing triggers + +## Exit Criteria + +Phase 02 is done when: + +1. `S5`, `S6`, `S18`, and `S20` are covered at protocol level +2. `distsim` can reproduce at least one `V1` failure, one `V1.5` failure, and the corresponding `V2` behavior on the same named scenario +3. protocol-level rejection/accept behavior is asserted in tests, not only inferred from final-state oracle checks +4. coverage matrix in `v2_scenarios.md` is current +5. changed-address and reconnect scenarios are modeled through explicit endpoint / control-plane behavior rather than helper-only abstraction +6. promotion selection uses explicit eligibility, including committed-prefix safety diff --git a/sw-block/.private/phase/phase-03-decisions.md b/sw-block/.private/phase/phase-03-decisions.md new file mode 100644 index 000000000..6a4c858cf --- /dev/null +++ b/sw-block/.private/phase/phase-03-decisions.md @@ -0,0 +1,97 @@ +# Phase 03 Decisions + +Date: 2026-03-27 +Status: initial + +## Why Phase 03 Exists + +Phase 02 already covered the main protocol-state story: + +- V1 / V1.5 / V2 comparison +- stale traffic rejection +- catch-up vs rebuild +- changed-address restart control-plane flow +- committed-prefix-safe promotion eligibility + +The next simulator problems are different: + +- timer semantics +- timeout races +- event ordering under contention + +That deserves a separate phase so the model boundary stays clear. + +## Initial Boundary + +### `distsim` + +Keep for: + +- protocol correctness +- reference-state validation +- recoverability logic +- promotion / lineage rules + +### `eventsim` + +Grow for: + +- explicit event queue behavior +- timeout events +- equal-time scheduling choices +- race exploration + +## Working Rule + +Do not move all scenarios into `eventsim`. + +Only move or duplicate scenarios when: + +- timer or event ordering is the real bug surface +- `distsim` abstraction hides the important behavior + +## Accepted Phase 03 Decisions + +### Same-tick rule + +Within one tick: + +- data/message delivery is evaluated before timeout firing + +Meaning: + +- if an ack arrives in the same tick as a timeout deadline, the ack wins and may cancel the timeout + +This is now an explicit simulator rule, not accidental behavior. + +### Timeout authority + +Not every timeout that reaches its deadline still has authority to mutate state. + +So we now distinguish: + +- `FiredTimeouts` + - timeout had authority and changed the model +- `IgnoredTimeouts` + - timeout reached deadline but was stale and ignored + +This keeps replay/debug output honest. + +### Late barrier ack rule + +Once a barrier instance times out: + +- it is marked expired +- late ack for that barrier instance is rejected + +That prevents a stale ack from reviving old durability state. + +### Review gate rule for timer work + +Timer/race work is easy to get subtly wrong while still having green tests. + +So timer-related work is not accepted until: + +- code path is reviewed +- tests assert the real protocol obligation +- stale and authoritative timer behavior are clearly distinguished diff --git a/sw-block/.private/phase/phase-03-log.md b/sw-block/.private/phase/phase-03-log.md new file mode 100644 index 000000000..4755e4a49 --- /dev/null +++ b/sw-block/.private/phase/phase-03-log.md @@ -0,0 +1,36 @@ +# Phase 03 Log + +Date: 2026-03-27 +Status: active + +## 2026-03-27 + +- Phase 03 created after Phase 02 core scope was effectively delivered. +- Reason for new phase: + - remaining simulator work is about timer semantics and race behavior, not basic protocol-state coverage +- Initial target: + - define `distsim` vs `eventsim` split more clearly + - add explicit timeout semantics + - add timer-race scenarios without bloating `distsim` ad hoc +- P0 delivered: + - timeout model added for barrier / catch-up / reservation + - timeout-backed scenarios added + - same-tick ordering rule defined as data-before-timers +- First review result: + - timeout semantics accepted only after making cancellation model-driven + - late barrier ack after timeout required explicit rejection +- P0 hardening delivered: + - recovery timeout cancellation moved into model logic + - stale late barrier ack rejected via expired-barrier tracking + - stale vs authoritative timeout distinction added: + - `FiredTimeouts` + - `IgnoredTimeouts` +- P1 delivered and reviewed: + - promotion vs stale timeout race + - rebuild completion vs epoch bump race + - trace builder moved into reusable code +- Current suite state at latest accepted review: + - 86 `distsim` tests passing +- Manager decision: + - Phase 03 P0/P1 are accepted + - next work should move to deliberate P2 selection rather than broadening the phase ad hoc diff --git a/sw-block/.private/phase/phase-03.md b/sw-block/.private/phase/phase-03.md new file mode 100644 index 000000000..001a2e1ce --- /dev/null +++ b/sw-block/.private/phase/phase-03.md @@ -0,0 +1,193 @@ +# Phase 03 + +Date: 2026-03-27 +Status: active +Purpose: define the next simulator tier after Phase 02, focused on timeout semantics, timer races, and a cleaner split between protocol simulation and event/interleaving simulation + +## Goal + +Phase 03 exists to cover behavior that current `distsim` still abstracts away: + +1. timeout semantics +2. timer races +3. event ordering under competing triggers +4. clearer separation between: + - protocol / lineage simulation + - event / race simulation + +This phase should not reopen already-closed Phase 02 protocol scope unless a clear bug is found. + +## Why A New Phase + +Phase 02 already delivered: + +- protocol-state assertions +- V1 / V1.5 / V2 comparison scenarios +- endpoint identity modeling +- control-plane assignment-update flow +- committed-prefix-aware promotion eligibility + +What remains is different in character: + +- timers +- delayed events racing with each other +- timeout-triggered state changes +- more explicit event scheduling + +That deserves a new phase boundary. + +## Source Of Truth + +Design/source-of-truth: +- `sw-block/design/v2_scenarios.md` +- `sw-block/design/v2-dist-fsm.md` +- `sw-block/design/v2-scenario-sources-from-v1.md` +- `sw-block/design/v1-v15-v2-comparison.md` + +Current prototype base: +- `sw-block/prototype/distsim/` +- `sw-block/prototype/distsim/simulator.go` + +## Scope + +### In scope + +1. timeout semantics +- barrier timeout +- catch-up timeout +- reservation expiry timeout +- rebuild timeout + +2. timer races +- delayed ack vs timeout +- timeout vs promotion +- reconnect vs timeout +- catch-up completion vs expiry +- rebuild completion vs epoch bump + +3. simulator split clarification +- `distsim` keeps: + - protocol correctness + - lineage + - recoverability + - reference-state checking +- `eventsim` grows into: + - event scheduling + - timer firing + - same-time interleavings + - race exploration + +### Out of scope + +- production integration +- real transport +- real disk timings +- SPDK +- raw allocator + +## Assigned Tasks For `sw` + +### P0 + +1. Write a concrete `eventsim` scope note in code/docs +- define what stays in `distsim` +- define what moves to `eventsim` +- avoid overlap and duplicated semantics + +2. Add minimal timeout event model +- first-class timeout event type(s) +- at minimum: + - barrier timeout + - catch-up timeout + - reservation expiry + +3. Add timeout-backed scenarios +- stale delayed ack vs timeout +- catch-up timeout before convergence +- reservation expiry during active recovery + +### P1 + +4. Add race-focused tests +- promotion vs delayed stale ack +- rebuild completion vs epoch bump +- reconnect success vs timeout firing + +5. Keep traces debuggable +- failing runs must dump: + - seed + - event order + - timer events + - node states + - committed prefix + +### P2 + +6. Decide whether selected `distsim` scenarios should also exist in `eventsim` +- only when timer/event ordering is the real point +- do not duplicate every scenario blindly + +## Current Progress + +Delivered in this phase so far: + +- `eventsim` scope note added in code +- explicit timeout model added: + - barrier timeout + - catch-up timeout + - reservation timeout +- timeout-backed scenarios added and reviewed +- same-tick rule made explicit: + - data before timers +- recovery timeout cancellation is now model-driven, not test-driven +- stale barrier ack after timeout is explicitly rejected +- stale timeouts are separated from authoritative timeouts: + - `FiredTimeouts` + - `IgnoredTimeouts` +- race-focused scenarios added and reviewed: + - promotion vs stale catch-up timeout + - promotion vs stale barrier timeout + - rebuild completion vs epoch bump + - epoch bump vs stale catch-up timeout +- reusable trace builder added for replay/debug support +- current `distsim` suite at latest review: + - 86 tests passing + +Remaining focus for `sw`: + +- Phase 03 P0 and P1 are effectively complete +- Phase 03 P2 is also effectively complete after review +- any further simulator work should now be narrow and evidence-driven +- recommended next simulator additions only: + - control-plane latency parameter + - sustained-write convergence / tail-chasing load test + - one multi-promotion lineage extension + +## Invariants To Preserve + +1. committed data remains durable per policy +2. uncommitted data is never revived as committed +3. stale epoch traffic never mutates current lineage +4. committed prefix remains contiguous +5. timeout-triggered transitions are explicit and explainable +6. races do not silently bypass fencing or rebuild boundaries + +## Required Updates Per Task + +For each completed task: + +1. add or update tests +2. update `sw-block/design/v2_scenarios.md` if scenario coverage changed +3. add a short note to: + - `sw-block/.private/phase/phase-03-log.md` +4. if the simulator boundary changed, record it in: + - `sw-block/.private/phase/phase-03-decisions.md` + +## Exit Criteria + +Phase 03 is done when: + +1. timeout semantics exist as explicit simulator behavior +2. at least three important timer-race scenarios are modeled and tested +3. `distsim` vs `eventsim` responsibilities are clearly separated +4. failure traces from race/timeout scenarios are replayable enough to debug diff --git a/sw-block/.private/phase/phase-04-decisions.md b/sw-block/.private/phase/phase-04-decisions.md new file mode 100644 index 000000000..500cbca74 --- /dev/null +++ b/sw-block/.private/phase/phase-04-decisions.md @@ -0,0 +1,97 @@ +# Phase 04 Decisions + +Date: 2026-03-27 +Status: initial + +## First Slice Decision + +The first standalone V2 implementation slice is: + +- per-replica sender ownership +- one active recovery session per replica per epoch + +## Why Not Start In V1 + +V1/V1.5 remains: + +- production line +- maintenance/fix line + +It should not be the place where V2 architecture is first implemented. + +## Why This Slice + +This slice: + +- directly addresses the clearest V1.5 structural pain +- maps cleanly to the V2-boundary tests +- is narrow enough to implement without dragging in the entire future architecture + +## Accepted P0 Refinements + +### Sender epoch coherence + +Sender-owned epoch is real state, not decoration. + +So: + +- reconcile/update paths must refresh sender epoch +- stale active session must be invalidated on epoch advance + +### Session lifecycle + +The first slice should not use a totally loose lifecycle shell. + +So: + +- session phase changes now follow an explicit transition map +- invalid jumps are rejected + +### Session attach rule + +Attaching a session at the wrong epoch is invalid. + +So: + +- `AttachSession(epoch, kind)` must reject epoch mismatch with the owning sender + +## Accepted P1 Refinements + +### Session identity fencing + +The standalone V2 slice must reject stale completion by explicit session identity. + +So: + +- `RecoverySession` has stable unique identity +- sender completion must be by session ID, not by "current pointer" +- stale session results are rejected at the sender authority boundary + +### Ownership vs execution + +Ownership creation is not the same as execution start. + +So: + +- `AttachSession()` and `SupersedeSession()` establish ownership only +- `BeginConnect()` is the first execution-state mutation + +### Completion authority + +An ID match alone is not enough to complete recovery. + +So: + +- completion must require a valid completion-ready phase +- normal completion requires converged catch-up +- zero-gap fast completion is allowed explicitly from handshake + +## P2 Direction + +The next prototype step is not broader simulation. + +It is: + +- recovery outcome branching +- assignment-intent orchestration +- prototype-level end-to-end recovery flow diff --git a/sw-block/.private/phase/phase-04-log.md b/sw-block/.private/phase/phase-04-log.md new file mode 100644 index 000000000..33d013a23 --- /dev/null +++ b/sw-block/.private/phase/phase-04-log.md @@ -0,0 +1,46 @@ +# Phase 04 Log + +Date: 2026-03-27 +Status: active + +## 2026-03-27 + +- Phase 04 created to start the first standalone V2 implementation slice. +- Decision: + - do not begin in `weed/storage/blockvol/` + - begin under `sw-block/` +- first slice chosen: + - per-replica sender ownership + - explicit recovery-session ownership +- Initial slice delivered under `sw-block/prototype/enginev2/`: + - sender + - recovery session + - sender group +- First review found: + - sender/session epoch coherence gap + - session lifecycle was shell-only, not enforcing real transitions + - attach-session epoch mismatch was not rejected +- Follow-up delivered and accepted: + - reconcile updates preserved sender epoch + - epoch bump invalidates stale session + - session transition map enforced + - attach-session rejects epoch mismatch + - enginev2 tests increased to 26 passing +- Phase 04a created to close the ownership-validation gap: + - explicit session identity in `distsim` + - bridge tests into `enginev2` +- Phase 04a ownership problem closed well enough: + - stale completion rejected by session ID + - endpoint invalidation includes `CtrlAddr` + - boundary doc aligned with real simulator/prototype evidence +- Phase 04 P1 delivered and accepted: + - sender-owned execution APIs added + - all execution APIs fence on `sessionID` + - completion now requires valid completion point + - attach/supersede now establish ownership only + - handshake range validation added + - enginev2 tests increased to 46 passing +- Next phase focus narrowed to P2: + - recovery outcome branching + - assignment-intent orchestration + - prototype end-to-end recovery flow diff --git a/sw-block/.private/phase/phase-04.md b/sw-block/.private/phase/phase-04.md new file mode 100644 index 000000000..407d1f79a --- /dev/null +++ b/sw-block/.private/phase/phase-04.md @@ -0,0 +1,153 @@ +# Phase 04 + +Date: 2026-03-27 +Status: active +Purpose: start the first standalone V2 implementation slice under `sw-block/`, centered on per-replica sender ownership and explicit recovery-session ownership + +## Goal + +Build the first real V2 implementation slice without destabilizing V1. + +This slice should prove: + +1. per-replica sender identity +2. explicit one-session-per-replica recovery ownership +3. endpoint/assignment-driven recovery updates +4. clean handoff between normal sender and recovery session + +## Why This Phase Exists + +The simulator and design work are now strong enough to support a narrow implementation slice. + +We should not start with: + +- Smart WAL +- new storage engine +- frontend integration + +We should start with the ownership problem that most clearly separates V2 from V1.5. + +## Source Of Truth + +Design: +- `sw-block/design/v2-first-slice-session-ownership.md` +- `sw-block/design/v2-acceptance-criteria.md` +- `sw-block/design/v2-open-questions.md` + +Simulator reference: +- `sw-block/prototype/distsim/` + +## Scope + +### In scope + +1. per-replica sender owner object +2. explicit recovery session object +3. session lifecycle rules +4. endpoint update handling +5. basic tests for sender/session ownership + +### Out of scope + +- Smart WAL in production code +- real block backend redesign +- V1 integration +- frontend publication + +## Assigned Tasks For `sw` + +### P0 + +1. create standalone V2 implementation area under `sw-block/` +- recommended: + - `sw-block/prototype/enginev2/` + +2. define sender/session types +- sender owner per replica +- recovery session per replica per epoch + +3. implement basic lifecycle +- create sender +- attach session +- supersede stale session +- close session on success / invalidation + +## Current Progress + +Delivered in this phase so far: + +- standalone V2 area created under: + - `sw-block/prototype/enginev2/` +- core types added: + - `Sender` + - `RecoverySession` + - `SenderGroup` +- sender/session lifecycle shell implemented +- per-replica ownership implemented +- endpoint-change invalidation implemented +- sender epoch coherence implemented +- session epoch attach validation implemented +- session phase transitions now enforce a real transition map +- session identity fencing implemented +- stale completion rejected by session ID +- execution APIs implemented: + - `BeginConnect` + - `RecordHandshake` + - `BeginCatchUp` + - `RecordCatchUpProgress` + - `CompleteSessionByID` +- completion authority tightened: + - catch-up must converge + - zero-gap handshake fast path allowed +- attach/supersede now establish ownership only +- sender-group orchestration tests added +- current `enginev2` test state at latest review: + - 46 tests passing + +Next focus for `sw`: + +- continue Phase 04 beyond execution gating: + - recovery outcome branching + - sender-group orchestration from assignment intent + - prototype-level end-to-end recovery flow +- do not integrate into V1 production tree yet + +### P1 + +4. implement endpoint update handling +- changed-address update must refresh the right sender owner + +5. implement epoch invalidation +- stale session must stop after epoch bump + +6. add tests matching the slice acceptance + +### P2 + +7. add recovery outcome branching +- distinguish: + - zero-gap fast completion + - positive-gap catch-up completion + - unrecoverable gap / `NeedsRebuild` + +8. add assignment-intent driven orchestration +- move beyond raw reconcile-only tests +- make sender-group react to explicit recovery intent + +9. add prototype-level end-to-end flow tests +- assignment/update +- session creation +- execution +- completion / invalidation +- rebuild escalation + +## Exit Criteria + +Phase 04 is done when: + +1. standalone V2 sender/session slice exists under `sw-block/` +2. sender ownership is per replica, not set-global +3. one active recovery session per replica per epoch is enforced +4. endpoint update and epoch invalidation are tested +5. sender-owned execution flow is validated +6. recovery outcome branching exists at prototype level diff --git a/sw-block/.private/phase/phase-04a-decisions.md b/sw-block/.private/phase/phase-04a-decisions.md new file mode 100644 index 000000000..62d9a0681 --- /dev/null +++ b/sw-block/.private/phase/phase-04a-decisions.md @@ -0,0 +1,49 @@ +# Phase 04a Decisions + +Date: 2026-03-27 +Status: initial + +## Core Decision + +The next must-fix validation problem is: + +- sender/session ownership semantics + +This outranks: + +- more timing realism +- more WAL detail +- broader scenario growth + +## Why + +V2's core claim over V1.5 is not only: + +- better recovery policy + +It is also: + +- stable per-replica sender identity +- one active recovery owner +- stale work cannot mutate current state + +If those ownership rules are not validated, the simulator can overstate confidence. + +## Validation Rule + +For this phase, a scenario is only complete when it is expressed at two levels: + +1. simulator ownership model (`distsim`) +2. standalone implementation slice (`enginev2`) + +Real `weed/` adversarial tests remain the system-level gate. + +## Scope Discipline + +Do not expand this phase into: + +- generic simulator feature growth +- Smart WAL design growth +- V1 integration work + +Keep it focused on the ownership model. diff --git a/sw-block/.private/phase/phase-04a-log.md b/sw-block/.private/phase/phase-04a-log.md new file mode 100644 index 000000000..99c1453a0 --- /dev/null +++ b/sw-block/.private/phase/phase-04a-log.md @@ -0,0 +1,22 @@ +# Phase 04a Log + +Date: 2026-03-27 +Status: active + +## 2026-03-27 + +- Phase 04a created as a narrow validation phase. +- Reason: + - the biggest remaining V2 validation gap is ownership semantics + - not general scenario count + - not more timer realism + - not more WAL detail +- Scope chosen: + - sender identity + - recovery session identity + - supersede / invalidate rules + - stale completion rejection + - `distsim` to `enginev2` bridge tests +- This phase is intentionally separate from broad Phase 04 implementation growth. +- Goal: + - gain confidence that V2 is validated as owned session/sender protocol state, not only as policy diff --git a/sw-block/.private/phase/phase-04a.md b/sw-block/.private/phase/phase-04a.md new file mode 100644 index 000000000..6a6c535d1 --- /dev/null +++ b/sw-block/.private/phase/phase-04a.md @@ -0,0 +1,113 @@ +# Phase 04a + +Date: 2026-03-27 +Status: active +Purpose: close the critical V2 ownership-validation gap by making sender/session ownership explicit in both simulation and the standalone `enginev2` slice + +## Goal + +Validate the core V2 claim more deeply: + +1. one stable sender identity per replica +2. one active recovery session per replica +3. endpoint change, epoch bump, and supersede rules invalidate stale work +4. stale late results from old sessions cannot mutate current state + +This phase is not about adding broad new simulator surface. +It is about proving the ownership model that is supposed to make V2 better than V1.5. + +## Why This Phase Exists + +Current simulation is already strong on: + +- quorum / commit rules +- stale epoch rejection +- catch-up vs rebuild +- timeout / race ordering +- changed-address recovery at the policy level + +The remaining critical risk is narrower: + +- the simulator still validates V2 strongly as policy +- but not yet strongly enough as owned sender/session protocol state + +That is the highest-value validation gap to close before trusting V2 too much. + +## Source Of Truth + +Design: +- `sw-block/design/v2-first-slice-session-ownership.md` +- `sw-block/design/v2-acceptance-criteria.md` +- `sw-block/design/v2-open-questions.md` +- `sw-block/design/protocol-development-process.md` + +Simulator / prototype: +- `sw-block/prototype/distsim/` +- `sw-block/prototype/enginev2/` + +Historical / review context: +- `learn/projects/sw-block/phases/phase-13-v2-boundary-tests.md` +- `sw-block/design/v2-scenario-sources-from-v1.md` + +## Scope + +### In scope + +1. explicit sender/session identity validation in `distsim` +2. explicit stale-session invalidation rules +3. bridge tests from `distsim` scenarios to `enginev2` sender/session invariants +4. doc cleanup so V2-boundary tests point to real simulator and `enginev2` coverage + +### Out of scope + +- Smart WAL expansion +- broad new timing realism +- TCP / disk realism +- V1 production integration +- new backend/storage engine work + +## Critical Questions To Close + +1. can an old session completion mutate state after a new session supersedes it? +2. does endpoint change invalidate or supersede the active session cleanly? +3. does epoch bump remove all authority from prior sessions? +4. can duplicate recovery triggers create overlapping active sessions? + +## Assigned Tasks For `sw` + +### P0 + +1. add explicit session identity to `distsim` +- model session ID or equivalent ownership token +- make stale session results rejectable by identity, not just by coarse state + +2. add ownership scenarios to `distsim` +- endpoint change during active catch-up +- epoch bump during active catch-up +- stale late completion from old session +- duplicate recovery trigger while a session is already active + +3. add bridge tests in `enginev2` +- same-address reconnect preserves sender identity +- endpoint bump supersedes or invalidates active session +- epoch bump rejects stale completion +- only one active session per sender + +### P1 + +4. tighten `learn/projects/sw-block/phases/phase-13-v2-boundary-tests.md` +- point to actual `distsim` scenarios +- point to actual `enginev2` bridge tests +- state what remains real-engine-only + +5. only add simulator mechanics if a bridge test exposes a real ownership gap + +## Exit Criteria + +Phase 04a is done when: + +1. `distsim` explicitly validates sender/session ownership invariants +2. `enginev2` has bridge tests for the same invariants +3. stale session work is shown unable to mutate current sender state +4. V2-boundary doc no longer has stale simulator references +5. we can say with confidence that V2 ownership semantics, not just V2 policy, are validated at prototype level diff --git a/sw-block/README.md b/sw-block/README.md new file mode 100644 index 000000000..e27860624 --- /dev/null +++ b/sw-block/README.md @@ -0,0 +1,18 @@ +# sw-block + +Private WAL V2 and standalone block-service workspace. + +Purpose: +- keep WAL V2 design/prototype work isolated from WAL V1 production code in `weed/storage/blockvol` +- allow private design notes and experiments to evolve without polluting V1 delivery paths +- keep the future standalone `sw-block` product structure clean enough to split into a separate repo later if needed + +Suggested layout: +- `design/`: shared V2 design docs +- `prototype/`: code prototypes and experiments +- `.private/`: private notes, phase development, roadmap, and non-public working material + +Repository direction: +- current state: `sw-block/` is an isolated workspace inside `seaweedfs` +- likely future state: `sw-block` becomes a standalone sibling repo/product +- design and prototype structure should therefore stay product-oriented and not depend on SeaweedFS-specific paths diff --git a/sw-block/design/README.md b/sw-block/design/README.md new file mode 100644 index 000000000..a1ee51100 --- /dev/null +++ b/sw-block/design/README.md @@ -0,0 +1,26 @@ +# V2 Design + +Current WAL V2 design set: +- `wal-replication-v2.md` +- `wal-replication-v2-state-machine.md` +- `wal-replication-v2-orchestrator.md` +- `wal-v2-tiny-prototype.md` +- `wal-v1-to-v2-mapping.md` +- `v2-dist-fsm.md` +- `v2_scenarios.md` +- `v1-v15-v2-comparison.md` +- `v2-scenario-sources-from-v1.md` +- `protocol-development-process.md` +- `v2-acceptance-criteria.md` +- `v2-open-questions.md` +- `v2-first-slice-session-ownership.md` +- `v2-prototype-roadmap-and-gates.md` + +These documents are the working design home for the V2 line. + +The original project-level copies under `learn/projects/sw-block/design/` remain as shared references for now. + +Execution note: +- active development tracking for the current simulator phase lives under: + - `../.private/phase/phase-01.md` + - `../.private/phase/phase-02.md` diff --git a/sw-block/design/protocol-development-process.md b/sw-block/design/protocol-development-process.md new file mode 100644 index 000000000..e4480d22a --- /dev/null +++ b/sw-block/design/protocol-development-process.md @@ -0,0 +1,288 @@ +# Protocol Development Process + +Date: 2026-03-27 + +## Purpose + +This document defines how `sw-block` protocol work should be developed. + +The process is meant to work for: + +- V2 +- future V3 +- or a later block algorithm that is not WAL-based + +The point is to make protocol work systematic rather than reactive. + +## Core Philosophy + +### 1. Design before implementation + +Do not start with production code and hope the protocol becomes clear later. + +Start with: + +1. system contract +2. invariants +3. state model +4. scenario backlog + +Only then move to implementation. + +### 2. Real failures are inputs, not just bugs + +When V1 or V1.5 fails in real testing, treat that as: + +- a design requirement +- a scenario source +- a simulator input + +Do not patch and forget. + +### 3. Simulator is part of the protocol, not a side tool + +The simulator exists to answer: + +- what should happen +- what must never happen +- which old designs fail +- why the new design is better + +It is not a replacement for real testing. +It is the design-validation layer before production implementation. + +### 4. Passing tests are not enough + +Green tests are necessary, not sufficient. + +We also require: + +- explicit invariants +- explicit scenario intent +- clear state transitions +- review of assumptions and abstraction boundaries + +### 5. Keep hot-path and recovery-path reasoning separate + +Healthy steady-state behavior and degraded recovery behavior are different problems. + +Both must be designed explicitly. + +## Development Ladder + +Every major protocol feature should move through these steps: + +1. **Problem statement** +- what real bug, limit, or product goal is driving the work + +2. **Contract** +- what the protocol guarantees +- what it does not guarantee + +3. **State model** +- node state +- coordinator state +- recovery state +- role / epoch / lineage rules + +4. **Scenario backlog** +- named scenarios +- source: + - real failure + - design obligation + - adversarial distributed case + +5. **Prototype / simulator** +- reduced but explicit model +- invariant checks +- V1 / V1.5 / V2 comparison where relevant + +6. **Implementation** +- production code only after the protocol shape is clear enough + +7. **Real validation** +- unit +- component +- integration +- real hardware where needed + +8. **Feedback loop** +- turn new failures back into scenario/design inputs + +## Required Artifacts + +For protocol work to be considered real progress, we usually want: + +### Design + +- design doc +- scenario doc +- comparison doc when replacing an older approach + +### Prototype + +- simulator or prototype code +- tests that assert protocol behavior + +### Implementation + +- production patch +- production tests +- docs updated to match the actual algorithm + +### Review + +- implementation gate +- design/protocol gate + +## Two-Gate Rule + +We use two acceptance gates. + +### Gate 1: implementation + +Owned by the coding side. + +Questions: + +- does it build? +- do tests pass? +- does it behave as intended in code? + +### Gate 2: protocol/design + +Owned by the design/review side. + +Questions: + +- is the logic actually sound? +- do tests prove the intended thing? +- are assumptions explicit? +- is the abstraction boundary honest? + +A task is not accepted until both gates pass. + +## Layering Rule + +Keep simulation layers separate. + +### `distsim` + +Use for: + +- protocol correctness +- state transitions +- fencing +- recoverability +- promotion / lineage +- reference-state checking + +### `eventsim` + +Use for: + +- timeout behavior +- timer races +- event ordering +- same-tick / delayed event interactions + +Do not duplicate scenarios blindly across both layers. + +## Test Selection Rule + +Do not choose simulator inputs only from failing tests. + +Review all relevant tests and classify them by: + +- protocol significance +- simulator value +- implementation specificity + +Good simulator candidates often come from: + +- barrier truth +- catch-up vs rebuild +- stale message rejection +- failover / promotion safety +- changed-address restart +- mode semantics + +Keep real-only tests for: + +- wire format +- OS timing +- exact WAL file behavior +- frontend transport specifics + +## Version Comparison Rule + +When designing a successor protocol: + +- keep the old version visible +- reproduce the old failure or limitation +- show the improved behavior in the new version + +For `sw-block`, that means: + +- `V1` +- `V1.5` +- `V2` + +should be compared explicitly where possible. + +## Documentation Rule + +The docs must track three different things: + +### `learn/projects/sw-block/` + +Use for: + +- project history +- V1/V1.5 algorithm records +- phase records +- real test history + +### `sw-block/design/` + +Use for: + +- active design truth +- V2 and later protocol docs +- scenario backlog +- comparison docs + +### `sw-block/.private/phase/` + +Use for: + +- active execution plan +- log +- decisions + +## What Good Progress Looks Like + +A good protocol iteration usually has this pattern: + +1. real failure or design pressure identified +2. scenario named and written down +3. simulator reproduces the bad case +4. new protocol handles it explicitly +5. implementation follows +6. real tests validate it + +If one of those steps is missing, confidence is weaker. + +## Bottom Line + +The process is: + +1. design the contract +2. model the state +3. define the scenarios +4. simulate the protocol +5. implement carefully +6. validate in real tests +7. feed failures back into design + +That is the process we should keep using for V2 and any later protocol line. diff --git a/sw-block/design/protocol-version-simulation.md b/sw-block/design/protocol-version-simulation.md new file mode 100644 index 000000000..49bab5e94 --- /dev/null +++ b/sw-block/design/protocol-version-simulation.md @@ -0,0 +1,252 @@ +# Protocol Version Simulation + +Date: 2026-03-26 +Status: design proposal +Purpose: define how the simulator should model WAL V1, WAL V1.5 (Phase 13), and WAL V2 on the same scenario set + +## Why This Exists + +The simulator is more valuable if the same scenario can answer: + +1. how WAL V1 behaves +2. how WAL V1.5 behaves +3. how WAL V2 should behave + +That turns the simulator into: +- a regression tool for V1/V1.5 +- a justification tool for V2 +- a comparison framework across protocol generations + +## Principle + +Do not fork three separate simulators. + +Instead: +- keep one simulator core +- add protocol-version behavior modes +- run the same named scenario under different modes + +## Proposed Versions + +### `ProtocolV1` + +Intent: +- represent pre-Phase-13 behavior + +Behavior shape: +- WAL is streamed optimistically +- lagging replica is degraded/excluded quickly +- no real short-gap catch-up contract +- no retention-backed recovery window +- replica usually falls toward rebuild rather than incremental recovery + +What scenarios should expose: +- short outage still causes unnecessary degrade/rebuild +- transient jitter may be over-penalized +- poor graceful rejoin story + +### `ProtocolV15` + +Intent: +- represent Phase-13 WAL V1.5 behavior + +Behavior shape: +- reconnect handshake exists +- WAL catch-up exists +- primary may retain WAL longer for lagging replica +- recovery still depends heavily on address stability and control-plane timing +- catch-up may still tail-chase or stall operationally + +What scenarios should expose: +- transient disconnects may recover +- restart with new receiver address may still fail practical recovery +- tail-chasing / retention pressure remain structural risks + +### `ProtocolV2` + +Intent: +- represent the target design + +Behavior shape: +- explicit recovery reservation +- explicit catch-up vs rebuild boundary +- lineage-first promotion +- version-correct recovery sources +- explicit abort/rebuild path on non-convergence or lost recoverability + +What scenarios should show: +- short gap recovers cleanly +- impossible catch-up fails cleanly +- rebuild is explicit, not accidental + +## Behavior Axes To Toggle + +The simulator does not need completely different code paths. +It needs protocol-version-sensitive policy on these axes: + +### 1. Lagging replica treatment + +`V1`: +- degrade quickly +- no meaningful WAL catch-up window + +`V1.5`: +- allow WAL catch-up while history remains available + +`V2`: +- allow catch-up only with explicit recoverability / reservation + +### 2. WAL retention / recoverability + +`V1`: +- little or no retention for lagging-replica recovery + +`V1.5`: +- retention-based recovery window +- but no strong reservation contract + +`V2`: +- recoverability check plus reservation + +### 3. Restart / address stability + +`V1`: +- generally poor rejoin path + +`V1.5`: +- reconnect may work only if replica address is stable + +`V2`: +- address/identity assumptions should be explicit in the model + +### 4. Tail-chasing behavior + +`V1`: +- usually degrades rather than catches up + +`V1.5`: +- catch-up may be attempted but may never converge + +`V2`: +- non-convergence should explicitly abort/escalate + +### 5. Promotion policy + +`V1`: +- weaker lineage reasoning + +`V1.5`: +- improved epoch/LSN handling + +`V2`: +- lineage-first promotion is a first-class rule + +## Recommended Simulator API + +Add a version enum, for example: + +```go +type ProtocolVersion string + +const ( + ProtocolV1 ProtocolVersion = "v1" + ProtocolV15 ProtocolVersion = "v1_5" + ProtocolV2 ProtocolVersion = "v2" +) +``` + +Attach it to the simulator or cluster: + +```go +type Cluster struct { + Protocol ProtocolVersion + ... +} +``` + +## Policy Hooks + +Rather than branching everywhere, centralize the differences in a few hooks: + +1. `CanAttemptCatchup(...)` +2. `CatchupConvergencePolicy(...)` +3. `RecoverabilityPolicy(...)` +4. `RestartRejoinPolicy(...)` +5. `PromotionPolicy(...)` + +That keeps the simulator readable. + +## Example Scenario Comparisons + +### Scenario: brief disconnect + +`V1`: +- likely degrade / no efficient catch-up + +`V1.5`: +- catch-up may succeed if address/history remain stable + +`V2`: +- explicit recoverability + reservation +- catch-up only if the missing window is still recoverable +- otherwise explicit rebuild + +### Scenario: replica restart with new receiver port + +`V1`: +- poor recovery path + +`V1.5`: +- background reconnect fails if it retries stale address + +`V2`: +- identity/address model must make this explicit +- direct reconnect is not assumed +- use explicit reassignment plus catch-up if recoverable, otherwise rebuild cleanly + +### Scenario: primary writes faster than catch-up + +`V1`: +- replica degrades + +`V1.5`: +- may tail-chase indefinitely or pin WAL too long + +`V2`: +- explicit non-convergence detection -> abort / rebuild + +## What To Measure + +For each scenario, compare: + +1. does committed data remain safe? +2. does uncommitted data stay out of committed lineage? +3. does recovery complete or stall? +4. does protocol choose catch-up or rebuild? +5. is the outcome explicit or accidental? + +## Immediate Next Step + +Start with a minimal versioned policy layer: + +1. add `ProtocolVersion` +2. implement one or two version-sensitive hooks: + - `CanAttemptCatchup` + - `CatchupConvergencePolicy` +3. run existing scenarios under: + - `ProtocolV1` + - `ProtocolV15` + - `ProtocolV2` + +That is enough to begin proving: +- V1 breaks +- V1.5 improves but still strains +- V2 handles the same scenario more cleanly + +## Bottom Line + +The same scenario set should become a comparison harness across protocol generations. + +That is one of the strongest uses of the simulator: +- not only "does V2 work?" +- but "why is V2 better than V1 and V1.5?" diff --git a/sw-block/design/v1-v15-v2-comparison.md b/sw-block/design/v1-v15-v2-comparison.md new file mode 100644 index 000000000..4df8dc0d6 --- /dev/null +++ b/sw-block/design/v1-v15-v2-comparison.md @@ -0,0 +1,314 @@ +# V1, V1.5, and V2 Comparison + +Date: 2026-03-27 + +## Purpose + +This document compares: + +- `V1`: original replicated WAL shipping model +- `V1.5`: Phase 13 catch-up-first improvements on top of V1 +- `V2`: explicit FSM / orchestrator / recoverability-driven design under `sw-block/` + +It is a design comparison, not a marketing document. + +## 1. One-line summary + +- `V1` is simple but weak on short-gap recovery. +- `V1.5` materially improves recovery, but still relies on assumptions and incremental control-plane fixes. +- `V2` is structurally cleaner, more explicit, and easier to validate, but is not yet a production engine. + +## 2. Steady-State Hot Path + +In the healthy case, all three versions can look similar: + +1. primary appends ordered WAL +2. primary ships entries to replicas +3. replicas apply in order +4. durability barrier determines when client-visible commit completes + +### V1 + +- simplest replication path +- lagging replica typically degrades quickly +- little explicit recovery structure + +### V1.5 + +- same basic hot path as V1 +- WAL retention and reconnect/catch-up improve short outage handling +- extra logic exists, but much of it is off the hot path + +### V2 + +- can keep a similar hot path if implemented carefully +- extra complexity is mainly in: + - recovery planner + - replica state machine + - coordinator/orchestrator + - recoverability checks + +### Performance expectation + +In a normal healthy cluster: + +- `V2` should not be much heavier than `V1.5` +- most V2 complexity sits in failure/recovery/control paths +- there is no proof yet that V2 has better steady-state throughput or latency + +## 3. Recovery Behavior + +### V1 + +Recovery is weakly structured: + +- lagging replica tends to degrade +- short outage often becomes rebuild or long degraded state +- little explicit catch-up boundary + +### V1.5 + +Recovery is improved: + +- short outage can recover by retained-WAL catch-up +- background reconnect closes the `sync_all` dead-loop +- catch-up-first is preferred before rebuild + +But the model is still partly implicit: + +- reconnect depends on endpoint stability unless control plane refreshes assignment +- recoverability boundary is not as explicit as V2 +- tail-chasing and retention pressure still need policy care + +### V2 + +Recovery is explicit by design: + +- `InSync` +- `Lagging` +- `CatchingUp` +- `NeedsRebuild` +- `Rebuilding` + +And explicit decisions exist for: + +- catch-up vs rebuild +- stale-epoch rejection +- promotion candidate choice +- recoverable vs unrecoverable gap + +## 4. Real V1.5 Lessons + +The main V2 requirements come from real V1.5 behavior. + +### 4.1 Changed-address restart + +Observed in `CP13-8 T4b`: + +- replica restarted +- endpoint changed +- primary shipper held stale address +- direct reconnect could not succeed until control plane refreshed assignment + +V1.5 fix: + +- saved address used only as hint +- heartbeat-reported address becomes source of truth +- master refreshes primary assignment + +Lesson for V2: + +- endpoint is not identity +- reassignment must be explicit + +### 4.2 Reconnect race + +Observed in Phase 13 review: + +- barrier path and background reconnect path could both trigger reconnect + +V1.5 fix: + +- `reconnectMu` serializes reconnect / catch-up + +Lesson for V2: + +- one active recovery session per replica should be a protocol rule, not just a local mutex trick + +### 4.3 Tail-chasing + +Even with retained WAL: + +- primary may write faster than a lagging replica can recover +- catch-up may not converge + +Lesson for V2: + +- explicit abort / `NeedsRebuild` +- do not pretend catch-up will always work + +### 4.4 Control-plane recovery latency + +V1.5 can be correct but still operationally slow if recovery waits on slower management cycles. + +Lesson for V2: + +- keep authority in coordinator +- but make recovery decisions explicit and fast when possible + +## 5. V2 Structural Improvements + +V2 is better primarily because it is easier to reason about and validate. + +### 5.1 Better state model + +Instead of implicit recovery behavior, V2 has: + +- per-replica FSM +- volume/orchestrator model +- distributed simulator with scenario coverage + +### 5.2 Better validation + +V2 has: + +- named scenario backlog +- protocol-state assertions +- randomized simulation +- V1/V1.5/V2 comparison tests + +This is a major difference from V1/V1.5, where many fixes were discovered through implementation and hardware testing first. + +### 5.3 Better correctness boundaries + +V2 makes these explicit: + +- recoverable gap vs rebuild +- stale traffic rejection +- promotion lineage safety +- reservation or payload availability transitions + +## 6. Stability Comparison + +### Current judgment + +- `V1`: least stable under failure/recovery stress +- `V1.5`: meaningfully better and now functionally validated on real tests +- `V2`: best protocol structure and best simulator confidence + +### Important limit + +`V2` is not yet proven more stable in production because: + +- it is not a production engine yet +- confidence comes from simulator/design work, not real block workload deployment + +So the accurate statement is: + +- `V2` is more stable **architecturally** +- `V1.5` is more stable **operationally today** because it is implemented and tested on real hardware + +## 7. Performance Comparison + +### What is likely true + +`V2` should perform better than rebuild-heavy recovery approaches when: + +- outage is short +- gap is recoverable +- catch-up avoids full rebuild + +It should also behave better under: + +- flapping replicas +- stale delayed messages +- mixed-state replica sets + +### What is not yet proven + +We do not yet know whether `V2` has: + +- better steady-state throughput +- lower p99 latency +- lower CPU overhead +- lower memory overhead + +than `V1.5` + +That requires real implementation and benchmarking. + +## 8. Smart WAL Fit + +### Why Smart WAL is awkward in V1/V1.5 + +V1/V1.5 do not naturally model: + +- payload classes +- recoverability reservations +- historical payload resolution +- explicit recoverable/unrecoverable transition + +So Smart WAL would be harder to add cleanly there. + +### Why Smart WAL fits V2 better + +V2 already has the right conceptual slots: + +- `RecoveryClass` + - `WALInline` + - `ExtentReferenced` +- recoverability planner +- catch-up vs rebuild decision point +- simulator for payload-availability transitions + +### Important rule + +Smart WAL must not mean: + +- “read current extent for old LSN” + +That is incorrect. + +Historical correctness requires: + +- WAL inline payload +- or pinned snapshot/versioned extent state +- not current live extent contents + +## 9. What Is Proven Today + +### Proven + +- `V1.5` significantly improves V1 recovery behavior +- real `CP13-8` testing validated the V1.5 data path and `sync_all` behavior +- the V2 simulator covers: + - stale traffic rejection + - tail-chasing + - flapping replicas + - multi-promotion lineage + - changed-address restart comparison + - same-address transient outage comparison + - Smart WAL availability transitions + +### Not yet proven + +- V2 production implementation quality +- V2 steady-state performance advantage +- V2 real hardware recovery performance + +## 10. Bottom Line + +If choosing based on current evidence: + +- use `V1.5` as the production line today +- use `V2` as the better long-term architecture + +If choosing based on protocol quality: + +- `V2` is clearly better structured +- `V1.5` is still more ad hoc, even after successful fixes + +If choosing based on current real-world proof: + +- `V1.5` has the stronger operational evidence today +- `V2` has the stronger design and simulation evidence today diff --git a/sw-block/design/v1-v15-v2-simulator-goals.md b/sw-block/design/v1-v15-v2-simulator-goals.md new file mode 100644 index 000000000..5de67eb31 --- /dev/null +++ b/sw-block/design/v1-v15-v2-simulator-goals.md @@ -0,0 +1,281 @@ +# V1 / V1.5 / V2 Simulator Goals + +Date: 2026-03-26 +Status: working design note +Purpose: define how the simulator should be used against WAL V1, Phase-13 V1.5, and WAL V2 + +## Why This Exists + +The simulator is not only for validating V2. + +It should also be used to: + +1. break WAL V1 +2. stress WAL V1.5 / Phase 13 +3. justify why WAL V2 is needed + +This note defines what failures we want the simulator to find in each protocol generation. + +## What The Simulator Can And Cannot Do + +### What it is good at + +The simulator is good at: + +1. finding concrete counterexamples +2. exposing bad protocol assumptions +3. checking commit / failover / fencing invariants +4. checking historical data correctness at target `LSN` + +### What it is not + +The simulator is not a full proof unless promoted to formal model checking. + +So the right claim is: + +- "no issue found under these modeled runs" + +not: + +- "protocol proven correct in all implementations" + +## Protocol Targets + +### WAL V1 + +Core shape: +- primary ships WAL out +- lagging replica degrades quickly +- no real recoverability contract +- no strong short-gap catch-up window + +Primary risk: +- a briefly lagging replica gets downgraded too early and forced into rebuild + +### WAL V1.5 / Phase 13 + +Core shape: +- primary retains WAL longer for lagging replicas +- reconnect / catch-up exists +- rebuild fallback exists +- primary may wait before releasing WAL + +Primary risks: +- WAL pinning +- tail chasing +- slow availability recovery +- recoverability assumptions that do not hold long enough + +### WAL V2 + +Core shape: +- explicit state machine +- explicit recoverability / reservation +- catch-up vs rebuild boundary is formalized +- eventual support for `WALInline` vs `ExtentReferenced` + +Primary goal: +- no committed data loss +- no false recovery +- cheaper and clearer short-gap recovery + +## What To Find In WAL V1 + +The simulator should try to find scenarios where V1 fails operationally or structurally. + +### V1-F1. Short Disconnect Still Forces Rebuild + +Sequence: +1. replica disconnects briefly +2. primary continues writing +3. replica returns quickly + +Expected ideal behavior: +- short-gap catch-up + +What V1 may do: +- downgrade replica too early +- no usable catch-up path +- rebuild required unnecessarily + +### V1-F2. Jitter Causes Avoidable Degrade + +Sequence: +1. replica is alive but sees delayed/reordered delivery +2. primary interprets this as lag/failure + +Failure signal: +- unnecessary downgrade or exclusion + +### V1-F3. Repeated Brief Flaps Cause Thrash + +Sequence: +1. repeated short disconnect/reconnect +2. primary repeatedly degrades replica + +Failure signal: +- poor availability +- excessive rebuild churn + +### V1-F4. No Efficient Path Back To Healthy State + +Sequence: +1. replica becomes degraded +2. network recovers + +Failure signal: +- control plane or protocol provides no clean short recovery path + +## What To Find In WAL V1.5 / Phase 13 + +The simulator should stress whether retention-based catch-up is actually enough. + +### V15-F1. Tail Chasing Under Ongoing Writes + +Sequence: +1. replica reconnects behind +2. primary keeps writing +3. catch-up tries to close the gap + +Failure signal: +- replica never converges +- stays forever behind +- no clean escalation path + +### V15-F2. WAL Pinning Harms System Progress + +Sequence: +1. replica lags +2. primary retains WAL to help recovery +3. lag persists + +Failure signal: +- WAL window remains pinned too long +- reclaim stalls +- system availability or throughput suffers + +### V15-F3. Catch-Up Window Expires Mid-Recovery + +Sequence: +1. catch-up begins +2. primary continues advancing +3. required recoverability disappears before completion + +Failure signal: +- protocol still claims success +- or lacks a clean abort-to-rebuild path + +### V15-F4. Restart Recovery Too Slow + +Sequence: +1. replica restarts +2. primary blocks writes correctly under `sync_all` +3. service recovery takes too long + +Failure signal: +- correctness preserved +- but availability recovery is operationally unacceptable + +### V15-F5. Multiple Lagging Replicas Poison Progress + +Sequence: +1. more than one replica lags +2. retention and recovery obligations interact + +Failure signal: +- one slow replica or mixed states poison the entire volume behavior + +## What WAL V2 Should Survive + +V2 should not merely avoid V1/V1.5 failures. +It should make them explicit and manageable. + +### V2-S1. Short Gap Recovers Cheaply + +Expected: +- brief disconnect -> catch-up -> promote +- no rebuild + +### V2-S2. Impossible Catch-Up Fails Cleanly + +Expected: +- not fully recoverable -> `NeedsRebuild` +- no pretend success + +### V2-S3. Reservation Loss Forces Correct Abort + +Expected: +- once recoverability is lost, catch-up aborts +- rebuild path takes over + +### V2-S4. Promotion Is Lineage-First + +Expected: +- new primary chosen from valid lineage +- not simply highest apparent `LSN` + +### V2-S5. Historical Data Correctness Is Preserved + +Expected: +- no rebuild from current extent pretending to be old state +- correct snapshot/base + replay behavior + +## Simulation Strategy By Version + +### For V1 + +Use simulator to: +- break it +- demonstrate avoidable rebuilds and downgrade behavior + +The simulator is mainly a diagnostic and justification tool here. + +### For V1.5 + +Use simulator to: +- stress retention-based catch-up +- find operational limits +- expose where retention alone is not enough + +The simulator is a stress and tradeoff tool here. + +### For V2 + +Use simulator to: +- validate named protocol scenarios +- validate random/adversarial runs +- confirm state + data correctness under failover/recovery + +The simulator is a design-validation tool here. + +## Practical Outcome + +If the simulator finds: + +### On V1 +- short outages still lead to rebuild + +Then conclusion: +- V1 lacks a real short-gap recovery story + +### On V1.5 +- retention helps but can still tail-chase or pin WAL too long + +Then conclusion: +- V1.5 is a useful bridge, but not the final architecture + +### On V2 +- catch-up/rebuild boundary is explicit and safe + +Then conclusion: +- V2 solves the protocol problem more cleanly + +## Bottom Line + +Use the simulator differently for each generation: + +1. WAL V1: find where it breaks +2. WAL V1.5: find where it strains +3. WAL V2: validate that it behaves correctly and more cleanly + +That is how the simulator justifies the architectural move from V1 to V2. diff --git a/sw-block/design/v2-acceptance-criteria.md b/sw-block/design/v2-acceptance-criteria.md new file mode 100644 index 000000000..7a77d4f09 --- /dev/null +++ b/sw-block/design/v2-acceptance-criteria.md @@ -0,0 +1,280 @@ +# V2 Acceptance Criteria + +Date: 2026-03-27 + +## Purpose + +This document defines the minimum protocol-validation bar for V2. + +It is not the full scenario backlog. + +It is the smaller acceptance set that should be true before we claim: + +- the V2 protocol shape is validated enough to guide implementation + +## Scope + +This acceptance set is about: + +- protocol correctness +- recovery correctness +- lineage / fencing correctness +- data correctness at target `LSN` + +This acceptance set is not yet about: + +- production performance +- frontend integration +- wire protocol +- disk implementation details + +## Acceptance Rule + +A V2 acceptance item should satisfy all of: + +1. named scenario +2. explicit expected behavior +3. simulator coverage +4. clear invariant or pass condition +5. mapped reason why it matters + +## Acceptance Set + +### A1. Committed Data Survives Failover + +Must prove: + +- acknowledged data is not lost after primary failure and promotion + +Evidence: + +- `S1` +- distributed simulator pass + +Pass condition: + +- promoted node matches reference state at committed `LSN` + +### A2. Uncommitted Data Is Not Revived + +Must prove: + +- non-acknowledged writes do not become committed after failover + +Evidence: + +- `S2` + +Pass condition: + +- committed prefix remains at the previous valid boundary + +### A3. Stale Epoch Traffic Is Fenced + +Must prove: + +- old primary / stale sender traffic cannot mutate current lineage + +Evidence: + +- `S3` +- stale write / stale barrier / stale delayed ack scenarios + +Pass condition: + +- stale traffic is rejected +- committed prefix does not change + +### A4. Short-Gap Catch-Up Works + +Must prove: + +- brief outage with recoverable gap returns via catch-up, not rebuild + +Evidence: + +- `S4` +- same-address transient outage comparison + +Pass condition: + +- recovered replica returns to `InSync` +- final state matches reference + +### A5. Non-Convergent Catch-Up Escalates Explicitly + +Must prove: + +- tail-chasing or failed catch-up does not pretend success + +Evidence: + +- `S6` + +Pass condition: + +- explicit `CatchingUp -> NeedsRebuild` + +### A6. Recoverability Boundary Is Explicit + +Must prove: + +- recoverable vs unrecoverable gap is decided explicitly + +Evidence: + +- `S7` +- Smart WAL availability transition scenarios + +Pass condition: + +- recovery aborts when reservation/payload availability is lost +- rebuild becomes the explicit fallback + +### A7. Historical Data Correctness Holds + +Must prove: + +- recovered data for target `LSN` is historically correct +- current extent cannot fake old history + +Evidence: + +- `S8` +- `S9` + +Pass condition: + +- snapshot + tail rebuild matches reference state +- current-extent reconstruction of old `LSN` fails correctness + +### A8. Durability Mode Semantics Are Correct + +Must prove: + +- `best_effort`, `sync_all`, and `sync_quorum` behave as intended under mixed replica states + +Evidence: + +- `S10` +- `S11` +- timeout-backed quorum/all race tests + +Pass condition: + +- `sync_all` remains strict +- `sync_quorum` commits only with true durable quorum +- invalid `sync_quorum` topology assumptions are rejected + +### A9. Promotion Uses Safe Candidate Eligibility + +Must prove: + +- promotion requires: + - running + - epoch alignment + - state eligibility + - committed-prefix sufficiency + +Evidence: + +- stronger `S12` +- candidate eligibility tests + +Pass condition: + +- unsafe candidates are rejected by default +- desperate promotion, if any, is explicit and separate + +### A10. Changed-Address Restart Is Explicitly Recoverable + +Must prove: + +- endpoint is not identity +- changed-address restart does not rely on stale endpoint reuse + +Evidence: + +- V1 / V1.5 / V2 changed-address comparison +- endpoint-version / assignment-update simulator flow + +Pass condition: + +- stale endpoint is rejected +- control-plane update refreshes primary view +- recovery proceeds only after explicit update + +### A11. Timeout Semantics Are Explicit + +Must prove: + +- barrier, catch-up, and reservation timeouts are first-class protocol behavior + +Evidence: + +- Phase 03 P0 timeout tests + +Pass condition: + +- timeout effects are explicit +- stale timeouts do not regress recovered state +- late barrier ack after timeout is rejected + +### A12. Timer Races Are Stable + +Must prove: + +- timer/event ordering does not silently break protocol guarantees + +Evidence: + +- Phase 03 P1/P2 race tests + +Pass condition: + +- same-tick ordering is explicit +- promotion / epoch bump / timeout interactions preserve invariants +- traces are debuggable + +## Compare Requirement + +Where meaningful, V2 acceptance should include comparison against: + +- `V1` +- `V1.5` + +Especially for: + +- changed-address restart +- same-address transient outage +- tail-chasing +- slow control-plane recovery + +## Required Evidence + +Before calling V2 protocol validation “good enough”, we want: + +1. scenario coverage in `v2_scenarios.md` +2. selected simulator tests in `distsim` +3. timing/race tests in `eventsim` +4. V1 / V1.5 / V2 comparison where relevant +5. review sign-off that the tests prove the right thing + +## What This Does Not Prove + +Even if all acceptance items pass, this still does not prove: + +- production implementation quality +- wire protocol correctness +- real performance +- disk-level behavior + +Those require later implementation and real-system validation. + +## Bottom Line + +If A1 through A12 are satisfied, V2 is validated enough at the protocol/design level to justify: + +1. implementation slicing +2. Smart WAL design refinement +3. later real-engine integration diff --git a/sw-block/design/v2-dist-fsm.md b/sw-block/design/v2-dist-fsm.md new file mode 100644 index 000000000..6fc311c94 --- /dev/null +++ b/sw-block/design/v2-dist-fsm.md @@ -0,0 +1,234 @@ +# WAL V2 Distributed Simulator + +Date: 2026-03-26 +Status: design proposal +Purpose: define the next prototype layer above `ReplicaFSM` and `VolumeModel` so WAL V2 can be validated as a distributed state machine rather than only a local state machine + +## Why This Exists + +The current V2 prototype already has: + +- `ReplicaFSM` +- `VolumeModel` +- `RecoveryPlanner` +- scenario tracing + +That is enough to reason about local recovery logic and volume-level admission. + +It is not enough to prove the distributed safety claim. + +The real system question is: + +- when time moves forward, nodes start/stop/disconnect/reconnect, and the coordinator changes epoch, +- do all acknowledged writes remain recoverable according to the configured durability policy? + +That requires a distributed simulator. + +## Core Idea + +Model the system as: + +1. node-local state machines +2. a coordinator state machine +3. a time-driven message simulator +4. a reference data model used as the correctness oracle + +## Layers + +### 1. `NodeModel` + +Each node has: + +- role +- epoch seen +- local WAL state + - head + - tail + - `receivedLSN` + - `flushedLSN` +- checkpoint/snapshot state + - `cpLSN` +- local extent state +- local connectivity state +- local `ReplicaFSM` for each remote relationship as needed + +### 2. `CoordinatorModel` + +The coordinator owns: + +- current epoch +- primary assignment +- membership +- durability policy +- rebuild assignments +- promotion decisions + +### 3. `Network/Time Simulator` + +The simulator owns: + +- logical time ticks +- message delivery queues +- delay, drop, and disconnect events +- node start/stop/restart + +### 4. `Reference Model` + +The reference model is the correctness oracle. + +It applies the committed write history to an idealized block map. +At any target `LSN = X`, it can answer: + +- what value should each block contain at `X`? + +## Data Correctness Model + +### Synthetic 4K writes + +For simulation, each 4K write should be represented as: + +- block ID +- value + +A simple deterministic choice is: +- `value = LSN` + +Example: +- `LSN 10`: write block 7 = 10 +- `LSN 11`: write block 2 = 11 +- `LSN 12`: write block 7 = 12 + +This makes correctness checks trivial. + +### Why this matters + +This catches the exact extent-recovery trap: + +1. `LSN 10`: block 7 = 10 +2. `LSN 12`: block 7 = 12 + +If recovery claims to rebuild state at `LSN 10` using current extent and returns block 7 = 12, the simulator detects the bug immediately. + +## Golden Invariant + +For any node declared recovered to target `LSN = T`: + +- node extent state must equal the reference model's state at `T` + +Not: +- equal to current latest state +- equal to any valid-looking value + +Exactly: +- the reference state at target `LSN` + +## Recovery Correctness Rules + +### WAL replay correctness + +For `(startLSN, endLSN]` replay to be valid: + +- every record in the interval must exist +- every payload must be the correct historical version for its LSN +- no replay gaps are allowed +- no stale-epoch records are allowed + +### Extent/snapshot correctness + +Extent-based recovery is valid only if the data source is version-correct. + +Allowed examples: +- immutable snapshot at `cpLSN` +- pinned copy-on-write generation +- pinned payload object referenced by a recovery record + +Not allowed: +- current live extent used as if it were historical state at old `cpLSN` + +## Suggested Prototype Package + +Prototype location: +- `sw-block/prototype/distsim/` + +Suggested files: +- `types.go` +- `node.go` +- `coordinator.go` +- `network.go` +- `reference.go` +- `scenario.go` +- `sim_test.go` + +## Minimal First Milestone + +Do not try to simulate the whole product first. + +First milestone: + +1. one primary +2. one replica +3. time ticks +4. synthetic 4K writes with deterministic values +5. canonical reference model +6. simple recovery check: + - WAL replay recovers correct value + - current extent alone does not recover old `LSN` + - snapshot/base image at `cpLSN` does recover correct value + +If that milestone is solid, then add: +- failover +- quorum +- multi-replica +- coordinator promotion rules + +## Test Cases To Add Early + +### 1. WAL replay preserves historical values +- write block 7 = 10 +- write block 7 = 12 +- replay only to `LSN 10` +- expect block 7 = 10 + +### 2. Current extent cannot reconstruct old `LSN` +- same write sequence +- try rebuilding `LSN 10` from latest extent +- expect mismatch/error + +### 3. Snapshot at `cpLSN` works +- snapshot at `LSN 10` +- later overwrite block 7 at `LSN 12` +- rebuild from snapshot `LSN 10` +- expect block 7 = 10 + +### 4. Reservation expiration invalidates recovery +- recovery window initially valid +- time advances +- reservation expires +- recovery must abort rather than return partial or wrong state + +## Relationship To Existing Prototype + +This simulator should reuse existing prototype concepts where possible: + +- `fsmv2` for node-local recovery lifecycle +- `volumefsm` ideas for mode semantics and admission +- `RecoveryPlanner` for recoverability decisions + +The simulator is the next proof layer: +- not just whether transitions are legal +- but whether data remains correct under those transitions + +## Bottom Line + +WAL V2 correctness is not only a state problem. +It is also a data-version problem. + +The distributed simulator should therefore prove two things together: + +1. state-machine safety +2. data correctness at target `LSN` + +That is the right next prototype layer if the goal is to prove: +- quorum commit safety +- no committed data loss +- no incorrect recovery from later extent state diff --git a/sw-block/design/v2-first-slice-sender-ownership.md b/sw-block/design/v2-first-slice-sender-ownership.md new file mode 100644 index 000000000..577ee63ff --- /dev/null +++ b/sw-block/design/v2-first-slice-sender-ownership.md @@ -0,0 +1,159 @@ +# V2 First Slice: Per-Replica Sender/Session Ownership + +Date: 2026-03-27 +Status: implementation-ready +Depends-on: Q1 (recovery session), Q6 (orchestrator scope), Q7 (first slice) + +## Problem + +`SetReplicaAddrs()` replaces the entire `ShipperGroup` atomically. This causes: + +1. **State loss on topology change.** All shippers are destroyed and recreated. + Recovery state (`replicaFlushedLSN`, `lastContactTime`, catch-up progress) is lost. + After a changed-address restart, the new shipper starts from scratch. + +2. **No per-replica identity.** Shippers are identified by array index. The master + cannot target a specific replica for rebuild/catch-up — it must re-issue the + entire address set. + +3. **Background reconnect races.** A reconnect cycle may be in progress when + `SetReplicaAddrs` replaces the group. The in-progress reconnect's connection + objects become orphaned. + +## Design + +### Per-replica sender identity + +`ShipperGroup` changes from `[]*WALShipper` to `map[string]*WALShipper`, keyed by +the replica's canonical data address. Each shipper stores its own `ReplicaID`. + +```go +type WALShipper struct { + ReplicaID string // canonical data address — identity across reconnects + // ... existing fields +} + +type ShipperGroup struct { + mu sync.RWMutex + shippers map[string]*WALShipper // keyed by ReplicaID +} +``` + +### ReconcileReplicas replaces SetReplicaAddrs + +Instead of replacing the entire group, `ReconcileReplicas` diffs old vs new: + +``` +ReconcileReplicas(newAddrs []ReplicaAddr): + for each existing shipper: + if NOT in newAddrs → Stop and remove + for each newAddr: + if matching shipper exists → keep (preserve state) + if no match → create new shipper +``` + +This preserves `replicaFlushedLSN`, `lastContactTime`, catch-up progress, and +background reconnect goroutines for replicas that stay in the set. + +`SetReplicaAddrs` becomes a wrapper: +```go +func (v *BlockVol) SetReplicaAddrs(addrs []ReplicaAddr) { + if v.shipperGroup == nil { + v.shipperGroup = NewShipperGroup(nil) + } + v.shipperGroup.ReconcileReplicas(addrs, v.makeShipperFactory()) +} +``` + +### Changed-address restart flow + +1. Replica restarts on new port. Heartbeat reports new address. +2. Master detects endpoint change (address differs, same volume). +3. Master sends assignment update to primary with new replica address. +4. Primary's `ReconcileReplicas` receives `[oldAddr1, newAddr2]`. +5. Old shipper for the changed replica is stopped (old address gone from set). +6. New shipper created with new address — but this is a fresh shipper. +7. New shipper bootstraps: Disconnected → Connecting → CatchingUp → InSync. + +The improvement over V1.5: the **other** replicas in the set are NOT disturbed. +Only the changed replica gets a fresh shipper. Recovery state for stable replicas +is preserved. + +### Recovery session + +Each WALShipper already contains the recovery state machine: +- `state` (Disconnected → Connecting → CatchingUp → InSync → Degraded → NeedsRebuild) +- `replicaFlushedLSN` (authoritative progress) +- `lastContactTime` (retention budget) +- `catchupFailures` (escalation counter) +- Background reconnect goroutine + +No separate `RecoverySession` object is needed. The WALShipper IS the per-replica +recovery session. The state machine already tracks the session lifecycle. + +What changes: the session is no longer destroyed on topology change (unless the +replica itself is removed from the set). + +### Coordinator vs primary responsibilities + +| Responsibility | Owner | +|---------------|-------| +| Endpoint truth (canonical address) | Coordinator (master) | +| Assignment updates (add/remove replicas) | Coordinator | +| Epoch authority | Coordinator | +| Session creation trigger | Coordinator (via assignment) | +| Session execution (reconnect, catch-up, barrier) | Primary (via WALShipper) | +| Timeout enforcement | Primary | +| Ordered receive/apply | Replica | +| Barrier ack | Replica | +| Heartbeat reporting | Replica | + +### Migration from current code + +| Current | V2 | +|---------|-----| +| `ShipperGroup.shippers []*WALShipper` | `ShipperGroup.shippers map[string]*WALShipper` | +| `SetReplicaAddrs()` creates all new | `ReconcileReplicas()` diffs and preserves | +| `StopAll()` in demote | `StopAll()` unchanged (stops all) | +| `ShipAll(entry)` iterates slice | `ShipAll(entry)` iterates map values | +| `BarrierAll(lsn)` parallel slice | `BarrierAll(lsn)` parallel map values | +| `MinReplicaFlushedLSN()` iterates slice | Same, iterates map values | +| `ShipperStates()` iterates slice | Same, iterates map values | +| No per-shipper identity | `WALShipper.ReplicaID` = canonical data addr | + +### Files changed + +| File | Change | +|------|--------| +| `wal_shipper.go` | Add `ReplicaID` field, pass in constructor | +| `shipper_group.go` | `map[string]*WALShipper`, `ReconcileReplicas`, update iterators | +| `blockvol.go` | `SetReplicaAddrs` calls `ReconcileReplicas`, shipper factory | +| `promotion.go` | No change (StopAll unchanged) | +| `dist_group_commit.go` | No change (uses ShipperGroup API) | +| `block_heartbeat.go` | No change (uses ShipperStates) | + +### Acceptance bar + +The following existing tests must continue to pass: +- All CP13-1 through CP13-7 protocol tests (sync_all_protocol_test.go) +- All adversarial tests (sync_all_adversarial_test.go) +- All baseline tests (sync_all_bug_test.go) +- All rebuild tests (rebuild_v1_test.go) + +The following CP13-8 tests validate the V2 improvement: +- `TestCP13_SyncAll_ReplicaRestart_Rejoin` — changed-address recovery +- `TestAdversarial_ReconnectUsesHandshakeNotBootstrap` — V2 reconnect protocol +- `TestAdversarial_CatchupMultipleDisconnects` — state preservation across reconnects + +New tests to add: +- `TestReconcileReplicas_PreservesExistingShipper` — stable replica keeps state +- `TestReconcileReplicas_RemovesStaleShipper` — removed replica stopped +- `TestReconcileReplicas_AddsNewShipper` — new replica bootstraps +- `TestReconcileReplicas_MixedUpdate` — one kept, one removed, one added + +## Non-goals for this slice + +- Smart WAL payload classes +- Recovery reservation protocol +- Full coordinator orchestration +- New transport layer diff --git a/sw-block/design/v2-first-slice-session-ownership.md b/sw-block/design/v2-first-slice-session-ownership.md new file mode 100644 index 000000000..e50f044c9 --- /dev/null +++ b/sw-block/design/v2-first-slice-session-ownership.md @@ -0,0 +1,193 @@ +# V2 First Slice: Per-Replica Sender and Recovery Session Ownership + +Date: 2026-03-27 + +## Purpose + +This document defines the first real V2 implementation slice. + +The slice is intentionally narrow: + +- per-replica sender ownership +- explicit recovery session ownership +- clear coordinator vs primary responsibility + +This is the first step toward a standalone V2 block engine under `sw-block/`. + +## Why This Slice First + +It directly addresses the clearest V1.5 structural limits: + +- sender identity loss when replica sets are refreshed +- changed-address restart recovery complexity +- repeated reconnect cycles without stable per-replica ownership +- adversarial Phase 13 boundary tests that V1.5 cannot cleanly satisfy + +It also avoids jumping too early into: + +- Smart WAL +- new backend storage layout +- full production transport redesign + +## Core Decision + +Use: + +- **one sender owner per replica** +- **at most one active recovery session per replica per epoch** + +Healthy replicas may only need their steady sender object. + +Degraded / reconnecting replicas gain an explicit recovery session owned by the primary. + +## Ownership Split + +### Coordinator + +Owns: + +- replica identity / endpoint truth +- assignment updates +- epoch authority +- session creation / destruction intent + +Does not own: + +- byte-by-byte catch-up execution +- local sender loop scheduling + +### Primary + +Owns: + +- per-replica sender objects +- per-replica recovery session execution +- reconnect / catch-up progress +- timeout enforcement for active session +- transition from: + - normal sender + - to recovery session + - back to normal sender + +### Replica + +Owns: + +- receive/apply path +- barrier ack +- heartbeat/reporting + +Replica remains passive from the recovery-orchestration point of view. + +## Data Model + +## Sender Owner + +Per replica, maintain a stable sender owner with: + +- replica logical ID +- current endpoint +- current epoch view +- steady-state health/status +- optional active recovery session reference + +## Recovery Session + +Per replica, per epoch: + +- `ReplicaID` +- `Epoch` +- `EndpointVersion` or equivalent endpoint truth +- `State` + - `connecting` + - `catching_up` + - `in_sync` + - `needs_rebuild` +- `StartLSN` +- `TargetLSN` +- timeout / deadline metadata + +## Session Rules + +1. only one active session per replica per epoch +2. new assignment for same replica: +- supersedes old session only if epoch/session generation is newer +3. stale session must not continue after: +- epoch bump +- endpoint truth change +- explicit coordinator replacement + +## Minimal State Transitions + +### Healthy path + +1. replica sender exists +2. sender ships normally +3. replica remains `InSync` + +### Recovery path + +1. sender detects or is told replica is not healthy +2. coordinator provides valid assignment/endpoint truth +3. primary creates recovery session +4. session connects +5. session catches up if recoverable +6. on success: +- session closes +- steady sender resumes normal state + +### Rebuild path + +1. session determines catch-up is not sufficient +2. session transitions to `needs_rebuild` +3. higher layer rebuild flow takes over + +## What This Slice Does Not Include + +Not in the first slice: + +- Smart WAL payload classes in production +- snapshot pinning / GC logic +- new on-disk engine +- frontend publication changes +- full production event scheduler + +## Proposed V2 Workspace Target + +Do this under `sw-block/`, not `weed/storage/blockvol/`. + +Suggested area: + +- `sw-block/prototype/enginev2/` + +Suggested first files: + +- `sw-block/prototype/enginev2/session.go` +- `sw-block/prototype/enginev2/sender.go` +- `sw-block/prototype/enginev2/group.go` +- `sw-block/prototype/enginev2/session_test.go` + +The first code does not need full storage I/O. +It should prove ownership and transition shape first. + +## Acceptance For This Slice + +The slice is good enough when: + +1. sender identity is stable per replica +2. changed-address reassignment updates the right sender owner +3. multiple reconnect cycles do not lose recovery ownership +4. stale session does not survive epoch bump +5. the 4 Phase 13 V2-boundary tests have a clear path to become satisfiable + +## Relationship To Existing Simulator + +This slice should align with: + +- `v2-acceptance-criteria.md` +- `v2-open-questions.md` +- `v1-v15-v2-comparison.md` +- `distsim` / `eventsim` behavior + +The simulator remains the design oracle. +The first implementation slice should not contradict it. diff --git a/sw-block/design/v2-open-questions.md b/sw-block/design/v2-open-questions.md new file mode 100644 index 000000000..4ec67ff73 --- /dev/null +++ b/sw-block/design/v2-open-questions.md @@ -0,0 +1,161 @@ +# V2 Open Questions + +Date: 2026-03-27 + +## Purpose + +This document records what is still algorithmically open in V2. + +These are not bugs. + +They are design questions that should be closed deliberately before or during implementation slicing. + +## 1. Recovery Session Ownership + +Open question: + +- what is the exact ownership model for one active recovery session per replica? + +Need to decide: + +- session identity fields +- supersede vs reject vs join behavior +- how epoch/session invalidates old recovery work + +Why it matters: + +- V1.5 needed local reconnect serialization +- V2 should make this a protocol rule + +## 2. Promotion Threshold Strictness + +Open question: + +- must a promotion candidate always have `FlushedLSN >= CommittedLSN`, or is there any narrower safe exception? + +Current prototype: + +- uses committed-prefix sufficiency as the safety gate + +Why it matters: + +- determines how strict real failover behavior should be + +## 3. Recovery Reservation Shape + +Open question: + +- what exactly is reserved during catch-up? + +Need to decide: + +- WAL range only? +- payload pins? +- snapshot pin? +- expiry semantics? + +Why it matters: + +- recoverability must be explicit, not hopeful + +## 4. Smart WAL Payload Classes + +Open question: + +- which payload classes are allowed in V2 first? + +Current model has: + +- `WALInline` +- `ExtentReferenced` + +Need to decide: + +- whether first real implementation includes both +- whether `ExtentReferenced` requires pinned snapshot/versioned extent only + +## 5. Smart WAL Garbage Collection Boundary + +Open question: + +- when can a referenced payload stop being recoverable? + +Need to decide: + +- GC interaction +- timeout interaction +- recovery session pinning + +Why it matters: + +- this is the line between catch-up and rebuild + +## 6. Exact Orchestrator Scope + +Open question: + +- how much of the final V2 control logic belongs in: + - local node state + - coordinator + - transport/session manager + +Why it matters: + +- avoid V1-style scattered state ownership + +## 7. First Real Implementation Slice + +Open question: + +- what is the first production slice of V2? + +Candidates: + +1. per-replica sender/session ownership +2. explicit recovery-session management +3. catch-up/rebuild decision plumbing + +Recommended default: + +- per-replica sender/session ownership + +## 8. Steady-State Overhead Budget + +Open question: + +- what overhead is acceptable in the normal healthy case? + +Need to decide: + +- metadata checks on hot path +- extra state bookkeeping +- what stays off the hot path + +Why it matters: + +- V2 should be structurally better without becoming needlessly heavy + +## 9. Smart WAL First-Phase Goal + +Open question: + +- is the first Smart WAL goal: + - lower recovery cost + - lower steady-state WAL volume + - or just proof of historical correctness model? + +Recommended answer: + +- first prove correctness model, then optimize + +## 10. End Condition For Simulator Work + +Open question: + +- when do we stop adding simulator depth and start implementation? + +Suggested answer: + +- once acceptance criteria are satisfied +- and the first implementation slice is clear +- and remaining simulator additions are no longer changing core protocol decisions diff --git a/sw-block/design/v2-prototype-roadmap-and-gates.md b/sw-block/design/v2-prototype-roadmap-and-gates.md new file mode 100644 index 000000000..073806566 --- /dev/null +++ b/sw-block/design/v2-prototype-roadmap-and-gates.md @@ -0,0 +1,239 @@ +# V2 Prototype Roadmap And Gates + +Date: 2026-03-27 +Status: active +Purpose: define the remaining prototype roadmap, the validation gates between stages, and the decision point between real V2 engine work and possible V2.5 redesign + +## Current Position + +V2 design/FSM/simulator work is sufficiently closed for serious prototyping, but not frozen against later `V2.5` adjustments. + +Current state: + +- design proof: high +- execution proof: medium +- data/recovery proof: low +- prototype end-to-end proof: low + +Rough prototype progress: + +- `25%` to `35%` + +This is early executable prototype, not engine-ready prototype. + +## Roadmap Goal + +Answer this question with prototype evidence: + +- can V2 become a real engine path? +- or should it become `V2.5` before real implementation begins? + +## Step 1: Execution Authority Closure + +Purpose: + +- finish the sender / recovery-session authority model so stale work is unambiguously rejected + +Scope: + +1. ownership-only `AttachSession()` / `SupersedeSession()` +2. execution begins only through execution APIs +3. stale handshake / progress / completion fenced by `sessionID` +4. endpoint bump / epoch bump invalidate execution authority +5. sender-group preserve-or-kill behavior is explicit + +Done when: + +1. all execution APIs are sender-gated and reject stale `sessionID` +2. session creation is separated from execution start +3. phase ordering is enforced +4. endpoint bump / epoch bump invalidate execution authority correctly +5. mixed add/remove/update reconciliation preserves or kills state exactly as intended + +Main files: + +- `sw-block/prototype/enginev2/` +- `sw-block/prototype/distsim/` +- `learn/projects/sw-block/phases/phase-13-v2-boundary-tests.md` + +Key gate: + +- old recovery work cannot mutate current sender state at any execution stage + +## Step 2: Orchestrated Recovery Prototype + +Purpose: + +- move from good local sender APIs to an actual prototype recovery flow driven by assignment/update intent + +Scope: + +1. assignment/update intent creates or supersedes recovery attempts +2. reconnect / reassignment / catch-up / rebuild decision path +3. sender-group becomes orchestration entry point +4. explicit outcome branching: + - zero-gap fast completion + - positive-gap catch-up + - unrecoverable gap -> `NeedsRebuild` + +Done when: + +1. the prototype expresses a realistic recovery flow from topology/control intent +2. sender-group drives recovery creation, not only unit helpers +3. recovery outcomes are explicit and testable +4. orchestrator responsibility is clear enough to narrow `v2-open-questions.md` item 6 + +Key gate: + +- recovery control is no longer scattered across helper calls; it has one clear orchestration path + +## Step 3: Minimal Historical Data Prototype + +Purpose: + +- prove the recovery model against real data-history assumptions, not only control logic + +Scope: + +1. minimal WAL/history model, not full engine +2. enough to exercise: + - catch-up range + - retained prefix/window + - rebuild fallback + - historical correctness at target LSN +3. enough reservation/recoverability state to make recovery explicit + +Done when: + +1. the prototype can prove why a gap is recoverable or unrecoverable +2. catch-up and rebuild decisions are backed by minimal data/history state +3. `v2-open-questions.md` items 3, 4, 5 are closed or sharply narrowed +4. prototype evidence strengthens acceptance criteria `A5`, `A6`, and `A7` + +Key gate: + +- the prototype must explain why recovery is allowed, not just that policy says it is + +## Step 4: Prototype Scenario Closure + +Purpose: + +- make the prototype itself demonstrate the V2 story end-to-end + +Scope: + +1. map key V2 scenarios onto the prototype +2. express the 4 V2-boundary cases against prototype behavior +3. add one small end-to-end harness inside `sw-block/prototype/` +4. align prototype evidence with acceptance criteria + +Done when: + +1. prototype behavior can be reviewed scenario-by-scenario +2. key V1/V1.5 failures have prototype equivalents +3. prototype outcomes match intended V2 design claims +4. remaining gaps are clearly real-engine gaps, not protocol/prototype ambiguity + +Key gate: + +- a reviewer can trace: + - acceptance criteria -> scenario -> prototype behavior + without hand-waving + +## Gates + +### Gate 1: Design Closed Enough + +Status: + +- mostly passed + +Meaning: + +1. acceptance criteria exist +2. core simulator exists +3. ownership gap from V1.5 is understood + +### Gate 2: Execution Authority Closed + +Passes after Step 1. + +Meaning: + +- stale execution results cannot mutate current authority + +### Gate 3: Orchestrated Recovery Closed + +Passes after Step 2. + +Meaning: + +- recovery flow is controlled by one coherent orchestration model + +### Gate 4: Historical Data Model Closed + +Passes after Step 3. + +Meaning: + +- catch-up vs rebuild is backed by executable data-history logic + +### Gate 5: Prototype Convincing + +Passes after Step 4. + +Meaning: + +- enough evidence exists to choose: + - real V2 engine path + - or `V2.5` redesign + +## Decision Gate After Step 4 + +### Path A: Real V2 Engine Planning + +Choose this if: + +1. prototype control logic is coherent +2. recovery boundary is explicit +3. boundary cases are convincing +4. no major structural flaw remains + +Outputs: + +1. real engine slicing plan +2. migration/integration plan into future standalone `sw-block` +3. explicit non-goals for first production version + +### Path B: V2.5 Redesign + +Choose this if the prototype reveals: + +1. ownership/orchestration still too fragile +2. recovery boundary still too implicit +3. historical correctness model too costly or too unclear +4. too much complexity leaks into the hot path + +Output: + +- write `V2.5` as a design/prototype correction before engine work + +## What Not To Do Yet + +1. no Smart WAL expansion beyond what Step 3 minimally needs +2. no backend/storage-engine redesign +3. no V1 production integration +4. no frontend/wire protocol work +5. no performance optimization as a primary goal + +## Practical Summary + +Current sequence: + +1. finish execution authority +2. build orchestrated recovery +3. add minimal historical-data proof +4. close key scenarios against the prototype +5. decide: + - V2 engine + - or `V2.5` diff --git a/sw-block/design/v2-scenario-sources-from-v1.md b/sw-block/design/v2-scenario-sources-from-v1.md new file mode 100644 index 000000000..cf47cc95e --- /dev/null +++ b/sw-block/design/v2-scenario-sources-from-v1.md @@ -0,0 +1,249 @@ +# V2 Scenario Sources From V1 and V1.5 + +Date: 2026-03-27 + +## Purpose + +This document distills V1 / V1.5 real-test material into V2 scenario inputs. + +Sources: + +- `learn/projects/sw-block/phases/phase13_test.md` +- `learn/projects/sw-block/phases/phase-13-v2-boundary-tests.md` + +This is not the active scenario backlog. + +Use: + +- `v2_scenarios.md` for the active V2 scenario set +- this file for historical source and rationale + +## How To Use This File + +For each item below: + +1. keep the real V1/V1.5 test as implementation evidence +2. create or maintain a V2 simulator scenario for the protocol core +3. define the expected V2 behavior explicitly + +## Source Buckets + +### 1. Core protocol behavior + +These are the highest-value simulator inputs. + +- barrier durability truth +- reconnect + catch-up +- non-convergent catch-up -> rebuild +- rebuild fallback +- failover / promotion safety +- WAL retention / tail-chasing +- durability mode semantics + +Recommended V2 treatment: + +- `sim_core` + +### 2. Supporting invariants + +These matter, but usually as reduced simulator checks. + +- canonical address handling +- replica role/epoch gating +- committed-prefix rules +- rebuild publication cleanup +- assignment refresh behavior + +Recommended V2 treatment: + +- `sim_reduced` + +### 3. Real-only implementation behavior + +These should usually stay in real-engine tests. + +- actual wire encoding / decode bugs +- real disk / `fdatasync` timing +- NVMe / iSCSI frontend behavior +- Go concurrency artifacts tied to concrete implementation + +Recommended V2 treatment: + +- `real_only` + +### 4. V2 boundary items + +These are especially important. + +They should remain visible as: + +- current V1/V1.5 limitation +- explicit V2 acceptance target + +Recommended V2 treatment: + +- `v2_boundary` + +## Distilled Scenario Inputs + +### A. Barrier truth uses durable replica progress + +Real source: + +- Phase 13 barrier / `replicaFlushedLSN` tests + +Why it matters: + +- commit must follow durable replica progress, not send progress + +V2 target: + +- barrier completion counted only from explicit durable progress state + +### B. Same-address transient outage + +Real source: + +- Phase 13 reconnect / catch-up tests +- `CP13-8` short outage recovery + +Why it matters: + +- proves cheap short-gap recovery path + +V2 target: + +- explicit recoverability check +- catch-up if recoverable +- rebuild otherwise + +### C. Changed-address restart + +Real source: + +- `CP13-8 T4b` +- changed-address refresh fixes + +Why it matters: + +- endpoint is not identity +- stale endpoint must not remain authoritative + +V2 target: + +- heartbeat/control-plane learns new endpoint +- reassignment updates sender target +- recovery session starts only after endpoint truth is updated + +### D. Non-convergent catch-up / tail-chasing + +Real source: + +- Phase 13 retention + catch-up + rebuild fallback line + +Why it matters: + +- “catch-up exists” is not enough +- must know when to stop and rebuild + +V2 target: + +- explicit `CatchingUp -> NeedsRebuild` +- no fake success + +### E. Slow control-plane recovery + +Real source: + +- `CP13-8 T4b` hardware behavior before fix + +Why it matters: + +- safety can be correct while availability recovery is poor + +V2 target: + +- explicit fast recovery path when possible +- explicit fallback when only control-plane repair can help + +### F. Stale message / delayed ack fencing + +Real source: + +- Phase 13 epoch/fencing tests +- V2 scenario work already mirrors this + +Why it matters: + +- old lineage must not mutate committed prefix + +V2 target: + +- stale message rejection is explicit and testable + +### G. Promotion candidate safety + +Real source: + +- failover / promotion gating tests +- V2 candidate-selection work + +Why it matters: + +- wrong promotion loses committed lineage + +V2 target: + +- candidate must satisfy: + - running + - epoch aligned + - state eligible + - committed-prefix sufficient + +### H. Rebuild boundary after failed catch-up + +Real source: + +- Phase 13 rebuild fallback behavior + +Why it matters: + +- rebuild is required when retained WAL cannot safely close the gap + +V2 target: + +- rebuild is explicit fallback, not ad hoc recovery + +## Immediate Feed Into `v2_scenarios.md` + +These are the most important V1/V1.5-derived V2 scenarios: + +1. same-address transient outage +2. changed-address restart +3. non-convergent catch-up / tail-chasing +4. stale delayed message / barrier ack rejection +5. committed-prefix-safe promotion +6. control-plane-latency recovery shape + +## What Should Not Be Copied Blindly + +Do not clone every real-engine test into the simulator. + +Do not use the simulator for: + +- exact OS timing +- exact socket/wire bugs +- exact block frontend behavior +- implementation-specific lock races + +Instead: + +- extract the protocol invariant +- model the reduced scenario if the protocol value is high + +## Bottom Line + +V1 / V1.5 tests should feed V2 in two ways: + +1. as historical evidence of what failed or mattered in real life +2. as scenario seeds for the V2 simulator and acceptance backlog diff --git a/sw-block/design/v2_scenarios.md b/sw-block/design/v2_scenarios.md new file mode 100644 index 000000000..30e510ff9 --- /dev/null +++ b/sw-block/design/v2_scenarios.md @@ -0,0 +1,638 @@ +# WAL V2 Scenarios + +Date: 2026-03-26 +Status: working scenario backlog +Purpose: define the scenario set that proves why WAL V2 exists, what it must do better than WAL V1, and what it should handle better than rebuild-heavy systems + +Execution note: +- active implementation planning for these scenarios lives under `../.private/phase/` +- `design/` is the design/source-of-truth view +- `.private/phase/` is the execution/checklist view for `sw` + +## Why This File Exists + +V2 should not grow by adding random simulations. + +Each new scenario should prove one of these claims: + +1. committed data is never lost +2. uncommitted data is never falsely revived +3. epoch and promotion lineage are safe +4. short-gap recovery is cheaper and cleaner than rebuild +5. catch-up vs rebuild boundary is explicit and correct +6. historical data correctness is preserved + +## Scenario Sources + +The backlog draws scenarios from three sources: + +1. **V1 / V1.5 real failures** +- real bugs and real-hardware gaps observed during Phase 12 / Phase 13 +- these are the highest-value scenarios because they came from actual system behavior + +2. **V2 design obligations** +- scenarios required by the intended V2 protocol shape +- examples: + - reservations + - lineage-first promotion + - explicit catch-up vs rebuild boundary + +3. **Distributed-systems adversarial cases** +- scenarios not yet seen in production, but known to be dangerous +- examples: + - zombie primary + - partitions + - message reordering + - multi-promotion lineage chains + +This file is the shared backlog for anyone extending: + +- `sw-block/prototype/fsmv2/` +- `sw-block/prototype/volumefsm/` +- `sw-block/prototype/distsim/` + +For active development sequencing, see: +- `sw-block/.private/phase/phase-01.md` +- `sw-block/.private/phase/phase-02.md` +- `sw-block/design/v2-scenario-sources-from-v1.md` + +Current simulator note: +- current `distsim` coverage already includes: + - changed-address restart comparison across `V1` / `V1.5` / `V2` + - same-address transient outage comparison + - slow control-plane recovery comparison + - stale-endpoint rejection + - committed-prefix-aware promotion eligibility + +## V2 Goals + +Compared with WAL V1, V2 should improve: + +1. state clarity +2. recovery boundary clarity +3. fencing and promotion correctness +4. testability of distributed behavior +5. proof of data correctness at a target `LSN` + +Compared with rebuild-heavy systems, V2 should improve: + +1. short-gap recovery cost +2. explicit progress semantics +3. catch-up vs rebuild decision quality + +## Scenario Format + +Each scenario should eventually define: + +1. setup +2. event sequence +3. expected commit/ack behavior +4. expected promotion/fencing behavior +5. expected final data state at target `LSN` + +Where possible, use synthetic 4K writes with: + +- `value = LSN` + +That makes correctness assertions trivial. + +## Priority 1: Commit Safety + +These scenarios prove the most important distributed claim: + +- if the system ACKed a write under the configured policy, that write is not lost + +### S1. ACK Then Primary Crash + +Goal: +- prove a quorum-acknowledged write survives failover + +Sequence: +1. primary commits a write +2. replicas durable-ACK enough nodes for policy +3. primary crashes immediately +4. coordinator promotes a valid replica + +Expect: +- promoted node contains the committed `LSN` +- final state matches reference model at committed `LSN` + +### S2. Non-Quorum Write Then Primary Crash + +Goal: +- prove uncommitted data is not revived after failover + +Sequence: +1. primary accepts a write locally +2. quorum durability is not reached +3. primary crashes +4. coordinator promotes another node + +Expect: +- promoted node does not expose the uncommitted write +- committed `LSN` stays at previous value + +### S3. Zombie Old Primary Is Fenced + +Goal: +- prove old-epoch traffic cannot corrupt new lineage + +Sequence: +1. primary loses lease +2. coordinator bumps epoch and promotes new primary +3. old primary continues trying to send writes / barriers + +Expect: +- all old-epoch traffic is rejected +- no stale write becomes committed under the new epoch + +## Priority 2: Short-Gap Recovery + +These scenarios justify V2 over rebuild-heavy designs. + +### S4. Brief Disconnect, WAL Catch-Up Only + +Goal: +- prove a short outage recovers via WAL catch-up, not rebuild + +Sequence: +1. replica disconnects briefly +2. primary continues writing +3. gap stays inside recoverable window +4. replica reconnects and catches up + +Expect: +- `CatchingUp -> PromotionHold -> InSync` +- no rebuild required +- final state matches reference at target `LSN` + +### S5. Flapping Replica Stays Recoverable + +Goal: +- prove transient disconnects do not force unnecessary rebuild + +Sequence: +1. replica disconnects and reconnects repeatedly +2. gaps stay within reserved recoverable windows + +Expect: +- replica may move between `Lagging`, `CatchingUp`, and `PromotionHold` +- replica does not enter `NeedsRebuild` unless recoverability is actually lost + +### S6. Tail-Chasing Under Load + +Goal: +- prove behavior when primary writes faster than catch-up rate + +Sequence: +1. replica reconnects behind +2. primary continues writing quickly +3. catch-up target may be reached or may fall behind again + +Expect: +- explicit result: + - converge and promote + - or abort to `NeedsRebuild` +- never silently pretend the replica is current + +## Priority 3: Catch-Up vs Rebuild Boundary + +These scenarios justify the V2 recoverability model. + +### S7. Recovery Initially Possible, Then Reservation Expires + +Goal: +- prove `check -> reserve -> recover` is enforced + +Sequence: +1. primary grants a recoverability reservation +2. catch-up starts +3. reservation expires or is revoked before completion + +Expect: +- catch-up aborts +- replica transitions to `NeedsRebuild` +- no partial recovery is treated as success + +### S8. Current Extent Cannot Recover Old LSN + +Goal: +- prove the historical correctness trap + +Sequence: +1. write block `B = 10` at `LSN 10` +2. later write block `B = 12` at `LSN 12` +3. attempt to recover state at `LSN 10` from current extent + +Expect: +- mismatch detected +- scenario must fail correctness check + +### S9. Snapshot + Tail Rebuild Works + +Goal: +- prove correct long-gap reconstruction + +Sequence: +1. take snapshot at `cpLSN` +2. later writes extend head +3. lagging replica rebuilds from snapshot +4. replay trailing WAL tail + +Expect: +- final state matches reference at target `LSN` + +## Priority 4: Quorum and Mixed Replica States + +These scenarios justify V2 mode clarity. + +### S10. Mixed States Under `sync_quorum` + +Goal: +- prove `sync_quorum` remains available with mixed replica states + +Sequence: +1. one replica `InSync` +2. one replica `CatchingUp` +3. one replica `Rebuilding` + +Expect: +- writes may continue if durable quorum exists +- ACK gating follows quorum rules exactly + +### S11. Mixed States Under `sync_all` + +Goal: +- prove `sync_all` remains strict + +Sequence: +1. same mixed-state setup as above + +Expect: +- writes/acks block or fail according to `sync_all` +- no silent downgrade to quorum or best effort + +### S12. Promotion Chooses Best Valid Lineage + +Goal: +- prove promotion is correctness-first, not “highest apparent LSN wins” + +Sequence: +1. candidate nodes have different: + - flushed LSN + - rebuild state + - epoch lineage +2. coordinator chooses a new primary + +Expect: +- only a valid-lineage node is promotable +- stale or inconsistent node is rejected + +## Priority 5: Smart WAL / Recovery Classes + +These scenarios justify V2’s future adaptive write path. + +### S13. `WALInline` Window Is Recoverable + +Goal: +- prove inline WAL payload replay works directly + +Sequence: +1. missing range consists of `WALInline` records +2. planner grants reservation + +Expect: +- catch-up allowed +- final state correct + +### S14. `ExtentReferenced` Payload Still Resolvable + +Goal: +- prove direct-extent records can still support catch-up when pinned + +Sequence: +1. missing range includes `ExtentReferenced` records +2. payload objects / generations are still resolvable +3. reservation pins those dependencies + +Expect: +- catch-up allowed +- final state correct + +### S15. `ExtentReferenced` Payload Lost + +Goal: +- prove metadata alone is not enough + +Sequence: +1. missing range includes `ExtentReferenced` records +2. metadata still exists +3. payload object / version is no longer resolvable + +Expect: +- planner returns `NeedsRebuild` +- catch-up is forbidden + +## Priority 6: Restart and Rebuild Robustness + +These scenarios justify operational resilience. + +### S16. Replica Restarts During Catch-Up + +Goal: +- prove restart does not corrupt catch-up state + +Sequence: +1. replica is catching up +2. replica restarts +3. reconnect and recover again + +Expect: +- no false promotion +- resume or restart recovery cleanly + +### S17. Replica Restarts During Rebuild + +Goal: +- prove rebuild interruption is safe + +Sequence: +1. replica is rebuilding from snapshot +2. replica restarts mid-copy + +Expect: +- rebuild aborts or restarts safely +- no partial base image is treated as valid + +### S18. Primary Restarts Without Failover + +Goal: +- prove restart with same lineage is handled explicitly + +Sequence: +1. primary stops and restarts +2. coordinator either preserves or changes epoch depending on policy + +Expect: +- replicas react consistently +- no stale assumptions about previous sender sessions + +### S19. Chain Of Custody Across Multiple Promotions + +Goal: +- prove committed data survives more than one failover lineage step + +Sequence: +1. primary `A` commits writes +2. fail over to `B` +3. `B` commits additional writes +4. fail over to `C` + +Expect: +- `C` contains all writes committed by `A` and `B` +- no committed data disappears across multiple promotions +- final state matches reference model at committed `LSN` + +### S20. Network Partition With Concurrent Write Attempts + +Goal: +- prove epoch fencing prevents split-brain writes during partition + +Sequence: +1. cluster partitions into two live sides +2. old primary side continues trying to write +3. coordinator promotes a new primary on the surviving side +4. both sides attempt to send control/data traffic + +Expect: +- only the current-epoch side can advance committed state +- stale-side writes are rejected or ignored +- no conflicting committed lineage appears + +## Suggested Implementation Order + +Implement in this order: + +1. `S1` ACK then primary crash +2. `S2` non-quorum write then primary crash +3. `S3` zombie old primary fenced +4. `S4` brief disconnect with WAL catch-up +5. `S7` reservation expiry aborts catch-up +6. `S10` mixed-state quorum policy +7. `S9` long-lag rebuild from snapshot + tail +8. `S13-S15` Smart WAL recoverability + +## Coverage Matrix + +Status values: +- `covered` +- `partial` +- `not_started` +- `needs_richer_model` + +| Scenario | Package | Test / Artifact | Status | Notes | +|---|---|---|---|---| +| `S1` ACK then primary crash | `distsim` | `TestQuorumCommitSurvivesPrimaryFailover` | `covered` | quorum commit survives failover | +| `S2` non-quorum write then primary crash | `distsim` | `TestUncommittedWriteNotPreservedAfterPrimaryLoss` | `covered` | no false revival | +| `S3` zombie old primary fenced | `distsim` | `TestZombieOldPrimaryWritesAreFenced` | `covered` | stale epoch traffic ignored | +| `S4` brief disconnect, WAL catch-up only | `distsim` | `TestReplicaCatchupFromPrimaryWAL` | `covered` | short-gap recovery | +| `S5` flapping replica stays recoverable | `distsim` | `TestS5_FlappingReplica_NoUnnecessaryRebuild`, `TestS5_FlappingWithStateTracking`, `TestS5_FlappingExceedsBudget_EscalatesToNeedsRebuild` | `covered` | both recoverable flapping and explicit budget-exceeded escalation are now asserted | +| `S6` tail-chasing under load | `distsim` | `TestS6_TailChasing_ConvergesOrAborts`, `TestS6_TailChasing_NonConvergent_Aborts`, `TestS6_TailChasing_NonConvergent_EscalatesToNeedsRebuild`, `TestP02_S6_NonConvergent_ExplicitStateTransition` | `covered` | explicit non-convergent `CatchingUp -> NeedsRebuild` path now asserted | +| `S7` reservation expiry aborts catch-up | `fsmv2`, `volumefsm`, `distsim` | `TestFSMReservationLostNeedsRebuild`, `TestModelReservationLostDuringCatchupAfterRebuild`, `TestReservationExpiryAbortsCatchup` | `covered` | present at 3 layers | +| `S8` current extent cannot recover old LSN | `distsim` | `TestCurrentExtentCannotRecoverOldLSN` | `covered` | historical correctness trap | +| `S9` snapshot + tail rebuild works | `distsim` | `TestReplicaRebuildFromSnapshotAndTail`, `TestSnapshotPlusTrailingReplayReachesTargetLSN` | `covered` | long-gap reconstruction | +| `S10` mixed states under `sync_quorum` | `volumefsm`, `distsim` | `TestModelSyncQuorumWithThreeReplicasMixedStates`, `TestSyncQuorumWithMixedReplicaStates` | `covered` | quorum stays available | +| `S11` mixed states under `sync_all` | `distsim` | `TestSyncAllBlocksWithMixedReplicaStates` | `covered` | strict sync_all behavior | +| `S12` promotion chooses best valid lineage | `distsim` | `TestPromotionUsesValidLineageNode`, `TestS12_PromotionChoosesBestLineage_NotHighestLSN`, `TestS12_PromotionRejectsRebuildingCandidate` | `covered` | lineage-first promotion now exercised beyond simple LSN comparison | +| `S13` `WALInline` window recoverable | `distsim` | `TestWALInlineRecordsAreRecoverable` | `covered` | inline payload recoverability | +| `S14` `ExtentReferenced` payload resolvable | `distsim` | `TestExtentReferencedResolvableRecordsAreRecoverable`, `TestMixedClassRecovery_FullSuccess` | `covered` | recoverable direct-extent and mixed-class recovery case | +| `S15` `ExtentReferenced` payload lost | `distsim` | `TestExtentReferencedUnresolvableForcesRebuild`, `TestRecoverableThenUnrecoverable`, `TestTimeVaryingAvailability` | `covered` | metadata alone not enough; active recovery can transition from recoverable to unrecoverable | +| `S16` replica restarts during catch-up | `distsim` | `TestReplicaRestartDuringCatchupRestartsSafely` | `covered` | safe recovery restart | +| `S17` replica restarts during rebuild | `distsim` | `TestReplicaRestartDuringRebuildRestartsSafely` | `covered` | rebuild interruption safe | +| `S18` primary restarts without failover | `distsim` | `TestS18_PrimaryRestart_SameLineage`, `TestS18_PrimaryRestart_ReplicasRejectOldEpoch`, `TestS18_PrimaryRestart_DelayedOldAck_DoesNotAdvancePrefix`, `TestS18_PrimaryRestart_InFlightBarrierDropped`, `TestP02_S18_DelayedAck_ExplicitRejection` | `covered` | delayed stale ack rejection and committed-prefix stability are now asserted directly | +| `S19` chain of custody across promotions | `distsim` | `TestS19_ChainOfCustody_MultiplePromotions`, `TestS19_ChainOfCustody_ThreePromotions` | `covered` | multi-promotion lineage continuity covered | +| `S20` live partition with competing writes | `distsim` | `TestS20_LivePartition_StaleWritesNotCommitted`, `TestS20_LivePartition_HealRecovers`, `TestS20_StalePartition_ProtocolRejectsStaleWrites`, `TestP02_S20_StaleTraffic_CommittedPrefixUnchanged` | `covered` | stale-side protocol traffic is explicitly rejected and committed prefix remains unchanged | + +## Ownership Notes + +When adding a scenario: + +1. add or extend the relevant prototype test: + - `fsmv2` + - `volumefsm` + - `distsim` +2. update this file with: + - status + - package location +3. keep correctness checks tied to: + - committed `LSN` + - reference model state + +## Current Coverage Snapshot + +Already covered in some form: + +- quorum commit survives primary failover +- uncommitted write not preserved after primary loss +- zombie old primary fenced by epoch +- lagging replica catch-up from primary WAL +- reservation expiry aborts catch-up in distributed sim +- `sync_quorum` continues with one lagging replica +- `sync_all` blocks with one lagging replica +- `sync_quorum` with mixed replica states +- `sync_all` with mixed replica states +- rebuild from snapshot + tail +- promotion uses valid lineage node +- flapping recoverable vs budget-exceeded rebuild path +- tail-chasing explicit escalation to rebuild +- restart during catch-up recovers safely +- restart during rebuild recovers safely +- primary restart delayed stale ack rejection +- `WALInline` recoverability +- `ExtentReferenced` resolvable vs unresolvable boundary +- mixed-class Smart WAL recovery and time-varying payload availability +- delayed stale messages and selective drop behavior +- multi-node reservation expiry and rebuild-timeout behavior +- current extent cannot reconstruct old `LSN` + +Still important to add: + +- explicit coordinator-driven candidate selection among competing valid/invalid lineages +- control-plane latency scenarios derived from `CP13-8 T4b` +- explicit V1 / V1.5 / V2 comparison scenarios for: + - changed-address restart + - same-address transient outage + - slow reassignment recovery + +## V1.5 Lessons To Add Or Strengthen + +These come directly from WAL V1.5 / Phase 13 behavior and should be treated as high-priority scenario drivers. + +### L1. Replica Restart With New Receiver Port + +Observed: +- replica VS restarts +- receiver comes back on a new random port +- primary background reconnect retries old address and fails + +Implication: +- direct reconnect only works if replica address is stable + +Backlog impact: +- strengthen `S18` +- add a restart/address-change sub-scenario under `S20` or a future network/control-plane recovery scenario + +### L2. Slow Control-Plane Reassignment Dominates Recovery + +Observed: +- sync correctness preserved +- write availability recovery waits for heartbeat/reassignment cycle + +Implication: +- "recoverable in theory" is not enough +- recovery latency is part of protocol quality + +Backlog impact: +- `S5` is now covered at current simulator level +- strengthen `S18` +- add long-running restart/rejoin timing scenarios + +### L3. Background Reconnect Helps Only Same-Address Recovery + +Observed: +- background reconnect is useful for transient network failure +- not sufficient for process restart with address change + +Implication: +- scenarios must distinguish: + - transient disconnect + - process restart + - address change + +Backlog impact: +- keep `S4` as transient disconnect +- strengthen `S18` with restart/address-stability cases + +### L4. Tail-Chasing And Retention Pressure Are Structural Risks + +Observed: +- Phase 13 reasoning repeatedly exposed: + - lagging replica may pin WAL + - catch-up may not converge while primary keeps advancing + +Implication: +- V2 must explicitly model convergence, abort, and rebuild boundaries + +Backlog impact: +- strengthen `S6` +- add multi-node retention / timeout variants + +### L5. Current Extent Is Not Historical State + +Observed: +- using current extent to reconstruct old `LSN` can return later values + +Implication: +- V2 must require version-correct base images or resolvable historical payloads + +Backlog impact: +- already covered by `S8` +- should remain a permanent regression scenario + +## Randomized Simulation + +In addition to fixed scenarios, V2 should keep a randomized simulator suite. + +Purpose: + +1. discover paths that were not explicitly written as named scenarios +2. stress promotion, restart, and recovery ordering +3. check invariants after each random step + +Current prototype: + +- `sw-block/prototype/distsim/random.go` +- `sw-block/prototype/distsim/random_test.go` + +Current invariants checked: + +1. current committed `LSN` remains a committed prefix +2. promotable nodes match reference state at committed `LSN` +3. current primary, if valid/running, matches reference state at committed `LSN` + +This does not replace named scenarios. +It complements them. + +## Scenario Summary + +When reviewing or adding scenarios, always record the source: + +1. from real V1/V1.5 behavior +2. from explicit V2 design obligation +3. from adversarial distributed-systems reasoning + +The best scenarios are the ones that come from real failures first, then are generalized into V2 requirements. + +## Development Phases + +Execution detail is tracked in: +- `sw-block/.private/phase/phase-01.md` +- `sw-block/.private/phase/phase-02.md` + +High-level phase order: + +1. close explicit scenario backlog + - `S19` + - `S20` +2. strengthen missing lifecycle scenarios + - `S5` + - `S6` + - `S18` + - stronger `S12` +3. extend protocol-state simulation and version comparison + - `V1` + - `V1.5` + - `V2` + - stronger closure of current `partial` scenarios +4. strengthen random/adversarial simulation +5. add timeout-based scenarios only when the execution path is modeled diff --git a/sw-block/design/wal-replication-v2-orchestrator.md b/sw-block/design/wal-replication-v2-orchestrator.md new file mode 100644 index 000000000..9f53f1c08 --- /dev/null +++ b/sw-block/design/wal-replication-v2-orchestrator.md @@ -0,0 +1,359 @@ +# WAL Replication V2 Orchestrator + +Date: 2026-03-26 +Status: design proposal +Purpose: define the volume-level orchestration model that sits above the per-replica WAL V2 FSM + +## Why This Document Exists + +`ReplicaFSM` alone is not enough. + +It can describe one replica relative to the current primary, but it cannot by itself model: + +- primary head continuing to advance +- multiple replicas in different states +- durability mode semantics +- primary lease loss and epoch change +- primary failover and replica promotion +- fencing of old recovery sessions + +So WAL V2 needs a second layer: +- per-replica `ReplicaFSM` +- volume-level `Orchestrator` + +## Scope + +This document defines the volume-level logic only. + +It does not define: +- exact network protocol +- exact master RPCs +- exact storage backend internals + +It assumes the per-replica state machine from: +- `wal-replication-v2-state-machine.md` + +## Core Model + +The orchestrator owns: + +1. current primary lineage +- `epoch` +- lease/authority state + +2. volume durability mode +- `best_effort` +- `sync_all` +- `sync_quorum` + +3. moving primary progress +- `headLSN` +- checkpoint/snapshot anchors + +4. replica set +- one `ReplicaFSM` per replica +- per-replica role in the current volume topology + +5. volume-level admission decision +- can writes proceed? +- can sync requests complete? +- must promotion/failover occur? + +## Two FSM Layers + +### Layer A: `ReplicaFSM` + +Owns per-replica state such as: +- `Bootstrapping` +- `InSync` +- `Lagging` +- `CatchingUp` +- `PromotionHold` +- `NeedsRebuild` +- `Rebuilding` +- `CatchUpAfterRebuild` +- `Failed` + +### Layer B: `VolumeOrchestrator` + +Owns system-wide state such as: +- current `epoch` +- current primary identity +- durability mode +- set of required replicas +- current `headLSN` +- whether writes or promotions are allowed + +The orchestrator does not replace `ReplicaFSM`. +It drives it. + +## Volume State + +The orchestrator should track at least: + +```go +type VolumeMode string + +type PrimaryState string + +const ( + PrimaryServing PrimaryState = "Serving" + PrimaryDraining PrimaryState = "Draining" + PrimaryLost PrimaryState = "Lost" +) + +type VolumeModel struct { + Epoch uint64 + PrimaryID string + PrimaryState PrimaryState + Mode VolumeMode + + HeadLSN uint64 + CheckpointLSN uint64 + + RequiredReplicaIDs []string + Replicas map[string]*ReplicaFSM +} +``` + +This is a model shape, not a required production struct. + +## Orchestrator Responsibilities + +### 1. Advance primary head + +When primary commits a new write: +- increment `headLSN` +- enqueue/send to replica sender loops +- evaluate whether the current mode still allows ACK + +### 2. Evaluate sync eligibility + +The orchestrator computes volume-level durability from replica states. + +Derived rule: +- only `ReplicaFSM.IsSyncEligible()` counts + +### 3. Drive recovery entry + +When a replica disconnects or falls behind: +- feed disconnect/lag events into that replica FSM +- decide whether to try catch-up or rebuild +- acquire recovery reservation if required + +### 4. Handle primary authority changes + +When lease is lost or a new primary is chosen: +- increment epoch +- abort stale recovery sessions +- reevaluate all replica relationships from the new primary's perspective + +### 5. Drive promotion / failover + +When current primary is lost: +- choose promotion candidate +- assign new epoch +- move old primary to stale/lost +- convert the promoted replica into the new serving primary +- reclassify remaining replicas relative to the new primary + +## Required Volume-Level Events + +The orchestrator should be able to simulate at least these events. + +### Write/progress events +- `WriteCommitted(lsn)` +- `CheckpointAdvanced(lsn)` +- `BarrierCompleted(replicaID, flushedLSN)` + +### Replica health events +- `ReplicaDisconnected(replicaID)` +- `ReplicaReconnect(replicaID, flushedLSN)` +- `ReplicaReservationLost(replicaID)` +- `ReplicaCatchupTimeout(replicaID)` +- `ReplicaRebuildTooSlow(replicaID)` + +### Topology/control events +- `PrimaryLeaseLost()` +- `EpochChanged(newEpoch)` +- `PromoteReplica(replicaID)` +- `ReplicaAssigned(replicaID)` +- `ReplicaRemoved(replicaID)` + +## Mode Semantics + +### `best_effort` + +Rules: +- ACK after primary local durability +- replicas may be `Lagging`, `CatchingUp`, `NeedsRebuild`, or `Rebuilding` +- background recovery continues + +Volume implication: +- primary can keep serving while replicas recover + +### `sync_all` + +Rules: +- ACK only when all required replicas are `InSync` and durable through target LSN +- bounded retry only +- no silent downgrade + +Volume implication: +- one lagging required replica can block sync completion +- orchestrator may fail requests, not silently reinterpret policy + +### `sync_quorum` + +Rules: +- ACK when quorum of required nodes are durable through target LSN +- lagging replicas may recover in background as long as quorum remains + +Volume implication: +- orchestrator must count eligible replicas, not just healthy sockets + +## Primary-Head Simulation Rules + +The orchestrator must explicitly model that the primary keeps moving. + +### Rule 1: head moves independently of replica recovery + +A replica entering `CatchingUp` does not freeze `headLSN`. + +### Rule 2: each recovery attempt uses explicit targets + +For a replica in recovery, orchestrator chooses: +- `catchupTargetLSN = H0` +- or `snapshotCpLSN = C` and replay target `H0` + +### Rule 3: promotion is explicit + +A replica is not restored to `InSync` just because it reaches `H0`. + +It must still pass: +- barrier confirmation +- `PromotionHold` + +## Failover / Promotion Model + +The orchestrator must be able to simulate: + +1. old primary loses lease +2. old primary is fenced by epoch change +3. one replica is promoted +4. promoted replica becomes new primary under a higher epoch +5. all old recovery sessions from the old primary are invalidated +6. remaining replicas are reevaluated relative to the new primary's head and retained history + +Important consequence: +- failover is not a `ReplicaFSM` transition only +- it is a volume-level re-rooting of all replica relationships + +## Suggested Promotion Rules + +Promotion candidate should prefer: +1. highest valid durable progress +2. current epoch-consistent history +3. healthiest replica among tied candidates + +After promotion: +- `PrimaryID` changes +- `Epoch` increments +- all replica reservations from the previous primary are void +- all non-primary replicas must renegotiate recovery against the new primary + +## Multi-Replica Examples + +### Example 1: `sync_all` + +- replica A = `InSync` +- replica B = `Lagging` +- replica C = `InSync` + +If A and B are required replicas in RF=3 `sync_all`: +- writes needing sync durability fail or wait +- even though one replica is still healthy + +### Example 2: `sync_quorum` + +- replica A = `InSync` +- replica B = `CatchingUp` +- replica C = `InSync` + +If quorum is 2: +- volume can continue serving sync requests +- B recovers in background + +### Example 3: failover + +- old primary lost +- replica A promoted +- replica B was previously `CatchingUp` under old epoch + +After promotion: +- B's old session is aborted +- B re-enters evaluation against A's history + +## What The Tiny Prototype Should Simulate + +The V2 prototype should be able to drive at least these scenarios: + +1. steady state keep-up +- primary head advances +- all required replicas remain `InSync` + +2. short outage +- one replica disconnects +- primary keeps writing +- reconnect succeeds within recoverable window +- replica returns via `PromotionHold` + +3. long outage +- one replica disconnects too long +- recoverability expires +- replica goes `NeedsRebuild` +- rebuild and trailing replay complete + +4. tail chasing +- replica catch-up speed is below primary ingest speed +- orchestrator chooses fail, throttle, or rebuild path depending on mode + +5. failover +- primary lease lost +- new epoch assigned +- replica promoted +- old recovery sessions fenced + +6. mixed-state quorum +- different replicas in different states +- orchestrator computes correct `sync_all` / `sync_quorum` result + +## Relationship To WAL V1 + +WAL V1 already contains pieces of this logic, but they are scattered across: +- shipper state +- barrier code +- retention code +- assignment/promotion code +- rebuild code +- heartbeat/master logic + +V2 should separate these into: +- per-replica recovery FSM +- volume-level orchestrator + +## Bottom Line + +The next step after `ReplicaFSM` is not `Smart WAL`. + +The next step is the volume-level orchestrator model. + +Why: +- primary keeps moving +- durability mode is volume-scoped +- failover/promotion is volume-scoped +- replica recovery must be evaluated in the context of the whole volume + +So V2 needs: +- `ReplicaFSM` for one replica +- `VolumeOrchestrator` for the moving multi-replica system diff --git a/sw-block/design/wal-replication-v2-state-machine.md b/sw-block/design/wal-replication-v2-state-machine.md new file mode 100644 index 000000000..c42919c66 --- /dev/null +++ b/sw-block/design/wal-replication-v2-state-machine.md @@ -0,0 +1,632 @@ +# WAL Replication V2 State Machine + +Date: 2026-03-26 +Status: design proposal +Purpose: define the V2 replication state machine for a moving-head primary where replicas may transition between keep-up, catch-up, and reconstruction while the primary continues accepting writes + +## Why This Document Exists + +The hard part of V2 is not the existence of three modes: + +- keep-up +- catch-up +- reconstruction + +The hard part is that the primary head continues advancing while replicas move between those modes. + +So V2 must be specified as a real state machine: + +- state definitions +- state-owned LSN anchors +- allowed transitions +- retention obligations +- abort rules + +This document treats edge cases as state-transition cases. + +## Scope + +This is a protocol/state-machine design. + +It does not yet define: +- exact RPC payloads +- exact snapshot storage format +- exact implementation package boundaries + +Those can follow after the state model is stable. + +## Core Terms + +### `headLSN` + +The primary's current highest WAL LSN. + +### `replicaFlushedLSN` + +The highest LSN durably persisted on the replica. + +### `cpLSN` + +A checkpoint/snapshot base point. A snapshot at `cpLSN` represents the block state exactly at that LSN. + +### `promotionBarrierLSN` + +The LSN a replica must durably reach before it can re-enter `InSync`. + +### `Recovery Feasibility` + +Whether `(startLSN, endLSN]` can be reconstructed completely, in order, under the current epoch. + +This is not a static fact. It changes over time as WAL is reclaimed, payload generations are garbage-collected, or snapshots are released. + +### `Recovery Reservation` + +A bounded primary-side reservation proving a recovery window is recoverable and pinning all dependencies needed to finish the current catch-up or rebuild-tail replay. + +A transition into recovery is valid only after the reservation is granted. + +## State Set + +Replica may be in one of these states: + +1. `Bootstrapping` +2. `InSync` +3. `Lagging` +4. `CatchingUp` +5. `PromotionHold` +6. `NeedsRebuild` +7. `Rebuilding` +8. `CatchUpAfterRebuild` +9. `Failed` + +Only `InSync` replicas count for sync durability. + +## State Semantics + +### 1. `Bootstrapping` + +Replica has not yet earned sync eligibility and does not yet have trusted reconnect progress. + +Properties: +- fresh replica identity or newly assigned replica +- may receive initial baseline/live stream +- not yet eligible for `sync_all` + +Counts for: +- `sync_all`: no +- `sync_quorum`: no +- `best_effort`: background/bootstrap only + +Owned anchors: +- current assignment epoch + +### 2. `InSync` + +Replica is eligible for sync durability. + +Properties: +- receiving live ordered stream +- `replicaFlushedLSN` is near the primary head +- normal barrier protocol is valid + +Counts for: +- `sync_all`: yes +- `sync_quorum`: yes +- `best_effort`: yes, but not required for ACK + +Owned anchors: +- `replicaFlushedLSN` + +### 3. `Lagging` + +Replica has fallen out of the normal live-stream envelope but recovery path is not yet chosen. + +Properties: +- primary no longer treats it as sync-eligible +- replica may still be recoverable from WAL or extent-backed recovery records +- or may require rebuild + +Counts for: +- `sync_all`: no +- `sync_quorum`: no +- `best_effort`: background recovery only + +Owned anchors: +- last known `replicaFlushedLSN` + +### 4. `CatchingUp` + +Replica is replaying from its own durable point toward a chosen target. + +Properties: +- short-gap recovery mode +- primary must reserve and pin the required recovery window +- primary head continues to move + +Counts for: +- `sync_all`: no +- `sync_quorum`: no +- `best_effort`: background recovery only + +Owned anchors: +- `catchupStartLSN = replicaFlushedLSN` +- `catchupTargetLSN` +- `promotionBarrierLSN` +- `recoveryReservationID` +- `reservationExpiry` + +### 5. `PromotionHold` + +Replica has reached the chosen promotion point but must demonstrate short stability before re-entering `InSync`. + +Properties: +- prevents immediate flapping back into sync eligibility +- replica has already reached `promotionBarrierLSN` +- promotion requires stable barriers or elapsed hold time + +Counts for: +- `sync_all`: no +- `sync_quorum`: no +- `best_effort`: stabilization only + +Owned anchors: +- `promotionBarrierLSN` +- `promotionHoldUntil` or equivalent hold criterion + +### 6. `NeedsRebuild` + +Replica cannot recover from retained recovery records alone. + +Properties: +- catch-up window is insufficient or no longer provable +- replica must not count toward sync durability +- replica no longer pins old catch-up history + +Counts for: +- `sync_all`: no +- `sync_quorum`: no +- `best_effort`: background repair candidate only + +Owned anchors: +- last known `replicaFlushedLSN` + +### 7. `Rebuilding` + +Replica is fetching and installing a checkpoint/snapshot base image. + +Properties: +- primary must preserve the chosen snapshot/base +- primary must preserve the required WAL or recovery tail after `cpLSN` + +Counts for: +- `sync_all`: no +- `sync_quorum`: no +- `best_effort`: background rebuild only + +Owned anchors: +- `snapshotID` +- `snapshotCpLSN` +- `tailReplayStartLSN = snapshotCpLSN + 1` +- `recoveryReservationID` +- `reservationExpiry` + +### 8. `CatchUpAfterRebuild` + +Replica has installed the base image and is replaying trailing history after it. + +Properties: +- semantically similar to `CatchingUp` +- base point is checkpoint/snapshot, not the replica's original own state + +Counts for: +- `sync_all`: no +- `sync_quorum`: no +- `best_effort`: background recovery only + +Owned anchors: +- `snapshotCpLSN` +- `catchupTargetLSN` +- `promotionBarrierLSN` +- `recoveryReservationID` +- `reservationExpiry` + +### 9. `Failed` + +Replica recovery failed in a way that needs operator/control-plane action beyond normal retry. + +Properties: +- terminal or semi-terminal fault state +- may require delete/recreate/manual intervention + +Counts for: +- `sync_all`: no +- `sync_quorum`: no +- `best_effort`: no direct role + +## Transition Rules + +### `Bootstrapping -> InSync` + +Trigger: +- initial bootstrap completes +- barrier confirms durable progress under the current epoch + +Action: +- establish trusted `replicaFlushedLSN` +- grant sync eligibility for the first time + +### `InSync -> Lagging` + +Trigger: +- disconnect +- barrier timeout +- barrier fsync failure +- stream error + +Action: +- remove sync eligibility immediately + +### `Lagging -> CatchingUp` + +Trigger: +- reconnect succeeds +- primary grants a recovery reservation proving `(replicaFlushedLSN, catchupTargetLSN]` is recoverable for a bounded window + +Action: +- choose `catchupTargetLSN` +- pin required recovery dependencies for the reservation lifetime + +### `Lagging -> NeedsRebuild` + +Trigger: +- required recovery window is not recoverable +- impossible progress reported +- epoch mismatch invalidates direct catch-up +- background janitor determines the replica is outside recoverable budget + +Action: +- stop treating replica as a catch-up candidate + +### `CatchingUp -> PromotionHold` + +Trigger: +- replica replays to `catchupTargetLSN` +- barrier confirms `promotionBarrierLSN` + +Action: +- start promotion debounce window + +### `PromotionHold -> InSync` + +Trigger: +- promotion hold criteria satisfied + - stable barrier successes + - or elapsed hold time + +Action: +- restore sync eligibility +- clear promotion anchors + +### `PromotionHold -> Lagging` + +Trigger: +- disconnect +- failed barrier +- failed live stream health check + +Action: +- cancel promotion attempt +- remove sync eligibility + +### `CatchingUp -> NeedsRebuild` + +Trigger: +- catch-up cannot converge +- recovery reservation is lost +- catch-up timeout policy exceeded +- epoch changes + +Action: +- abandon WAL-only catch-up +- move to reconstruction path + +### `NeedsRebuild -> Rebuilding` + +Trigger: +- control plane or primary chooses reconstruction base +- snapshot/base image transfer starts +- primary grants a rebuild reservation + +Action: +- bind replica to `snapshotID` and `snapshotCpLSN` + +### `Rebuilding -> CatchUpAfterRebuild` + +Trigger: +- snapshot/base image installed successfully +- trailing recovery reservation is still valid + +Action: +- replay trailing history after `snapshotCpLSN` + +### `Rebuilding -> NeedsRebuild` + +Trigger: +- rebuild copy fails +- rebuild reservation is lost +- rebuild WAL-tail budget is exceeded +- epoch changes + +Action: +- abort current rebuild session +- remain excluded from sync durability + +### `CatchUpAfterRebuild -> PromotionHold` + +Trigger: +- trailing replay reaches target +- barrier confirms durable replay through `promotionBarrierLSN` + +Action: +- start promotion debounce + +### `CatchUpAfterRebuild -> NeedsRebuild` + +Trigger: +- reservation is lost +- replay cannot converge +- epoch changes + +Action: +- abandon current attempt +- require a fresh rebuild plan + +### Any state -> `Failed` + +Trigger examples: +- unrecoverable protocol inconsistency +- repeated rebuild failure beyond retry policy +- snapshot corruption +- local replica storage failure + +## Retention Obligations By State + +The key V2 rule is: + +- recoverability is not a static fact +- it is a bounded promise the primary must honor once it admits a replica into recovery + +### `InSync` + +Primary must retain: +- recent WAL under normal retention policy + +Primary does not need: +- snapshot pin purely for this replica + +### `Lagging` + +Primary must retain: +- enough recent information to evaluate recoverability or intentionally declare `NeedsRebuild` + +This state should be short-lived. + +### `CatchingUp` + +Primary must retain for the reservation lifetime: +- recovery metadata for `(catchupStartLSN, promotionBarrierLSN]` +- every payload referenced by that recovery window +- current epoch lineage for the session + +### `PromotionHold` + +Primary must retain: +- whatever live-stream and barrier state is required to validate promotion + +This state should be brief and must not pin long-lived history. + +### `NeedsRebuild` + +Primary retains: +- no special old recovery window for this replica + +This state explicitly releases the old catch-up hold. + +### `Rebuilding` + +Primary must retain for the reservation lifetime: +- chosen `snapshotID` +- any base-image dependencies +- trailing history after `snapshotCpLSN` + +### `CatchUpAfterRebuild` + +Primary must retain for the reservation lifetime: +- recovery metadata for `(snapshotCpLSN, promotionBarrierLSN]` +- every payload referenced by that trailing window + +## Moving-Head Rules + +The primary head continues advancing during: +- `CatchingUp` +- `Rebuilding` +- `CatchUpAfterRebuild` + +Therefore transitions must never use current head at finish time as an implicit target. + +Instead, each transition must select explicit targets. + +### Catch-up target + +When catch-up starts, choose: +- `catchupTargetLSN = H0` + +Replica first chases to `H0`, not to an infinite moving head. + +Then: +- either enter `PromotionHold` and promote +- or begin another bounded cycle +- or abort to rebuild + +### Rebuild target + +When rebuild starts, choose: +- `snapshotCpLSN = C` +- trailing replay target `H0` + +Replica installs the snapshot at `C`, then replays `(C, H0]`, then enters `PromotionHold`. + +## Tail-Chasing Rule + +Replica may fail to converge if: +- catch-up speed < primary ingest speed + +V2 must define bounded behavior: + +1. bounded catch-up window +2. bounded catch-up time +3. policy after failure to converge: + - for `sync_all`: bounded retry, then fail requests + - for `best_effort`: keep serving and continue background recovery or escalate to rebuild + +No silent downgrade of `sync_all` is allowed. + +## Recovery Feasibility + +The primary must not admit a replica into catch-up based on a best-effort guess. + +It must prove the requested recovery window is recoverable and then reserve it. + +Recommended abstraction: + +- `CheckRecoveryFeasibility(startLSN, endLSN) -> fully recoverable | needs rebuild` +- `ReserveRecoveryWindow(startLSN, endLSN) -> reservation` + +Only a successful reservation may drive: +- `Lagging -> CatchingUp` +- `NeedsRebuild -> Rebuilding` +- `Rebuilding -> CatchUpAfterRebuild` + +## Recovery Classes + +V2 must support more than one local record type without leaking that detail into replica state. + +### `WALInline` + +Properties: +- payload lives directly in WAL +- recoverable while WAL is retained + +### `ExtentReferenced` + +Properties: +- recovery metadata points at payload outside WAL +- payload must be resolved from extent/snapshot generation state + +The FSM does not care how payload is stored. + +It only cares whether the requested window is fully recoverable for the lifetime of the reservation. + +The engine-level rule is: + +- every record in `(startLSN, endLSN]` must be payload-resolvable +- the resolved version must correspond to that record's historical state +- the payload must stay pinned until the reservation ends + +If any required payload is not resolvable: +- the window is not recoverable +- the replica must go to `NeedsRebuild` + +## Snapshot Rule + +Rebuild must use a real checkpoint/snapshot base image. + +Valid: +- immutable snapshot at `cpLSN` +- copy-on-write checkpoint image +- frozen base image with exact `cpLSN` + +Invalid: +- current extent treated as historical `cpLSN` + +## Epoch / Fencing Rule + +Every transition is epoch-bound. + +If epoch changes during: +- `Bootstrapping` +- `Lagging` +- `CatchingUp` +- `PromotionHold` +- `Rebuilding` +- `CatchUpAfterRebuild` + +Then: +- abort current transition +- discard old sender assumptions +- restart negotiation under the new epoch + +This prevents stale-primary recovery traffic from being accepted. + +## Multi-Replica Volume Rules + +Different replicas may be in different states simultaneously. + +Example: +- replica A = `InSync` +- replica B = `CatchingUp` +- replica C = `Rebuilding` + +Volume-level durability policy is computed per mode. + +### `sync_all` +- all required replicas must be `InSync` + +### `sync_quorum` +- enough replicas must be `InSync` + +### `best_effort` +- primary local durability only +- replicas recover in background + +## Illegal or Suspicious Conditions + +These should force rejection or abort: + +1. replica reports `replicaFlushedLSN > headLSN` +2. replica progress belongs to wrong epoch +3. requested recovery window is not recoverable +4. recovery reservation cannot be granted +5. snapshot base does not match claimed `cpLSN` +6. replay stream shows impossible gap/ordering after reconstruction + +## Design Guidance + +V2 should be implemented so that: + +1. state owns recovery semantics +2. anchors make transitions explicit +3. retention obligations are derived from state +4. catch-up admission requires reservation, not guesswork +5. mode semantics are derived from `InSync` eligibility + +This is better than burying recovery behavior across many ad hoc code paths. + +## Bottom Line + +V2 is fundamentally a state machine problem. + +The correct abstraction is not: +- some edge cases around WAL replay + +It is: +- replicas move through explicit states while the primary head continues advancing and recovery windows must be provable and reserved + +So V2 must be designed around: +- state definitions +- anchor LSNs +- transition rules +- retention obligations +- recoverability checks +- recovery reservations +- abort conditions diff --git a/sw-block/design/wal-replication-v2.md b/sw-block/design/wal-replication-v2.md new file mode 100644 index 000000000..473485b6d --- /dev/null +++ b/sw-block/design/wal-replication-v2.md @@ -0,0 +1,401 @@ +# WAL Replication V2 + +Date: 2026-03-26 +Status: design proposal +Purpose: redesign WAL-based block replication around explicit short-gap catch-up and long-gap reconstruction + +## Goal + +Provide a replication architecture that: + +- keeps the primary write path fast +- supports correct synchronous durability semantics +- supports short-gap reconnect catch-up using WAL +- avoids paying unbounded WAL retention tax for long-lag replicas +- uses reconstruction from a real checkpoint/snapshot base for larger lag + +This design replaces a "WAL does everything" mindset with a 3-tier recovery model. + +## Core Principle + +WAL is excellent for: +- recent ordered delta +- local crash recovery +- short-gap replica catch-up + +WAL is not the right long-range recovery mechanism for lagging block replicas. + +Long-gap recovery should use: +- a real checkpoint/snapshot base image +- plus WAL tail replay after that base point + +## Correctness Boundary + +Never reconstruct old state from current extent alone. + +Example: + +1. `LSN 100`: block `A = foo` +2. `LSN 120`: block `A = bar` + +If a replica needs state at `LSN 100`, current extent contains `bar`, not `foo`. + +Therefore: +- current extent is latest state +- not historical state + +So long-gap recovery must use a base image that is known to represent a real checkpoint/snapshot `cpLSN`. + +## 3-Tier Replication Model + +### Tier A: Keep-up + +Replica is close enough to the primary that normal ordered streaming keeps it current. + +Properties: +- normal steady-state mode +- no special recovery path +- replica stays `InSync` + +### Tier B: Lagging Catch-up + +Replica fell behind, but the primary still has enough recoverable history covering the missing range. + +Properties: +- reconnect handshake determines the replica durable point +- primary proves and reserves a bounded recovery window +- primary replays missing history +- replica returns to `InSync` only after replay, barrier confirmation, and promotion hold + +### Tier C: Reconstruction + +Replica is too far behind for direct replay. + +Properties: +- replica must rebuild from a real checkpoint/snapshot base +- after base image install, primary replays trailing history after `cpLSN` +- replica only re-enters `InSync` after durable catch-up completes + +## Architecture + +### Primary Artifacts + +The primary owns three forms of state: + +1. `Active WAL` +- recent ordered metadata/delta stream +- bounded by retention policy + +2. `Checkpoint Snapshot` +- immutable point-in-time base image at `cpLSN` +- used for long-gap reconstruction + +3. `Current Extent` +- latest live block state +- not a substitute for historical checkpoint state + +### Replica Artifacts + +Replica maintains: + +1. local WAL or equivalent recovery log +2. replica `receivedLSN` +3. replica `flushedLSN` +4. local extent state + +## Sender Model + +Do not ship recovery data inline from foreground write goroutines. + +Per replica, use: +- one ordered send queue +- one sender loop + +The sender loop owns: +- live stream shipping +- reconnect handling +- short-gap catch-up +- reconstruction tail replay + +This guarantees: +- strict LSN order per replica +- clean transport state ownership +- no inline shipping races in the primary write path + +## Write Path + +Primary write path: + +1. allocate monotonic `LSN` +2. append recovery metadata to local WAL or journal +3. enqueue the record to each replica sender queue +4. return according to durability mode semantics + +Flusher later: +- flushes dirty data to extent +- manages checkpoints +- manages bounded retention of WAL and other recovery dependencies + +## Recovery Classes + +V2 supports more than one local record type. + +### `WALInline` + +Properties: +- payload lives directly in WAL +- recoverable while WAL is retained + +### `ExtentReferenced` + +Properties: +- journal entry contains metadata only +- payload is resolved from extent/snapshot generation state +- direct-extent writes and future smart-WAL paths fall into this class + +Replica state does not encode these classes. + +Instead, the primary must answer a stricter question for reconnect: +- is `(startLSN, endLSN]` fully recoverable under the current epoch, and can it be reserved for the duration of recovery? + +## Replica Progress Model + +Each replica reports progress explicitly. + +### `receivedLSN` +- highest LSN received and appended locally +- not yet a durability guarantee + +### `flushedLSN` +- highest LSN durably persisted on the replica +- authoritative sync durability signal + +Only `flushedLSN` counts for: +- `sync_all` +- `sync_quorum` + +## Replica States + +Replica state is defined by `wal-replication-v2-state-machine.md`. + +Important highlights: +- `Bootstrapping` +- `InSync` +- `Lagging` +- `CatchingUp` +- `PromotionHold` +- `NeedsRebuild` +- `Rebuilding` +- `CatchUpAfterRebuild` +- `Failed` + +Only `InSync` replicas count toward sync durability. + +## Protocol + +### 1. Normal Streaming + +Primary sender loop: +- sends ordered replicated write records + +Replica: +1. validates ordering +2. appends locally +3. advances `receivedLSN` + +### 2. Barrier / Sync + +Primary sends: +- `BarrierReq{LSN, Epoch}` + +Replica: +1. wait until `receivedLSN >= LSN` +2. flush durable local state +3. set `flushedLSN = LSN` +4. reply `BarrierResp{Status, FlushedLSN}` + +Primary uses this to evaluate mode policy. + +### 3. Reconnect Handshake + +On reconnect, primary obtains: +- current epoch +- primary head +- replica durable `flushedLSN` + +Then primary evaluates recovery feasibility. + +Possible outcomes: + +1. replica already caught up +- state -> `PromotionHold` or `InSync` depending on policy + +2. bounded catch-up possible +- reserve recovery window +- state -> `CatchingUp` + +3. direct replay not possible +- state -> `NeedsRebuild` + +## Recovery Feasibility and Reservation + +The key V2 rule is: +- `fully recoverable` is not enough +- the primary must also reserve the recovery window + +Recommended engine-side flow: + +1. `CheckRecoveryFeasibility(startLSN, endLSN)` +2. if feasible, `ReserveRecoveryWindow(startLSN, endLSN)` +3. only then start `CatchingUp` or `CatchUpAfterRebuild` + +A recovery reservation pins: +- recovery metadata +- referenced payload generations +- required snapshots/base images +- current epoch lineage for the session + +If the reservation is lost during recovery: +- abort the current attempt +- fall back to `NeedsRebuild` + +## Tier B: Lagging Catch-up Algorithm + +When a replica is behind but within a recoverable retained window: + +1. choose a bounded target `H0` +2. reserve `(ReplicaFlushedLSN, H0]` +3. replay the missing range +4. barrier confirms durable `flushedLSN >= H0` +5. enter `PromotionHold` +6. only then restore `InSync` + +### Tail-chasing problem + +If the primary is writing faster than the replica can catch up, the replica may never converge. + +To handle this: + +1. define a bounded catch-up window +2. if catch-up rate is slower than ingest rate for too long: + - either temporarily throttle primary admission for strict `sync_all` + - or fail `sync_all` requests and let control-plane policy react + - or abort to rebuild +3. do not let a replica remain in unbounded perpetual `CatchingUp` + +### Important rule + +For `sync_all`, the data path must not silently downgrade to `best_effort`. + +Correct behavior: +- bounded retry +- then fail + +Any mode change must be explicit policy, not silent transport behavior. + +## Tier C: Reconstruction Algorithm + +When a replica is too far behind for direct replay: + +1. mark replica `NeedsRebuild` +2. choose a real checkpoint/snapshot base at `cpLSN` +3. create a rebuild reservation +4. replica enters `Rebuilding` +5. replica pulls immutable checkpoint/snapshot image +6. replica installs that base image and sets base progress to `cpLSN` +7. primary replays trailing history `(cpLSN, H0]` +8. barrier confirms durable replay +9. replica enters `PromotionHold` +10. replica returns to `InSync` + +### Why snapshot/base image must be real + +If the replica needs state at `cpLSN`, the base image must represent exactly that checkpoint. + +Invalid: +- current extent copied at some later time and treated as historical `cpLSN` + +Valid: +- immutable snapshot +- copy-on-write checkpoint image +- frozen base image + +## Retention and Budget + +V2 retention is bounded. + +### WAL / recovery metadata retention + +Primary keeps only a bounded recent recovery window: +- `max_retained_wal_bytes` +- optionally `max_retained_wal_time` + +### Recovery reservation budget + +Reservations are also bounded: +- timeout +- bytes pinned +- snapshot dependency lifetime + +If a catch-up or rebuild session exceeds its reservation budget: +- primary aborts the session +- replica falls back to `NeedsRebuild` +- a newer rebuild plan may be chosen later + +## Sync Modes + +### `best_effort` +- ACK after primary local durability +- replicas may lag +- background catch-up or rebuild allowed + +### `sync_all` +- ACK only when all required replicas are `InSync` and durably at target LSN +- bounded retry only +- no silent downgrade + +### `sync_quorum` +- ACK when enough replicas are `InSync` and durably at target LSN + +## Why This Direction + +V2 separates three different concerns cleanly: + +1. fast steady-state replication +2. short-gap replay +3. long-gap reconstruction + +This avoids forcing WAL alone to solve all recovery cases. + +## Implementation Order + +Recommended order: + +1. pure FSM +2. ordered sender loop +3. bounded direct replay +4. checkpoint/snapshot reconstruction +5. smarter local write path and recovery classes +6. policy and control-plane integration + +## Phase 13 current direction + +Current Phase 13 / WAL V1 is still: +- fixing correctness of WAL-centered sync replication +- still focused mainly on bounded WAL replay and rebuild fallback + +That is the right bridge. + +V2 should follow after WAL V1 closes. + +## Bottom Line + +V2 is not "more WAL features." + +It is: +- explicit recovery feasibility +- explicit recovery reservations +- ordered sender loops +- short-gap replay for recent lag +- checkpoint/snapshot reconstruction for long lag +- promotion back to `InSync` only after durable proof diff --git a/sw-block/design/wal-v1-to-v2-mapping.md b/sw-block/design/wal-v1-to-v2-mapping.md new file mode 100644 index 000000000..c6ab33d39 --- /dev/null +++ b/sw-block/design/wal-v1-to-v2-mapping.md @@ -0,0 +1,349 @@ +# WAL V1 To V2 Mapping + +Date: 2026-03-26 +Status: working note +Purpose: map the current WAL V1 scattered state across `sw-block` into the proposed WAL V2 FSM vocabulary + +## Why This Note Exists + +Current WAL V1 correctness logic is spread across: + +- `wal_shipper.go` +- `replica_apply.go` +- `dist_group_commit.go` +- `blockvol.go` +- `promotion.go` +- `rebuild.go` +- heartbeat/master reporting + +This note does not propose immediate code changes. + +It exists to answer two questions: + +1. what state already exists in WAL V1 today? +2. how does that state map into the cleaner WAL V2 FSM model? + +## Current V1 State Owners + +### 1. Shipper state + +Primary-side per-replica transport and recovery state lives mainly in: +- `weed/storage/blockvol/wal_shipper.go` + +Current V1 shipper states: +- `ReplicaDisconnected` +- `ReplicaConnecting` +- `ReplicaCatchingUp` +- `ReplicaInSync` +- `ReplicaDegraded` +- `ReplicaNeedsRebuild` + +Other shipper-owned flags/anchors: +- `replicaFlushedLSN` +- `hasFlushedProgress` +- `catchupFailures` +- `lastContactTime` + +### 2. Replica receiver progress + +Replica-side receive/apply progress lives mainly in: +- `weed/storage/blockvol/replica_apply.go` + +Current V1 replica progress: +- `receivedLSN` +- `flushedLSN` +- duplicate/gap handling in `applyEntry()` + +### 3. Volume-level durability policy + +Volume-level sync semantics live mainly in: +- `weed/storage/blockvol/dist_group_commit.go` + +Current V1 policy uses: +- local WAL sync result +- per-shipper barrier results +- `DurabilityBestEffort` +- `DurabilitySyncAll` +- `DurabilitySyncQuorum` + +### 4. Volume-level retention/checkpoint state + +Primary-side local checkpoint and WAL retention state lives mainly in: +- `weed/storage/blockvol/blockvol.go` +- `weed/storage/blockvol/flusher.go` + +Current V1 anchors: +- `nextLSN` +- `CheckpointLSN()` +- WAL retained range +- retention-floor callbacks from `ShipperGroup` + +### 5. Role/assignment state + +Master-driven volume role state lives mainly in: +- `weed/storage/blockvol/promotion.go` +- `weed/storage/blockvol/blockvol.go` +- `weed/server/volume_server_block.go` + +Current V1 roles: +- `RolePrimary` +- `RoleReplica` +- `RoleStale` +- `RoleRebuilding` +- `RoleDraining` + +### 6. Rebuild state + +Existing V1 rebuild transport/process lives mainly in: +- `weed/storage/blockvol/rebuild.go` + +Current V1 rebuild phases: +- WAL catch-up attempt +- full extent copy +- trailing WAL catch-up +- rejoin via assignment + fresh shipper bootstrap + +### 7. Heartbeat/master-visible replication state + +Master-visible state lives mainly in: +- `weed/storage/blockvol/block_heartbeat.go` +- `weed/storage/blockvol/blockvol.go` +- server-side registry/master handling + +Current V1 visible fields include: +- `ReplicaDegraded` +- `ReplicaShipperStates []ReplicaShipperStatus` +- role/epoch/checkpoint/head state + +## V1 To V2 Mapping + +### Shipper state mapping + +| WAL V1 shipper state | Proposed WAL V2 FSM state | Notes | +| --- | --- | --- | +| `ReplicaDisconnected` | `Bootstrapping` or `Lagging` | Fresh shipper with no durable progress maps to `Bootstrapping`; previously-synced disconnected replica maps to `Lagging`. | +| `ReplicaConnecting` | transitional part of `Lagging -> CatchingUp` | V2 should model this as an event/session phase, not a durable steady state. | +| `ReplicaCatchingUp` | `CatchingUp` | Direct mapping for short-gap replay. | +| `ReplicaInSync` | `InSync` | Direct mapping. | +| `ReplicaDegraded` | `Lagging` | V1 transport failure state becomes the cleaner V2 recovery-needed state. | +| `ReplicaNeedsRebuild` | `NeedsRebuild` | Direct mapping. | + +Main V1 cleanup opportunity: +- V1 mixes transport/session detail (`Connecting`) with recovery lifecycle state. +- V2 should keep the long-lived FSM smaller and push connection mechanics into sender-loop/session logic. + +### Replica receiver progress mapping + +| WAL V1 field | WAL V2 concept | Notes | +| --- | --- | --- | +| `receivedLSN` | `receivedLSN` | Keep as transport/apply progress only. | +| `flushedLSN` | `replicaFlushedLSN` | Keep as authoritative durability anchor. | +| duplicate/gap rules | replay validity rules | These become part of the V2 replay contract, not ad hoc receiver behavior. | + +Main V1 cleanup opportunity: +- V1 receiver progress is already conceptually sound. +- V2 should keep it but drive it from explicit FSM transitions and replay reservations. + +### Volume durability policy mapping + +| WAL V1 behavior | WAL V2 concept | Notes | +| --- | --- | --- | +| `BarrierAll` against current shippers | promotion and sync gate | V2 should keep barrier-based durability truth. | +| `sync_all` requires all barriers | `InSync` eligibility gate | Same rule, but V2 eligibility should come from FSM state rather than scattered checks. | +| `best_effort` ignores barrier failures | background recovery mode | Same high-level policy. | +| `sync_quorum` counts successful barriers | quorum over `InSync` replicas | Same direction, but should be derived from explicit FSM state. | + +Main V1 cleanup opportunity: +- durability mode logic should depend on `IsSyncEligible()`-style state, not raw shipper state enums spread across code. + +### Retention/checkpoint mapping + +| WAL V1 concept | WAL V2 concept | Notes | +| --- | --- | --- | +| `CheckpointLSN()` | checkpoint/base anchor | Keep, but V2 also adds explicit `cpLSN` snapshot semantics. | +| retention floor from recoverable replicas | recoverability budget | Keep the idea, but V2 turns this into explicit reservation management. | +| timeout-based `NeedsRebuild` | janitor-driven `Lagging -> NeedsRebuild` | Keep as background control logic, not hot-path mutation. | + +Main V1 cleanup opportunity: +- V1 retains data because replicas might need it. +- V2 should reserve specific recovery windows, not rely only on ambient retention conditions. + +### Role/assignment mapping + +| WAL V1 role state | WAL V2 meaning | Notes | +| --- | --- | --- | +| `RolePrimary` | primary ownership / epoch authority | Not a replica FSM state; remains volume/control-plane state. | +| `RoleReplica` | replica service role | Orthogonal to replication FSM state. A replica volume may be `RoleReplica` while its sender-facing state is `Bootstrapping`, `Lagging`, or `InSync`. | +| `RoleStale` | pre-rebuild/non-serving | Closest to `NeedsRebuild` preparation on the volume role side. | +| `RoleRebuilding` | rebuild session role | Maps to volume-wide orchestration around V2 `Rebuilding`. | +| `RoleDraining` | assignment/failover coordination | Outside replica FSM; remains a volume transition role. | + +Main V1 cleanup opportunity: +- role state and replication FSM state are different dimensions. +- V1 sometimes implicitly blends them. +- V2 should keep them separate: + - control-plane role FSM + - per-replica replication FSM + +### Rebuild flow mapping + +| WAL V1 rebuild phase | WAL V2 FSM phase | Notes | +| --- | --- | --- | +| WAL catch-up pre-pass | `Lagging -> CatchingUp` if feasible | Same idea, but V2 requires recoverability proof and reservation. | +| full extent copy | `NeedsRebuild -> Rebuilding` | Same high-level phase. | +| trailing WAL catch-up | `CatchUpAfterRebuild` | Direct conceptual mapping. | +| fresh shipper bootstrap after reassignment | `Bootstrapping` then promotion | V1 does this through assignment refresh; V2 may eventually do it with cleaner local transitions. | + +Main V1 cleanup opportunity: +- V1 rebuild success is currently rejoined indirectly through control-plane reassignment. +- V2 should eventually make rebuild completion and promotion explicit FSM transitions. + +### Heartbeat/master state mapping + +| WAL V1 visible state | WAL V2 meaning | Notes | +| --- | --- | --- | +| `ReplicaShipperStatus{DataAddr, State, FlushedLSN}` | control-plane view of per-replica FSM | Good starting shape. | +| `ReplicaDegraded` | derived summary only | Too coarse for V2 decision-making; keep only as convenience/compat field. | +| role/epoch/head/checkpoint | role FSM + replication anchors | Continue reporting; V2 may need richer recovery reservation visibility later. | + +Main V1 cleanup opportunity: +- master-facing replication state should be per replica, not summarized as one degraded bit. + +## Current V1 Event Sources vs V2 Events + +### V1 event source: `Barrier()` outcome + +Current effects: +- mark `InSync` +- update `replicaFlushedLSN` +- mark degraded on error + +V2 event mapping: +- `BarrierSuccess` +- `BarrierFailure` +- `PromotionHealthy` + +### V1 event source: reconnect handshake + +Current effects: +- `Connecting` +- choose `InSync`, `CatchingUp`, or `NeedsRebuild` + +V2 event mapping: +- `ReconnectObserved` +- `RecoveryFeasible` +- `RecoveryReservationGranted` +- `ReconnectNeedsRebuild` + +### V1 event source: retention budget evaluation + +Current effects: +- stale replica becomes `NeedsRebuild` + +V2 event mapping: +- `RecoverabilityExpired` +- `BackgroundJanitorNeedsRebuild` + +### V1 event source: rebuild assignment and `StartRebuild` + +Current effects: +- role becomes `RoleRebuilding` +- run baseline + trailing catch-up +- rejoin later via reassignment + +V2 event mapping: +- `StartRebuild` +- `RebuildBaseApplied` +- `RebuildReservationLost` +- `RebuildCompleteReadyForPromotion` + +## Main Gaps Between V1 And V2 + +### 1. V1 has shipper state, but not a pure FSM + +Current V1 state is embedded in: +- transport logic +- barrier logic +- retention logic +- rebuild orchestration + +V2 goal: +- one pure FSM that owns state and anchors +- transport/session code only executes actions + +### 2. V1 does not model reservation explicitly + +Current V1 asks, roughly: +- is WAL still retained? + +V2 must ask: +- is `(startLSN, endLSN]` fully recoverable? +- can the primary reserve that window until recovery completes? + +### 3. V1 has no explicit promotion debounce state + +Current V1 goes effectively: +- caught up -> `InSync` + +V2 adds: +- `PromotionHold` + +### 4. V1 rebuild completion is control-plane indirect + +Current V1: +- old `NeedsRebuild` shipper stays stuck +- master reassigns +- fresh shipper bootstraps + +V2 likely wants: +- cleaner local FSM transitions, even if control plane still participates + +### 5. V1 does not yet encode recovery classes + +Current V1 is mostly WAL-centric. + +V2 should support: +- `WALInline` +- `ExtentReferenced` + +without leaking storage details into replica state. + +## What Should Stay From V1 + +These V1 ideas are solid and should be preserved: + +1. `replicaFlushedLSN` as sync truth +2. barrier-driven durability confirmation +3. explicit `NeedsRebuild` +4. per-replica status reporting to master +5. retention budgets eventually forcing rebuild +6. rebuild as a separate path from normal catch-up + +## What Should Move In V2 + +These are the main redesign items: + +1. move scattered shipper/recovery state into one pure FSM +2. separate transport/session phases from durable FSM state +3. add `Bootstrapping` and `PromotionHold` +4. add recoverability proof and reservation as first-class concepts +5. make replay/rebuild admission depend on reservation, not just present-time checks +6. cleanly separate: + - control-plane role FSM + - per-replica replication FSM + +## Bottom Line + +WAL V1 already contains most of the important primitives: + +- durable progress +- barrier truth +- catch-up +- rebuild detection +- master-visible per-replica state + +What V2 changes is not the existence of these ideas. + +It changes their organization: +- from scattered transport/rebuild logic +- to one explicit, testable FSM with recovery reservations and cleaner state boundaries diff --git a/sw-block/design/wal-v2-tiny-prototype.md b/sw-block/design/wal-v2-tiny-prototype.md new file mode 100644 index 000000000..4ec9b9d3c --- /dev/null +++ b/sw-block/design/wal-v2-tiny-prototype.md @@ -0,0 +1,277 @@ +# WAL V2 Tiny Prototype + +Date: 2026-03-26 +Status: design/prototyping plan +Purpose: validate the core V2 replication logic before committing to a broader redesign + +## Goal + +Build a small, non-production prototype that proves the core V2 ideas: + +1. `ExtentBackend` abstraction +2. 3-tier replication FSM +3. async ordered sender loop +4. barrier-driven durability tracking +5. short-gap catch-up vs long-gap rebuild boundary +6. recovery feasibility and reservation semantics + +This prototype is for discovering: +- state complexity +- recovery correctness +- sender-loop behavior +- performance shape + +It is not for shipping. + +## Prototype Scope + +### 1. Extent backend isolation layer + +Define a clean backend interface for extent reads/writes. + +Initial implementation: +- `FileBackend` +- normal Linux file +- `pread` +- `pwrite` +- optional `fallocate` + +Do not start with raw-device allocation. + +The point is to stabilize: +- extent semantics +- base-image import/export assumptions +- checkpoint/snapshot integration points + +### 2. V2 asynchronous replication FSM + +Build a pure in-memory FSM for one replica. + +FSM owns: +- state +- anchor LSNs +- transition legality +- sync eligibility +- action suggestions +- recovery reservation metadata + +Target state set: +- `Bootstrapping` +- `InSync` +- `Lagging` +- `CatchingUp` +- `PromotionHold` +- `NeedsRebuild` +- `Rebuilding` +- `CatchUpAfterRebuild` +- `Failed` + +The FSM must not do: +- network I/O +- disk I/O +- goroutine management + +### 3. Sender loop + barrier primitive + +For each replica: +- one ordered sender goroutine +- one non-blocking enqueue path from primary write path +- one barrier/progress path + +Primary write path: +1. allocate `LSN` +2. append local WAL/journal metadata +3. enqueue to sender loop +4. return according to durability mode + +The sender loop is responsible for: +- live ordered send +- reconnect handling +- catch-up replay +- rebuild-tail replay + +## Explicit Non-Goals + +These are intentionally excluded from the tiny prototype: + +- raw allocator +- garbage collection +- `NVMe-oF` +- `ublk` +- chain replication +- CSI / control plane +- multi-replica quorum +- encryption +- real snapshot storage optimization + +These are extension layers, not the core logic being validated here. + +## Design Principle + +Those excluded items are not being rejected. + +They are treated as: +- extensions of the core logic + +The prototype should be designed so they can later plug in without rewriting the state machine. + +## Suggested Layout + +One reasonable layout: + +- `weed/storage/blockvol/fsmv2/` + - `fsm.go` + - `events.go` + - `actions.go` + - `fsm_test.go` +- `weed/storage/blockvol/prototypev2/` + - `backend.go` + - `file_backend.go` + - `sender_loop.go` + - `barrier.go` + - `prototype_test.go` + +Preferred direction: +- keep it close enough to production packages that later reuse is easy +- but clearly marked experimental + +## Core Interfaces + +### Extent backend + +Example direction: + +```go +type ExtentBackend interface { + ReadAt(p []byte, off int64) (int, error) + WriteAt(p []byte, off int64) (int, error) + Sync() error + Size() uint64 +} +``` + +### FSM + +Example direction: + +```go +type ReplicaFSM struct { + // state + // epoch + // anchor LSNs + // reservation metadata +} + +func (f *ReplicaFSM) Apply(evt ReplicaEvent) ([]ReplicaAction, error) +``` + +### Sender loop + +Example direction: + +```go +type SenderLoop struct { + // input queue + // FSM + // transport mock/adapter +} +``` + +## What The Prototype Must Prove + +### A. FSM correctness + +The FSM must show that the state set is sufficient and coherent. + +Key scenarios: + +1. `Bootstrapping -> InSync` +2. `InSync -> Lagging -> CatchingUp -> PromotionHold -> InSync` +3. `Lagging -> NeedsRebuild -> Rebuilding -> CatchUpAfterRebuild -> PromotionHold -> InSync` +4. epoch change aborts catch-up +5. epoch change aborts rebuild +6. reservation-lost aborts catch-up +7. rebuild-too-slow aborts reconstruction +8. flapping replica does not instantly re-enter `InSync` + +### B. Sender ordering + +The sender loop must prove: +- strict LSN order per replica +- no inline ship races from concurrent writes +- decoupled foreground write path + +### C. Barrier semantics + +Barrier must prove: +- it waits on replica progress +- it uses `flushedLSN`, not transport guesses +- it can drive promotion eligibility cleanly + +### D. Recovery boundary + +Prototype must make the handoff explicit: +- recent lag -> reserved replay window +- long lag -> rebuild from base image + trailing replay + +### E. Recovery reservation + +Prototype must make this explicit: +- a window is not enough +- it must be provable and then reserved +- losing the reservation must abort recovery cleanly + +## Performance Questions The Prototype Should Answer + +Not benchmark headlines. + +Instead: + +1. how much contention disappears from the hot write path after removing inline ship +2. how queue depth grows under slow replicas +3. when catch-up stops converging +4. how expensive promotion hold is +5. how much complexity is added by rebuild-tail replay +6. how much complexity is added by reservation management + +## Success Criteria + +The tiny prototype is successful if it gives clear answers to: + +1. can the V2 FSM be made explicit and testable? +2. does sender-loop ordering materially simplify the replication path? +3. is the catch-up vs rebuild boundary coherent under a moving primary head? +4. does reservation-based recoverability make the design safer and clearer? +5. does the architecture look simpler than extending WAL V1 forever? + +## Failure Criteria + +The prototype should be considered unsuccessful if: + +1. state count explodes and remains hard to reason about +2. sender loop does not materially simplify ordering/recovery +3. promotion and recovery rules remain too coupled to ad hoc timers and network callbacks +4. rebuild-from-base + trailing replay is still ambiguous even in a controlled prototype +5. reservation handling turns into unbounded complexity + +## Relationship To WAL V1 + +WAL V1 remains the current delivery line. + +This prototype is not a replacement for: +- `CP13-6` +- `CP13-7` +- `CP13-8` +- `CP13-9` + +It exists to inform what should move into WAL V2 after WAL V1 closes. + +## Bottom Line + +The tiny prototype should validate the core logic only: + +- clean backend boundary +- explicit FSM +- ordered async sender +- recoverability as a proof-plus-reservation problem +- rebuild as a separate recovery mode, not a WAL accident diff --git a/sw-block/private/README.md b/sw-block/private/README.md new file mode 100644 index 000000000..fcbad1da6 --- /dev/null +++ b/sw-block/private/README.md @@ -0,0 +1,14 @@ +# private + +Deprecated in favor of `../.private/`. + +Private working area for: +- design sketches +- draft notes +- temporary comparison docs +- prototype experiments not ready to move into shared design docs + +Keep production-independent work here until it is ready to be promoted into: +- `../design/` +- `../prototype/` +- or the main repo docs under `learn/projects/sw-block/` diff --git a/sw-block/prototype/README.md b/sw-block/prototype/README.md new file mode 100644 index 000000000..dbc63bd30 --- /dev/null +++ b/sw-block/prototype/README.md @@ -0,0 +1,23 @@ +# V2 Prototype + +Experimental WAL V2 prototype code lives here. + +Current prototype: +- `fsmv2/`: pure in-memory replication FSM prototype +- `volumefsm/`: volume-level orchestrator prototype above `fsmv2` +- `distsim/`: early distributed/data-correctness simulator with synthetic 4K block values + +Rules: +- do not wire this directly into WAL V1 production code +- keep interfaces and tests focused on architecture learning +- promote pieces into production only after V2 design stabilizes + +## Windows test workflow + +Because normal `go test` may be blocked by Windows Defender when it executes temporary test binaries from `%TEMP%`, use: + +```powershell +powershell -ExecutionPolicy Bypass -File .\sw-block\prototype\run-tests.ps1 +``` + +This builds test binaries into the workspace and runs them directly. diff --git a/sw-block/prototype/distsim/cluster.go b/sw-block/prototype/distsim/cluster.go new file mode 100644 index 000000000..6e65a5e55 --- /dev/null +++ b/sw-block/prototype/distsim/cluster.go @@ -0,0 +1,1120 @@ +package distsim + +import ( + "fmt" + "sort" +) + +type Role string + +const ( + RolePrimary Role = "primary" + RoleReplica Role = "replica" +) + +type CommitMode string + +const ( + CommitBestEffort CommitMode = "best_effort" + CommitSyncAll CommitMode = "sync_all" + CommitSyncQuorum CommitMode = "sync_quorum" +) + +type MessageKind string + +const ( + MsgWrite MessageKind = "write" + MsgBarrier MessageKind = "barrier" + MsgBarrierAck MessageKind = "barrier_ack" +) + +type Message struct { + Kind MessageKind + From string + To string + Epoch uint64 + Write Write + TargetLSN uint64 +} + +type inFlightMessage struct { + deliverAt uint64 + msg Message +} + +type PendingCommit struct { + Write Write + DurableOn map[string]bool + Committed bool + CommittedAt uint64 +} + +type RecoveryClass string + +const ( + RecoveryClassWALInline RecoveryClass = "wal_inline" + RecoveryClassExtentReferenced RecoveryClass = "extent_referenced" +) + +type RecoveryRecord struct { + Write + Class RecoveryClass + PayloadResolvable bool +} + +// ReplicaNodeState tracks per-node replication protocol state. +type ReplicaNodeState string + +const ( + NodeStateInSync ReplicaNodeState = "InSync" + NodeStateLagging ReplicaNodeState = "Lagging" + NodeStateCatchingUp ReplicaNodeState = "CatchingUp" + NodeStateNeedsRebuild ReplicaNodeState = "NeedsRebuild" + NodeStateRebuilding ReplicaNodeState = "Rebuilding" +) + +// MessageRejectReason tracks why a message was rejected. +type MessageRejectReason string + +const ( + RejectEpochMismatch MessageRejectReason = "epoch_mismatch" + RejectNodeDown MessageRejectReason = "node_down" + RejectLinkDown MessageRejectReason = "link_down" + RejectStaleEndpoint MessageRejectReason = "stale_endpoint" + RejectBarrierExpired MessageRejectReason = "barrier_expired" +) + +// CachedEndpoint represents the primary's cached view of a replica's address. +type CachedEndpoint struct { + DataAddr string + EndpointVersion uint64 +} + +type NodeModel struct { + ID string + Role Role + Epoch uint64 + Running bool + + Storage *Storage + + // Endpoint identity. + DataAddr string // current data-plane address + EndpointVersion uint64 // bumped on address change (restart with new port) + + // Protocol state (V2 model extension). + ReplicaState ReplicaNodeState + CatchupAttempts int // consecutive catch-up attempts without convergence +} + +func NewNodeModel(id string, role Role, epoch uint64) *NodeModel { + return &NodeModel{ + ID: id, + Role: role, + Epoch: epoch, + Running: true, + Storage: NewStorage(), + ReplicaState: NodeStateInSync, + DataAddr: fmt.Sprintf("%s:9333", id), + EndpointVersion: 1, + } +} + +type Coordinator struct { + Epoch uint64 + PrimaryID string + Mode CommitMode + Members []string + CommittedLSN uint64 +} + +// RejectedMessage records a message that was rejected with a reason. +type RejectedMessage struct { + Msg Message + Reason MessageRejectReason + Time uint64 +} + +// DeliveryResult records the outcome of a message delivery attempt. +type DeliveryResult struct { + Msg Message + Accepted bool + Reason MessageRejectReason // empty if accepted + Time uint64 +} + +type Cluster struct { + Now uint64 + Coordinator *Coordinator + Reference *Reference + Protocol ProtocolPolicy // version-aware protocol decisions + + Nodes map[string]*NodeModel + Links map[string]map[string]bool + Queue []inFlightMessage + Pending map[uint64]*PendingCommit + nextLSN uint64 + + // Protocol tracking. + Rejected []RejectedMessage // messages rejected by deliver() + Deliveries []DeliveryResult // all delivery attempts with accept/reject outcome + + // Endpoint tracking: primary's cached view of replica addresses. + CachedReplicaEndpoints map[string]CachedEndpoint + + // Timeout tracking (eventsim layer). + Timeouts []PendingTimeout + FiredTimeouts []FiredTimeout // timeouts that fired with authority to mutate state + IgnoredTimeouts []FiredTimeout // timeouts that reached deadline but had no authority (stale) + ExpiredBarriers map[barrierExpiredKey]bool // late acks for expired barriers are rejected + TickLog []TickEvent // ordered event log for race debugging + + // Session tracking (ownership layer). + Sessions map[string]*TrackedSession // active session per replica (keyed by ReplicaID) + SessionHistory []*TrackedSession // all sessions for debugging + nextSessionID uint64 + + // Timeout config: 0 = disabled. + BarrierTimeoutTicks uint64 // auto-register barrier timeout per replica on CommitWrite + + // Catch-up config. + MaxCatchupAttempts int // default 10; escalates to NeedsRebuild +} + +func NewCluster(mode CommitMode, primaryID string, replicaIDs ...string) *Cluster { + return NewClusterWithProtocol(mode, ProtocolV2, primaryID, replicaIDs...) +} + +func NewClusterWithProtocol(mode CommitMode, proto ProtocolVersion, primaryID string, replicaIDs ...string) *Cluster { + c := &Cluster{ + Coordinator: &Coordinator{ + Epoch: 1, + PrimaryID: primaryID, + Mode: mode, + Members: append([]string{primaryID}, replicaIDs...), + }, + Protocol: ProtocolPolicy{Version: proto}, + Reference: NewReference(), + Nodes: map[string]*NodeModel{}, + Links: map[string]map[string]bool{}, + MaxCatchupAttempts: 10, + Pending: map[uint64]*PendingCommit{}, + } + c.CachedReplicaEndpoints = map[string]CachedEndpoint{} + c.ExpiredBarriers = map[barrierExpiredKey]bool{} + c.Sessions = map[string]*TrackedSession{} + c.AddNode(primaryID, RolePrimary) + for _, id := range replicaIDs { + c.AddNode(id, RoleReplica) + n := c.Nodes[id] + c.CachedReplicaEndpoints[id] = CachedEndpoint{ + DataAddr: n.DataAddr, + EndpointVersion: n.EndpointVersion, + } + } + for _, from := range c.Coordinator.Members { + for _, to := range c.Coordinator.Members { + if from == to { + continue + } + c.Connect(from, to) + } + } + return c +} + +func (c *Cluster) AddNode(id string, role Role) { + c.Nodes[id] = NewNodeModel(id, role, c.Coordinator.Epoch) + if c.Links[id] == nil { + c.Links[id] = map[string]bool{} + } +} + +func (c *Cluster) Connect(from, to string) { + if c.Links[from] == nil { + c.Links[from] = map[string]bool{} + } + c.Links[from][to] = true +} + +func (c *Cluster) Disconnect(from, to string) { + if c.Links[from] == nil { + return + } + delete(c.Links[from], to) +} + +func (c *Cluster) StopNode(id string) { + if n := c.Nodes[id]; n != nil { + n.Running = false + } +} + +func (c *Cluster) StartNode(id string) { + if n := c.Nodes[id]; n != nil { + n.Running = true + n.Epoch = c.Coordinator.Epoch + } +} + +func (c *Cluster) Primary() *NodeModel { + return c.Nodes[c.Coordinator.PrimaryID] +} + +func (c *Cluster) replicaIDs() []string { + ids := make([]string, 0, len(c.Coordinator.Members)-1) + for _, id := range c.Coordinator.Members { + if id != c.Coordinator.PrimaryID { + ids = append(ids, id) + } + } + sort.Strings(ids) + return ids +} + +func (c *Cluster) quorumSize() int { + return len(c.Coordinator.Members)/2 + 1 +} + +func (c *Cluster) durableAckCount(p *PendingCommit) int { + if p == nil { + return 0 + } + count := 0 + for _, id := range c.Coordinator.Members { + if p.DurableOn[id] { + count++ + } + } + return count +} + +func (c *Cluster) commitSatisfied(p *PendingCommit) bool { + switch c.Coordinator.Mode { + case CommitBestEffort: + return true + case CommitSyncAll: + return c.durableAckCount(p) == len(c.Coordinator.Members) + case CommitSyncQuorum: + return c.durableAckCount(p) >= c.quorumSize() + default: + return false + } +} + +func (c *Cluster) CommitWrite(block uint64) uint64 { + primary := c.Primary() + if primary == nil || !primary.Running || primary.Epoch != c.Coordinator.Epoch { + return 0 + } + c.nextLSN++ + w := Write{LSN: c.nextLSN, Block: block, Value: c.nextLSN} + primary.Storage.AppendWrite(w) + primary.Storage.AdvanceFlush(w.LSN) + c.Reference.Apply(w) + c.Pending[w.LSN] = &PendingCommit{ + Write: w, + DurableOn: map[string]bool{primary.ID: true}, + Committed: false, + } + c.refreshCommits() + + for _, id := range c.replicaIDs() { + c.enqueue(Message{ + Kind: MsgWrite, + From: primary.ID, + To: id, + Epoch: c.Coordinator.Epoch, + Write: w, + }, c.Now+1) + c.enqueue(Message{ + Kind: MsgBarrier, + From: primary.ID, + To: id, + Epoch: c.Coordinator.Epoch, + TargetLSN: w.LSN, + }, c.Now+2) + if c.BarrierTimeoutTicks > 0 { + c.RegisterTimeout(TimeoutBarrier, id, w.LSN, c.Now+c.BarrierTimeoutTicks) + } + } + return w.LSN +} + +// StaleWrite simulates a partitioned old-primary attempting writes through +// the message protocol at a stale epoch. The writes go through enqueue/deliver +// and should be rejected by epoch fencing — not silently succeed. +// Returns the number of messages that were delivered (should be 0 if fencing works). +func (c *Cluster) StaleWrite(staleNodeID string, staleEpoch uint64, block uint64) int { + node := c.Nodes[staleNodeID] + if node == nil || !node.Running { + return 0 + } + // Allocate a fake LSN from the stale node's perspective. + staleLSN := node.Storage.ReceivedLSN + 1 + w := Write{LSN: staleLSN, Block: block, Value: staleLSN} + node.Storage.AppendWrite(w) + node.Storage.AdvanceFlush(staleLSN) + + rejectedBefore := len(c.Rejected) + + // Try to ship to all other nodes at the stale epoch. + for _, id := range c.Coordinator.Members { + if id == staleNodeID { + continue + } + c.enqueue(Message{ + Kind: MsgWrite, + From: staleNodeID, + To: id, + Epoch: staleEpoch, + Write: w, + }, c.Now+1) + c.enqueue(Message{ + Kind: MsgBarrier, + From: staleNodeID, + To: id, + Epoch: staleEpoch, + TargetLSN: staleLSN, + }, c.Now+2) + } + c.TickN(5) + + // Count how many of those messages were actually delivered (not rejected). + rejectedAfter := len(c.Rejected) + rejectedCount := rejectedAfter - rejectedBefore + // Each target gets 2 messages (write + barrier). Total sent = (members-1) * 2. + totalSent := (len(c.Coordinator.Members) - 1) * 2 + delivered := totalSent - rejectedCount + return delivered +} + +// CatchUpWithEscalation attempts partial catch-up and tracks attempts. +// If max attempts exceeded, transitions the replica to NeedsRebuild. +// Returns true if catch-up converged, false if escalated or in-progress. +func (c *Cluster) CatchUpWithEscalation(replicaID string, batchSize int) bool { + replica := c.Nodes[replicaID] + primary := c.Primary() + if replica == nil || primary == nil { + return false + } + // NeedsRebuild is sticky — don't attempt catch-up. + if replica.ReplicaState == NodeStateNeedsRebuild { + return false + } + + target := c.Coordinator.CommittedLSN + start := replica.Storage.FlushedLSN + + if start >= target { + replica.ReplicaState = NodeStateInSync + replica.CatchupAttempts = 0 + c.cancelRecoveryTimeouts(replicaID) // auto-cancel on convergence + return true + } + + recovered, err := c.RecoverReplicaFromPrimaryPartial(replicaID, start, target, batchSize) + if err != nil { + replica.CatchupAttempts++ + if replica.CatchupAttempts >= c.MaxCatchupAttempts { + replica.ReplicaState = NodeStateNeedsRebuild + c.cancelRecoveryTimeouts(replicaID) // auto-cancel on escalation + } + return false + } + + if recovered >= target { + replica.ReplicaState = NodeStateInSync + replica.CatchupAttempts = 0 + c.cancelRecoveryTimeouts(replicaID) // auto-cancel on convergence + return true + } + + replica.ReplicaState = NodeStateCatchingUp + replica.CatchupAttempts++ + if replica.CatchupAttempts >= c.MaxCatchupAttempts { + replica.ReplicaState = NodeStateNeedsRebuild + c.cancelRecoveryTimeouts(replicaID) // auto-cancel on escalation + } + return false +} + +// RejectedByReason returns the count of rejected messages for a specific reason. +func (c *Cluster) RejectedByReason(reason MessageRejectReason) int { + count := 0 + for _, r := range c.Rejected { + if r.Reason == reason { + count++ + } + } + return count +} + +func (c *Cluster) recordDelivery(msg Message, accepted bool, reason MessageRejectReason) { + c.Deliveries = append(c.Deliveries, DeliveryResult{ + Msg: msg, Accepted: accepted, Reason: reason, Time: c.Now, + }) + // Use Write.LSN for MsgWrite, TargetLSN for barrier/ack. + lsn := msg.TargetLSN + if msg.Kind == MsgWrite { + lsn = msg.Write.LSN + } + if accepted { + c.logEvent(EventDeliveryAccepted, fmt.Sprintf("%s %s→%s lsn=%d", + msg.Kind, msg.From, msg.To, lsn)) + } else { + c.Rejected = append(c.Rejected, RejectedMessage{Msg: msg, Reason: reason, Time: c.Now}) + c.logEvent(EventDeliveryRejected, fmt.Sprintf("%s %s→%s lsn=%d reason=%s", + msg.Kind, msg.From, msg.To, lsn, reason)) + } +} + +// AcceptedCount returns the number of accepted deliveries. +func (c *Cluster) AcceptedCount() int { + count := 0 + for _, d := range c.Deliveries { + if d.Accepted { + count++ + } + } + return count +} + +// AcceptedByKind returns accepted deliveries of a specific message kind. +func (c *Cluster) AcceptedByKind(kind MessageKind) int { + count := 0 + for _, d := range c.Deliveries { + if d.Accepted && d.Msg.Kind == kind { + count++ + } + } + return count +} + +func (c *Cluster) enqueue(msg Message, deliverAt uint64) { + c.Queue = append(c.Queue, inFlightMessage{deliverAt: deliverAt, msg: msg}) +} + +func (c *Cluster) Tick() { + c.Now++ + current := append([]inFlightMessage(nil), c.Queue...) + c.Queue = nil + remaining := make([]inFlightMessage, 0, len(current)) + for _, item := range current { + if item.deliverAt > c.Now { + remaining = append(remaining, item) + continue + } + if !c.deliver(item.msg) { + remaining = append(remaining, item) + } + } + c.Queue = append(c.Queue, remaining...) + c.fireTimeouts() // eventsim: fire timeouts AFTER message delivery (data before timers) + c.refreshCommits() +} + +func (c *Cluster) TickN(n int) { + for i := 0; i < n; i++ { + c.Tick() + } +} + +func (c *Cluster) deliver(msg Message) bool { + from := c.Nodes[msg.From] + to := c.Nodes[msg.To] + if from == nil || to == nil || !from.Running || !to.Running { + c.recordDelivery(msg, false, RejectNodeDown) + return true + } + if msg.Epoch != c.Coordinator.Epoch || to.Epoch != c.Coordinator.Epoch { + c.recordDelivery(msg, false, RejectEpochMismatch) + return true + } + if !c.Links[msg.From][msg.To] { + c.recordDelivery(msg, false, RejectLinkDown) + return true + } + // Endpoint match: primary→replica messages fail if cached endpoint is stale. + if c.CachedReplicaEndpoints != nil && msg.From == c.Coordinator.PrimaryID { + if cached, ok := c.CachedReplicaEndpoints[msg.To]; ok { + if cached.EndpointVersion != to.EndpointVersion { + c.recordDelivery(msg, false, RejectStaleEndpoint) + return true + } + } + } + + c.recordDelivery(msg, true, "") + + switch msg.Kind { + case MsgWrite: + to.Storage.AppendWrite(msg.Write) + case MsgBarrier: + if to.Storage.ReceivedLSN >= msg.TargetLSN { + to.Storage.AdvanceFlush(msg.TargetLSN) + c.enqueue(Message{ + Kind: MsgBarrierAck, + From: to.ID, + To: msg.From, + Epoch: msg.Epoch, + TargetLSN: msg.TargetLSN, + }, c.Now+1) + } else { + c.enqueue(msg, c.Now+1) + } + case MsgBarrierAck: + // Reject late ack if the barrier instance already timed out. + if c.ExpiredBarriers[barrierExpiredKey{msg.From, msg.TargetLSN}] { + c.recordDelivery(msg, false, RejectBarrierExpired) + return true + } + if pending := c.Pending[msg.TargetLSN]; pending != nil { + pending.DurableOn[msg.From] = true + } + c.CancelTimeout(TimeoutBarrier, msg.From, msg.TargetLSN) + } + return true +} + +func (c *Cluster) refreshCommits() { + lsns := make([]uint64, 0, len(c.Pending)) + for lsn := range c.Pending { + lsns = append(lsns, lsn) + } + sort.Slice(lsns, func(i, j int) bool { return lsns[i] < lsns[j] }) + for _, lsn := range lsns { + p := c.Pending[lsn] + if !p.Committed && c.commitSatisfied(p) { + p.Committed = true + p.CommittedAt = c.Now + } + } + + for { + next := c.Coordinator.CommittedLSN + 1 + p := c.Pending[next] + if p == nil || !p.Committed { + break + } + c.Coordinator.CommittedLSN = next + } +} + +func (c *Cluster) Promote(newPrimaryID string) error { + newPrimary := c.Nodes[newPrimaryID] + if newPrimary == nil || !newPrimary.Running { + return fmt.Errorf("distsim: cannot promote missing/down node %s", newPrimaryID) + } + oldPrimary := c.Primary() + c.Coordinator.Epoch++ + c.Coordinator.PrimaryID = newPrimaryID + for _, n := range c.Nodes { + if n == nil || !n.Running { + continue + } + n.Epoch = c.Coordinator.Epoch + n.Role = RoleReplica + } + newPrimary.Role = RolePrimary + if oldPrimary != nil && oldPrimary.ID != newPrimaryID { + oldPrimary.Role = RoleReplica + } + // Epoch bump invalidates all active recovery sessions. + c.InvalidateAllSessions("epoch_bump_promotion") + // Refresh cached endpoints for new primary's view of replicas. + c.CachedReplicaEndpoints = map[string]CachedEndpoint{} + for _, id := range c.Coordinator.Members { + if id == newPrimaryID { + continue + } + if n := c.Nodes[id]; n != nil { + c.CachedReplicaEndpoints[id] = CachedEndpoint{ + DataAddr: n.DataAddr, + EndpointVersion: n.EndpointVersion, + } + } + } + return nil +} + +func (c *Cluster) RecoverReplicaFromPrimary(replicaID string, startExclusive, endInclusive uint64) error { + primary := c.Primary() + replica := c.Nodes[replicaID] + if primary == nil || replica == nil { + return fmt.Errorf("distsim: missing primary or replica") + } + if !primary.Running || !replica.Running { + return fmt.Errorf("distsim: primary or replica not running") + } + for _, w := range writesInRange(primary.Storage.WAL, startExclusive, endInclusive) { + replica.Storage.AppendWrite(w) + } + replica.Storage.AdvanceFlush(endInclusive) + return nil +} + +func (c *Cluster) RecoverReplicaFromPrimaryPartial(replicaID string, startExclusive, endInclusive uint64, maxWrites int) (uint64, error) { + primary := c.Primary() + replica := c.Nodes[replicaID] + if primary == nil || replica == nil { + return startExclusive, fmt.Errorf("distsim: missing primary or replica") + } + if !primary.Running || !replica.Running { + return startExclusive, fmt.Errorf("distsim: primary or replica not running") + } + lastRecovered := startExclusive + applied := 0 + for _, w := range writesInRange(primary.Storage.WAL, startExclusive, endInclusive) { + if applied >= maxWrites { + break + } + replica.Storage.AppendWrite(w) + lastRecovered = w.LSN + applied++ + } + replica.Storage.AdvanceFlush(lastRecovered) + return lastRecovered, nil +} + +func (c *Cluster) RecoverReplicaFromPrimaryReserved(replicaID string, startExclusive, endInclusive, reservationExpiry uint64) error { + primary := c.Primary() + replica := c.Nodes[replicaID] + if primary == nil || replica == nil { + return fmt.Errorf("distsim: missing primary or replica") + } + if !primary.Running || !replica.Running { + return fmt.Errorf("distsim: primary or replica not running") + } + lastRecovered := startExclusive + for _, w := range writesInRange(primary.Storage.WAL, startExclusive, endInclusive) { + if c.Now >= reservationExpiry { + replica.Storage.AdvanceFlush(lastRecovered) + return fmt.Errorf("distsim: recovery reservation expired at time %d before LSN %d", c.Now, w.LSN) + } + replica.Storage.AppendWrite(w) + lastRecovered = w.LSN + c.Tick() + } + replica.Storage.AdvanceFlush(lastRecovered) + if lastRecovered != endInclusive { + return fmt.Errorf("distsim: incomplete recovery range, got %d want %d", lastRecovered, endInclusive) + } + return nil +} + +func (c *Cluster) RebuildReplicaFromSnapshot(replicaID, snapshotID string, targetLSN uint64) error { + primary := c.Primary() + replica := c.Nodes[replicaID] + if primary == nil || replica == nil { + return fmt.Errorf("distsim: missing primary or replica") + } + snap, ok := primary.Storage.Snapshots[snapshotID] + if !ok { + return fmt.Errorf("distsim: snapshot %s missing", snapshotID) + } + replica.Storage.LoadSnapshot(snap) + for _, w := range writesInRange(primary.Storage.WAL, snap.LSN, targetLSN) { + replica.Storage.AppendWrite(w) + } + replica.Storage.AdvanceFlush(targetLSN) + return nil +} + +func (c *Cluster) RebuildReplicaFromSnapshotPartial(replicaID, snapshotID string, targetLSN uint64, maxWrites int) (uint64, error) { + primary := c.Primary() + replica := c.Nodes[replicaID] + if primary == nil || replica == nil { + return 0, fmt.Errorf("distsim: missing primary or replica") + } + snap, ok := primary.Storage.Snapshots[snapshotID] + if !ok { + return 0, fmt.Errorf("distsim: snapshot %s missing", snapshotID) + } + replica.Storage.LoadSnapshot(snap) + lastRecovered := snap.LSN + applied := 0 + for _, w := range writesInRange(primary.Storage.WAL, snap.LSN, targetLSN) { + if applied >= maxWrites { + break + } + replica.Storage.AppendWrite(w) + lastRecovered = w.LSN + applied++ + } + replica.Storage.AdvanceFlush(lastRecovered) + return lastRecovered, nil +} + +func (c *Cluster) InjectMessage(msg Message, deliverAt uint64) { + c.enqueue(msg, deliverAt) +} + +func FullyRecoverable(records []RecoveryRecord) bool { + for _, r := range records { + switch r.Class { + case RecoveryClassWALInline: + continue + case RecoveryClassExtentReferenced: + if !r.PayloadResolvable { + return false + } + default: + return false + } + } + return true +} + +func ApplyRecoveryRecords(records []RecoveryRecord, startExclusive, endInclusive uint64) map[uint64]uint64 { + state := map[uint64]uint64{} + for _, r := range records { + if r.LSN <= startExclusive { + continue + } + if r.LSN > endInclusive { + break + } + state[r.Block] = r.Value + } + return state +} + +// PromotionCandidate represents a replica's suitability for primary promotion. +type PromotionCandidate struct { + ID string + FlushedLSN uint64 + State ReplicaNodeState + Running bool +} + +// statePromotionRank returns priority rank for promotion. Lower = better. +func statePromotionRank(s ReplicaNodeState) int { + switch s { + case NodeStateInSync: + return 0 + case NodeStateCatchingUp: + return 1 + case NodeStateLagging: + return 2 + case NodeStateRebuilding: + return 3 + case NodeStateNeedsRebuild: + return 4 + default: + return 5 + } +} + +// PromotionCandidates returns all replicas ranked by promotion suitability. +// Ranking: Running first, then by state rank (InSync > CatchingUp > ... > NeedsRebuild), +// then by FlushedLSN descending, then by ID (alphabetical tie-break). +func (c *Cluster) PromotionCandidates() []PromotionCandidate { + var candidates []PromotionCandidate + for _, id := range c.replicaIDs() { + n := c.Nodes[id] + if n == nil { + continue + } + candidates = append(candidates, PromotionCandidate{ + ID: id, + FlushedLSN: n.Storage.FlushedLSN, + State: n.ReplicaState, + Running: n.Running, + }) + } + sort.SliceStable(candidates, func(i, j int) bool { + ci, cj := candidates[i], candidates[j] + if ci.Running != cj.Running { + return ci.Running + } + ri, rj := statePromotionRank(ci.State), statePromotionRank(cj.State) + if ri != rj { + return ri < rj + } + if ci.FlushedLSN != cj.FlushedLSN { + return ci.FlushedLSN > cj.FlushedLSN + } + return ci.ID < cj.ID + }) + return candidates +} + +// BestPromotionCandidate returns the best eligible candidate. +// Uses EvaluateCandidateEligibility: must be running, epoch-aligned, +// state-eligible (not NeedsRebuild/Rebuilding), and have flushed data. +// Returns "" if no eligible candidate exists. +func (c *Cluster) BestPromotionCandidate() string { + eligible := c.EligiblePromotionCandidates() + if len(eligible) == 0 { + return "" + } + return eligible[0].ID +} + +// BestPromotionCandidateDesperate returns the best running candidate regardless +// of state. Use only as explicit fallback when no safe candidate exists and +// availability is prioritized over consistency. +func (c *Cluster) BestPromotionCandidateDesperate() string { + candidates := c.PromotionCandidates() + if len(candidates) == 0 || !candidates[0].Running { + return "" + } + return candidates[0].ID +} + +// === Endpoint lifecycle === + +// RestartNodeWithNewAddress simulates a node restarting on a different address. +// The EndpointVersion bumps, making the primary's cached endpoint stale. +// Messages from primary to this node will be rejected as RejectStaleEndpoint +// until the control-plane flow updates the cached endpoint. +func (c *Cluster) RestartNodeWithNewAddress(id string) { + n := c.Nodes[id] + if n == nil { + return + } + n.Running = true + n.Epoch = c.Coordinator.Epoch + n.EndpointVersion++ + n.DataAddr = fmt.Sprintf("%s:v%d", id, n.EndpointVersion) + // Endpoint change invalidates any active session for this replica. + c.InvalidateReplicaSession(id, "endpoint_changed") +} + +// === Control-plane flow: heartbeat → detect → assignment → trigger === + +// HeartbeatReport represents what a node reports to the coordinator. +type HeartbeatReport struct { + NodeID string + DataAddr string + EndpointVersion uint64 + FlushedLSN uint64 + State ReplicaNodeState + Running bool +} + +// ReportHeartbeat returns the current heartbeat for a node. +func (c *Cluster) ReportHeartbeat(nodeID string) HeartbeatReport { + n := c.Nodes[nodeID] + if n == nil { + return HeartbeatReport{NodeID: nodeID} + } + return HeartbeatReport{ + NodeID: nodeID, + DataAddr: n.DataAddr, + EndpointVersion: n.EndpointVersion, + FlushedLSN: n.Storage.FlushedLSN, + State: n.ReplicaState, + Running: n.Running, + } +} + +// AssignmentUpdate represents a master-driven update to the primary's replica view. +type AssignmentUpdate struct { + TargetNodeID string // which primary receives this + UpdatedEndpoints map[string]CachedEndpoint // new cached endpoints +} + +// CoordinatorDetectEndpointChange checks if a heartbeat reveals an address change. +// Returns an assignment update if the endpoint version differs from cached, nil otherwise. +func (c *Cluster) CoordinatorDetectEndpointChange(report HeartbeatReport) *AssignmentUpdate { + cached, ok := c.CachedReplicaEndpoints[report.NodeID] + if ok && cached.EndpointVersion == report.EndpointVersion { + return nil + } + updated := map[string]CachedEndpoint{} + for id, ep := range c.CachedReplicaEndpoints { + updated[id] = ep + } + updated[report.NodeID] = CachedEndpoint{ + DataAddr: report.DataAddr, + EndpointVersion: report.EndpointVersion, + } + return &AssignmentUpdate{ + TargetNodeID: c.Coordinator.PrimaryID, + UpdatedEndpoints: updated, + } +} + +// ApplyAssignmentUpdate overwrites the cluster-global cached replica endpoints. +// The current model has a single primary view (CachedReplicaEndpoints is +// cluster-level, not per-node). TargetNodeID is carried for API shape but +// is not used for routing — the update is always applied to the global cache. +func (c *Cluster) ApplyAssignmentUpdate(update AssignmentUpdate) { + c.CachedReplicaEndpoints = update.UpdatedEndpoints +} + +// === Recovery session ownership === + +// RecoveryTrigger identifies how a recovery session was initiated. +type RecoveryTrigger string + +const ( + TriggerNone RecoveryTrigger = "none" + TriggerBackgroundReconnect RecoveryTrigger = "background_reconnect" // V1.5 + TriggerReassignment RecoveryTrigger = "reassignment" // V2 +) + +// TrackedSession is an explicitly identified recovery session. Each replica +// has at most one active session. Sessions are invalidated by epoch bump, +// endpoint change, or explicit supersede. Stale session completions are +// rejected by ID. +type TrackedSession struct { + ID uint64 + ReplicaID string + Epoch uint64 + Trigger RecoveryTrigger + Active bool + Reason string // non-empty when invalidated +} + +// TriggerRecoverySession initiates a version-specific recovery session with +// explicit identity tracking. Rejects duplicate triggers while a session is +// active. Returns (trigger, sessionID, ok). +func (c *Cluster) TriggerRecoverySession(replicaID string) (RecoveryTrigger, uint64, bool) { + node := c.Nodes[replicaID] + if node == nil || !node.Running { + return TriggerNone, 0, false + } + // Reject duplicate trigger while session active. + if existing := c.Sessions[replicaID]; existing != nil && existing.Active { + return TriggerNone, 0, false + } + cached := c.CachedReplicaEndpoints[replicaID] + addrStable := cached.EndpointVersion == node.EndpointVersion + + var trigger RecoveryTrigger + switch c.Protocol.Version { + case ProtocolV1: + return TriggerNone, 0, false + case ProtocolV15: + if !addrStable { + return TriggerNone, 0, false + } + trigger = TriggerBackgroundReconnect + case ProtocolV2: + if !addrStable { + return TriggerNone, 0, false + } + trigger = TriggerReassignment + default: + return TriggerNone, 0, false + } + + c.nextSessionID++ + sess := &TrackedSession{ + ID: c.nextSessionID, + ReplicaID: replicaID, + Epoch: c.Coordinator.Epoch, + Trigger: trigger, + Active: true, + } + c.Sessions[replicaID] = sess + c.SessionHistory = append(c.SessionHistory, sess) + + node.ReplicaState = NodeStateCatchingUp + node.CatchupAttempts = 0 + return trigger, sess.ID, true +} + +// CompleteRecoverySession marks the session as completed and transitions the +// replica to InSync. Returns false if the session ID doesn't match the active +// session (stale completion from an old/superseded session). +func (c *Cluster) CompleteRecoverySession(replicaID string, sessionID uint64) bool { + sess := c.Sessions[replicaID] + if sess == nil || !sess.Active || sess.ID != sessionID { + return false // stale: old session, already completed, or wrong ID + } + sess.Active = false + if node := c.Nodes[replicaID]; node != nil { + node.ReplicaState = NodeStateInSync + node.CatchupAttempts = 0 + } + c.cancelRecoveryTimeouts(replicaID) + return true +} + +// InvalidateReplicaSession invalidates the active session for a replica. +func (c *Cluster) InvalidateReplicaSession(replicaID, reason string) { + if sess := c.Sessions[replicaID]; sess != nil && sess.Active { + sess.Active = false + sess.Reason = reason + } +} + +// InvalidateAllSessions invalidates all active sessions (e.g., epoch bump). +func (c *Cluster) InvalidateAllSessions(reason string) int { + count := 0 + for _, sess := range c.Sessions { + if sess.Active { + sess.Active = false + sess.Reason = reason + count++ + } + } + return count +} + +// === Candidate eligibility === + +// CandidateEligibility describes why a node is or is not eligible for promotion. +type CandidateEligibility struct { + ID string + Eligible bool + Reasons []string // non-empty when ineligible +} + +// EvaluateCandidateEligibility checks all promotion prerequisites for a node. +// A candidate must have the full committed prefix (FlushedLSN >= CommittedLSN) +// to be eligible. Promoting a replica that is missing committed data would +// lose acknowledged writes. +func (c *Cluster) EvaluateCandidateEligibility(candidateID string) CandidateEligibility { + n := c.Nodes[candidateID] + if n == nil { + return CandidateEligibility{ID: candidateID, Reasons: []string{"not_found"}} + } + var reasons []string + if !n.Running { + reasons = append(reasons, "not_running") + } + if n.Epoch != c.Coordinator.Epoch { + reasons = append(reasons, "epoch_misaligned") + } + if n.ReplicaState == NodeStateNeedsRebuild || n.ReplicaState == NodeStateRebuilding { + reasons = append(reasons, "state_ineligible") + } + if n.Storage.FlushedLSN < c.Coordinator.CommittedLSN { + reasons = append(reasons, "insufficient_committed_prefix") + } + return CandidateEligibility{ + ID: candidateID, + Eligible: len(reasons) == 0, + Reasons: reasons, + } +} + +// EligiblePromotionCandidates returns only eligible candidates, ranked by suitability. +func (c *Cluster) EligiblePromotionCandidates() []PromotionCandidate { + var eligible []PromotionCandidate + for _, pc := range c.PromotionCandidates() { + e := c.EvaluateCandidateEligibility(pc.ID) + if e.Eligible { + eligible = append(eligible, pc) + } + } + return eligible +} + +func (c *Cluster) AssertCommittedRecoverable(nodeID string) error { + node := c.Nodes[nodeID] + if node == nil { + return fmt.Errorf("distsim: missing node %s", nodeID) + } + want := c.Reference.StateAt(c.Coordinator.CommittedLSN) + got := node.Storage.StateAt(c.Coordinator.CommittedLSN) + if !EqualState(got, want) { + return fmt.Errorf("distsim: node %s mismatch at committed LSN %d: got=%v want=%v", nodeID, c.Coordinator.CommittedLSN, got, want) + } + return nil +} diff --git a/sw-block/prototype/distsim/cluster_test.go b/sw-block/prototype/distsim/cluster_test.go new file mode 100644 index 000000000..8fd38f82d --- /dev/null +++ b/sw-block/prototype/distsim/cluster_test.go @@ -0,0 +1,1004 @@ +package distsim + +import "testing" + +func TestQuorumCommitSurvivesPrimaryFailover(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + + lsn1 := c.CommitWrite(7) + if lsn1 != 1 { + t.Fatalf("unexpected lsn1=%d", lsn1) + } + c.TickN(4) + if c.Coordinator.CommittedLSN != lsn1 { + t.Fatalf("expected committed lsn %d, got %d", lsn1, c.Coordinator.CommittedLSN) + } + + c.StopNode("p") + if err := c.Promote("r1"); err != nil { + t.Fatalf("promote: %v", err) + } + + if err := c.AssertCommittedRecoverable("r1"); err != nil { + t.Fatal(err) + } +} + +func TestUncommittedWriteNotPreservedAfterPrimaryLoss(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + + c.Disconnect("p", "r1") + c.Disconnect("r1", "p") + c.Disconnect("p", "r2") + c.Disconnect("r2", "p") + + lsn1 := c.CommitWrite(4) + if lsn1 != 1 { + t.Fatalf("unexpected lsn1=%d", lsn1) + } + c.TickN(4) + if c.Coordinator.CommittedLSN != 0 { + t.Fatalf("expected no committed lsn, got %d", c.Coordinator.CommittedLSN) + } + + c.StopNode("p") + c.StartNode("r1") + if err := c.Promote("r1"); err != nil { + t.Fatalf("promote: %v", err) + } + + if err := c.AssertCommittedRecoverable("r1"); err != nil { + t.Fatal(err) + } + if got := c.Nodes["r1"].Storage.StateAt(1); len(got) != 0 { + t.Fatalf("uncommitted state leaked into promoted primary: %v", got) + } +} + +func TestReplicaCatchupFromPrimaryWAL(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + + c.Disconnect("p", "r2") + c.Disconnect("r2", "p") + + c.CommitWrite(1) + c.TickN(4) + c.CommitWrite(2) + c.TickN(4) + + if c.Coordinator.CommittedLSN != 2 { + t.Fatalf("expected committed lsn 2, got %d", c.Coordinator.CommittedLSN) + } + if c.Nodes["r2"].Storage.FlushedLSN != 0 { + t.Fatalf("expected r2 to lag, got flushed=%d", c.Nodes["r2"].Storage.FlushedLSN) + } + + c.Connect("p", "r2") + c.Connect("r2", "p") + if err := c.RecoverReplicaFromPrimary("r2", 0, c.Coordinator.CommittedLSN); err != nil { + t.Fatalf("recover replica: %v", err) + } + + want := c.Reference.StateAt(c.Coordinator.CommittedLSN) + got := c.Nodes["r2"].Storage.StateAt(c.Coordinator.CommittedLSN) + if !EqualState(got, want) { + t.Fatalf("catchup mismatch: got=%v want=%v", got, want) + } +} + +func TestReplicaRebuildFromSnapshotAndTail(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + + c.CommitWrite(7) + c.TickN(4) + c.CommitWrite(8) + c.TickN(4) + + snap := c.Primary().Storage.TakeSnapshot("snap-2", 2) + + c.CommitWrite(7) + c.TickN(4) + + r2 := c.Nodes["r2"] + r2.Storage = NewStorage() + if err := c.RebuildReplicaFromSnapshot("r2", snap.ID, c.Coordinator.CommittedLSN); err != nil { + t.Fatalf("rebuild replica: %v", err) + } + + want := c.Reference.StateAt(c.Coordinator.CommittedLSN) + got := r2.Storage.StateAt(c.Coordinator.CommittedLSN) + if !EqualState(got, want) { + t.Fatalf("rebuild mismatch: got=%v want=%v", got, want) + } +} + +func TestPromotionUsesValidLineageNode(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + + c.CommitWrite(3) + c.TickN(4) + c.CommitWrite(5) + c.TickN(4) + + c.StopNode("p") + if err := c.Promote("r2"); err != nil { + t.Fatalf("promote: %v", err) + } + if c.Nodes["p"].Epoch == c.Coordinator.Epoch { + t.Fatalf("stopped primary should not auto-advance epoch") + } + if c.Nodes["r2"].Epoch != c.Coordinator.Epoch { + t.Fatalf("new primary epoch mismatch: node=%d coord=%d", c.Nodes["r2"].Epoch, c.Coordinator.Epoch) + } + if err := c.AssertCommittedRecoverable("r2"); err != nil { + t.Fatal(err) + } +} + +func TestZombieOldPrimaryWritesAreFenced(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + + c.CommitWrite(3) + c.TickN(4) + + c.StopNode("p") + if err := c.Promote("r1"); err != nil { + t.Fatalf("promote: %v", err) + } + + c.StartNode("p") + c.InjectMessage(Message{ + Kind: MsgWrite, + From: "p", + To: "r1", + Epoch: 1, + Write: Write{LSN: 99, Block: 42, Value: 99}, + }, c.Now+1) + c.InjectMessage(Message{ + Kind: MsgBarrierAck, + From: "p", + To: "r1", + Epoch: 1, + TargetLSN: 99, + }, c.Now+1) + c.TickN(2) + + if c.Coordinator.CommittedLSN != 1 { + t.Fatalf("stale message changed committed lsn: got=%d", c.Coordinator.CommittedLSN) + } + if got := c.Nodes["r1"].Storage.Extent[42]; got != 0 { + t.Fatalf("stale message mutated new primary extent: block42=%d", got) + } +} + +func TestReservationExpiryAbortsCatchup(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + + c.Disconnect("p", "r2") + c.Disconnect("r2", "p") + c.CommitWrite(1) + c.TickN(4) + c.CommitWrite(2) + c.TickN(4) + c.CommitWrite(3) + c.TickN(4) + + c.Connect("p", "r2") + c.Connect("r2", "p") + err := c.RecoverReplicaFromPrimaryReserved("r2", 0, c.Coordinator.CommittedLSN, c.Now+2) + if err == nil { + t.Fatalf("expected reservation-expiry failure") + } + if c.Nodes["r2"].Storage.FlushedLSN >= c.Coordinator.CommittedLSN { + t.Fatalf("replica should not be fully caught up after expired reservation: flushed=%d committed=%d", c.Nodes["r2"].Storage.FlushedLSN, c.Coordinator.CommittedLSN) + } +} + +func TestSyncQuorumContinuesWithOneLaggingReplica(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + + c.Disconnect("p", "r2") + c.Disconnect("r2", "p") + + lsn := c.CommitWrite(9) + c.TickN(4) + if c.Coordinator.CommittedLSN != lsn { + t.Fatalf("expected quorum commit with one lagging replica, got committed=%d", c.Coordinator.CommittedLSN) + } +} + +func TestSyncAllBlocksWithOneLaggingReplica(t *testing.T) { + c := NewCluster(CommitSyncAll, "p", "r1", "r2") + + c.Disconnect("p", "r2") + c.Disconnect("r2", "p") + + lsn := c.CommitWrite(9) + c.TickN(4) + if c.Coordinator.CommittedLSN >= lsn { + t.Fatalf("sync_all should not commit with one lagging replica: committed=%d", c.Coordinator.CommittedLSN) + } +} + +func TestSyncQuorumWithMixedReplicaStates(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3") + + c.Disconnect("p", "r2") + c.Disconnect("r2", "p") + + lsn := c.CommitWrite(9) + c.TickN(4) + if c.Coordinator.CommittedLSN != lsn { + t.Fatalf("expected quorum commit with mixed replica states: committed=%d", c.Coordinator.CommittedLSN) + } +} + +func TestSyncAllBlocksWithMixedReplicaStates(t *testing.T) { + c := NewCluster(CommitSyncAll, "p", "r1", "r2", "r3") + + c.Disconnect("p", "r2") + c.Disconnect("r2", "p") + c.StopNode("r3") + + lsn := c.CommitWrite(9) + c.TickN(4) + if c.Coordinator.CommittedLSN >= lsn { + t.Fatalf("sync_all should not commit with mixed replica states: committed=%d", c.Coordinator.CommittedLSN) + } +} + +func TestReplicaRestartDuringCatchupRestartsSafely(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + + c.Disconnect("p", "r2") + c.Disconnect("r2", "p") + c.CommitWrite(1) + c.TickN(4) + c.CommitWrite(2) + c.TickN(4) + c.CommitWrite(3) + c.TickN(4) + + c.Connect("p", "r2") + c.Connect("r2", "p") + last, err := c.RecoverReplicaFromPrimaryPartial("r2", 0, c.Coordinator.CommittedLSN, 1) + if err != nil { + t.Fatalf("partial catchup: %v", err) + } + if last != 1 { + t.Fatalf("expected partial catchup to reach lsn 1, got %d", last) + } + + c.StopNode("r2") + c.StartNode("r2") + + if err := c.RecoverReplicaFromPrimary("r2", c.Nodes["r2"].Storage.FlushedLSN, c.Coordinator.CommittedLSN); err != nil { + t.Fatalf("restart catchup: %v", err) + } + if err := c.AssertCommittedRecoverable("r2"); err != nil { + t.Fatal(err) + } +} + +func TestReplicaRestartDuringRebuildRestartsSafely(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + + c.CommitWrite(7) + c.TickN(4) + c.CommitWrite(8) + c.TickN(4) + snap := c.Primary().Storage.TakeSnapshot("snap-2", 2) + c.CommitWrite(7) + c.TickN(4) + c.CommitWrite(9) + c.TickN(4) + + last, err := c.RebuildReplicaFromSnapshotPartial("r2", snap.ID, c.Coordinator.CommittedLSN, 1) + if err != nil { + t.Fatalf("partial rebuild: %v", err) + } + if last != 3 { + t.Fatalf("expected partial rebuild to reach lsn 3, got %d", last) + } + + c.StopNode("r2") + c.StartNode("r2") + if err := c.RebuildReplicaFromSnapshot("r2", snap.ID, c.Coordinator.CommittedLSN); err != nil { + t.Fatalf("restart rebuild: %v", err) + } + if err := c.AssertCommittedRecoverable("r2"); err != nil { + t.Fatal(err) + } +} + +func TestWALInlineRecordsAreRecoverable(t *testing.T) { + records := []RecoveryRecord{ + {Write: Write{LSN: 1, Block: 7, Value: 1}, Class: RecoveryClassWALInline}, + {Write: Write{LSN: 2, Block: 7, Value: 2}, Class: RecoveryClassWALInline}, + } + if !FullyRecoverable(records) { + t.Fatalf("wal-inline records should be recoverable") + } + got := ApplyRecoveryRecords(records, 0, 2) + want := map[uint64]uint64{7: 2} + if !EqualState(got, want) { + t.Fatalf("wal-inline recovery mismatch: got=%v want=%v", got, want) + } +} + +func TestExtentReferencedResolvableRecordsAreRecoverable(t *testing.T) { + records := []RecoveryRecord{ + {Write: Write{LSN: 1, Block: 7, Value: 1}, Class: RecoveryClassWALInline}, + {Write: Write{LSN: 2, Block: 9, Value: 2}, Class: RecoveryClassExtentReferenced, PayloadResolvable: true}, + } + if !FullyRecoverable(records) { + t.Fatalf("resolvable extent-referenced records should be recoverable") + } + got := ApplyRecoveryRecords(records, 0, 2) + want := map[uint64]uint64{7: 1, 9: 2} + if !EqualState(got, want) { + t.Fatalf("extent-referenced recovery mismatch: got=%v want=%v", got, want) + } +} + +func TestExtentReferencedUnresolvableForcesRebuild(t *testing.T) { + records := []RecoveryRecord{ + {Write: Write{LSN: 1, Block: 7, Value: 1}, Class: RecoveryClassWALInline}, + {Write: Write{LSN: 2, Block: 9, Value: 2}, Class: RecoveryClassExtentReferenced, PayloadResolvable: false}, + } + if FullyRecoverable(records) { + t.Fatalf("unresolvable extent-referenced records must not be recoverable") + } +} + +// --- S19: Chain of custody across multiple promotions --- + +func TestS19_ChainOfCustody_MultiplePromotions(t *testing.T) { + // A writes → committed → A crashes → promote B → + // B writes → committed → B crashes → promote C → + // C must have all committed data from both A and B. + c := NewCluster(CommitSyncQuorum, "A", "B", "C") + + // Phase 1: A is primary, writes blocks 1-3. + for i := uint64(1); i <= 3; i++ { + c.CommitWrite(i) + } + c.TickN(5) + if c.Coordinator.CommittedLSN != 3 { + t.Fatalf("phase 1: expected committedLSN=3, got %d", c.Coordinator.CommittedLSN) + } + + // Crash A, promote B. + c.StopNode("A") + if err := c.Promote("B"); err != nil { + t.Fatal(err) + } + + // Phase 2: B is primary, writes blocks 4-6. + for i := uint64(4); i <= 6; i++ { + c.CommitWrite(i) + } + c.TickN(5) + // CommittedLSN should have advanced (B + C form quorum). + phase2Committed := c.Coordinator.CommittedLSN + if phase2Committed < 4 { + t.Fatalf("phase 2: expected committedLSN >= 4, got %d", phase2Committed) + } + + // Crash B, promote C. + c.StopNode("B") + if err := c.Promote("C"); err != nil { + t.Fatal(err) + } + + // Phase 3: Verify C has all committed data from both A and B. + if err := c.AssertCommittedRecoverable("C"); err != nil { + t.Fatalf("chain of custody broken: %v", err) + } + + // Verify specific blocks: data written by A (blocks 1-3) and B (blocks 4-6). + cState := c.Nodes["C"].Storage.StateAt(phase2Committed) + refState := c.Reference.StateAt(phase2Committed) + for block := uint64(1); block <= 6; block++ { + if cState[block] != refState[block] { + t.Fatalf("block %d: C has %d, reference has %d", block, cState[block], refState[block]) + } + } +} + +func TestS19_ChainOfCustody_ThreePromotions(t *testing.T) { + // Even longer chain: A → B → C → back to A (restarted). + c := NewCluster(CommitSyncQuorum, "A", "B", "C") + + // A writes. + c.CommitWrite(1) + c.TickN(5) + + // A crashes, promote B. + c.StopNode("A") + c.Promote("B") + + // B writes. + c.CommitWrite(2) + c.TickN(5) + + // B crashes, promote C. + c.StopNode("B") + c.Promote("C") + + // C writes. + c.CommitWrite(3) + c.TickN(5) + + // Restart A, promote back to A. + c.StartNode("A") + // Recover A from C (current primary). + c.RecoverReplicaFromPrimary("A", 0, c.Coordinator.CommittedLSN) + c.Promote("A") + + // A must have all committed data from A, B, and C eras. + if err := c.AssertCommittedRecoverable("A"); err != nil { + t.Fatalf("triple chain of custody broken: %v", err) + } +} + +// --- S20: Live partition with competing writes --- + +func TestS20_LivePartition_StaleWritesNotCommitted(t *testing.T) { + // Partition splits cluster: A can reach B but not C. + // A thinks it's still primary and writes. + // Meanwhile coordinator promotes C (which can reach B). + // A's writes must NOT become committed under the new epoch. + c := NewCluster(CommitSyncQuorum, "A", "B", "C") + + // Phase 1: Normal operation — all connected. + c.CommitWrite(1) + c.TickN(5) + if c.Coordinator.CommittedLSN != 1 { + t.Fatalf("expected committedLSN=1, got %d", c.Coordinator.CommittedLSN) + } + + // Phase 2: Partition — A isolated from C (but A still connected to B). + c.Disconnect("A", "C") + c.Disconnect("C", "A") + + // Coordinator detects A is unreachable from C's perspective, promotes C. + c.Promote("C") + + // Phase 3: A (stale primary, old epoch) tries to write. + // CommitWrite uses coordinator's current primary, which is now C. + // So we manually simulate A trying to write with its old epoch. + oldEpoch := c.Coordinator.Epoch - 1 + staleNode := c.Nodes["A"] + staleNode.Epoch = oldEpoch // A still thinks it's the old epoch + + // C (new primary) writes under new epoch. + c.CommitWrite(2) + c.TickN(5) + + // Phase 4: Verify. + // C's committed data should be correct. + if err := c.AssertCommittedRecoverable("C"); err != nil { + t.Fatalf("new primary data incorrect: %v", err) + } + + // A's stale epoch means its Storage may have old data, but it should NOT + // have data from the new epoch that it couldn't have received. + if staleNode.Epoch == c.Coordinator.Epoch { + t.Fatal("stale node should NOT have the current epoch") + } +} + +func TestS20_LivePartition_HealRecovers(t *testing.T) { + // After partition heals, the stale side can catch up and rejoin. + c := NewCluster(CommitSyncQuorum, "A", "B", "C") + + // Normal writes. + c.CommitWrite(1) + c.TickN(5) + + // Partition A from B and C. + c.Disconnect("A", "B") + c.Disconnect("B", "A") + c.Disconnect("A", "C") + c.Disconnect("C", "A") + + // Promote B. + c.Promote("B") + + // B writes during partition. + c.CommitWrite(2) + c.CommitWrite(3) + c.TickN(5) + + // Heal partition. + c.Connect("A", "B") + c.Connect("B", "A") + c.Connect("A", "C") + c.Connect("C", "A") + + // Recover A from B. + c.StartNode("A") // ensure A is at current epoch + c.RecoverReplicaFromPrimary("A", 0, c.Coordinator.CommittedLSN) + + // Verify A now has all committed data. + aState := c.Nodes["A"].Storage.StateAt(c.Coordinator.CommittedLSN) + refState := c.Reference.StateAt(c.Coordinator.CommittedLSN) + if !EqualState(aState, refState) { + t.Fatalf("A after heal: got %v, want %v", aState, refState) + } +} + +// --- S5: Flapping replica stays recoverable --- + +func TestS5_FlappingReplica_NoUnnecessaryRebuild(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + + // Write initial data. + c.CommitWrite(1) + c.CommitWrite(2) + c.TickN(5) + + // Simulate 5 flapping cycles: disconnect, write, reconnect, catch up. + for cycle := 0; cycle < 5; cycle++ { + // Disconnect r1. + c.Disconnect("p", "r1") + c.Disconnect("r1", "p") + + // Write during disconnect (r2 still connected — quorum holds). + c.CommitWrite(uint64(3 + cycle*2)) + c.CommitWrite(uint64(4 + cycle*2)) + c.TickN(3) + + // Reconnect r1. + c.Connect("p", "r1") + c.Connect("r1", "p") + + // Catch r1 up from primary WAL (not rebuild). + r1 := c.Nodes["r1"] + c.RecoverReplicaFromPrimary("r1", r1.Storage.FlushedLSN, c.Coordinator.CommittedLSN) + c.TickN(3) + } + + // After 5 flapping cycles, r1 should have all committed data. + if err := c.AssertCommittedRecoverable("r1"); err != nil { + t.Fatalf("flapping replica lost data: %v", err) + } + // r1 was never rebuilt — only WAL catch-up. + if c.Nodes["r1"].Storage.BaseSnapshot != nil { + t.Fatal("r1 should NOT have been rebuilt from snapshot — only WAL catch-up") + } +} + +// --- S6: Tail-chasing under load --- + +func TestS6_TailChasing_ConvergesOrAborts(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + + // Write initial batch. + for i := uint64(1); i <= 5; i++ { + c.CommitWrite(i) + } + c.TickN(5) + + // Disconnect r1. + c.Disconnect("p", "r1") + c.Disconnect("r1", "p") + + // Write a LOT more while r1 is disconnected. + for i := uint64(6); i <= 30; i++ { + c.CommitWrite(i % 8) + } + c.TickN(5) + + // Reconnect r1. + c.Connect("p", "r1") + c.Connect("r1", "p") + + // Partial catch-up: simulate primary still writing during catch-up. + r1 := c.Nodes["r1"] + lastRecovered := r1.Storage.FlushedLSN + + // Try catching up in batches of 5 — primary writes 3 more each batch. + for attempt := 0; attempt < 10; attempt++ { + target := c.Coordinator.CommittedLSN + recovered, err := c.RecoverReplicaFromPrimaryPartial("r1", lastRecovered, target, 5) + if err != nil { + t.Fatalf("partial catch-up attempt %d: %v", attempt, err) + } + lastRecovered = recovered + + // Primary writes more during catch-up. + c.CommitWrite(uint64(31 + attempt)) + c.TickN(2) + + // Check convergence. + if lastRecovered >= c.Coordinator.CommittedLSN { + break + } + } + + // Final full catch-up to close any remaining gap. + c.RecoverReplicaFromPrimary("r1", lastRecovered, c.Coordinator.CommittedLSN) + + // Verify correctness. + if err := c.AssertCommittedRecoverable("r1"); err != nil { + t.Fatalf("tail-chasing replica diverged: %v", err) + } +} + +// --- S18: Primary restart without failover --- + +func TestS18_PrimaryRestart_SameLineage(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + + // Write and commit. + c.CommitWrite(1) + c.CommitWrite(2) + c.TickN(5) + committedBefore := c.Coordinator.CommittedLSN + + // Primary stops (but NOT crashed — coordinator doesn't promote). + c.StopNode("p") + + // Coordinator bumps epoch (same primary, just restarted). + c.Coordinator.Epoch++ + + // Broadcast new epoch to all running nodes (replicas learn via heartbeat). + for _, n := range c.Nodes { + if n.Running { + n.Epoch = c.Coordinator.Epoch + } + } + + // Restart primary with new epoch. + c.StartNode("p") + + // Primary writes after restart. + c.CommitWrite(3) + c.TickN(5) + + // All committed data should be intact. + if c.Coordinator.CommittedLSN <= committedBefore { + t.Fatalf("no new commits after restart: before=%d after=%d", committedBefore, c.Coordinator.CommittedLSN) + } + if err := c.AssertCommittedRecoverable("p"); err != nil { + t.Fatalf("data lost after primary restart: %v", err) + } +} + +func TestS18_PrimaryRestart_ReplicasRejectOldEpoch(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + + c.CommitWrite(1) + c.TickN(5) + + // Restart with epoch bump. + c.StopNode("p") + c.Coordinator.Epoch++ + c.StartNode("p") + + // Simulate old-epoch message arriving at r1 (stale, should be rejected). + oldMsg := Message{ + Kind: MsgWrite, + From: "p", + To: "r1", + Epoch: c.Coordinator.Epoch - 1, // old epoch + Write: Write{LSN: 999, Block: 0, Value: 999}, + } + // deliver should reject due to epoch mismatch. + c.deliver(oldMsg) + + // r1 should NOT have the stale write. + if c.Nodes["r1"].Storage.ReceivedLSN >= 999 { + t.Fatal("r1 accepted stale-epoch write after primary restart") + } +} + +// --- S12 (stronger): Promotion chooses best valid lineage --- + +func TestS12_PromotionChoosesBestLineage_NotHighestLSN(t *testing.T) { + // Setup: p writes, replicates to r1 and r2. + // r2 gets MORE data than r1 but at a STALE epoch (simulating split-brain). + // Promotion should choose r1 (valid lineage) over r2 (higher LSN but stale). + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + + // Normal writes, all replicated. + c.CommitWrite(1) + c.CommitWrite(2) + c.TickN(5) + + // Partition r2 from current epoch. + c.Disconnect("p", "r2") + c.Disconnect("r2", "p") + + // More writes only to r1 (r2 is partitioned). + c.CommitWrite(3) + c.CommitWrite(4) + c.TickN(5) + + // Artificially give r2 higher ReceivedLSN but at stale epoch. + // This simulates r2 having local writes from a stale primary. + r2 := c.Nodes["r2"] + r2.Epoch = c.Coordinator.Epoch - 1 // stale epoch + r2.Storage.AppendWrite(Write{LSN: 100, Block: 0, Value: 100}) + + // Now r2 has ReceivedLSN=100, r1 has ReceivedLSN=4. + // Naive "highest LSN wins" would pick r2. Correct logic picks r1. + + // Crash primary. + c.StopNode("p") + + // Promotion: choose between r1 (current epoch, LSN 4) and r2 (stale epoch, LSN 100). + // r2 is at wrong epoch — should not be promotable. + r1 := c.Nodes["r1"] + if r1.Epoch != c.Coordinator.Epoch { + t.Fatalf("r1 should be at current epoch") + } + if r2.Epoch == c.Coordinator.Epoch { + t.Fatal("r2 should be at stale epoch for this test") + } + + // Promote r1 (valid lineage). This should succeed. + if err := c.Promote("r1"); err != nil { + t.Fatalf("promote r1: %v", err) + } + + // Verify r1 has the correct committed data. + if err := c.AssertCommittedRecoverable("r1"); err != nil { + t.Fatalf("promoted r1 has wrong data: %v", err) + } + + // r2's stale data should NOT be in the committed lineage. + r1State := c.Nodes["r1"].Storage.StateAt(c.Coordinator.CommittedLSN) + if _, hasBlock0 := r1State[0]; hasBlock0 && r1State[0] == 100 { + t.Fatal("stale r2 data leaked into promoted lineage") + } +} + +func TestS12_PromotionRejectsRebuildingCandidate(t *testing.T) { + // A node that is mid-rebuild should NOT be promoted even if it + // has a high ReceivedLSN (from partial rebuild data). + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + + c.CommitWrite(1) + c.TickN(5) + + // r2 is "rebuilding" — mark it as not at current epoch. + c.Nodes["r2"].Epoch = 0 // not current + c.Nodes["r2"].Running = false // simulate down for rebuild + + // Crash primary. + c.StopNode("p") + + // Only r1 is promotable. + if err := c.Promote("r1"); err != nil { + t.Fatalf("promote r1: %v", err) + } + + // r2 should not be primary. + if c.Coordinator.PrimaryID != "r1" { + t.Fatalf("expected r1 as primary, got %s", c.Coordinator.PrimaryID) + } + + if err := c.AssertCommittedRecoverable("r1"); err != nil { + t.Fatal(err) + } +} + +// ============================================================ +// Strengthened partials: S20, S6, S5, S18 +// ============================================================ + +// --- S20 (strengthened): stale side routes writes through protocol --- + +func TestS20_StalePartition_ProtocolRejectsStaleWrites(t *testing.T) { + // True competing writes through the message protocol. + // Stale side attempts writes via StaleWrite (enqueue/deliver path). + // All stale messages must be rejected by epoch fencing. + c := NewCluster(CommitSyncQuorum, "A", "B", "C") + + // Phase 1: normal writes. + c.CommitWrite(1) + c.CommitWrite(2) + c.TickN(5) + committedBefore := c.Coordinator.CommittedLSN + + // Phase 2: full partition — A isolated. + c.Disconnect("A", "B") + c.Disconnect("B", "A") + c.Disconnect("A", "C") + c.Disconnect("C", "A") + + // Promote B (new epoch). + c.Promote("B") + // A never learned about the new epoch. + c.Nodes["A"].Epoch = c.Coordinator.Epoch - 1 + staleEpoch := c.Nodes["A"].Epoch + + // Phase 3: B (new primary) writes through protocol — succeeds. + c.CommitWrite(3) + c.CommitWrite(4) + c.TickN(5) + if c.Coordinator.CommittedLSN <= committedBefore { + t.Fatalf("B didn't advance commits") + } + + // Phase 4: A (stale) attempts writes through the protocol. + // StaleWrite routes through enqueue/deliver — epoch fencing should reject. + delivered := c.StaleWrite("A", staleEpoch, 99) + if delivered > 0 { + t.Fatalf("stale writes were delivered! %d messages passed epoch fencing", delivered) + } + + // Verify: all stale messages were rejected with epoch_mismatch. + epochRejects := c.RejectedByReason(RejectEpochMismatch) + if epochRejects == 0 { + t.Fatal("expected epoch_mismatch rejections for stale writes, got 0") + } + t.Logf("stale writes correctly rejected: %d epoch_mismatch rejections", epochRejects) + + // Phase 5: committed lineage unchanged by stale traffic. + if err := c.AssertCommittedRecoverable("B"); err != nil { + t.Fatalf("committed data corrupted by stale traffic: %v", err) + } + + // A's stale block 99 is local-only, never in reference. + refState := c.Reference.StateAt(c.Coordinator.CommittedLSN) + if _, has99 := refState[99]; has99 { + t.Fatal("stale block 99 leaked into reference") + } +} + +// --- S6 (strengthened): non-converging tail-chase aborts --- + +func TestS6_TailChasing_NonConvergent_EscalatesToNeedsRebuild(t *testing.T) { + // Primary writes FASTER than catch-up rate. + // Replica can never converge. CatchUpWithEscalation must transition + // to NeedsRebuild after MaxCatchupAttempts. + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + // MaxCatchupAttempts = 3 means after 3 non-convergent attempts → NeedsRebuild. + c.MaxCatchupAttempts = 3 + + // Initial writes — all replicated. + for i := uint64(1); i <= 5; i++ { + c.CommitWrite(i) + } + c.TickN(5) + + // Disconnect r1. + c.Disconnect("p", "r1") + c.Disconnect("r1", "p") + + // Write heavily while disconnected — large initial gap. + for i := uint64(6); i <= 200; i++ { + c.CommitWrite(i % 8) + } + c.TickN(5) + + // Reconnect r1. + c.Connect("p", "r1") + c.Connect("r1", "p") + + r1 := c.Nodes["r1"] + r1.ReplicaState = NodeStateCatchingUp + + // Each attempt: recover 1 entry, then primary writes 20 more. + // Ratio: 1 recovered per 20 new → gap always grows → never converges. + for attempt := 0; attempt < 10; attempt++ { + c.CatchUpWithEscalation("r1", 1) + + if r1.ReplicaState == NodeStateNeedsRebuild { + t.Logf("correctly escalated to NeedsRebuild after %d catch-up attempts", r1.CatchupAttempts) + return + } + + // Primary writes 20 more while r1 is disconnected again — gap grows. + c.Disconnect("p", "r1") + c.Disconnect("r1", "p") + for w := 0; w < 20; w++ { + c.CommitWrite(uint64(201 + attempt*20 + w) % 8) + } + c.TickN(2) + c.Connect("p", "r1") + c.Connect("r1", "p") + } + + t.Fatalf("expected NeedsRebuild escalation but state is %s after %d attempts", + r1.ReplicaState, r1.CatchupAttempts) +} + +// --- S18 (strengthened): restart races --- + +func TestS18_PrimaryRestart_DelayedOldAck_DoesNotAdvancePrefix(t *testing.T) { + // Old barrier ack arrives AFTER primary restart + epoch bump. + // Must be rejected by epoch fencing. committedLSN must NOT advance + // from the stale ack — assert before/after prefix. + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + + c.CommitWrite(1) + c.TickN(5) + + // Write LSN 2 but don't let barrier ack arrive yet. + c.Disconnect("r1", "p") // acks from r1 can't reach p + c.CommitWrite(2) + c.TickN(3) + + // Primary restarts with epoch bump. + c.StopNode("p") + c.Coordinator.Epoch++ + for _, n := range c.Nodes { + if n.Running { + n.Epoch = c.Coordinator.Epoch + } + } + c.StartNode("p") + + // Snapshot committed prefix BEFORE stale ack. + committedBefore := c.Coordinator.CommittedLSN + + // Reconnect r1 and deliver old-epoch ack. + c.Connect("r1", "p") + c.Connect("p", "r1") + + oldAck := Message{ + Kind: MsgBarrierAck, + From: "r1", + To: "p", + Epoch: c.Coordinator.Epoch - 1, // old epoch! + TargetLSN: 2, + } + c.deliver(oldAck) + c.refreshCommits() + + // Committed prefix must NOT have advanced from the stale ack. + committedAfter := c.Coordinator.CommittedLSN + if committedAfter > committedBefore { + t.Fatalf("stale ack advanced committed prefix: before=%d after=%d", committedBefore, committedAfter) + } + + // The stale ack should have been rejected. + epochRejects := c.RejectedByReason(RejectEpochMismatch) + if epochRejects == 0 { + t.Fatal("expected epoch_mismatch rejection for stale ack") + } + + // Data correctness still holds. + if err := c.AssertCommittedRecoverable("p"); err != nil { + t.Fatalf("data incorrect: %v", err) + } +} + +func TestS18_PrimaryRestart_InFlightBarrierDropped(t *testing.T) { + // Barrier is in-flight when primary restarts. The in-flight barrier + // should be dropped (events cleared on crash), not processed post-restart. + c := NewCluster(CommitSyncAll, "p", "r1") + + c.CommitWrite(1) + c.TickN(5) + + // Write LSN 2. + c.CommitWrite(2) + // Don't tick — barrier is "in flight". + + // Crash primary (drops in-flight events). + c.StopNode("p") + + // Tick to deliver any remaining messages. + c.TickN(5) + + // Under sync_all, LSN 2 should NOT be committed if barrier didn't complete. + // (It depends on whether the ack was already queued before crash.) + // The key invariant: whatever committedLSN is, the data must be correct. + c.Coordinator.Epoch++ + c.StartNode("p") + for _, n := range c.Nodes { + if n.Running { + n.Epoch = c.Coordinator.Epoch + } + } + + if err := c.AssertCommittedRecoverable("p"); err != nil { + t.Fatalf("data incorrect after in-flight barrier drop: %v", err) + } +} diff --git a/sw-block/prototype/distsim/distsim.test.exe b/sw-block/prototype/distsim/distsim.test.exe new file mode 100644 index 000000000..c25c46c83 Binary files /dev/null and b/sw-block/prototype/distsim/distsim.test.exe differ diff --git a/sw-block/prototype/distsim/eventsim.go b/sw-block/prototype/distsim/eventsim.go new file mode 100644 index 000000000..a5b092956 --- /dev/null +++ b/sw-block/prototype/distsim/eventsim.go @@ -0,0 +1,266 @@ +// eventsim.go — timeout events and timer-race infrastructure. +// +// This file implements the eventsim layer within the distsim package. +// The two conceptual layers share the Cluster model but serve different purposes: +// +// distsim (protocol layer — cluster.go, protocol.go): +// - Protocol correctness: epoch fencing, barrier semantics, commit rules +// - Reference-state validation: AssertCommittedRecoverable +// - Recoverability logic: catch-up, rebuild, reservation +// - Promotion/lineage: candidate eligibility, ranking +// - Endpoint identity: address versioning, stale endpoint rejection +// - Control-plane flow: heartbeat → detect → assignment +// +// eventsim (timing/race layer — this file): +// - Explicit timeout events: barrier, catch-up, reservation +// - Timer-triggered state transitions +// - Same-tick race resolution: data events process before timeouts +// - Timeout cancellation on successful ack/convergence +// +// Boundary rule: +// - A scenario belongs in distsim tests if the bug is protocol-level +// (wrong state, wrong commit, wrong rejection). +// - A scenario belongs in eventsim tests if the bug is timing-level +// (race between ack and timeout, ordering of concurrent events). +// - Do not duplicate scenarios across both layers unless +// timer/event ordering is the actual bug surface. + +package distsim + +import "fmt" + +// TimeoutKind identifies the type of timeout event. +type TimeoutKind string + +const ( + TimeoutBarrier TimeoutKind = "barrier" + TimeoutCatchup TimeoutKind = "catchup" + TimeoutReservation TimeoutKind = "reservation" +) + +// PendingTimeout represents a registered timeout that has not yet fired or been cancelled. +type PendingTimeout struct { + Kind TimeoutKind + ReplicaID string + LSN uint64 // for barrier timeouts: which LSN's barrier + DeadlineAt uint64 // absolute tick when timeout fires + Cancelled bool +} + +// FiredTimeout records a timeout that actually fired (was not cancelled in time). +type FiredTimeout struct { + PendingTimeout + FiredAt uint64 +} + +// barrierExpiredKey uniquely identifies a timed-out barrier instance. +type barrierExpiredKey struct { + ReplicaID string + LSN uint64 +} + +// RegisterTimeout adds a pending timeout to the cluster. +func (c *Cluster) RegisterTimeout(kind TimeoutKind, replicaID string, lsn uint64, deadline uint64) { + c.Timeouts = append(c.Timeouts, PendingTimeout{ + Kind: kind, + ReplicaID: replicaID, + LSN: lsn, + DeadlineAt: deadline, + }) +} + +// CancelTimeout cancels a pending timeout matching the given kind, replica, and LSN. +// For catch-up/reservation timeouts, LSN is ignored (matched by kind+replica only). +func (c *Cluster) CancelTimeout(kind TimeoutKind, replicaID string, lsn uint64) { + for i := range c.Timeouts { + t := &c.Timeouts[i] + if t.Cancelled { + continue + } + if t.Kind != kind || t.ReplicaID != replicaID { + continue + } + if kind == TimeoutBarrier && t.LSN != lsn { + continue + } + t.Cancelled = true + c.logEvent(EventTimeoutCancelled, fmt.Sprintf("%s replica=%s lsn=%d", kind, replicaID, t.LSN)) + } +} + +// fireTimeouts checks all pending timeouts against the current tick. +// Called by Tick() AFTER message delivery, so data events (acks) get +// a chance to cancel timeouts before they fire. This is the same-tick +// race resolution rule: data before timers. +// +// State-guard rules (prevent stale timeout from mutating post-success state): +// - CatchupTimeout only fires if replica is still CatchingUp +// - ReservationTimeout only fires if replica is still CatchingUp +// - BarrierTimeout marks the barrier instance as expired (late acks rejected) +func (c *Cluster) fireTimeouts() { + var remaining []PendingTimeout + for i := range c.Timeouts { + t := c.Timeouts[i] + if t.Cancelled { + continue + } + if c.Now < t.DeadlineAt { + remaining = append(remaining, t) + continue + } + // Check whether the timeout still has authority to mutate state. + stale := false + switch t.Kind { + case TimeoutBarrier: + // Barrier timeouts always apply — they mark the instance as expired. + case TimeoutCatchup, TimeoutReservation: + // Only valid if replica is still CatchingUp. If already recovered + // or escalated, the timeout is stale and has no authority. + if n := c.Nodes[t.ReplicaID]; n == nil || n.ReplicaState != NodeStateCatchingUp { + stale = true + } + } + + if stale { + c.IgnoredTimeouts = append(c.IgnoredTimeouts, FiredTimeout{ + PendingTimeout: t, + FiredAt: c.Now, + }) + c.logEvent(EventTimeoutIgnored, fmt.Sprintf("%s replica=%s lsn=%d (stale)", t.Kind, t.ReplicaID, t.LSN)) + continue + } + + // Timeout fires with authority. + c.FiredTimeouts = append(c.FiredTimeouts, FiredTimeout{ + PendingTimeout: t, + FiredAt: c.Now, + }) + c.logEvent(EventTimeoutFired, fmt.Sprintf("%s replica=%s lsn=%d", t.Kind, t.ReplicaID, t.LSN)) + switch t.Kind { + case TimeoutBarrier: + c.removeQueuedBarrier(t.ReplicaID, t.LSN) + c.ExpiredBarriers[barrierExpiredKey{t.ReplicaID, t.LSN}] = true + case TimeoutCatchup: + c.Nodes[t.ReplicaID].ReplicaState = NodeStateNeedsRebuild + case TimeoutReservation: + c.Nodes[t.ReplicaID].ReplicaState = NodeStateNeedsRebuild + } + } + c.Timeouts = remaining +} + +// removeQueuedBarrier removes a re-queuing barrier from the message queue +// after its timeout fires. Without this, the barrier would re-queue indefinitely. +func (c *Cluster) removeQueuedBarrier(replicaID string, lsn uint64) { + var kept []inFlightMessage + for _, item := range c.Queue { + if item.msg.Kind == MsgBarrier && item.msg.To == replicaID && item.msg.TargetLSN == lsn { + continue + } + kept = append(kept, item) + } + c.Queue = kept +} + +// cancelRecoveryTimeouts cancels all catch-up and reservation timeouts for a replica. +// Called automatically by CatchUpWithEscalation on convergence or escalation, +// so stale timeouts cannot regress a replica that already recovered or failed. +func (c *Cluster) cancelRecoveryTimeouts(replicaID string) { + c.CancelTimeout(TimeoutCatchup, replicaID, 0) + c.CancelTimeout(TimeoutReservation, replicaID, 0) +} + +// === Tick event log === + +// TickEventKind identifies the type of event within a tick. +type TickEventKind string + +const ( + EventDeliveryAccepted TickEventKind = "delivery_accepted" + EventDeliveryRejected TickEventKind = "delivery_rejected" + EventTimeoutFired TickEventKind = "timeout_fired" + EventTimeoutIgnored TickEventKind = "timeout_ignored" + EventTimeoutCancelled TickEventKind = "timeout_cancelled" +) + +// TickEvent records a single event within a tick, in processing order. +type TickEvent struct { + Tick uint64 + Kind TickEventKind + Detail string +} + +// logEvent appends a tick event to the cluster's event log. +func (c *Cluster) logEvent(kind TickEventKind, detail string) { + c.TickLog = append(c.TickLog, TickEvent{Tick: c.Now, Kind: kind, Detail: detail}) +} + +// TickEventsAt returns all events recorded at a specific tick. +func (c *Cluster) TickEventsAt(tick uint64) []TickEvent { + var events []TickEvent + for _, e := range c.TickLog { + if e.Tick == tick { + events = append(events, e) + } + } + return events +} + +// === Trace infrastructure === + +// Trace captures a snapshot of cluster state for debugging failed scenarios. +// Reusable across test files and future replay/debug tooling. +type Trace struct { + Tick uint64 + CommittedLSN uint64 + PrimaryID string + Epoch uint64 + NodeStates map[string]string + FiredTimeouts []string + IgnoredTimeouts []string + TickEvents []TickEvent // full ordered event log + Deliveries int + Rejections int + QueueDepth int +} + +// BuildTrace captures the current cluster state as a debuggable trace. +func BuildTrace(c *Cluster) Trace { + tr := Trace{ + Tick: c.Now, + CommittedLSN: c.Coordinator.CommittedLSN, + PrimaryID: c.Coordinator.PrimaryID, + Epoch: c.Coordinator.Epoch, + NodeStates: map[string]string{}, + TickEvents: c.TickLog, + Deliveries: len(c.Deliveries), + Rejections: len(c.Rejected), + QueueDepth: len(c.Queue), + } + for id, n := range c.Nodes { + tr.NodeStates[id] = fmt.Sprintf("role=%s state=%s epoch=%d flushed=%d running=%v", + n.Role, n.ReplicaState, n.Epoch, n.Storage.FlushedLSN, n.Running) + } + for _, ft := range c.FiredTimeouts { + tr.FiredTimeouts = append(tr.FiredTimeouts, + fmt.Sprintf("%s replica=%s lsn=%d fired_at=%d", ft.Kind, ft.ReplicaID, ft.LSN, ft.FiredAt)) + } + for _, it := range c.IgnoredTimeouts { + tr.IgnoredTimeouts = append(tr.IgnoredTimeouts, + fmt.Sprintf("%s replica=%s lsn=%d stale_at=%d", it.Kind, it.ReplicaID, it.LSN, it.FiredAt)) + } + return tr +} + +// === Query helpers === + +// FiredTimeoutsByKind returns the count of fired timeouts of a specific kind. +func (c *Cluster) FiredTimeoutsByKind(kind TimeoutKind) int { + count := 0 + for _, ft := range c.FiredTimeouts { + if ft.Kind == kind { + count++ + } + } + return count +} diff --git a/sw-block/prototype/distsim/phase02_advanced_test.go b/sw-block/prototype/distsim/phase02_advanced_test.go new file mode 100644 index 000000000..82e0cd8fc --- /dev/null +++ b/sw-block/prototype/distsim/phase02_advanced_test.go @@ -0,0 +1,213 @@ +package distsim + +import ( + "testing" +) + +// ============================================================ +// Phase 02: Item 4 — Smart WAL recovery-class transitions +// ============================================================ + +// Test: recovery starts with resolvable ExtentReferenced records, +// then a payload becomes unresolvable during active recovery. +// Protocol must detect the transition and abort to NeedsRebuild. + +func TestP02_SmartWAL_RecoverableThenUnrecoverable(t *testing.T) { + // Build recovery records: first 3 WALInline, then 2 ExtentReferenced. + records := []RecoveryRecord{ + {Write: Write{LSN: 1, Block: 1, Value: 1}, Class: RecoveryClassWALInline}, + {Write: Write{LSN: 2, Block: 2, Value: 2}, Class: RecoveryClassWALInline}, + {Write: Write{LSN: 3, Block: 3, Value: 3}, Class: RecoveryClassWALInline}, + {Write: Write{LSN: 4, Block: 4, Value: 4}, Class: RecoveryClassExtentReferenced, PayloadResolvable: true}, + {Write: Write{LSN: 5, Block: 5, Value: 5}, Class: RecoveryClassExtentReferenced, PayloadResolvable: true}, + } + + // Initially fully recoverable. + if !FullyRecoverable(records) { + t.Fatal("initial records should be fully recoverable") + } + + // Simulate payload becoming unresolvable (e.g., extent generation GC'd). + records[4].PayloadResolvable = false + + // Now NOT recoverable — must detect and abort. + if FullyRecoverable(records) { + t.Fatal("after payload loss, records should NOT be recoverable") + } + + // Apply only the recoverable prefix. + state := ApplyRecoveryRecords(records[:4], 0, 4) // only first 4 + if state[4] != 4 { + t.Fatalf("partial apply: block 4 should be 4, got %d", state[4]) + } + if _, has5 := state[5]; has5 { + t.Fatal("block 5 should NOT be in partial state — payload was lost") + } +} + +func TestP02_SmartWAL_MixedClassRecovery_FullSuccess(t *testing.T) { + records := []RecoveryRecord{ + {Write: Write{LSN: 1, Block: 0, Value: 10}, Class: RecoveryClassWALInline}, + {Write: Write{LSN: 2, Block: 1, Value: 20}, Class: RecoveryClassExtentReferenced, PayloadResolvable: true}, + {Write: Write{LSN: 3, Block: 0, Value: 30}, Class: RecoveryClassWALInline}, + {Write: Write{LSN: 4, Block: 2, Value: 40}, Class: RecoveryClassExtentReferenced, PayloadResolvable: true}, + } + + if !FullyRecoverable(records) { + t.Fatal("all resolvable — should be recoverable") + } + + state := ApplyRecoveryRecords(records, 0, 4) + // Block 0 overwritten: 10 then 30. + if state[0] != 30 { + t.Fatalf("block 0: got %d, want 30", state[0]) + } + if state[1] != 20 { + t.Fatalf("block 1: got %d, want 20", state[1]) + } + if state[2] != 40 { + t.Fatalf("block 2: got %d, want 40", state[2]) + } +} + +func TestP02_SmartWAL_TimeVaryingAvailability(t *testing.T) { + // Simulate time-varying payload availability: + // At time T1, all records are recoverable. + // At time T2, one becomes unrecoverable. + // At time T3, it becomes recoverable again (re-pinned). + + records := []RecoveryRecord{ + {Write: Write{LSN: 1, Block: 0, Value: 1}, Class: RecoveryClassWALInline}, + {Write: Write{LSN: 2, Block: 1, Value: 2}, Class: RecoveryClassExtentReferenced, PayloadResolvable: true}, + {Write: Write{LSN: 3, Block: 2, Value: 3}, Class: RecoveryClassExtentReferenced, PayloadResolvable: true}, + } + + // T1: all recoverable. + if !FullyRecoverable(records) { + t.Fatal("T1: should be recoverable") + } + + // T2: payload for LSN 2 lost. + records[1].PayloadResolvable = false + if FullyRecoverable(records) { + t.Fatal("T2: should NOT be recoverable after payload loss") + } + + // T3: payload re-pinned (e.g., operator restores snapshot). + records[1].PayloadResolvable = true + if !FullyRecoverable(records) { + t.Fatal("T3: should be recoverable after re-pin") + } +} + +// ============================================================ +// Phase 02: Item 5 — Strengthen S5 (flapping replica) +// ============================================================ + +// S5 strengthened: repeated disconnect/reconnect with catch-up +// state tracking. If flapping exceeds budget, escalate to NeedsRebuild. + +func TestP02_S5_FlappingWithStateTracking(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + c.MaxCatchupAttempts = 10 // generous for flapping + + // Initial writes. + c.CommitWrite(1) + c.CommitWrite(2) + c.TickN(5) + + r1 := c.Nodes["r1"] + + // 5 flapping cycles — each creates a small gap then catches up. + for cycle := 0; cycle < 5; cycle++ { + c.Disconnect("p", "r1") + c.Disconnect("r1", "p") + + c.CommitWrite(uint64(3 + cycle*2)) + c.CommitWrite(uint64(4 + cycle*2)) + c.TickN(3) + + c.Connect("p", "r1") + c.Connect("r1", "p") + + r1.ReplicaState = NodeStateCatchingUp + converged := c.CatchUpWithEscalation("r1", 100) + if !converged { + t.Fatalf("cycle %d: catch-up should converge for small gap", cycle) + } + if r1.ReplicaState != NodeStateInSync { + t.Fatalf("cycle %d: expected InSync, got %s", cycle, r1.ReplicaState) + } + } + + // After 5 successful flaps, CatchupAttempts should be 0 (reset on success). + if r1.CatchupAttempts != 0 { + t.Fatalf("CatchupAttempts should be 0 after successful catch-ups, got %d", r1.CatchupAttempts) + } + + // No unnecessary rebuild — r1 should NOT have a base snapshot. + if r1.Storage.BaseSnapshot != nil { + t.Fatal("flapping replica should not have been rebuilt — only WAL catch-up") + } + + if err := c.AssertCommittedRecoverable("r1"); err != nil { + t.Fatal(err) + } +} + +func TestP02_S5_FlappingExceedsBudget_EscalatesToNeedsRebuild(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + c.MaxCatchupAttempts = 3 // tight budget + + c.CommitWrite(1) + c.TickN(5) + + r1 := c.Nodes["r1"] + + // Each flap creates a gap, but primary writes a LOT during disconnect. + // Catch-up recovers only 1 entry per attempt. After MaxCatchupAttempts + // non-convergent attempts, escalate. + for cycle := 0; cycle < 5; cycle++ { + c.Disconnect("p", "r1") + c.Disconnect("r1", "p") + + // Large writes during disconnect. + for w := 0; w < 30; w++ { + c.CommitWrite(uint64(cycle*30+w+2) % 8) + } + c.TickN(3) + + c.Connect("p", "r1") + c.Connect("r1", "p") + + r1.ReplicaState = NodeStateCatchingUp + + // Try catch-up with small batch — will not converge. + for attempt := 0; attempt < 5; attempt++ { + c.Disconnect("p", "r1") + c.Disconnect("r1", "p") + for w := 0; w < 10; w++ { + c.CommitWrite(uint64(200+cycle*50+attempt*10+w) % 8) + } + c.TickN(2) + c.Connect("p", "r1") + c.Connect("r1", "p") + + c.CatchUpWithEscalation("r1", 1) + + if r1.ReplicaState == NodeStateNeedsRebuild { + t.Logf("flapping escalated to NeedsRebuild at cycle %d, attempt %d", cycle, attempt) + // Verify: NeedsRebuild is sticky. + c.CatchUpWithEscalation("r1", 100) + if r1.ReplicaState != NodeStateNeedsRebuild { + t.Fatal("NeedsRebuild should be sticky — catch-up should not reset it") + } + return + } + } + } + + // If we got here, the budget wasn't reached. That's wrong. + t.Fatalf("expected NeedsRebuild escalation, but state is %s with %d attempts", + r1.ReplicaState, r1.CatchupAttempts) +} diff --git a/sw-block/prototype/distsim/phase02_candidate_test.go b/sw-block/prototype/distsim/phase02_candidate_test.go new file mode 100644 index 000000000..c24568043 --- /dev/null +++ b/sw-block/prototype/distsim/phase02_candidate_test.go @@ -0,0 +1,445 @@ +package distsim + +import ( + "testing" +) + +// ============================================================ +// Phase 02: Coordinator candidate-selection tests +// Verifies promotion ranking under mixed replica states. +// ============================================================ + +func TestP02_CandidateSelection_AllEqual_AlphabeticalTieBreak(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3") + c.CommitWrite(1) + c.TickN(5) + + // All replicas InSync with same FlushedLSN → alphabetical tie-break. + best := c.BestPromotionCandidate() + if best != "r1" { + t.Fatalf("all equal: expected r1 (alphabetical), got %q", best) + } + + candidates := c.PromotionCandidates() + if len(candidates) != 3 { + t.Fatalf("expected 3 candidates, got %d", len(candidates)) + } + for i, exp := range []string{"r1", "r2", "r3"} { + if candidates[i].ID != exp { + t.Fatalf("candidate[%d]: got %q, want %q", i, candidates[i].ID, exp) + } + } +} + +func TestP02_CandidateSelection_HigherLSN_Wins(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3") + + // Directly set FlushedLSN to simulate different progress. + // All InSync — higher LSN wins. + for _, id := range []string{"r1", "r2", "r3"} { + c.Nodes[id].ReplicaState = NodeStateInSync + } + c.Nodes["r1"].Storage.FlushedLSN = 10 + c.Nodes["r2"].Storage.FlushedLSN = 20 + c.Nodes["r3"].Storage.FlushedLSN = 15 + + best := c.BestPromotionCandidate() + if best != "r2" { + t.Fatalf("higher LSN: expected r2, got %q", best) + } + + candidates := c.PromotionCandidates() + if candidates[0].ID != "r2" || candidates[1].ID != "r3" || candidates[2].ID != "r1" { + t.Fatalf("order: got [%s, %s, %s], want [r2, r3, r1]", + candidates[0].ID, candidates[1].ID, candidates[2].ID) + } +} + +func TestP02_CandidateSelection_StoppedNode_Excluded(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + + c.Nodes["r1"].Storage.FlushedLSN = 100 + c.Nodes["r2"].Storage.FlushedLSN = 50 + c.StopNode("r1") // highest LSN but stopped + + best := c.BestPromotionCandidate() + if best != "r2" { + t.Fatalf("stopped excluded: expected r2, got %q", best) + } + + // r1 should be last in ranking (not running). + candidates := c.PromotionCandidates() + if candidates[0].ID != "r2" { + t.Fatalf("first candidate should be r2, got %s", candidates[0].ID) + } + if candidates[1].Running { + t.Fatal("r1 should be marked not running") + } +} + +func TestP02_CandidateSelection_InSync_Beats_CatchingUp(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3") + + // r1: CatchingUp with highest LSN. + c.Nodes["r1"].ReplicaState = NodeStateCatchingUp + c.Nodes["r1"].Storage.FlushedLSN = 100 + + // r2: InSync with lower LSN. + c.Nodes["r2"].ReplicaState = NodeStateInSync + c.Nodes["r2"].Storage.FlushedLSN = 50 + + // r3: InSync with even lower LSN. + c.Nodes["r3"].ReplicaState = NodeStateInSync + c.Nodes["r3"].Storage.FlushedLSN = 40 + + // InSync with lower LSN beats CatchingUp with higher LSN. + best := c.BestPromotionCandidate() + if best != "r2" { + t.Fatalf("InSync beats CatchingUp: expected r2, got %q", best) + } + + candidates := c.PromotionCandidates() + // r2 (InSync, 50), r3 (InSync, 40), r1 (CatchingUp, 100) + if candidates[0].ID != "r2" || candidates[1].ID != "r3" || candidates[2].ID != "r1" { + t.Fatalf("order: got [%s, %s, %s]", candidates[0].ID, candidates[1].ID, candidates[2].ID) + } +} + +func TestP02_CandidateSelection_AllCatchingUp_HighestLSN_Wins(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3") + + for _, id := range []string{"r1", "r2", "r3"} { + c.Nodes[id].ReplicaState = NodeStateCatchingUp + } + c.Nodes["r1"].Storage.FlushedLSN = 30 + c.Nodes["r2"].Storage.FlushedLSN = 80 + c.Nodes["r3"].Storage.FlushedLSN = 50 + + best := c.BestPromotionCandidate() + if best != "r2" { + t.Fatalf("all CatchingUp: expected r2 (highest LSN), got %q", best) + } +} + +func TestP02_CandidateSelection_NeedsRebuild_Skipped(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3") + + // r1: NeedsRebuild with highest LSN. + c.Nodes["r1"].ReplicaState = NodeStateNeedsRebuild + c.Nodes["r1"].Storage.FlushedLSN = 100 + + // r2: InSync with moderate LSN. + c.Nodes["r2"].ReplicaState = NodeStateInSync + c.Nodes["r2"].Storage.FlushedLSN = 50 + + // r3: CatchingUp with low LSN. + c.Nodes["r3"].ReplicaState = NodeStateCatchingUp + c.Nodes["r3"].Storage.FlushedLSN = 20 + + best := c.BestPromotionCandidate() + if best != "r2" { + t.Fatalf("NeedsRebuild skipped: expected r2, got %q", best) + } + + candidates := c.PromotionCandidates() + // r2 (InSync, 50), r3 (CatchingUp, 20), r1 (NeedsRebuild, 100) + if candidates[0].ID != "r2" { + t.Fatalf("first should be r2, got %s", candidates[0].ID) + } + if candidates[2].ID != "r1" { + t.Fatalf("last should be r1 (NeedsRebuild), got %s", candidates[2].ID) + } +} + +func TestP02_CandidateSelection_NoRunning_ReturnsEmpty(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + c.StopNode("r1") + c.StopNode("r2") + + best := c.BestPromotionCandidate() + if best != "" { + t.Fatalf("no running: expected empty, got %q", best) + } +} + +func TestP02_CandidateSelection_AfterPartition_RankingUpdates(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3") + + c.CommitWrite(1) + c.CommitWrite(2) + c.TickN(5) + + // All InSync at FlushedLSN=2. Best = r1 (alphabetical). + if best := c.BestPromotionCandidate(); best != "r1" { + t.Fatalf("before partition: expected r1, got %q", best) + } + + // Partition r1. Write more via p+r2+r3. + c.Disconnect("p", "r1") + c.Disconnect("r1", "p") + // With 4 members (p, r1, r2, r3), quorum=3. p+r2+r3=3. OK. + // Actually, quorum = 4/2+1=3. p+r2+r3=3. Marginal. + c.CommitWrite(3) + c.CommitWrite(4) + c.CommitWrite(5) + c.TickN(5) + + // r1 lagging, r2/r3 ahead. + c.Nodes["r1"].ReplicaState = NodeStateCatchingUp + + // Now r2 or r3 should win (both InSync with higher LSN). + best := c.BestPromotionCandidate() + if best == "r1" { + t.Fatal("after partition: r1 should not be best (CatchingUp)") + } + if best != "r2" { + t.Fatalf("after partition: expected r2 (InSync, alphabetical tie-break), got %q", best) + } + t.Logf("after partition: best=%s", best) +} + +func TestP02_CandidateSelection_MixedStates_FullRanking(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3", "r4", "r5") + + // Set up a diverse state mix: + // r1: InSync, LSN=50 + // r2: InSync, LSN=60 (highest InSync) + // r3: CatchingUp, LSN=80 (highest overall but CatchingUp) + // r4: NeedsRebuild, LSN=90 (highest but NeedsRebuild) + // r5: stopped, LSN=100 (highest but not running) + c.Nodes["r1"].ReplicaState = NodeStateInSync + c.Nodes["r1"].Storage.FlushedLSN = 50 + c.Nodes["r2"].ReplicaState = NodeStateInSync + c.Nodes["r2"].Storage.FlushedLSN = 60 + c.Nodes["r3"].ReplicaState = NodeStateCatchingUp + c.Nodes["r3"].Storage.FlushedLSN = 80 + c.Nodes["r4"].ReplicaState = NodeStateNeedsRebuild + c.Nodes["r4"].Storage.FlushedLSN = 90 + c.Nodes["r5"].Storage.FlushedLSN = 100 + c.StopNode("r5") + + best := c.BestPromotionCandidate() + if best != "r2" { + t.Fatalf("mixed states: expected r2 (InSync+highest among InSync), got %q", best) + } + + candidates := c.PromotionCandidates() + // Expected order: r2(InSync,60), r1(InSync,50), r3(CatchingUp,80), + // r4(NeedsRebuild,90), r5(stopped,100) + expected := []string{"r2", "r1", "r3", "r4", "r5"} + for i, exp := range expected { + if candidates[i].ID != exp { + t.Fatalf("candidate[%d]: got %q, want %q", i, candidates[i].ID, exp) + } + } + t.Logf("full ranking: %s(%s/%d) > %s(%s/%d) > %s(%s/%d) > %s(%s/%d) > %s(%s/%d)", + candidates[0].ID, candidates[0].State, candidates[0].FlushedLSN, + candidates[1].ID, candidates[1].State, candidates[1].FlushedLSN, + candidates[2].ID, candidates[2].State, candidates[2].FlushedLSN, + candidates[3].ID, candidates[3].State, candidates[3].FlushedLSN, + candidates[4].ID, candidates[4].State, candidates[4].FlushedLSN) +} + +func TestP02_CandidateSelection_AllNeedsRebuild_SafeDefaultEmpty(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + c.Nodes["r1"].ReplicaState = NodeStateNeedsRebuild + c.Nodes["r1"].Storage.FlushedLSN = 50 + c.Nodes["r2"].ReplicaState = NodeStateNeedsRebuild + c.Nodes["r2"].Storage.FlushedLSN = 80 + + // Safe default: refuses NeedsRebuild candidates. + safe := c.BestPromotionCandidate() + if safe != "" { + t.Fatalf("safe default should return empty for all-NeedsRebuild, got %q", safe) + } +} + +func TestP02_CandidateSelection_DesperationPromotion_ExplicitAPI(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3") + for _, id := range []string{"r1", "r2", "r3"} { + c.Nodes[id].ReplicaState = NodeStateNeedsRebuild + } + c.Nodes["r1"].Storage.FlushedLSN = 10 + c.Nodes["r2"].Storage.FlushedLSN = 30 + c.Nodes["r3"].Storage.FlushedLSN = 20 + + safe := c.BestPromotionCandidate() + if safe != "" { + t.Fatalf("safe default should return empty, got %q", safe) + } + + desperate := c.BestPromotionCandidateDesperate() + if desperate != "r2" { + t.Fatalf("desperation: expected r2 (highest LSN), got %q", desperate) + } +} + +// === Candidate eligibility tests === + +func TestP02_CandidateEligibility_Running(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + c.CommitWrite(1) + c.TickN(5) + + e := c.EvaluateCandidateEligibility("r1") + if !e.Eligible { + t.Fatalf("running InSync replica should be eligible, reasons: %v", e.Reasons) + } + + c.StopNode("r1") + e = c.EvaluateCandidateEligibility("r1") + if e.Eligible { + t.Fatal("stopped replica should not be eligible") + } + if e.Reasons[0] != "not_running" { + t.Fatalf("expected not_running reason, got %v", e.Reasons) + } +} + +func TestP02_CandidateEligibility_EpochAlignment(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + c.CommitWrite(1) + c.TickN(5) + + // Manually desync r1's epoch. + c.Nodes["r1"].Epoch = c.Coordinator.Epoch - 1 + + e := c.EvaluateCandidateEligibility("r1") + if e.Eligible { + t.Fatal("epoch-misaligned replica should not be eligible") + } + found := false + for _, r := range e.Reasons { + if r == "epoch_misaligned" { + found = true + } + } + if !found { + t.Fatalf("expected epoch_misaligned reason, got %v", e.Reasons) + } +} + +func TestP02_CandidateEligibility_StateIneligible(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + c.CommitWrite(1) + c.TickN(5) + + for _, state := range []ReplicaNodeState{NodeStateNeedsRebuild, NodeStateRebuilding} { + c.Nodes["r1"].ReplicaState = state + e := c.EvaluateCandidateEligibility("r1") + if e.Eligible { + t.Fatalf("%s should not be eligible", state) + } + } + + // CatchingUp IS eligible (data may be mostly current). + c.Nodes["r1"].ReplicaState = NodeStateCatchingUp + e := c.EvaluateCandidateEligibility("r1") + if !e.Eligible { + t.Fatalf("CatchingUp should be eligible, reasons: %v", e.Reasons) + } +} + +func TestP02_CandidateEligibility_InsufficientCommittedPrefix(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + c.CommitWrite(1) + c.TickN(5) + + // r1 has FlushedLSN=1, CommittedLSN=1 → eligible. + e := c.EvaluateCandidateEligibility("r1") + if !e.Eligible { + t.Fatalf("r1 at committed prefix should be eligible, reasons: %v", e.Reasons) + } + + // Manually set r1 behind committed prefix. + c.Nodes["r1"].Storage.FlushedLSN = 0 + e = c.EvaluateCandidateEligibility("r1") + if e.Eligible { + t.Fatal("FlushedLSN=0 with CommittedLSN=1 should not be eligible") + } + found := false + for _, r := range e.Reasons { + if r == "insufficient_committed_prefix" { + found = true + } + } + if !found { + t.Fatalf("expected insufficient_committed_prefix reason, got %v", e.Reasons) + } +} + +func TestP02_CandidateEligibility_InSyncButLagging_Rejected(t *testing.T) { + // Scenario from finding: r1 is InSync with correct epoch but FlushedLSN << CommittedLSN. + // r2 is CatchingUp but has the committed prefix. r2 should be selected over r1. + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3") + + // Set committed prefix high. + c.Coordinator.CommittedLSN = 100 + + // r1: InSync, correct epoch, but FlushedLSN=1. Ineligible. + c.Nodes["r1"].ReplicaState = NodeStateInSync + c.Nodes["r1"].Storage.FlushedLSN = 1 + + // r2: CatchingUp, correct epoch, FlushedLSN=100. Eligible. + c.Nodes["r2"].ReplicaState = NodeStateCatchingUp + c.Nodes["r2"].Storage.FlushedLSN = 100 + + // r3: InSync, correct epoch, FlushedLSN=100. Eligible. + c.Nodes["r3"].ReplicaState = NodeStateInSync + c.Nodes["r3"].Storage.FlushedLSN = 100 + + // r1 is ineligible despite being InSync. + e1 := c.EvaluateCandidateEligibility("r1") + if e1.Eligible { + t.Fatal("r1 (InSync, FlushedLSN=1, CommittedLSN=100) should be ineligible") + } + + // r2 and r3 are eligible. + e2 := c.EvaluateCandidateEligibility("r2") + if !e2.Eligible { + t.Fatalf("r2 should be eligible, reasons: %v", e2.Reasons) + } + + // BestPromotionCandidate should pick r3 (InSync with prefix) over r2 (CatchingUp). + best := c.BestPromotionCandidate() + if best != "r3" { + t.Fatalf("expected r3 (InSync+prefix), got %q", best) + } + + // r1 must NOT be in the eligible list at all. + eligible := c.EligiblePromotionCandidates() + for _, pc := range eligible { + if pc.ID == "r1" { + t.Fatal("r1 should not appear in eligible candidates") + } + } + t.Logf("committed-prefix gate: r1(InSync/flushed=1) rejected, r3(InSync/flushed=100) selected") +} + +func TestP02_CandidateEligibility_EligiblePromotionCandidates(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3", "r4") + c.CommitWrite(1) + c.TickN(5) + + // r1: InSync, eligible + // r2: NeedsRebuild, ineligible + c.Nodes["r2"].ReplicaState = NodeStateNeedsRebuild + // r3: stopped, ineligible + c.StopNode("r3") + // r4: epoch misaligned, ineligible + c.Nodes["r4"].Epoch = 0 + + eligible := c.EligiblePromotionCandidates() + if len(eligible) != 1 { + t.Fatalf("expected 1 eligible candidate, got %d", len(eligible)) + } + if eligible[0].ID != "r1" { + t.Fatalf("expected r1 as only eligible, got %s", eligible[0].ID) + } + + // BestPromotionCandidate uses eligibility. + best := c.BestPromotionCandidate() + if best != "r1" { + t.Fatalf("BestPromotionCandidate should return r1, got %q", best) + } +} diff --git a/sw-block/prototype/distsim/phase02_network_test.go b/sw-block/prototype/distsim/phase02_network_test.go new file mode 100644 index 000000000..0ba91622f --- /dev/null +++ b/sw-block/prototype/distsim/phase02_network_test.go @@ -0,0 +1,371 @@ +package distsim + +import ( + "testing" +) + +// ============================================================ +// Phase 02: Delayed/drop network + multi-node reservation expiry +// ============================================================ + +// --- Item 4: Stale delayed messages after heal/promote --- + +// Scenario: messages from old primary are in-flight when partition heals +// and a new primary is promoted. The stale messages arrive AFTER the +// promotion. They must be rejected by epoch fencing. + +func TestP02_DelayedStaleMessages_AfterPromote(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "A", "B", "C") + + // Phase 1: A writes, ships to B and C. + c.CommitWrite(1) + c.CommitWrite(2) + c.TickN(5) + + // Phase 2: A writes more, but we manually enqueue delayed delivery + // to simulate in-flight messages when partition happens. + c.CommitWrite(3) // LSN 3 ships normally + // Don't tick yet — messages are in the queue. + + // Phase 3: Partition A from everyone, promote B. + c.Disconnect("A", "B") + c.Disconnect("B", "A") + c.Disconnect("A", "C") + c.Disconnect("C", "A") + c.StopNode("A") + c.Promote("B") + + // Phase 4: Manually inject stale messages as if they were delayed in the network. + // These represent A's write(3) + barrier(3) that were in-flight when A crashed. + staleEpoch := c.Coordinator.Epoch - 1 + c.InjectMessage(Message{ + Kind: MsgWrite, From: "A", To: "B", Epoch: staleEpoch, + Write: Write{LSN: 3, Block: 3, Value: 3}, + }, c.Now+1) + c.InjectMessage(Message{ + Kind: MsgBarrier, From: "A", To: "B", Epoch: staleEpoch, + TargetLSN: 3, + }, c.Now+2) + c.InjectMessage(Message{ + Kind: MsgWrite, From: "A", To: "C", Epoch: staleEpoch, + Write: Write{LSN: 3, Block: 3, Value: 3}, + }, c.Now+1) + + // Phase 5: Tick to deliver stale messages. + committedBefore := c.Coordinator.CommittedLSN + c.TickN(5) + + // All stale messages must be rejected — either by epoch fencing or node-down. + epochRejects := c.RejectedByReason(RejectEpochMismatch) + nodeDownRejects := c.RejectedByReason(RejectNodeDown) + totalRejects := epochRejects + nodeDownRejects + if totalRejects == 0 { + t.Fatal("stale delayed messages were not rejected") + } + + // Committed prefix must not change from stale messages. + if c.Coordinator.CommittedLSN != committedBefore { + t.Fatalf("stale delayed messages changed committed prefix: before=%d after=%d", + committedBefore, c.Coordinator.CommittedLSN) + } + + // Data correct on new primary. + if err := c.AssertCommittedRecoverable("B"); err != nil { + t.Fatalf("data incorrect after stale delayed messages: %v", err) + } + t.Logf("stale delayed messages: %d rejected by epoch_mismatch", epochRejects) +} + +// Scenario: old barrier ACK arrives after promotion with long delay. +// This is different from S18 — the delay is network-level, not restart-level. + +func TestP02_DelayedBarrierAck_LongNetworkDelay(t *testing.T) { + c := NewCluster(CommitSyncAll, "p", "r1") + + c.CommitWrite(1) + c.TickN(5) + + // Write 2 — barrier sent to r1. + c.CommitWrite(2) + c.TickN(2) // barrier in flight + + // Promote r1 (simulate primary failure + promotion). + c.StopNode("p") + c.Promote("r1") + + committedBefore := c.Coordinator.CommittedLSN + + // Long-delayed barrier ack from r1 → dead primary p. + c.InjectMessage(Message{ + Kind: MsgBarrierAck, From: "r1", To: "p", + Epoch: c.Coordinator.Epoch - 1, TargetLSN: 2, + }, c.Now+10) + + c.TickN(15) + + // Must be rejected — p is dead and epoch is stale. + nodeDownRejects := c.RejectedByReason(RejectNodeDown) + epochRejects := c.RejectedByReason(RejectEpochMismatch) + if nodeDownRejects == 0 && epochRejects == 0 { + t.Fatal("delayed barrier ack should be rejected (node down or epoch mismatch)") + } + + // Stale ack must not advance committed prefix. + if c.Coordinator.CommittedLSN != committedBefore { + t.Fatalf("stale ack changed committed prefix: before=%d after=%d", + committedBefore, c.Coordinator.CommittedLSN) + } +} + +// Scenario: write ships to replica, network drops the write but delivers +// the barrier. Barrier should timeout or detect missing data. + +func TestP02_DroppedWrite_BarrierDelivered_Stalls(t *testing.T) { + c := NewCluster(CommitSyncAll, "p", "r1") + + c.CommitWrite(1) + c.TickN(5) + + // Write 2 — but drop the write message to r1 (link down for data only). + // We simulate by writing but not ticking, then dropping queued writes. + c.CommitWrite(2) // enqueues write(2) + barrier(2) to r1 + + // Remove only the write message from the queue (simulate selective drop). + var kept []inFlightMessage + for _, item := range c.Queue { + if item.msg.Kind == MsgWrite && item.msg.To == "r1" && item.msg.Write.LSN == 2 { + continue // drop this write + } + kept = append(kept, item) + } + c.Queue = kept + + // Tick — barrier arrives at r1 but r1 doesn't have LSN 2. + // Barrier should re-queue (waiting for data). + c.TickN(10) + + // Assert 1: sync_all blocked — CommittedLSN stuck at 1. + if c.Coordinator.CommittedLSN != 1 { + t.Fatalf("sync_all should be blocked at LSN 1, got committed=%d", c.Coordinator.CommittedLSN) + } + + // Assert 2: LSN 2 is pending but NOT committed. + p2 := c.Pending[2] + if p2 == nil { + t.Fatal("LSN 2 should be pending") + } + if p2.Committed { + t.Fatal("LSN 2 committed under sync_all but r1 never received the write — safety violation") + } + + // Assert 3: barrier still re-queuing — stall proven positively. + barrierRequeued := false + for _, item := range c.Queue { + if item.msg.Kind == MsgBarrier && item.msg.To == "r1" && item.msg.TargetLSN == 2 { + barrierRequeued = true + break + } + } + if !barrierRequeued { + t.Fatal("barrier for LSN 2 should still be re-queuing — stall not proven") + } + t.Logf("dropped write stall proven: committed=%d, pending[2].committed=%v, barrier re-queuing=%v", + c.Coordinator.CommittedLSN, p2.Committed, barrierRequeued) +} + +// --- Item 5: Multi-node reservation expiry / rebuild timeout --- + +// Scenario: RF=3 cluster. Two replicas need catch-up. One's reservation +// expires during recovery. Must handle correctly: one rebuilds, one catches up. + +func TestP02_MultiNode_ReservationExpiry_MixedOutcome(t *testing.T) { + // 5 nodes: p+r3+r4 provide quorum (3 of 5) while r1+r2 are disconnected. + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3", "r4") + + // Write initial data. + for i := uint64(1); i <= 10; i++ { + c.CommitWrite(i % 4) + } + c.TickN(5) + + // Take snapshot for rebuild. + c.Primary().Storage.TakeSnapshot("snap-1", c.Coordinator.CommittedLSN) + + // r1+r2 disconnect. r3+r4 stay for quorum (p+r3+r4 = 3 of 5). + c.Disconnect("p", "r1") + c.Disconnect("r1", "p") + c.Disconnect("p", "r2") + c.Disconnect("r2", "p") + + // Write more during disconnect — committed via p+r3+r4 quorum. + for i := uint64(11); i <= 30; i++ { + c.CommitWrite(i % 4) + } + c.TickN(5) + + // Reconnect both. + c.Connect("p", "r1") + c.Connect("r1", "p") + c.Connect("p", "r2") + c.Connect("r2", "p") + + // r1: reserved catch-up with tight expiry — MUST expire. + // 20 entries to replay, but only 2 ticks of budget. + r1 := c.Nodes["r1"] + shortExpiry := c.Now + 2 + err := c.RecoverReplicaFromPrimaryReserved("r1", r1.Storage.FlushedLSN, c.Coordinator.CommittedLSN, shortExpiry) + if err == nil { + t.Fatal("r1 reservation must expire — 20 entries with 2-tick budget") + } + r1.ReplicaState = NodeStateNeedsRebuild + t.Logf("r1 reservation expired: %v", err) + + // r2: full catch-up (no reservation pressure). + r2 := c.Nodes["r2"] + if err := c.RecoverReplicaFromPrimary("r2", r2.Storage.FlushedLSN, c.Coordinator.CommittedLSN); err != nil { + t.Fatalf("r2 full catch-up failed: %v", err) + } + r2.ReplicaState = NodeStateInSync + + // Deterministic mixed outcome: r1=NeedsRebuild, r2=InSync. + if r1.ReplicaState != NodeStateNeedsRebuild { + t.Fatalf("r1 should be NeedsRebuild, got %s", r1.ReplicaState) + } + if r2.ReplicaState != NodeStateInSync { + t.Fatalf("r2 should be InSync, got %s", r2.ReplicaState) + } + + // r2 data correct. + if err := c.AssertCommittedRecoverable("r2"); err != nil { + t.Fatalf("r2 data incorrect: %v", err) + } + + // r1 rebuild from snapshot. + c.RebuildReplicaFromSnapshot("r1", "snap-1", c.Coordinator.CommittedLSN) + r1.ReplicaState = NodeStateInSync + if err := c.AssertCommittedRecoverable("r1"); err != nil { + t.Fatalf("r1 data incorrect after rebuild: %v", err) + } + + t.Logf("mixed outcome proven: r1=NeedsRebuild→rebuilt, r2=InSync") +} + +// Scenario: all replicas need rebuild but only one snapshot exists. +// First replica rebuilds from snapshot, second must wait or use first +// replica as rebuild source. + +func TestP02_MultiNode_AllNeedRebuild(t *testing.T) { + // Use 5 nodes so quorum (3 of 5) can be met with p+r3+r4 while r1+r2 are down. + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2", "r3", "r4") + c.MaxCatchupAttempts = 2 + + for i := uint64(1); i <= 5; i++ { + c.CommitWrite(i) + } + c.TickN(5) + + c.Primary().Storage.TakeSnapshot("snap-all", c.Coordinator.CommittedLSN) + + // r1 and r2 disconnect. r3+r4 stay connected so quorum (p+r3+r4) can commit. + c.Disconnect("p", "r1") + c.Disconnect("r1", "p") + c.Disconnect("p", "r2") + c.Disconnect("r2", "p") + for i := uint64(6); i <= 100; i++ { + c.CommitWrite(i % 8) + } + c.TickN(5) + + // Try catch-up for r1 and r2 — both will escalate. + // Pattern: write while target disconnected, then try partial catch-up. + for _, id := range []string{"r1", "r2"} { + n := c.Nodes[id] + n.ReplicaState = NodeStateCatchingUp + for attempt := 0; attempt < 5; attempt++ { + // Write MORE while target is still disconnected (r3+r4 provide quorum). + for w := 0; w < 20; w++ { + c.CommitWrite(uint64(101+attempt*20+w) % 8) + } + c.TickN(3) // 3 ticks: deliver writes, barriers, then acks + // Now try catch-up (partial, batch=1). Target stays disconnected — + // RecoverReplicaFromPrimaryPartial reads directly from primary WAL. + c.CatchUpWithEscalation(id, 1) + if n.ReplicaState == NodeStateNeedsRebuild { + break + } + } + } + + // Reconnect all for rebuild. + c.Connect("p", "r1") + c.Connect("r1", "p") + c.Connect("p", "r2") + c.Connect("r2", "p") + + // Both should be NeedsRebuild. + if c.Nodes["r1"].ReplicaState != NodeStateNeedsRebuild { + t.Fatalf("r1: expected NeedsRebuild, got %s", c.Nodes["r1"].ReplicaState) + } + if c.Nodes["r2"].ReplicaState != NodeStateNeedsRebuild { + t.Fatalf("r2: expected NeedsRebuild, got %s", c.Nodes["r2"].ReplicaState) + } + + // Rebuild both from snapshot. + c.RebuildReplicaFromSnapshot("r1", "snap-all", c.Coordinator.CommittedLSN) + c.RebuildReplicaFromSnapshot("r2", "snap-all", c.Coordinator.CommittedLSN) + c.Nodes["r1"].ReplicaState = NodeStateInSync + c.Nodes["r2"].ReplicaState = NodeStateInSync + + // Both correct. + if err := c.AssertCommittedRecoverable("r1"); err != nil { + t.Fatal(err) + } + if err := c.AssertCommittedRecoverable("r2"); err != nil { + t.Fatal(err) + } + t.Logf("multi-node rebuild complete: both replicas recovered from snapshot") +} + +// Scenario: rebuild timeout — rebuild takes too long, coordinator +// should be able to abort and retry or fail explicitly. + +func TestP02_RebuildTimeout_PartialRebuildAborts(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + + for i := uint64(1); i <= 20; i++ { + c.CommitWrite(i % 4) + } + c.TickN(5) + + c.Primary().Storage.TakeSnapshot("snap-timeout", c.Coordinator.CommittedLSN) + + // Write much more. + for i := uint64(21); i <= 100; i++ { + c.CommitWrite(i % 4) + } + c.TickN(5) + + // r1 needs rebuild — use partial rebuild with small max. + lastRecovered, err := c.RebuildReplicaFromSnapshotPartial("r1", "snap-timeout", c.Coordinator.CommittedLSN, 5) + if err != nil { + t.Fatalf("partial rebuild: %v", err) + } + + // Partial rebuild: not complete. + if lastRecovered >= c.Coordinator.CommittedLSN { + t.Fatal("expected partial rebuild, not complete") + } + + // r1 state should remain NeedsRebuild (not promoted to InSync). + c.Nodes["r1"].ReplicaState = NodeStateRebuilding + if c.Nodes["r1"].ReplicaState == NodeStateInSync { + t.Fatal("partial rebuild should not grant InSync") + } + + // Full rebuild to complete. + c.RebuildReplicaFromSnapshot("r1", "snap-timeout", c.Coordinator.CommittedLSN) + c.Nodes["r1"].ReplicaState = NodeStateInSync + if err := c.AssertCommittedRecoverable("r1"); err != nil { + t.Fatal(err) + } +} diff --git a/sw-block/prototype/distsim/phase02_test.go b/sw-block/prototype/distsim/phase02_test.go new file mode 100644 index 000000000..78b4a42fb --- /dev/null +++ b/sw-block/prototype/distsim/phase02_test.go @@ -0,0 +1,359 @@ +package distsim + +import ( + "testing" +) + +// ============================================================ +// Phase 02: Protocol-state assertions + version comparison +// ============================================================ + +// --- P0: Protocol-level rejection assertions --- + +func TestP02_EpochFencing_AllStaleTrafficRejected(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + + c.CommitWrite(1) + c.TickN(5) + + // Partition + promote. + c.Disconnect("p", "r1") + c.Disconnect("r1", "p") + c.Disconnect("p", "r2") + c.Disconnect("r2", "p") + c.Promote("r1") + staleEpoch := c.Coordinator.Epoch - 1 + c.Nodes["p"].Epoch = staleEpoch + + // Stale writes through protocol. + delivered := c.StaleWrite("p", staleEpoch, 99) + + // Protocol-level assertion: zero accepted, all rejected by epoch. + if delivered > 0 { + t.Fatalf("stale traffic accepted: %d messages passed fencing", delivered) + } + epochRejects := c.RejectedByReason(RejectEpochMismatch) + if epochRejects == 0 { + t.Fatal("no epoch rejections recorded — fencing not tracked") + } + + // Delivery log must show explicit rejections (protocol behavior, not just final state). + totalRejected := 0 + for _, d := range c.Deliveries { + if !d.Accepted { + totalRejected++ + } + } + if totalRejected == 0 { + t.Fatal("delivery log has no rejections — protocol behavior not recorded") + } + t.Logf("protocol-level: %d rejected, %d epoch_mismatch", totalRejected, epochRejects) +} + +func TestP02_AcceptedDeliveries_Tracked(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + + c.CommitWrite(1) + c.TickN(5) + + // Should have accepted write + barrier deliveries. + accepted := c.AcceptedCount() + if accepted == 0 { + t.Fatal("no accepted deliveries recorded") + } + acceptedWrites := c.AcceptedByKind(MsgWrite) + if acceptedWrites == 0 { + t.Fatal("no accepted write deliveries") + } + t.Logf("after 1 write: %d accepted total, %d writes", accepted, acceptedWrites) +} + +// --- P1: S20 protocol-level closure --- + +func TestP02_S20_StaleTraffic_CommittedPrefixUnchanged(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "A", "B", "C") + + c.CommitWrite(1) + c.CommitWrite(2) + c.TickN(5) + + // Partition A, promote B. + c.Disconnect("A", "B") + c.Disconnect("B", "A") + c.Disconnect("A", "C") + c.Disconnect("C", "A") + c.Promote("B") + c.Nodes["A"].Epoch = c.Coordinator.Epoch - 1 + + // B writes (new epoch). + c.CommitWrite(3) + c.TickN(5) + committedBefore := c.Coordinator.CommittedLSN + + // A stale writes through protocol. + c.StaleWrite("A", c.Nodes["A"].Epoch, 99) + + // Protocol assertion: committed prefix unchanged by stale traffic. + committedAfter := c.Coordinator.CommittedLSN + if committedAfter != committedBefore { + t.Fatalf("stale traffic changed committed prefix: before=%d after=%d", committedBefore, committedAfter) + } + + // All stale messages rejected by epoch. + if c.RejectedByReason(RejectEpochMismatch) == 0 { + t.Fatal("no epoch rejections for stale traffic") + } +} + +// --- P1: S6 protocol-level closure --- + +func TestP02_S6_NonConvergent_ExplicitStateTransition(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + c.MaxCatchupAttempts = 3 + + for i := uint64(1); i <= 5; i++ { + c.CommitWrite(i) + } + c.TickN(5) + + c.Disconnect("p", "r1") + c.Disconnect("r1", "p") + for i := uint64(6); i <= 100; i++ { + c.CommitWrite(i % 8) + } + c.TickN(5) + c.Connect("p", "r1") + c.Connect("r1", "p") + + r1 := c.Nodes["r1"] + r1.ReplicaState = NodeStateCatchingUp + + // Protocol assertion: state transitions are explicit. + // Track the state at each step. + var stateTrace []ReplicaNodeState + for attempt := 0; attempt < 10; attempt++ { + c.Disconnect("p", "r1") + c.Disconnect("r1", "p") + for w := 0; w < 20; w++ { + c.CommitWrite(uint64(101+attempt*20+w) % 8) + } + c.TickN(2) + c.Connect("p", "r1") + c.Connect("r1", "p") + + c.CatchUpWithEscalation("r1", 1) + stateTrace = append(stateTrace, r1.ReplicaState) + + if r1.ReplicaState == NodeStateNeedsRebuild { + break + } + } + + // Must have explicit state transitions: CatchingUp → ... → NeedsRebuild. + if r1.ReplicaState != NodeStateNeedsRebuild { + t.Fatalf("expected NeedsRebuild, got %s", r1.ReplicaState) + } + // Trace must show CatchingUp before NeedsRebuild. + hasCatchingUp := false + for _, s := range stateTrace { + if s == NodeStateCatchingUp { + hasCatchingUp = true + } + } + if !hasCatchingUp { + t.Fatal("state trace should include CatchingUp before NeedsRebuild") + } + t.Logf("state trace: %v", stateTrace) +} + +// --- P1: S18 protocol-level closure --- + +func TestP02_S18_DelayedAck_ExplicitRejection(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + + c.CommitWrite(1) + c.TickN(5) + + // Write 2 without r1 ack. + c.Disconnect("r1", "p") + c.CommitWrite(2) + c.TickN(3) + + // Restart primary with epoch bump. + c.StopNode("p") + c.Coordinator.Epoch++ + for _, n := range c.Nodes { + if n.Running { + n.Epoch = c.Coordinator.Epoch + } + } + c.StartNode("p") + + committedBefore := c.Coordinator.CommittedLSN + deliveriesBefore := len(c.Deliveries) + + // Reconnect r1, deliver stale ack. + c.Connect("r1", "p") + c.Connect("p", "r1") + oldAck := Message{ + Kind: MsgBarrierAck, From: "r1", To: "p", + Epoch: c.Coordinator.Epoch - 1, TargetLSN: 2, + } + c.deliver(oldAck) + c.refreshCommits() + + // Protocol assertion 1: committed prefix unchanged. + if c.Coordinator.CommittedLSN > committedBefore { + t.Fatalf("stale ack advanced prefix: %d → %d", committedBefore, c.Coordinator.CommittedLSN) + } + + // Protocol assertion 2: the delivery was explicitly recorded as rejected. + newDeliveries := c.Deliveries[deliveriesBefore:] + found := false + for _, d := range newDeliveries { + if !d.Accepted && d.Reason == RejectEpochMismatch && d.Msg.Kind == MsgBarrierAck { + found = true + } + } + if !found { + t.Fatal("stale ack not recorded as epoch_mismatch rejection in delivery log") + } +} + +// --- P2: Version comparison --- + +func TestP02_VersionComparison_BriefDisconnect(t *testing.T) { + // Same scenario under V1, V1.5, V2 — different expected outcomes. + for _, tc := range []struct { + version ProtocolVersion + expectCatchup bool + expectRebuild bool + }{ + {ProtocolV1, false, false}, // V1: no catch-up, stays degraded + {ProtocolV15, true, false}, // V1.5: catch-up possible if address stable + {ProtocolV2, true, false}, // V2: catch-up allowed for this recoverable short-gap case + } { + t.Run(string(tc.version), func(t *testing.T) { + c := NewClusterWithProtocol(CommitSyncQuorum, tc.version, "p", "r1", "r2") + c.MaxCatchupAttempts = 5 + + c.CommitWrite(1) + c.CommitWrite(2) + c.TickN(5) + + // Brief disconnect. + c.Disconnect("p", "r1") + c.Disconnect("r1", "p") + c.CommitWrite(3) + c.CommitWrite(4) + c.TickN(5) + c.Connect("p", "r1") + c.Connect("r1", "p") + + canCatchup := c.Protocol.CanAttemptCatchup(true) + if canCatchup != tc.expectCatchup { + t.Fatalf("CanAttemptCatchup: got %v, want %v", canCatchup, tc.expectCatchup) + } + + if canCatchup { + // Catch up r1. + r1 := c.Nodes["r1"] + r1.ReplicaState = NodeStateCatchingUp + converged := c.CatchUpWithEscalation("r1", 100) + if !converged { + t.Fatal("expected catch-up to converge for short gap") + } + if r1.ReplicaState != NodeStateInSync { + t.Fatalf("expected InSync after catch-up, got %s", r1.ReplicaState) + } + } + }) + } +} + +func TestP02_VersionComparison_BriefDisconnectActions(t *testing.T) { + for _, tc := range []struct { + version ProtocolVersion + addrStable bool + recoverable bool + expectAction string + }{ + {ProtocolV1, true, true, "degrade_or_rebuild"}, + {ProtocolV15, true, true, "catchup_if_history_survives"}, + {ProtocolV15, false, true, "stall_or_control_plane_recovery"}, + {ProtocolV2, true, true, "reserved_catchup"}, + {ProtocolV2, false, false, "explicit_rebuild"}, + } { + t.Run(string(tc.version)+"_stable="+boolStr(tc.addrStable)+"_recoverable="+boolStr(tc.recoverable), func(t *testing.T) { + policy := ProtocolPolicy{Version: tc.version} + action := policy.BriefDisconnectAction(tc.addrStable, tc.recoverable) + if action != tc.expectAction { + t.Fatalf("BriefDisconnectAction(%v,%v): got %q, want %q", tc.addrStable, tc.recoverable, action, tc.expectAction) + } + }) + } +} + +func TestP02_VersionComparison_TailChasing(t *testing.T) { + for _, tc := range []struct { + version ProtocolVersion + expectAction string + }{ + {ProtocolV1, "degrade"}, + {ProtocolV15, "stall_or_rebuild"}, + {ProtocolV2, "abort_to_rebuild"}, + } { + t.Run(string(tc.version), func(t *testing.T) { + policy := ProtocolPolicy{Version: tc.version} + action := policy.TailChasingAction(false) // non-convergent + if action != tc.expectAction { + t.Fatalf("TailChasingAction(false): got %q, want %q", action, tc.expectAction) + } + }) + } +} + +func TestP02_VersionComparison_RestartRejoin(t *testing.T) { + for _, tc := range []struct { + version ProtocolVersion + addrStable bool + expectAction string + }{ + {ProtocolV1, true, "control_plane_only"}, + {ProtocolV1, false, "control_plane_only"}, + {ProtocolV15, true, "background_reconnect_or_control_plane"}, + {ProtocolV15, false, "control_plane_only"}, + {ProtocolV2, true, "direct_reconnect_or_control_plane"}, + {ProtocolV2, false, "explicit_reassignment_or_rebuild"}, + } { + t.Run(string(tc.version)+"_stable="+boolStr(tc.addrStable), func(t *testing.T) { + policy := ProtocolPolicy{Version: tc.version} + action := policy.RestartRejoinAction(tc.addrStable) + if action != tc.expectAction { + t.Fatalf("RestartRejoinAction(%v): got %q, want %q", tc.addrStable, action, tc.expectAction) + } + }) + } +} + +func TestP02_VersionComparison_V15RestartAddressInstability(t *testing.T) { + v15 := ProtocolPolicy{Version: ProtocolV15} + v2 := ProtocolPolicy{Version: ProtocolV2} + + if got := v15.RestartRejoinAction(false); got != "control_plane_only" { + t.Fatalf("v1.5 changed-address restart should fall back to control plane, got %q", got) + } + if got := v2.ChangedAddressRestartAction(true); got != "explicit_reassignment_then_catchup" { + t.Fatalf("v2 changed-address recoverable restart should use explicit reassignment + catch-up, got %q", got) + } + if got := v2.ChangedAddressRestartAction(false); got != "explicit_reassignment_or_rebuild" { + t.Fatalf("v2 changed-address unrecoverable restart should go to explicit reassignment/rebuild, got %q", got) + } +} + +func boolStr(b bool) string { + if b { + return "true" + } + return "false" +} diff --git a/sw-block/prototype/distsim/phase02_v1_failures_test.go b/sw-block/prototype/distsim/phase02_v1_failures_test.go new file mode 100644 index 000000000..b129897d5 --- /dev/null +++ b/sw-block/prototype/distsim/phase02_v1_failures_test.go @@ -0,0 +1,434 @@ +package distsim + +import ( + "testing" +) + +// ============================================================ +// Phase 02 P2: Real V1/V1.5 failure reproductions +// Source: actual Phase 13 hardware behavior and CP13-8 findings +// ============================================================ + +// --- Scenario: Changed-address restart (CP13-8 T4b) --- +// Real bug: replica restarts on a different port. V1.5 shipper retries +// the old address forever. Catch-up never succeeds because the old +// address is dead. + +func TestP02_V1_ChangedAddressRestart_NeverRecovers(t *testing.T) { + c := NewClusterWithProtocol(CommitSyncQuorum, ProtocolV1, "p", "r1", "r2") + + c.CommitWrite(1) + c.CommitWrite(2) + c.TickN(5) + + // r1 restarts with changed address — endpoint version bumps. + c.StopNode("r1") + c.Coordinator.Epoch++ + for _, n := range c.Nodes { + if n.Running { + n.Epoch = c.Coordinator.Epoch + } + } + c.RestartNodeWithNewAddress("r1") + + // Messages from primary to r1 now rejected: stale endpoint. + staleRejects := c.RejectedByReason(RejectStaleEndpoint) + + // Writes accumulate — r1 can't receive (endpoint mismatch). + for i := uint64(3); i <= 12; i++ { + c.CommitWrite(i) + } + c.TickN(5) + + // Verify: messages rejected by stale endpoint, not just link down. + newStaleRejects := c.RejectedByReason(RejectStaleEndpoint) - staleRejects + if newStaleRejects == 0 { + t.Fatal("V1: writes to r1 should be rejected by stale_endpoint") + } + + // V1: no recovery trigger available. + trigger, _, ok := c.TriggerRecoverySession("r1") + if ok { + t.Fatalf("V1 should not trigger recovery, got %s", trigger) + } + + // Gap confirmed. + r1 := c.Nodes["r1"] + if err := c.AssertCommittedRecoverable("r1"); err == nil { + t.Fatal("V1: r1 should have data inconsistency") + } + t.Logf("V1: gap=%d, %d stale_endpoint rejections, no recovery path", + c.Coordinator.CommittedLSN-r1.Storage.FlushedLSN, newStaleRejects) +} + +func TestP02_V15_ChangedAddressRestart_RetriesToStaleAddress(t *testing.T) { + c := NewClusterWithProtocol(CommitSyncQuorum, ProtocolV15, "p", "r1", "r2") + + c.CommitWrite(1) + c.CommitWrite(2) + c.TickN(5) + + // r1 restarts with changed address — endpoint version bumps. + c.StopNode("r1") + c.Coordinator.Epoch++ + for _, n := range c.Nodes { + if n.Running { + n.Epoch = c.Coordinator.Epoch + } + } + c.RestartNodeWithNewAddress("r1") + + // Writes accumulate — rejected by stale endpoint. + for i := uint64(3); i <= 12; i++ { + c.CommitWrite(i) + } + c.TickN(5) + + // V1.5: recovery trigger fails — address mismatch detected. + trigger, _, ok := c.TriggerRecoverySession("r1") + if ok { + t.Fatalf("V1.5 should not trigger recovery with changed address, got %s", trigger) + } + + // Heartbeat reveals new endpoint, but V1.5 can only do control_plane_only. + report := c.ReportHeartbeat("r1") + update := c.CoordinatorDetectEndpointChange(report) + if update == nil { + t.Fatal("coordinator should detect endpoint change") + } + // V1.5: does NOT apply assignment update — no mechanism to update primary. + if got := c.Protocol.ChangedAddressRestartAction(true); got != "control_plane_only" { + t.Fatalf("V1.5: got %q, want control_plane_only", got) + } + + // Gap persists, data inconsistency. + r1 := c.Nodes["r1"] + if err := c.AssertCommittedRecoverable("r1"); err == nil { + t.Fatal("V1.5: r1 should have data inconsistency") + } + t.Logf("V1.5: gap=%d, stale endpoint blocks recovery — control_plane_only", + c.Coordinator.CommittedLSN-r1.Storage.FlushedLSN) +} + +func TestP02_V2_ChangedAddressRestart_ExplicitReassignment(t *testing.T) { + c := NewClusterWithProtocol(CommitSyncQuorum, ProtocolV2, "p", "r1", "r2") + c.MaxCatchupAttempts = 5 + + c.CommitWrite(1) + c.CommitWrite(2) + c.TickN(5) + + // r1 restarts with changed address — endpoint version bumps. + c.StopNode("r1") + c.Coordinator.Epoch++ + for _, n := range c.Nodes { + if n.Running { + n.Epoch = c.Coordinator.Epoch + } + } + c.RestartNodeWithNewAddress("r1") + + // Writes accumulate — rejected by stale endpoint. + for i := uint64(3); i <= 12; i++ { + c.CommitWrite(i) + } + c.TickN(5) + + // Before control-plane flow: recovery trigger fails (stale endpoint). + trigger, _, ok := c.TriggerRecoverySession("r1") + if ok { + t.Fatalf("V2: recovery should fail before assignment update, got %s", trigger) + } + + // Step 1: heartbeat discovers new endpoint. + report := c.ReportHeartbeat("r1") + update := c.CoordinatorDetectEndpointChange(report) + if update == nil { + t.Fatal("coordinator should detect endpoint change") + } + + // Step 2: coordinator applies assignment — primary learns new address. + c.ApplyAssignmentUpdate(*update) + + // Step 3: recovery trigger now succeeds (endpoint matches). + trigger, _, ok = c.TriggerRecoverySession("r1") + if !ok || trigger != TriggerReassignment { + t.Fatalf("V2: expected reassignment trigger after update, got %s/%v", trigger, ok) + } + + // Step 4: catch-up via protocol. + converged := c.CatchUpWithEscalation("r1", 100) + if !converged { + t.Fatal("V2: catch-up should converge after reassignment") + } + + // Data correct after full control-plane flow. + if err := c.AssertCommittedRecoverable("r1"); err != nil { + t.Fatalf("V2: data incorrect after reassignment+catchup: %v", err) + } + t.Logf("V2: recovered via heartbeat→detect→assignment→trigger→catchup") +} + +// --- Scenario: Same-address transient outage --- +// Common case: brief network hiccup, same ports. + +func TestP02_V1_TransientOutage_Degrades(t *testing.T) { + c := NewClusterWithProtocol(CommitSyncQuorum, ProtocolV1, "p", "r1", "r2") + + c.CommitWrite(1) + c.TickN(5) + + // Brief partition. + c.Disconnect("p", "r1") + c.Disconnect("r1", "p") + c.CommitWrite(2) + c.CommitWrite(3) + c.TickN(5) + + // Heal. + c.Connect("p", "r1") + c.Connect("r1", "p") + + // V1: no catch-up. r1 stays at flushed=1. + if c.Protocol.CanAttemptCatchup(true) { + t.Fatal("V1 should not catch-up even with stable address") + } + + c.TickN(5) + r1 := c.Nodes["r1"] + if r1.Storage.FlushedLSN >= c.Coordinator.CommittedLSN { + // V1 doesn't catch up — unless messages from BEFORE disconnect are still delivering. + // In our model, messages enqueued before disconnect may still arrive. That's a V1 "accident" not protocol. + } + t.Logf("V1 transient outage: flushed=%d committed=%d action=%s", + r1.Storage.FlushedLSN, c.Coordinator.CommittedLSN, + c.Protocol.BriefDisconnectAction(true, true)) +} + +func TestP02_V15_TransientOutage_CatchesUp(t *testing.T) { + c := NewClusterWithProtocol(CommitSyncQuorum, ProtocolV15, "p", "r1", "r2") + c.MaxCatchupAttempts = 5 + + c.CommitWrite(1) + c.TickN(5) + + c.Disconnect("p", "r1") + c.Disconnect("r1", "p") + c.CommitWrite(2) + c.CommitWrite(3) + c.TickN(5) + + c.Connect("p", "r1") + c.Connect("r1", "p") + + // V1.5: catch-up works if address stable. + if !c.Protocol.CanAttemptCatchup(true) { + t.Fatal("V1.5 should catch-up with stable address") + } + + r1 := c.Nodes["r1"] + r1.ReplicaState = NodeStateCatchingUp + converged := c.CatchUpWithEscalation("r1", 100) + if !converged { + t.Fatal("V1.5: should converge for short gap with stable address") + } + if r1.ReplicaState != NodeStateInSync { + t.Fatalf("V1.5: expected InSync, got %s", r1.ReplicaState) + } + if err := c.AssertCommittedRecoverable("r1"); err != nil { + t.Fatal(err) + } + t.Logf("V1.5 transient outage: recovered via catch-up, flushed=%d", r1.Storage.FlushedLSN) +} + +func TestP02_V2_TransientOutage_ReservedCatchup(t *testing.T) { + c := NewClusterWithProtocol(CommitSyncQuorum, ProtocolV2, "p", "r1", "r2") + c.MaxCatchupAttempts = 5 + + c.CommitWrite(1) + c.TickN(5) + + c.Disconnect("p", "r1") + c.Disconnect("r1", "p") + c.CommitWrite(2) + c.CommitWrite(3) + c.TickN(5) + + c.Connect("p", "r1") + c.Connect("r1", "p") + + // V2: reserved catch-up — explicit recoverability check. + action := c.Protocol.BriefDisconnectAction(true, true) + if action != "reserved_catchup" { + t.Fatalf("V2 brief disconnect: got %q, want reserved_catchup", action) + } + + r1 := c.Nodes["r1"] + r1.ReplicaState = NodeStateCatchingUp + converged := c.CatchUpWithEscalation("r1", 100) + if !converged { + t.Fatal("V2: should converge for short gap") + } + if err := c.AssertCommittedRecoverable("r1"); err != nil { + t.Fatal(err) + } + t.Logf("V2 transient outage: reserved catch-up succeeded") +} + +// --- Scenario: Slow control-plane recovery --- +// Source: real Phase 13 hardware behavior. +// Data path recovers fast. Control plane (master) is slow to re-issue +// assignments. During this window, V1/V1.5 behavior differs from V2. + +func TestP02_SlowControlPlane_V1_WaitsForMaster(t *testing.T) { + c := NewClusterWithProtocol(CommitSyncQuorum, ProtocolV1, "p", "r1", "r2") + + c.CommitWrite(1) + c.TickN(5) + + // r1 disconnects. Stays disconnected through outage + control-plane delay. + c.Disconnect("p", "r1") + c.Disconnect("r1", "p") + + // Writes accumulate: outage write + delay-window writes. r1 misses all. + for i := uint64(2); i <= 10; i++ { + c.CommitWrite(i) + } + c.TickN(5) + + // Data path heals — but V1 has no catch-up protocol. + c.Connect("p", "r1") + c.Connect("r1", "p") + + // V1: no recovery trigger even with address stable. + trigger, _, ok := c.TriggerRecoverySession("r1") + if ok { + t.Fatalf("V1 should not trigger recovery, got %s", trigger) + } + + // r1 is behind: FlushedLSN=1, CommittedLSN=10. Gap = 9. + r1 := c.Nodes["r1"] + gap := c.Coordinator.CommittedLSN - r1.Storage.FlushedLSN + if gap < 9 { + t.Fatalf("V1: expected gap >= 9, got %d", gap) + } + + // V1 data inconsistency: r1 missed writes 2-10. No self-heal mechanism. + err := c.AssertCommittedRecoverable("r1") + if err == nil { + t.Fatal("V1: r1 should have data inconsistency — no catch-up mechanism") + } + t.Logf("V1 slow control-plane: gap=%d, data inconsistency — %v", gap, err) +} + +func TestP02_SlowControlPlane_V15_BackgroundReconnect(t *testing.T) { + c := NewClusterWithProtocol(CommitSyncQuorum, ProtocolV15, "p", "r1", "r2") + c.MaxCatchupAttempts = 5 + + c.CommitWrite(1) + c.TickN(5) + + // r1 disconnects. Stays disconnected through outage + delay window. + c.Disconnect("p", "r1") + c.Disconnect("r1", "p") + + // Writes accumulate while r1 is disconnected. + for i := uint64(2); i <= 10; i++ { + c.CommitWrite(i) + } + c.TickN(5) + + // Data path heals. + c.Connect("p", "r1") + c.Connect("r1", "p") + + // Before catch-up: r1 is behind (FlushedLSN=1, CommittedLSN=10). + r1 := c.Nodes["r1"] + if r1.Storage.FlushedLSN >= c.Coordinator.CommittedLSN { + t.Fatal("V1.5: r1 should be behind before catch-up") + } + if err := c.AssertCommittedRecoverable("r1"); err == nil { + t.Fatal("V1.5: r1 should have data gap before catch-up") + } + + // V1.5 policy: background reconnect if address stable. + if c.Protocol.RestartRejoinAction(true) != "background_reconnect_or_control_plane" { + t.Fatal("V1.5 stable-address should be background_reconnect_or_control_plane") + } + + // V1.5 recovery trigger: background reconnect (address stable → endpoint matches). + trigger, _, ok := c.TriggerRecoverySession("r1") + if !ok || trigger != TriggerBackgroundReconnect { + t.Fatalf("V1.5: expected background_reconnect trigger, got %s/%v", trigger, ok) + } + // r1.ReplicaState is now CatchingUp (set by TriggerRecoverySession). + converged := c.CatchUpWithEscalation("r1", 100) + if !converged { + t.Fatal("V1.5: should catch up with stable address") + } + + // After catch-up: data correct. + if err := c.AssertCommittedRecoverable("r1"); err != nil { + t.Fatalf("V1.5: data should be correct after catch-up — %v", err) + } + + // V1.5 changed-address: falls back to control plane. + if c.Protocol.RestartRejoinAction(false) != "control_plane_only" { + t.Fatal("V1.5 changed-address should fall back to control_plane_only") + } + t.Logf("V1.5 slow control-plane: caught up %d entries via background reconnect", + c.Coordinator.CommittedLSN-1) +} + +func TestP02_SlowControlPlane_V2_DirectReconnect(t *testing.T) { + c := NewClusterWithProtocol(CommitSyncQuorum, ProtocolV2, "p", "r1", "r2") + c.MaxCatchupAttempts = 5 + + c.CommitWrite(1) + c.TickN(5) + + // r1 disconnects. Stays disconnected through outage + delay window. + c.Disconnect("p", "r1") + c.Disconnect("r1", "p") + + // Writes accumulate while r1 is disconnected. + for i := uint64(2); i <= 10; i++ { + c.CommitWrite(i) + } + c.TickN(5) + + // Data path heals. + c.Connect("p", "r1") + c.Connect("r1", "p") + + // Before catch-up: r1 is behind. + r1 := c.Nodes["r1"] + if r1.Storage.FlushedLSN >= c.Coordinator.CommittedLSN { + t.Fatal("V2: r1 should be behind before direct reconnect") + } + if err := c.AssertCommittedRecoverable("r1"); err == nil { + t.Fatal("V2: r1 should have data gap before direct reconnect") + } + + // V2 policy: direct reconnect, doesn't wait for master. + if c.Protocol.RestartRejoinAction(true) != "direct_reconnect_or_control_plane" { + t.Fatal("V2 should be direct_reconnect_or_control_plane") + } + + // V2 recovery trigger: reassignment (address stable → endpoint matches). + trigger, _, ok := c.TriggerRecoverySession("r1") + if !ok || trigger != TriggerReassignment { + t.Fatalf("V2: expected reassignment trigger, got %s/%v", trigger, ok) + } + converged := c.CatchUpWithEscalation("r1", 100) + if !converged { + t.Fatal("V2: should catch up directly without master intervention") + } + + // After catch-up: data correct. + if err := c.AssertCommittedRecoverable("r1"); err != nil { + t.Fatalf("V2: data should be correct after direct reconnect — %v", err) + } + t.Logf("V2 slow control-plane: caught up %d entries immediately via direct reconnect", + c.Coordinator.CommittedLSN-1) +} diff --git a/sw-block/prototype/distsim/phase03_p2_race_test.go b/sw-block/prototype/distsim/phase03_p2_race_test.go new file mode 100644 index 000000000..478f21f36 --- /dev/null +++ b/sw-block/prototype/distsim/phase03_p2_race_test.go @@ -0,0 +1,287 @@ +package distsim + +import ( + "testing" +) + +// ============================================================ +// Phase 03 P2: Timer-ordering races +// ============================================================ + +// --- Race 1: Concurrent barrier timeouts under sync_quorum --- + +func TestP03_P2_ConcurrentBarrierTimeout_QuorumEdge(t *testing.T) { + // RF=3 (p, r1, r2). sync_quorum (quorum=2). + // Both r1 and r2 have barrier timeouts. r1's ack arrives in the same tick + // as r2's timeout fires. The "data before timers" rule means: + // r1 ack processed → cancels r1 timeout → r2 timeout fires → quorum = p+r1 = 2 → committed. + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + c.BarrierTimeoutTicks = 5 + + // r2 disconnected — barrier will time out. r1 connected — will ack. + c.Disconnect("p", "r2") + c.Disconnect("r2", "p") + + c.CommitWrite(1) // barrier to r1 at Now+2, barrier to r2 at Now+2 + // Barrier timeout for both at Now+5. + + c.TickN(10) + + // r1 ack arrived → cancelled r1 timeout. + // r2 barrier timed out (link down, no ack). + firedBarriers := c.FiredTimeoutsByKind(TimeoutBarrier) + if firedBarriers != 1 { + t.Fatalf("expected 1 barrier timeout (r2), got %d", firedBarriers) + } + + // Event log: r1's barrier timeout was cancelled (ack arrived earlier). + // r2's barrier timeout fired. Verify both are in the TickLog. + var cancelCount, fireCount int + for _, e := range c.TickLog { + if e.Kind == EventTimeoutCancelled { + cancelCount++ + } + if e.Kind == EventTimeoutFired { + fireCount++ + } + } + if cancelCount != 1 { + t.Fatalf("expected 1 timeout cancel (r1 ack), got %d", cancelCount) + } + if fireCount != 1 { + t.Fatalf("expected 1 timeout fire (r2), got %d", fireCount) + } + + // Quorum: p + r1 = 2 of 3 → committed. + if c.Coordinator.CommittedLSN != 1 { + t.Fatalf("LSN 1 should commit via quorum (p+r1), committed=%d", c.Coordinator.CommittedLSN) + } + + // DurableOn: p=true (self-ack), r1=true (ack), r2 NOT set (timed out). + p1 := c.Pending[1] + if !p1.DurableOn["p"] || !p1.DurableOn["r1"] { + t.Fatal("DurableOn should have p and r1") + } + if p1.DurableOn["r2"] { + t.Fatal("DurableOn should NOT have r2 (timed out)") + } + + t.Logf("concurrent timeout: r1 acked, r2 timed out, quorum met, committed=%d", c.Coordinator.CommittedLSN) +} + +func TestP03_P2_ConcurrentBarrierTimeout_BothTimeout_NoQuorum(t *testing.T) { + // Both r1 and r2 disconnected. Both timeouts fire. Quorum = p alone = 1 < 2. + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + c.BarrierTimeoutTicks = 5 + + c.Disconnect("p", "r1") + c.Disconnect("r1", "p") + c.Disconnect("p", "r2") + c.Disconnect("r2", "p") + + c.CommitWrite(1) + c.TickN(10) + + // Both barriers timed out. + if c.FiredTimeoutsByKind(TimeoutBarrier) != 2 { + t.Fatalf("expected 2 barrier timeouts, got %d", c.FiredTimeoutsByKind(TimeoutBarrier)) + } + + // No quorum — uncommitted. + if c.Coordinator.CommittedLSN != 0 { + t.Fatalf("LSN 1 should not commit without quorum, committed=%d", c.Coordinator.CommittedLSN) + } + + // Neither r1 nor r2 in DurableOn. + p1 := c.Pending[1] + if p1.DurableOn["r1"] || p1.DurableOn["r2"] { + t.Fatal("DurableOn should not have r1 or r2") + } + t.Logf("both timeouts: no quorum, LSN 1 uncommitted") +} + +func TestP03_P2_ConcurrentBarrierTimeout_SameTick_AckAndTimeout(t *testing.T) { + // The precise same-tick race: r1 ack arrives at exactly the tick when r2's + // timeout fires. Verify data-before-timers ordering in the event log. + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + c.BarrierTimeoutTicks = 4 // timeout at Now+4 + + c.Disconnect("p", "r2") + c.Disconnect("r2", "p") + + c.CommitWrite(1) + // Write at Now+1, barrier at Now+2, ack back at Now+3. + // Timeout for r1 at Now+4, timeout for r2 at Now+4. + + // Tick to barrier ack arrival (tick 3): r1 ack delivered, cancels r1 timeout. + // Tick 4: r2 timeout fires. r1 timeout already cancelled. + c.TickN(6) + + // Check event ordering at the timeout tick. + timeoutTick := uint64(0) + for _, ft := range c.FiredTimeouts { + timeoutTick = ft.FiredAt + } + events := c.TickEventsAt(timeoutTick) + + // At the timeout tick, we should see: r2 timeout fired (r1 was cancelled earlier). + var firedDetails []string + for _, e := range events { + if e.Kind == EventTimeoutFired { + firedDetails = append(firedDetails, e.Detail) + } + } + if len(firedDetails) != 1 { + t.Fatalf("expected 1 timeout fire at tick %d, got %d: %v", timeoutTick, len(firedDetails), firedDetails) + } + + // Committed via quorum. + if c.Coordinator.CommittedLSN != 1 { + t.Fatalf("committed=%d, want 1", c.Coordinator.CommittedLSN) + } + t.Logf("same-tick race: r1 ack cancelled at tick 3, r2 timeout fired at tick %d, committed=1", timeoutTick) +} + +// --- Race 2: Epoch bump during active barrier timeout window --- + +func TestP03_P2_EpochBumpDuringBarrierTimeout_CrossSurface(t *testing.T) { + // Three cleanup mechanisms interact for the same barrier: + // 1. Epoch fencing in deliver() rejects old-epoch messages + // 2. Barrier timeout in fireTimeouts() removes queued barriers + marks expired + // 3. ExpiredBarriers in deliver() rejects late acks + // + // Scenario: barrier re-queues (r1 missing data), epoch bumps, then timeout fires. + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + c.BarrierTimeoutTicks = 10 + + c.CommitWrite(1) // write+barrier to r1, r2 + + // Drop write to r1 so barrier keeps re-queuing. + var kept []inFlightMessage + for _, item := range c.Queue { + if item.msg.Kind == MsgWrite && item.msg.To == "r1" && item.msg.Write.LSN == 1 { + continue + } + kept = append(kept, item) + } + c.Queue = kept + + // Tick 1-3: r1's barrier delivers but r1 doesn't have data → re-queues. + // r2 gets write+barrier normally → acks. + c.TickN(3) + + // Epoch bump: promote r2 (p stays running as demoted replica). + // This ensures the old-epoch barrier hits epoch fencing, not node_down. + if err := c.Promote("r2"); err != nil { + t.Fatal(err) + } + + // Record state before timeout window. + epochRejectsBefore := c.RejectedByReason(RejectEpochMismatch) + + // Tick 4-5: old-epoch barrier (p→r1) is in queue. deliver() rejects + // with epoch_mismatch (msg epoch=1 vs coordinator epoch=2). + c.TickN(2) + + // Old barrier rejected by epoch fencing. + epochRejectsAfter := c.RejectedByReason(RejectEpochMismatch) + newEpochRejects := epochRejectsAfter - epochRejectsBefore + if newEpochRejects == 0 { + t.Fatal("old-epoch barrier should be rejected by epoch fencing") + } + + // Tick past barrier timeout deadline. + c.TickN(10) + + // Barrier timeout fires for r1/LSN 1 (removes any remaining queued copies). + if c.FiredTimeoutsByKind(TimeoutBarrier) == 0 { + t.Fatal("barrier timeout should fire for r1/LSN 1") + } + + // Expired barrier marked. + if !c.ExpiredBarriers[barrierExpiredKey{"r1", 1}] { + t.Fatal("r1/LSN 1 should be in ExpiredBarriers") + } + + // Inject late ack from r1 for LSN 1 at current epoch (to new primary r2). + // The barrier is expired — ack should be rejected by barrier_expired. + deliveriesBefore := len(c.Deliveries) + c.InjectMessage(Message{ + Kind: MsgBarrierAck, From: "r1", To: "r2", + Epoch: c.Coordinator.Epoch, TargetLSN: 1, + }, c.Now+1) + c.TickN(2) + + // Late ack rejected by barrier_expired. + lateRejected := false + for _, d := range c.Deliveries[deliveriesBefore:] { + if d.Msg.Kind == MsgBarrierAck && d.Msg.From == "r1" && d.Msg.TargetLSN == 1 { + if !d.Accepted && d.Reason == RejectBarrierExpired { + lateRejected = true + } + } + } + if !lateRejected { + t.Fatal("late ack for expired barrier should be rejected as barrier_expired") + } + + // Verify event log shows the cross-surface interaction. + var epochRejectEvents, timeoutFireEvents int + for _, e := range c.TickLog { + if e.Kind == EventDeliveryRejected { + epochRejectEvents++ + } + if e.Kind == EventTimeoutFired { + timeoutFireEvents++ + } + } + if epochRejectEvents == 0 || timeoutFireEvents == 0 { + t.Fatalf("event log should show both epoch rejections (%d) and timeout fires (%d)", + epochRejectEvents, timeoutFireEvents) + } + + t.Logf("cross-surface: epoch_rejects=%d, timeout_fires=%d, expired_barrier=true, late_ack_rejected=true", + newEpochRejects, c.FiredTimeoutsByKind(TimeoutBarrier)) +} + +// --- TickEvents trace verification --- + +func TestP03_P2_TickEvents_OrderingVerifiable(t *testing.T) { + // Verify that TickEvents captures delivery → timeout ordering within a tick. + c := NewCluster(CommitSyncAll, "p", "r1") + c.BarrierTimeoutTicks = 5 + + c.CommitWrite(1) + c.TickN(10) // normal flow: ack cancels timeout + + // TickLog should have events. + if len(c.TickLog) == 0 { + t.Fatal("TickLog should record events") + } + + // Find delivery events and timeout events. + var deliveries, cancels int + for _, e := range c.TickLog { + switch e.Kind { + case EventDeliveryAccepted: + deliveries++ + case EventTimeoutCancelled: + cancels++ + } + } + if deliveries == 0 { + t.Fatal("should have delivery events") + } + if cancels == 0 { + t.Fatal("should have timeout cancel events (ack cancelled barrier timeout)") + } + + // BuildTrace includes TickEvents. + trace := BuildTrace(c) + if len(trace.TickEvents) == 0 { + t.Fatal("BuildTrace should include TickEvents") + } + + t.Logf("tick events: %d deliveries, %d cancels, %d total events", + deliveries, cancels, len(c.TickLog)) +} diff --git a/sw-block/prototype/distsim/phase03_race_test.go b/sw-block/prototype/distsim/phase03_race_test.go new file mode 100644 index 000000000..9bf6631ea --- /dev/null +++ b/sw-block/prototype/distsim/phase03_race_test.go @@ -0,0 +1,281 @@ +package distsim + +import ( + "testing" +) + +// ============================================================ +// Phase 03 P1: Race-focused tests with trace quality +// ============================================================ + +// --- Race 1: Promotion vs delayed catch-up timeout --- + +func TestP03_Race_PromotionThenStaleCatchupTimeout(t *testing.T) { + // r1 is CatchingUp with a catch-up timeout registered. + // Before the timeout fires, primary crashes and r1 is promoted. + // The stale catch-up timeout must not regress r1 (now primary) to NeedsRebuild. + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + + c.CommitWrite(1) + c.CommitWrite(2) + c.TickN(5) + + // r1 falls behind, starts catching up. + c.Disconnect("p", "r1") + c.Disconnect("r1", "p") + c.CommitWrite(3) + c.TickN(5) + c.Connect("p", "r1") + c.Connect("r1", "p") + + r1 := c.Nodes["r1"] + r1.ReplicaState = NodeStateCatchingUp + c.RegisterTimeout(TimeoutCatchup, "r1", 0, c.Now+10) + + // r1 catches up successfully. + converged := c.CatchUpWithEscalation("r1", 100) + if !converged { + t.Fatal("r1 should converge before promotion") + } + + // Primary crashes. Promote r1. + c.StopNode("p") + if err := c.Promote("r1"); err != nil { + t.Fatal(err) + } + + // Tick past the catch-up timeout deadline. + c.TickN(15) + + // Stale timeout must not fire (was auto-cancelled on convergence). + if c.FiredTimeoutsByKind(TimeoutCatchup) != 0 { + t.Fatal("stale catch-up timeout must not fire after promotion") + } + // r1 must remain primary and running. + if r1.Role != RolePrimary { + t.Fatalf("r1 should be primary, got %s", r1.Role) + } + if r1.ReplicaState == NodeStateNeedsRebuild { + t.Fatal("stale timeout regressed promoted r1 to NeedsRebuild") + } + t.Logf("promotion vs timeout: stale catch-up timeout suppressed, r1 is primary") +} + +func TestP03_Race_PromotionThenStaleBarrierTimeout(t *testing.T) { + // Barrier timeout registered for r1 at old epoch. + // Promotion bumps epoch. The stale barrier timeout fires but must not + // affect the new epoch's commit state. + c := NewCluster(CommitSyncAll, "p", "r1") + c.BarrierTimeoutTicks = 8 + + // Write 1 — barrier to r1. Disconnect r1 so barrier can't ack. + c.Disconnect("p", "r1") + c.Disconnect("r1", "p") + c.CommitWrite(1) + + // Tick 2 — barrier timeout registered at Now+8. + c.TickN(2) + + // Primary crashes, promote r1 (even though it doesn't have write 1). + c.StopNode("p") + c.StartNode("r1") + if err := c.Promote("r1"); err != nil { + t.Fatal(err) + } + + // Snapshot committed prefix before stale timeout window. + committedBefore := c.Coordinator.CommittedLSN + + // r1 is now primary at new epoch. Write new data. + c.CommitWrite(10) + c.TickN(10) // well past barrier timeout deadline + + // Stale barrier timeout fires (from old epoch, old primary "p" → old replica "r1"). + barriersFired := c.FiredTimeoutsByKind(TimeoutBarrier) + + // Assert 1: old timed-out barrier did not change committed prefix unexpectedly. + // CommittedLSN may advance from r1's new-epoch writes, but must not regress + // or be influenced by the stale barrier timeout. + if c.Coordinator.CommittedLSN < committedBefore { + t.Fatalf("committed prefix regressed: before=%d after=%d", + committedBefore, c.Coordinator.CommittedLSN) + } + + // Assert 2: old-epoch barrier did not set DurableOn for new-epoch writes. + // LSN 1 was written by old primary "p". Under the new epoch, DurableOn + // should not have been modified by the stale barrier's timeout path. + if p1 := c.Pending[1]; p1 != nil { + if p1.DurableOn["r1"] { + t.Fatal("stale barrier timeout should not set DurableOn[r1] for old-epoch LSN 1") + } + } + + // Assert 3: old-epoch LSN 1 barrier is marked expired (stale timeout fired correctly). + if !c.ExpiredBarriers[barrierExpiredKey{"r1", 1}] { + t.Fatal("old-epoch barrier for r1/LSN 1 should be in ExpiredBarriers") + } + + t.Logf("promotion vs barrier timeout: committed=%d, fired=%d, DurableOn[r1]=%v, expired[r1/1]=%v", + c.Coordinator.CommittedLSN, barriersFired, + c.Pending[1] != nil && c.Pending[1].DurableOn["r1"], + c.ExpiredBarriers[barrierExpiredKey{"r1", 1}]) +} + +// --- Race 2: Rebuild completion vs epoch bump --- + +func TestP03_Race_RebuildCompletes_ThenEpochBumps(t *testing.T) { + // r1 needs rebuild. Rebuild completes, but before r1 can rejoin, + // epoch bumps (another failover). The rebuild result is valid but + // the replica must re-validate against the new epoch before rejoining. + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + + for i := uint64(1); i <= 10; i++ { + c.CommitWrite(i) + } + c.TickN(5) + c.Primary().Storage.TakeSnapshot("snap-1", c.Coordinator.CommittedLSN) + + // r1 needs rebuild. + r1 := c.Nodes["r1"] + r1.ReplicaState = NodeStateNeedsRebuild + + // Rebuild from snapshot — succeeds. + c.RebuildReplicaFromSnapshot("r1", "snap-1", c.Coordinator.CommittedLSN) + r1.ReplicaState = NodeStateRebuilding // transitional + + // Before r1 can rejoin: epoch bumps (simulate another failure/promotion). + epochBefore := c.Coordinator.Epoch + c.StopNode("p") + if err := c.Promote("r2"); err != nil { + t.Fatal(err) + } + epochAfter := c.Coordinator.Epoch + + if epochAfter <= epochBefore { + t.Fatal("epoch should have bumped") + } + + // r1's epoch is now stale (was set to epochBefore, promotion updated running nodes). + // r1 was stopped? No, r1 is still running. But Promote sets all running nodes' epoch. + // Wait — r1 IS running, so Promote set r1.Epoch = new epoch. Let me check. + // Actually Promote() sets all running nodes' epoch to new coordinator epoch. + // r1 is running. So r1.Epoch = epochAfter. But r1.Role = RoleReplica. + + // The rebuild data is from the OLD epoch's committed prefix. + // Under the new primary (r2), committed prefix may differ. + // r1 must NOT be promoted to InSync until validated against new epoch. + + // Eligibility check: r1 is Rebuilding — ineligible for promotion. + e := c.EvaluateCandidateEligibility("r1") + if e.Eligible { + t.Fatal("r1 in Rebuilding state should not be eligible") + } + + // r1 should NOT be InSync until it completes catch-up from new primary. + if r1.ReplicaState == NodeStateInSync { + t.Fatal("r1 should not be InSync after epoch bump during rebuild") + } + + // After catch-up from new primary (r2), r1 can rejoin. + r1.ReplicaState = NodeStateCatchingUp + converged := c.CatchUpWithEscalation("r1", 100) + if !converged { + t.Fatal("r1 should converge from new primary") + } + if err := c.AssertCommittedRecoverable("r1"); err != nil { + t.Fatalf("r1 data incorrect after post-epoch-bump catch-up: %v", err) + } + + t.Logf("rebuild vs epoch bump: r1 rebuilt at epoch %d, bumped to %d, caught up from r2", + epochBefore, epochAfter) +} + +func TestP03_Race_EpochBumpsDuringCatchupTimeout(t *testing.T) { + // Catch-up timeout registered. Epoch bumps before timeout fires. + // The timeout is now stale (different epoch context). + // Must not mutate state under the new epoch. + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + + c.CommitWrite(1) + c.TickN(5) + + c.Disconnect("p", "r1") + c.Disconnect("r1", "p") + c.CommitWrite(2) + c.TickN(5) + c.Connect("p", "r1") + c.Connect("r1", "p") + + r1 := c.Nodes["r1"] + r1.ReplicaState = NodeStateCatchingUp + c.RegisterTimeout(TimeoutCatchup, "r1", 0, c.Now+10) + + // Epoch bumps (promotion) before timeout. + c.StopNode("p") + if err := c.Promote("r1"); err != nil { + t.Fatal(err) + } + // r1 is now primary. State changes from CatchingUp to... well, we need to + // set it. In production, promotion sets the role but the replica state is + // reset. Let me set it to InSync (as new primary). + r1.ReplicaState = NodeStateInSync + + // Tick past timeout deadline. + c.TickN(15) + + // Timeout should be ignored (r1 is InSync, not CatchingUp). + if c.FiredTimeoutsByKind(TimeoutCatchup) != 0 { + t.Fatal("catch-up timeout should not fire after epoch bump + promotion") + } + if len(c.IgnoredTimeouts) != 1 { + t.Fatalf("expected 1 ignored (stale) timeout, got %d", len(c.IgnoredTimeouts)) + } + if r1.ReplicaState != NodeStateInSync { + t.Fatalf("r1 should remain InSync, got %s", r1.ReplicaState) + } + t.Logf("epoch bump vs timeout: stale catch-up timeout correctly ignored") +} + +// --- Trace quality: dump state on failure --- + +func TestP03_TraceQuality_FailingScenarioDumpsState(t *testing.T) { + // Verify that the timeout model produces debuggable traces. + // This test does NOT intentionally fail — it verifies that trace + // information is available for inspection. + c := NewCluster(CommitSyncAll, "p", "r1") + c.BarrierTimeoutTicks = 5 + + c.CommitWrite(1) + c.TickN(3) + + c.Disconnect("p", "r1") + c.Disconnect("r1", "p") + c.CommitWrite(2) + c.TickN(10) + + // Build trace. + trace := BuildTrace(c) + + // Trace must contain key debugging information. + if trace.Tick == 0 { + t.Fatal("trace should have non-zero tick") + } + if trace.CommittedLSN == 0 && len(c.Pending) == 0 { + t.Fatal("trace should reflect cluster state") + } + if len(trace.FiredTimeouts) == 0 { + t.Fatal("trace should include fired timeouts") + } + if len(trace.NodeStates) < 2 { + t.Fatal("trace should include all node states") + } + if trace.Deliveries == 0 { + t.Fatal("trace should include deliveries") + } + + t.Logf("trace: tick=%d committed=%d fired_timeouts=%d deliveries=%d nodes=%v", + trace.Tick, trace.CommittedLSN, len(trace.FiredTimeouts), + trace.Deliveries, trace.NodeStates) +} + +// Trace infrastructure lives in eventsim.go (BuildTrace / Trace type). diff --git a/sw-block/prototype/distsim/phase03_timeout_test.go b/sw-block/prototype/distsim/phase03_timeout_test.go new file mode 100644 index 000000000..44e688a78 --- /dev/null +++ b/sw-block/prototype/distsim/phase03_timeout_test.go @@ -0,0 +1,333 @@ +package distsim + +import ( + "testing" +) + +// ============================================================ +// Phase 03 P0: Timeout-backed scenarios +// ============================================================ + +// --- Barrier timeout --- + +func TestP03_BarrierTimeout_SyncAllBlocked(t *testing.T) { + // Barrier sent to replica, link goes down, ack never arrives. + // Barrier timeout fires → barrier removed from queue. + // sync_all: write stays uncommitted. + c := NewCluster(CommitSyncAll, "p", "r1") + c.BarrierTimeoutTicks = 5 + + c.CommitWrite(1) + c.TickN(10) // enough for barrier timeout to fire and normal commit + + // LSN 1: p self-acks. r1 acks. sync_all: both must ack. Should commit. + if c.Coordinator.CommittedLSN != 1 { + t.Fatalf("LSN 1 should commit normally, got committed=%d", c.Coordinator.CommittedLSN) + } + // No timeouts fired for LSN 1 (ack arrived in time). + if c.FiredTimeoutsByKind(TimeoutBarrier) != 0 { + t.Fatal("no barrier timeouts should have fired for LSN 1") + } + + // Now disconnect r1. Write LSN 2. Barrier can't be acked. + c.Disconnect("p", "r1") + c.Disconnect("r1", "p") + c.CommitWrite(2) + c.TickN(10) // barrier timeout fires after 5 ticks + + // Barrier timeout should have fired for r1/LSN 2. + if c.FiredTimeoutsByKind(TimeoutBarrier) != 1 { + t.Fatalf("expected 1 barrier timeout, got %d", c.FiredTimeoutsByKind(TimeoutBarrier)) + } + + // sync_all: LSN 2 NOT committed (r1 never acked). + if c.Coordinator.CommittedLSN != 1 { + t.Fatalf("LSN 2 should not commit under sync_all without r1 ack, committed=%d", + c.Coordinator.CommittedLSN) + } + + // Barrier removed from queue (no indefinite re-queuing). + for _, item := range c.Queue { + if item.msg.Kind == MsgBarrier && item.msg.To == "r1" && item.msg.TargetLSN == 2 { + t.Fatal("timed-out barrier should be removed from queue") + } + } + t.Logf("barrier timeout: LSN 2 uncommitted, barrier cleaned from queue") +} + +func TestP03_BarrierTimeout_SyncQuorum_StillCommits(t *testing.T) { + // RF=3 sync_quorum: r1 times out, but r2 acks → quorum met → commits. + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + c.BarrierTimeoutTicks = 5 + + // Disconnect r1 only. r2 stays connected. + c.Disconnect("p", "r1") + c.Disconnect("r1", "p") + + c.CommitWrite(1) + c.TickN(10) + + // r1 barrier times out, but r2 acked. quorum = p + r2 = 2 of 3. + if c.FiredTimeoutsByKind(TimeoutBarrier) != 1 { + t.Fatalf("expected 1 barrier timeout (r1), got %d", c.FiredTimeoutsByKind(TimeoutBarrier)) + } + if c.Coordinator.CommittedLSN != 1 { + t.Fatalf("LSN 1 should commit via quorum (p+r2), committed=%d", c.Coordinator.CommittedLSN) + } + t.Logf("barrier timeout: r1 timed out, LSN 1 committed via quorum") +} + +// --- Catch-up timeout --- + +func TestP03_CatchupTimeout_EscalatesToNeedsRebuild(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + + c.CommitWrite(1) + c.TickN(5) + + // r1 disconnects, primary writes more. + c.Disconnect("p", "r1") + c.Disconnect("r1", "p") + for i := uint64(2); i <= 20; i++ { + c.CommitWrite(i) + } + c.TickN(5) + c.Connect("p", "r1") + c.Connect("r1", "p") + + // Register catch-up timeout: 3 ticks from now. + r1 := c.Nodes["r1"] + r1.ReplicaState = NodeStateCatchingUp + c.RegisterTimeout(TimeoutCatchup, "r1", 0, c.Now+3) + + // Tick 3 times — timeout fires before catch-up completes. + c.TickN(3) + + if r1.ReplicaState != NodeStateNeedsRebuild { + t.Fatalf("catch-up timeout should escalate to NeedsRebuild, got %s", r1.ReplicaState) + } + if c.FiredTimeoutsByKind(TimeoutCatchup) != 1 { + t.Fatalf("expected 1 catchup timeout, got %d", c.FiredTimeoutsByKind(TimeoutCatchup)) + } + t.Logf("catch-up timeout: escalated to NeedsRebuild after 3 ticks") +} + +// --- Reservation expiry as timeout event --- + +func TestP03_ReservationTimeout_AbortsCatchup(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + + for i := uint64(1); i <= 10; i++ { + c.CommitWrite(i) + } + c.TickN(5) + + // r1 disconnects, more writes. + c.Disconnect("p", "r1") + c.Disconnect("r1", "p") + for i := uint64(11); i <= 30; i++ { + c.CommitWrite(i) + } + c.TickN(5) + c.Connect("p", "r1") + c.Connect("r1", "p") + + // Register reservation timeout: 2 ticks. + r1 := c.Nodes["r1"] + r1.ReplicaState = NodeStateCatchingUp + c.RegisterTimeout(TimeoutReservation, "r1", 0, c.Now+2) + + c.TickN(2) + + if r1.ReplicaState != NodeStateNeedsRebuild { + t.Fatalf("reservation timeout should escalate to NeedsRebuild, got %s", r1.ReplicaState) + } + if c.FiredTimeoutsByKind(TimeoutReservation) != 1 { + t.Fatalf("expected 1 reservation timeout, got %d", c.FiredTimeoutsByKind(TimeoutReservation)) + } +} + +// --- Timer-race scenarios: same-tick resolution --- + +func TestP03_Race_AckArrivesBeforeTimeout_Cancels(t *testing.T) { + // Barrier ack arrives in the same tick as the timeout deadline. + // Rule: data events (ack) process before timeouts → timeout is cancelled. + c := NewCluster(CommitSyncAll, "p", "r1") + c.BarrierTimeoutTicks = 4 // timeout at Now+4 + + c.CommitWrite(1) // barrier enqueued at Now+2, ack back at Now+3 + // Barrier timeout registered at Now+4. + + // Tick 1: write delivered. + // Tick 2: barrier delivered, ack enqueued at Now+1 = tick 3. + // Tick 3: ack delivered → cancels timeout. + // Tick 4: timeout deadline reached — but already cancelled. + c.TickN(5) + + // Ack arrived first → timeout cancelled → LSN 1 committed. + if c.FiredTimeoutsByKind(TimeoutBarrier) != 0 { + t.Fatal("barrier timeout should be cancelled by ack arriving first") + } + if c.Coordinator.CommittedLSN != 1 { + t.Fatalf("LSN 1 should commit (ack arrived before timeout), committed=%d", + c.Coordinator.CommittedLSN) + } + t.Logf("race resolved: ack cancelled timeout, LSN 1 committed") +} + +func TestP03_Race_TimeoutBeforeAck_Fires(t *testing.T) { + // Timeout fires before barrier can deliver (timeout < barrier delivery time). + // CommitWrite enqueues barrier at Now+2. Timeout at Now+1 fires first. + c := NewCluster(CommitSyncAll, "p", "r1") + c.BarrierTimeoutTicks = 1 // timeout at Now+1 — before barrier delivers at Now+2 + + c.CommitWrite(1) + c.TickN(5) + + // Timeout fires at tick 1. Barrier would deliver at tick 2, but timeout + // removes it from queue first. + if c.FiredTimeoutsByKind(TimeoutBarrier) != 1 { + t.Fatalf("expected barrier timeout to fire, got %d", c.FiredTimeoutsByKind(TimeoutBarrier)) + } + // sync_all: uncommitted (r1 never acked). + if c.Coordinator.CommittedLSN != 0 { + t.Fatalf("LSN 1 should not commit (timeout before barrier delivery), committed=%d", + c.Coordinator.CommittedLSN) + } + t.Logf("race resolved: timeout fired before barrier delivery, LSN 1 uncommitted") +} + +func TestP03_Race_CatchupConverges_CancelsTimeout(t *testing.T) { + // Catch-up completes before the timeout fires. + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + + c.CommitWrite(1) + c.TickN(5) + + c.Disconnect("p", "r1") + c.Disconnect("r1", "p") + c.CommitWrite(2) + c.CommitWrite(3) + c.TickN(5) + c.Connect("p", "r1") + c.Connect("r1", "p") + + // Register catch-up timeout: 10 ticks (generous). + r1 := c.Nodes["r1"] + r1.ReplicaState = NodeStateCatchingUp + c.RegisterTimeout(TimeoutCatchup, "r1", 0, c.Now+10) + + // Catch-up completes immediately (small gap). + // CatchUpWithEscalation auto-cancels recovery timeouts on convergence. + converged := c.CatchUpWithEscalation("r1", 100) + if !converged { + t.Fatal("catch-up should converge for small gap") + } + + // Tick past deadline — timeout should already be cancelled. + c.TickN(15) + + // Timeout should NOT have fired (was cancelled). + if c.FiredTimeoutsByKind(TimeoutCatchup) != 0 { + t.Fatal("catch-up timeout should be cancelled on convergence") + } + if r1.ReplicaState != NodeStateInSync { + t.Fatalf("r1 should be InSync after convergence, got %s", r1.ReplicaState) + } + t.Logf("race resolved: catch-up converged, timeout auto-cancelled") +} + +// --- Stale timeout hardening --- + +func TestP03_StaleReservationTimeout_AfterRecoverySuccess(t *testing.T) { + // Reservation timeout registered, but recovery completes before deadline. + // The stale timeout must NOT regress state from InSync back to NeedsRebuild. + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + + c.CommitWrite(1) + c.CommitWrite(2) + c.TickN(5) + + c.Disconnect("p", "r1") + c.Disconnect("r1", "p") + c.CommitWrite(3) + c.TickN(5) + c.Connect("p", "r1") + c.Connect("r1", "p") + + // Register reservation timeout: 10 ticks. + r1 := c.Nodes["r1"] + r1.ReplicaState = NodeStateCatchingUp + c.RegisterTimeout(TimeoutReservation, "r1", 0, c.Now+10) + + // Catch-up succeeds immediately — auto-cancels reservation timeout. + converged := c.CatchUpWithEscalation("r1", 100) + if !converged { + t.Fatal("catch-up should converge") + } + if r1.ReplicaState != NodeStateInSync { + t.Fatalf("expected InSync after convergence, got %s", r1.ReplicaState) + } + + // Tick well past the deadline. + c.TickN(20) + + // Stale reservation timeout must NOT fire (cancelled by convergence). + if c.FiredTimeoutsByKind(TimeoutReservation) != 0 { + t.Fatal("stale reservation timeout should not fire after recovery success") + } + if r1.ReplicaState != NodeStateInSync { + t.Fatalf("stale timeout regressed state: expected InSync, got %s", r1.ReplicaState) + } + t.Logf("stale reservation timeout correctly suppressed after recovery") +} + +func TestP03_LateBarrierAck_AfterTimeout_Rejected(t *testing.T) { + // Barrier times out, then a late ack arrives. The late ack must be + // rejected — it must not count toward DurableOn. + c := NewCluster(CommitSyncAll, "p", "r1") + c.BarrierTimeoutTicks = 1 // timeout at Now+1 + + c.CommitWrite(1) + + // Tick 1: write delivered, timeout fires (barrier at Now+2 not yet delivered). + c.TickN(1) + + if c.FiredTimeoutsByKind(TimeoutBarrier) != 1 { + t.Fatalf("expected barrier timeout to fire, got %d", c.FiredTimeoutsByKind(TimeoutBarrier)) + } + + // LSN 1 should NOT be committed. + if c.Coordinator.CommittedLSN != 0 { + t.Fatalf("LSN 1 should not be committed after timeout, got %d", c.Coordinator.CommittedLSN) + } + + // Now inject a late barrier ack (as if the network delayed it massively). + c.InjectMessage(Message{ + Kind: MsgBarrierAck, + From: "r1", + To: "p", + Epoch: c.Coordinator.Epoch, + TargetLSN: 1, + }, c.Now+1) + + c.TickN(5) + + // Late ack must be rejected with barrier_expired reason. + expiredRejects := c.RejectedByReason(RejectBarrierExpired) + if expiredRejects == 0 { + t.Fatal("late barrier ack should be rejected as barrier_expired") + } + + // LSN 1 must still be uncommitted (late ack did not count). + if c.Coordinator.CommittedLSN != 0 { + t.Fatalf("late ack should not commit LSN 1, got committed=%d", c.Coordinator.CommittedLSN) + } + + // DurableOn should NOT include r1. + p1 := c.Pending[1] + if p1 != nil && p1.DurableOn["r1"] { + t.Fatal("late ack should not set DurableOn for r1") + } + t.Logf("late barrier ack: rejected as barrier_expired, LSN 1 stays uncommitted") +} diff --git a/sw-block/prototype/distsim/phase04a_ownership_test.go b/sw-block/prototype/distsim/phase04a_ownership_test.go new file mode 100644 index 000000000..e31102cb6 --- /dev/null +++ b/sw-block/prototype/distsim/phase04a_ownership_test.go @@ -0,0 +1,243 @@ +package distsim + +import "testing" + +// ============================================================ +// Phase 04a: Session ownership validation in distsim +// ============================================================ + +// --- Scenario 1: Endpoint change during active catch-up --- + +func TestP04a_EndpointChangeDuringCatchup_InvalidatesSession(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + + c.CommitWrite(1) + c.TickN(5) + + // Start catch-up session for r1. + c.Disconnect("p", "r1") + c.Disconnect("r1", "p") + c.CommitWrite(2) + c.TickN(5) + c.Connect("p", "r1") + c.Connect("r1", "p") + + trigger, sessID, ok := c.TriggerRecoverySession("r1") + if !ok || trigger != TriggerReassignment { + t.Fatalf("should trigger reassignment, got %s/%v", trigger, ok) + } + + // Session is active. + sess := c.Sessions["r1"] + if !sess.Active { + t.Fatal("session should be active") + } + + // Endpoint changes (replica restarts on new address). + c.StopNode("r1") + c.RestartNodeWithNewAddress("r1") + + // Session invalidated by endpoint change. + if sess.Active { + t.Fatal("session should be invalidated after endpoint change") + } + if sess.Reason != "endpoint_changed" { + t.Fatalf("invalidation reason: got %q, want endpoint_changed", sess.Reason) + } + + // Stale completion from old session is rejected. + if c.CompleteRecoverySession("r1", sessID) { + t.Fatal("stale session completion should be rejected") + } + t.Logf("endpoint change: session %d invalidated, stale completion rejected", sessID) +} + +// --- Scenario 2: Epoch bump during active catch-up --- + +func TestP04a_EpochBumpDuringCatchup_InvalidatesSession(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + + c.CommitWrite(1) + c.TickN(5) + + c.Disconnect("p", "r1") + c.Disconnect("r1", "p") + c.CommitWrite(2) + c.TickN(5) + c.Connect("p", "r1") + c.Connect("r1", "p") + + _, sessID, ok := c.TriggerRecoverySession("r1") + if !ok { + t.Fatal("trigger should succeed") + } + sess := c.Sessions["r1"] + + // Epoch bumps (promotion). + c.StopNode("p") + c.Promote("r2") + + // Session invalidated by epoch bump. + if sess.Active { + t.Fatal("session should be invalidated after epoch bump") + } + if sess.Reason != "epoch_bump_promotion" { + t.Fatalf("reason: got %q", sess.Reason) + } + + // Stale completion rejected. + if c.CompleteRecoverySession("r1", sessID) { + t.Fatal("stale completion after epoch bump should be rejected") + } + t.Logf("epoch bump: session %d invalidated, completion rejected", sessID) +} + +// --- Scenario 3: Stale late completion from old session --- + +func TestP04a_StaleCompletion_AfterSupersede_Rejected(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + + c.CommitWrite(1) + c.TickN(5) + + c.Disconnect("p", "r1") + c.Disconnect("r1", "p") + c.CommitWrite(2) + c.TickN(5) + c.Connect("p", "r1") + c.Connect("r1", "p") + + // First session. + _, oldSessID, _ := c.TriggerRecoverySession("r1") + oldSess := c.Sessions["r1"] + + // Invalidate old session manually (simulate timeout or abort). + c.InvalidateReplicaSession("r1", "timeout") + if oldSess.Active { + t.Fatal("old session should be invalidated") + } + + // New session triggered. + c.Nodes["r1"].ReplicaState = NodeStateLagging // reset state to allow retrigger + _, newSessID, ok := c.TriggerRecoverySession("r1") + if !ok { + t.Fatal("second trigger should succeed after invalidation") + } + newSess := c.Sessions["r1"] + + // Old session completion attempt — must be rejected by ID mismatch. + if c.CompleteRecoverySession("r1", oldSessID) { + t.Fatal("old session completion must be rejected") + } + // New session still active. + if !newSess.Active { + t.Fatal("new session should still be active") + } + + // New session completion succeeds. + if !c.CompleteRecoverySession("r1", newSessID) { + t.Fatal("new session completion should succeed") + } + if c.Nodes["r1"].ReplicaState != NodeStateInSync { + t.Fatalf("r1 should be InSync after new session completes, got %s", c.Nodes["r1"].ReplicaState) + } + t.Logf("stale completion: old=%d rejected, new=%d accepted", oldSessID, newSessID) +} + +// --- Scenario 4: Duplicate recovery trigger while session active --- + +func TestP04a_DuplicateTrigger_WhileActive_Rejected(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + + c.CommitWrite(1) + c.TickN(5) + + c.Disconnect("p", "r1") + c.Disconnect("r1", "p") + c.CommitWrite(2) + c.TickN(5) + c.Connect("p", "r1") + c.Connect("r1", "p") + + // First trigger succeeds. + _, _, ok := c.TriggerRecoverySession("r1") + if !ok { + t.Fatal("first trigger should succeed") + } + + // Duplicate trigger while session active — rejected. + _, _, ok = c.TriggerRecoverySession("r1") + if ok { + t.Fatal("duplicate trigger should be rejected while session active") + } + + // Session count: only one in history. + sessCount := 0 + for _, s := range c.SessionHistory { + if s.ReplicaID == "r1" { + sessCount++ + } + } + if sessCount != 1 { + t.Fatalf("should have exactly 1 session in history, got %d", sessCount) + } + t.Logf("duplicate trigger correctly rejected") +} + +// --- Scenario 5: Session tracking through full lifecycle --- + +func TestP04a_FullLifecycle_SessionTracking(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + + c.CommitWrite(1) + c.CommitWrite(2) + c.TickN(5) + + // Disconnect, write, reconnect. + c.Disconnect("p", "r1") + c.Disconnect("r1", "p") + for i := uint64(3); i <= 10; i++ { + c.CommitWrite(i) + } + c.TickN(5) + c.Connect("p", "r1") + c.Connect("r1", "p") + + // Trigger session. + trigger, sessID, ok := c.TriggerRecoverySession("r1") + if !ok { + t.Fatal("trigger failed") + } + if trigger != TriggerReassignment { + t.Fatalf("expected reassignment, got %s", trigger) + } + + // Catch up. + converged := c.CatchUpWithEscalation("r1", 100) + if !converged { + t.Fatal("catch-up should converge") + } + + // Complete session. + if !c.CompleteRecoverySession("r1", sessID) { + t.Fatal("completion should succeed") + } + + // Verify final state. + if c.Nodes["r1"].ReplicaState != NodeStateInSync { + t.Fatalf("r1 should be InSync, got %s", c.Nodes["r1"].ReplicaState) + } + if err := c.AssertCommittedRecoverable("r1"); err != nil { + t.Fatalf("data incorrect: %v", err) + } + + // Session in history, not active. + sess := c.Sessions["r1"] + if sess.Active { + t.Fatal("session should not be active after completion") + } + if len(c.SessionHistory) != 1 { + t.Fatalf("expected 1 session in history, got %d", len(c.SessionHistory)) + } + t.Logf("full lifecycle: trigger=%s session=%d → catch-up → complete → InSync", trigger, sessID) +} diff --git a/sw-block/prototype/distsim/protocol.go b/sw-block/prototype/distsim/protocol.go new file mode 100644 index 000000000..0b6cce3f0 --- /dev/null +++ b/sw-block/prototype/distsim/protocol.go @@ -0,0 +1,102 @@ +package distsim + +type ProtocolVersion string + +const ( + ProtocolV1 ProtocolVersion = "v1" + ProtocolV15 ProtocolVersion = "v1_5" + ProtocolV2 ProtocolVersion = "v2" +) + +type ProtocolPolicy struct { + Version ProtocolVersion +} + +func (p ProtocolPolicy) CanAttemptCatchup(addressStable bool) bool { + switch p.Version { + case ProtocolV1: + return false + case ProtocolV15: + return addressStable + case ProtocolV2: + return true + default: + return false + } +} + +func (p ProtocolPolicy) BriefDisconnectAction(addressStable, recoverable bool) string { + switch p.Version { + case ProtocolV1: + return "degrade_or_rebuild" + case ProtocolV15: + if addressStable && recoverable { + return "catchup_if_history_survives" + } + return "stall_or_control_plane_recovery" + case ProtocolV2: + if recoverable { + return "reserved_catchup" + } + return "explicit_rebuild" + default: + return "unknown" + } +} + +func (p ProtocolPolicy) TailChasingAction(converged bool) string { + switch p.Version { + case ProtocolV1: + if converged { + return "unexpected_catchup" + } + return "degrade" + case ProtocolV15: + if converged { + return "catchup" + } + return "stall_or_rebuild" + case ProtocolV2: + if converged { + return "catchup" + } + return "abort_to_rebuild" + default: + return "unknown" + } +} + +func (p ProtocolPolicy) RestartRejoinAction(addressStable bool) string { + switch p.Version { + case ProtocolV1: + return "control_plane_only" + case ProtocolV15: + if addressStable { + return "background_reconnect_or_control_plane" + } + return "control_plane_only" + case ProtocolV2: + if addressStable { + return "direct_reconnect_or_control_plane" + } + return "explicit_reassignment_or_rebuild" + default: + return "unknown" + } +} + +func (p ProtocolPolicy) ChangedAddressRestartAction(recoverable bool) string { + switch p.Version { + case ProtocolV1: + return "control_plane_only" + case ProtocolV15: + return "control_plane_only" + case ProtocolV2: + if recoverable { + return "explicit_reassignment_then_catchup" + } + return "explicit_reassignment_or_rebuild" + default: + return "unknown" + } +} diff --git a/sw-block/prototype/distsim/protocol_test.go b/sw-block/prototype/distsim/protocol_test.go new file mode 100644 index 000000000..777de667f --- /dev/null +++ b/sw-block/prototype/distsim/protocol_test.go @@ -0,0 +1,84 @@ +package distsim + +import "testing" + +func TestProtocolV1CannotAttemptCatchup(t *testing.T) { + p := ProtocolPolicy{Version: ProtocolV1} + if p.CanAttemptCatchup(true) { + t.Fatal("v1 should not expose meaningful catch-up path") + } +} + +func TestProtocolV15CatchupDependsOnStableAddress(t *testing.T) { + p := ProtocolPolicy{Version: ProtocolV15} + if !p.CanAttemptCatchup(true) { + t.Fatal("v1.5 should allow catch-up when address is stable") + } + if p.CanAttemptCatchup(false) { + t.Fatal("v1.5 should not assume reconnect with changed address") + } +} + +func TestProtocolV2AllowsCatchupByPolicy(t *testing.T) { + p := ProtocolPolicy{Version: ProtocolV2} + if !p.CanAttemptCatchup(true) || !p.CanAttemptCatchup(false) { + t.Fatal("v2 policy should allow catch-up attempt subject to explicit recoverability checks") + } +} + +func TestProtocolBriefDisconnectActions(t *testing.T) { + if got := (ProtocolPolicy{Version: ProtocolV1}).BriefDisconnectAction(true, true); got != "degrade_or_rebuild" { + t.Fatalf("v1 brief-disconnect action = %s", got) + } + if got := (ProtocolPolicy{Version: ProtocolV15}).BriefDisconnectAction(true, true); got != "catchup_if_history_survives" { + t.Fatalf("v1.5 brief-disconnect action = %s", got) + } + if got := (ProtocolPolicy{Version: ProtocolV15}).BriefDisconnectAction(false, true); got != "stall_or_control_plane_recovery" { + t.Fatalf("v1.5 changed-address brief-disconnect action = %s", got) + } + if got := (ProtocolPolicy{Version: ProtocolV2}).BriefDisconnectAction(true, false); got != "explicit_rebuild" { + t.Fatalf("v2 unrecoverable brief-disconnect action = %s", got) + } + if got := (ProtocolPolicy{Version: ProtocolV2}).BriefDisconnectAction(false, true); got != "reserved_catchup" { + t.Fatalf("v2 recoverable brief-disconnect action = %s", got) + } +} + +func TestProtocolTailChasingActions(t *testing.T) { + if got := (ProtocolPolicy{Version: ProtocolV1}).TailChasingAction(false); got != "degrade" { + t.Fatalf("v1 tail-chasing action = %s", got) + } + if got := (ProtocolPolicy{Version: ProtocolV15}).TailChasingAction(false); got != "stall_or_rebuild" { + t.Fatalf("v1.5 tail-chasing action = %s", got) + } + if got := (ProtocolPolicy{Version: ProtocolV2}).TailChasingAction(false); got != "abort_to_rebuild" { + t.Fatalf("v2 tail-chasing action = %s", got) + } +} + +func TestProtocolRestartRejoinActions(t *testing.T) { + if got := (ProtocolPolicy{Version: ProtocolV1}).RestartRejoinAction(true); got != "control_plane_only" { + t.Fatalf("v1 restart action = %s", got) + } + if got := (ProtocolPolicy{Version: ProtocolV15}).RestartRejoinAction(false); got != "control_plane_only" { + t.Fatalf("v1.5 changed-address restart action = %s", got) + } + if got := (ProtocolPolicy{Version: ProtocolV2}).RestartRejoinAction(false); got != "explicit_reassignment_or_rebuild" { + t.Fatalf("v2 changed-address restart action = %s", got) + } +} + +func TestProtocolChangedAddressRestartActions(t *testing.T) { + if got := (ProtocolPolicy{Version: ProtocolV1}).ChangedAddressRestartAction(true); got != "control_plane_only" { + t.Fatalf("v1 changed-address restart action = %s", got) + } + if got := (ProtocolPolicy{Version: ProtocolV15}).ChangedAddressRestartAction(true); got != "control_plane_only" { + t.Fatalf("v1.5 changed-address restart action = %s", got) + } + if got := (ProtocolPolicy{Version: ProtocolV2}).ChangedAddressRestartAction(true); got != "explicit_reassignment_then_catchup" { + t.Fatalf("v2 recoverable changed-address restart action = %s", got) + } + if got := (ProtocolPolicy{Version: ProtocolV2}).ChangedAddressRestartAction(false); got != "explicit_reassignment_or_rebuild" { + t.Fatalf("v2 unrecoverable changed-address restart action = %s", got) + } +} diff --git a/sw-block/prototype/distsim/random.go b/sw-block/prototype/distsim/random.go new file mode 100644 index 000000000..f51938a95 --- /dev/null +++ b/sw-block/prototype/distsim/random.go @@ -0,0 +1,256 @@ +package distsim + +import ( + "fmt" + "math/rand" + "sort" +) + +type RandomEvent string + +const ( + RandomCommitWrite RandomEvent = "commit_write" + RandomTick RandomEvent = "tick" + RandomDisconnect RandomEvent = "disconnect" + RandomReconnect RandomEvent = "reconnect" + RandomStopNode RandomEvent = "stop_node" + RandomStartNode RandomEvent = "start_node" + RandomPromote RandomEvent = "promote" + RandomTakeSnapshot RandomEvent = "take_snapshot" + RandomCatchup RandomEvent = "catchup" + RandomRebuild RandomEvent = "rebuild" +) + +type RandomStep struct { + Step int + Event RandomEvent + Detail string +} + +type RandomResult struct { + Seed int64 + Steps []RandomStep + Cluster *Cluster + Snapshots []string +} + +func RunRandomScenario(seed int64, steps int) (*RandomResult, error) { + rng := rand.New(rand.NewSource(seed)) + cluster := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + result := &RandomResult{ + Seed: seed, + Cluster: cluster, + } + + for i := 0; i < steps; i++ { + step, err := runRandomStep(cluster, rng, i) + if err != nil { + result.Steps = append(result.Steps, step) + return result, err + } + result.Steps = append(result.Steps, step) + if err := assertClusterInvariants(cluster); err != nil { + return result, fmt.Errorf("seed=%d step=%d event=%s detail=%s: %w", seed, i, step.Event, step.Detail, err) + } + } + return result, assertClusterInvariants(cluster) +} + +func runRandomStep(c *Cluster, rng *rand.Rand, step int) (RandomStep, error) { + events := []RandomEvent{ + RandomCommitWrite, + RandomTick, + RandomDisconnect, + RandomReconnect, + RandomStopNode, + RandomStartNode, + RandomPromote, + RandomTakeSnapshot, + RandomCatchup, + RandomRebuild, + } + ev := events[rng.Intn(len(events))] + rs := RandomStep{Step: step, Event: ev} + + switch ev { + case RandomCommitWrite: + block := uint64(rng.Intn(8) + 1) + lsn := c.CommitWrite(block) + rs.Detail = fmt.Sprintf("block=%d lsn=%d", block, lsn) + case RandomTick: + n := rng.Intn(3) + 1 + c.TickN(n) + rs.Detail = fmt.Sprintf("ticks=%d", n) + case RandomDisconnect: + from, to := randomPair(c, rng) + c.Disconnect(from, to) + rs.Detail = fmt.Sprintf("%s->%s", from, to) + case RandomReconnect: + from, to := randomPair(c, rng) + c.Connect(from, to) + rs.Detail = fmt.Sprintf("%s->%s", from, to) + case RandomStopNode: + id := randomNodeID(c, rng) + c.StopNode(id) + rs.Detail = id + case RandomStartNode: + id := randomNodeID(c, rng) + c.StartNode(id) + rs.Detail = id + case RandomPromote: + if primary := c.Primary(); primary != nil && primary.Running { + rs.Detail = "primary_still_running" + return rs, nil + } + candidates := promotableNodes(c) + if len(candidates) == 0 { + rs.Detail = "no_candidate" + return rs, nil + } + id := candidates[rng.Intn(len(candidates))] + rs.Detail = id + if err := c.Promote(id); err != nil { + return rs, err + } + case RandomTakeSnapshot: + primary := c.Primary() + if primary == nil || !primary.Running { + rs.Detail = "no_primary" + return rs, nil + } + lsn := c.Coordinator.CommittedLSN + id := fmt.Sprintf("snap-%s-%d", primary.ID, lsn) + primary.Storage.TakeSnapshot(id, lsn) + rs.Detail = fmt.Sprintf("%s@%d", id, lsn) + case RandomCatchup: + id := randomReplicaID(c, rng) + if id == "" { + rs.Detail = "no_replica" + return rs, nil + } + node := c.Nodes[id] + if node == nil || !node.Running { + rs.Detail = id + ":down" + return rs, nil + } + start := node.Storage.FlushedLSN + end := c.Coordinator.CommittedLSN + if end <= start { + rs.Detail = fmt.Sprintf("%s:no_gap", id) + return rs, nil + } + rs.Detail = fmt.Sprintf("%s:%d..%d", id, start+1, end) + if err := c.RecoverReplicaFromPrimary(id, start, end); err != nil { + return rs, err + } + case RandomRebuild: + id := randomReplicaID(c, rng) + if id == "" { + rs.Detail = "no_replica" + return rs, nil + } + primary := c.Primary() + node := c.Nodes[id] + if primary == nil || node == nil || !primary.Running || !node.Running { + rs.Detail = id + ":unavailable" + return rs, nil + } + snapshotIDs := make([]string, 0, len(primary.Storage.Snapshots)) + for snapID := range primary.Storage.Snapshots { + snapshotIDs = append(snapshotIDs, snapID) + } + if len(snapshotIDs) == 0 { + rs.Detail = id + ":no_snapshot" + return rs, nil + } + sort.Strings(snapshotIDs) + snapID := snapshotIDs[rng.Intn(len(snapshotIDs))] + rs.Detail = fmt.Sprintf("%s:%s->%d", id, snapID, c.Coordinator.CommittedLSN) + if err := c.RebuildReplicaFromSnapshot(id, snapID, c.Coordinator.CommittedLSN); err != nil { + return rs, err + } + default: + return rs, fmt.Errorf("unknown random event %s", ev) + } + + return rs, nil +} + +func randomNodeID(c *Cluster, rng *rand.Rand) string { + ids := append([]string(nil), c.Coordinator.Members...) + sort.Strings(ids) + if len(ids) == 0 { + return "" + } + return ids[rng.Intn(len(ids))] +} + +func randomReplicaID(c *Cluster, rng *rand.Rand) string { + ids := c.replicaIDs() + if len(ids) == 0 { + return "" + } + return ids[rng.Intn(len(ids))] +} + +func randomPair(c *Cluster, rng *rand.Rand) (string, string) { + from := randomNodeID(c, rng) + to := randomNodeID(c, rng) + if from == to { + ids := append([]string(nil), c.Coordinator.Members...) + sort.Strings(ids) + for _, id := range ids { + if id != from { + to = id + break + } + } + } + return from, to +} + +func promotableNodes(c *Cluster) []string { + out := make([]string, 0) + want := c.Reference.StateAt(c.Coordinator.CommittedLSN) + for _, id := range c.Coordinator.Members { + n := c.Nodes[id] + if n == nil || !n.Running || n.Storage.FlushedLSN < c.Coordinator.CommittedLSN { + continue + } + if !EqualState(n.Storage.StateAt(c.Coordinator.CommittedLSN), want) { + continue + } + out = append(out, id) + } + sort.Strings(out) + return out +} + +func assertClusterInvariants(c *Cluster) error { + committed := c.Coordinator.CommittedLSN + want := c.Reference.StateAt(committed) + + for lsn, p := range c.Pending { + if p.Committed && lsn > committed { + return fmt.Errorf("pending lsn %d marked committed above coordinator committed lsn %d", lsn, committed) + } + } + + for _, id := range promotableNodes(c) { + n := c.Nodes[id] + got := n.Storage.StateAt(committed) + if !EqualState(got, want) { + return fmt.Errorf("promotable node %s mismatch at committed lsn %d: got=%v want=%v", id, committed, got, want) + } + } + + primary := c.Primary() + if primary != nil && primary.Running && primary.Epoch == c.Coordinator.Epoch { + got := primary.Storage.StateAt(committed) + if !EqualState(got, want) { + return fmt.Errorf("primary %s mismatch at committed lsn %d: got=%v want=%v", primary.ID, committed, got, want) + } + } + + return nil +} diff --git a/sw-block/prototype/distsim/random_test.go b/sw-block/prototype/distsim/random_test.go new file mode 100644 index 000000000..68f2c3fb4 --- /dev/null +++ b/sw-block/prototype/distsim/random_test.go @@ -0,0 +1,43 @@ +package distsim + +import "testing" + +func TestRandomScenarioSeeds(t *testing.T) { + seeds := []int64{ + 1, 2, 3, 4, 5, + 11, 21, 34, 55, 89, + 101, 202, 303, 404, 505, + } + + for _, seed := range seeds { + seed := seed + t.Run("seed_"+itoa64(seed), func(t *testing.T) { + t.Parallel() + if _, err := RunRandomScenario(seed, 60); err != nil { + t.Fatal(err) + } + }) + } +} + +func itoa64(v int64) string { + if v == 0 { + return "0" + } + neg := v < 0 + if neg { + v = -v + } + buf := make([]byte, 0, 20) + for v > 0 { + buf = append(buf, byte('0'+v%10)) + v /= 10 + } + if neg { + buf = append(buf, '-') + } + for i, j := 0, len(buf)-1; i < j; i, j = i+1, j-1 { + buf[i], buf[j] = buf[j], buf[i] + } + return string(buf) +} diff --git a/sw-block/prototype/distsim/reference.go b/sw-block/prototype/distsim/reference.go new file mode 100644 index 000000000..c8168e870 --- /dev/null +++ b/sw-block/prototype/distsim/reference.go @@ -0,0 +1,95 @@ +package distsim + +type Write struct { + LSN uint64 + Block uint64 + Value uint64 +} + +type Snapshot struct { + LSN uint64 + State map[uint64]uint64 +} + +type Reference struct { + writes []Write + snapshots map[uint64]Snapshot +} + +func NewReference() *Reference { + return &Reference{snapshots: map[uint64]Snapshot{}} +} + +func (r *Reference) Apply(w Write) { + r.writes = append(r.writes, w) +} + +func (r *Reference) StateAt(lsn uint64) map[uint64]uint64 { + state := make(map[uint64]uint64) + for _, w := range r.writes { + if w.LSN > lsn { + break + } + state[w.Block] = w.Value + } + return state +} + +func cloneMap(in map[uint64]uint64) map[uint64]uint64 { + out := make(map[uint64]uint64, len(in)) + for k, v := range in { + out[k] = v + } + return out +} + +func (r *Reference) TakeSnapshot(lsn uint64) Snapshot { + s := Snapshot{LSN: lsn, State: cloneMap(r.StateAt(lsn))} + r.snapshots[lsn] = s + return s +} + +func (r *Reference) SnapshotAt(lsn uint64) (Snapshot, bool) { + s, ok := r.snapshots[lsn] + return s, ok +} + +type Node struct { + Extent map[uint64]uint64 +} + +func NewNode() *Node { + return &Node{Extent: map[uint64]uint64{}} +} + +func (n *Node) ApplyWrite(w Write) { + n.Extent[w.Block] = w.Value +} + +func (n *Node) LoadSnapshot(s Snapshot) { + n.Extent = cloneMap(s.State) +} + +func (n *Node) ReplayFromWrites(writes []Write, startExclusive, endInclusive uint64) { + for _, w := range writes { + if w.LSN <= startExclusive { + continue + } + if w.LSN > endInclusive { + break + } + n.ApplyWrite(w) + } +} + +func EqualState(a, b map[uint64]uint64) bool { + if len(a) != len(b) { + return false + } + for k, v := range a { + if b[k] != v { + return false + } + } + return true +} diff --git a/sw-block/prototype/distsim/reference_test.go b/sw-block/prototype/distsim/reference_test.go new file mode 100644 index 000000000..124bb9ad5 --- /dev/null +++ b/sw-block/prototype/distsim/reference_test.go @@ -0,0 +1,66 @@ +package distsim + +import "testing" + +func TestWALReplayPreservesHistoricalValue(t *testing.T) { + ref := NewReference() + ref.Apply(Write{LSN: 10, Block: 7, Value: 10}) + ref.Apply(Write{LSN: 12, Block: 7, Value: 12}) + + node := NewNode() + node.ReplayFromWrites(ref.writes, 0, 10) + + want := ref.StateAt(10) + if !EqualState(node.Extent, want) { + t.Fatalf("replay mismatch: got=%v want=%v", node.Extent, want) + } +} + +func TestCurrentExtentCannotRecoverOldLSN(t *testing.T) { + ref := NewReference() + ref.Apply(Write{LSN: 10, Block: 7, Value: 10}) + ref.Apply(Write{LSN: 12, Block: 7, Value: 12}) + + primary := NewNode() + for _, w := range ref.writes { + primary.ApplyWrite(w) + } + + wantOld := ref.StateAt(10) + if EqualState(primary.Extent, wantOld) { + t.Fatalf("latest extent should not equal old LSN state: latest=%v old=%v", primary.Extent, wantOld) + } +} + +func TestSnapshotAtCpLSNRecoversCorrectHistoricalValue(t *testing.T) { + ref := NewReference() + ref.Apply(Write{LSN: 10, Block: 7, Value: 10}) + snap := ref.TakeSnapshot(10) + ref.Apply(Write{LSN: 12, Block: 7, Value: 12}) + + node := NewNode() + node.LoadSnapshot(snap) + + want := ref.StateAt(10) + if !EqualState(node.Extent, want) { + t.Fatalf("snapshot mismatch: got=%v want=%v", node.Extent, want) + } +} + +func TestSnapshotPlusTrailingReplayReachesTargetLSN(t *testing.T) { + ref := NewReference() + ref.Apply(Write{LSN: 10, Block: 7, Value: 10}) + ref.Apply(Write{LSN: 11, Block: 2, Value: 11}) + snap := ref.TakeSnapshot(11) + ref.Apply(Write{LSN: 12, Block: 7, Value: 12}) + ref.Apply(Write{LSN: 13, Block: 9, Value: 13}) + + node := NewNode() + node.LoadSnapshot(snap) + node.ReplayFromWrites(ref.writes, 11, 13) + + want := ref.StateAt(13) + if !EqualState(node.Extent, want) { + t.Fatalf("snapshot+replay mismatch: got=%v want=%v", node.Extent, want) + } +} diff --git a/sw-block/prototype/distsim/simulator.go b/sw-block/prototype/distsim/simulator.go new file mode 100644 index 000000000..24da8d3f2 --- /dev/null +++ b/sw-block/prototype/distsim/simulator.go @@ -0,0 +1,581 @@ +package distsim + +import ( + "container/heap" + "fmt" + "math/rand" + "strings" +) + +// --- Event types --- + +type EventKind int + +const ( + EvWriteStart EventKind = iota // client writes to primary + EvShipEntry // primary sends WAL entry to replica + EvShipDeliver // entry arrives at replica + EvBarrierSend // primary sends barrier to replica + EvBarrierDeliver // barrier arrives at replica + EvBarrierFsync // replica fsync completes + EvBarrierAck // ack arrives back at primary + EvNodeCrash // node crashes + EvNodeRestart // node restarts + EvLinkDown // network link drops + EvLinkUp // network link restores + EvFlusherTick // flusher checkpoint cycle + EvPromote // coordinator promotes a node + EvLockAcquire // thread tries to acquire lock + EvLockRelease // thread releases lock +) + +func (k EventKind) String() string { + names := [...]string{ + "WriteStart", "ShipEntry", "ShipDeliver", + "BarrierSend", "BarrierDeliver", "BarrierFsync", "BarrierAck", + "NodeCrash", "NodeRestart", "LinkDown", "LinkUp", + "FlusherTick", "Promote", + "LockAcquire", "LockRelease", + } + if int(k) < len(names) { + return names[k] + } + return fmt.Sprintf("Event(%d)", k) +} + +type Event struct { + Time uint64 + ID uint64 // unique, for stable ordering + Kind EventKind + NodeID string + Payload EventPayload +} + +type EventPayload struct { + Write Write // for WriteStart, ShipEntry, ShipDeliver + TargetLSN uint64 // for barriers + FromNode string // for delivered messages + ToNode string + LockName string // for lock events + ThreadID string + PromoteID string // for EvPromote +} + +// --- Priority queue --- + +type eventHeap []Event + +func (h eventHeap) Len() int { return len(h) } +func (h eventHeap) Swap(i, j int) { h[i], h[j] = h[j], h[i] } +func (h eventHeap) Less(i, j int) bool { + if h[i].Time != h[j].Time { + return h[i].Time < h[j].Time + } + return h[i].ID < h[j].ID // stable tie-break +} +func (h *eventHeap) Push(x interface{}) { *h = append(*h, x.(Event)) } +func (h *eventHeap) Pop() interface{} { + old := *h + n := len(old) + e := old[n-1] + *h = old[:n-1] + return e +} + +// --- Lock model --- + +type lockState struct { + held bool + holder string // threadID + waiting []Event // parked EvLockAcquire events +} + +// --- Trace --- + +type TraceEntry struct { + Time uint64 + Event Event + Note string +} + +// --- Simulator --- + +type Simulator struct { + Cluster *Cluster + rng *rand.Rand + queue eventHeap + nextID uint64 + locks map[string]*lockState // lockName -> state + trace []TraceEntry + Errors []string + maxTime uint64 + jitterMax uint64 // max random delay added to message delivery + + // Config + FaultRate float64 // probability of injecting a fault per step [0,1] + MaxEvents int // stop after this many events + eventsRun int +} + +func NewSimulator(cluster *Cluster, seed int64) *Simulator { + return &Simulator{ + Cluster: cluster, + rng: rand.New(rand.NewSource(seed)), + locks: map[string]*lockState{}, + maxTime: 100000, + jitterMax: 3, + FaultRate: 0.05, + MaxEvents: 5000, + } +} + +// Enqueue adds an event to the priority queue. +func (s *Simulator) Enqueue(e Event) { + s.nextID++ + e.ID = s.nextID + heap.Push(&s.queue, e) +} + +// EnqueueAt is a convenience for enqueueing at a specific time. +func (s *Simulator) EnqueueAt(time uint64, kind EventKind, nodeID string, payload EventPayload) { + s.Enqueue(Event{Time: time, Kind: kind, NodeID: nodeID, Payload: payload}) +} + +// jitter returns a random delay in [1, jitterMax]. +func (s *Simulator) jitter() uint64 { + if s.jitterMax <= 1 { + return 1 + } + return 1 + uint64(s.rng.Int63n(int64(s.jitterMax))) +} + +// --- Main loop --- + +// Step executes the next event. Returns false if queue is empty or limit reached. +// When multiple events share the same timestamp, one is chosen randomly +// to explore different interleavings across runs with different seeds. +func (s *Simulator) Step() bool { + if s.queue.Len() == 0 || s.eventsRun >= s.MaxEvents { + return false + } + // Collect all events at the earliest timestamp. + earliest := s.queue[0].Time + if earliest > s.maxTime { + return false + } + var ready []Event + for s.queue.Len() > 0 && s.queue[0].Time == earliest { + ready = append(ready, heap.Pop(&s.queue).(Event)) + } + // Shuffle to randomize interleaving of equal-time events. + s.rng.Shuffle(len(ready), func(i, j int) { ready[i], ready[j] = ready[j], ready[i] }) + // Execute the first, re-enqueue the rest. + e := ready[0] + for _, r := range ready[1:] { + heap.Push(&s.queue, r) + } + + s.Cluster.Now = e.Time + s.eventsRun++ + + s.execute(e) + s.checkInvariants(e) + + return len(s.Errors) == 0 +} + +// Run executes until queue empty, limit reached, or invariant violated. +func (s *Simulator) Run() { + for s.Step() { + } +} + +// --- Event execution --- + +func (s *Simulator) execute(e Event) { + node := s.Cluster.Nodes[e.NodeID] + + switch e.Kind { + case EvWriteStart: + s.executeWriteStart(e) + + case EvShipEntry: + // Primary ships entry to a replica. Enqueue delivery with jitter. + if node != nil && node.Running { + deliverTime := s.Cluster.Now + s.jitter() + s.EnqueueAt(deliverTime, EvShipDeliver, e.Payload.ToNode, EventPayload{ + Write: e.Payload.Write, + FromNode: e.NodeID, + }) + s.record(e, fmt.Sprintf("ship LSN=%d to %s, deliver@%d", e.Payload.Write.LSN, e.Payload.ToNode, deliverTime)) + } + + case EvShipDeliver: + if node != nil && node.Running && node.Epoch == s.Cluster.Coordinator.Epoch { + if s.Cluster.Links[e.Payload.FromNode] != nil && s.Cluster.Links[e.Payload.FromNode][e.NodeID] { + node.Storage.AppendWrite(e.Payload.Write) + s.record(e, fmt.Sprintf("deliver LSN=%d on %s, receivedLSN=%d", e.Payload.Write.LSN, e.NodeID, node.Storage.ReceivedLSN)) + } else { + s.record(e, fmt.Sprintf("drop LSN=%d to %s (link down)", e.Payload.Write.LSN, e.NodeID)) + } + } + + case EvBarrierSend: + if node != nil && node.Running { + deliverTime := s.Cluster.Now + s.jitter() + s.EnqueueAt(deliverTime, EvBarrierDeliver, e.Payload.ToNode, EventPayload{ + TargetLSN: e.Payload.TargetLSN, + FromNode: e.NodeID, + }) + s.record(e, fmt.Sprintf("barrier LSN=%d to %s", e.Payload.TargetLSN, e.Payload.ToNode)) + } + + case EvBarrierDeliver: + if node != nil && node.Running && node.Epoch == s.Cluster.Coordinator.Epoch { + if s.Cluster.Links[e.Payload.FromNode] != nil && s.Cluster.Links[e.Payload.FromNode][e.NodeID] { + if node.Storage.ReceivedLSN >= e.Payload.TargetLSN { + // Can fsync now. Enqueue fsync completion with small delay. + s.EnqueueAt(s.Cluster.Now+1, EvBarrierFsync, e.NodeID, EventPayload{ + TargetLSN: e.Payload.TargetLSN, + FromNode: e.Payload.FromNode, + }) + s.record(e, fmt.Sprintf("barrier deliver LSN=%d, fsync scheduled", e.Payload.TargetLSN)) + } else { + // Not enough entries yet. Re-enqueue barrier with delay (retry). + s.EnqueueAt(s.Cluster.Now+1, EvBarrierDeliver, e.NodeID, e.Payload) + s.record(e, fmt.Sprintf("barrier LSN=%d waiting (received=%d)", e.Payload.TargetLSN, node.Storage.ReceivedLSN)) + } + } + } + + case EvBarrierFsync: + if node != nil && node.Running { + node.Storage.AdvanceFlush(e.Payload.TargetLSN) + // Send ack back to primary. + deliverTime := s.Cluster.Now + s.jitter() + s.EnqueueAt(deliverTime, EvBarrierAck, e.Payload.FromNode, EventPayload{ + TargetLSN: e.Payload.TargetLSN, + FromNode: e.NodeID, + }) + s.record(e, fmt.Sprintf("fsync LSN=%d on %s, flushedLSN=%d", e.Payload.TargetLSN, e.NodeID, node.Storage.FlushedLSN)) + } + + case EvBarrierAck: + // Only process acks on running nodes in the current epoch. + // After crash+promote, stale acks for the old primary must not advance commits. + if node != nil && node.Running && node.Epoch == s.Cluster.Coordinator.Epoch { + if pending := s.Cluster.Pending[e.Payload.TargetLSN]; pending != nil { + pending.DurableOn[e.Payload.FromNode] = true + s.Cluster.refreshCommits() + s.record(e, fmt.Sprintf("ack LSN=%d from %s, durable=%d", e.Payload.TargetLSN, e.Payload.FromNode, s.Cluster.durableAckCount(pending))) + } + } else { + s.record(e, fmt.Sprintf("ack LSN=%d from %s DROPPED (node down or stale epoch)", e.Payload.TargetLSN, e.Payload.FromNode)) + } + + case EvNodeCrash: + if node != nil { + node.Running = false + // Drop all pending events for this node. + s.dropEventsForNode(e.NodeID) + s.record(e, fmt.Sprintf("CRASH %s", e.NodeID)) + } + + case EvNodeRestart: + if node != nil { + node.Running = true + node.Epoch = s.Cluster.Coordinator.Epoch + s.record(e, fmt.Sprintf("RESTART %s epoch=%d", e.NodeID, node.Epoch)) + } + + case EvLinkDown: + s.Cluster.Disconnect(e.Payload.FromNode, e.Payload.ToNode) + s.Cluster.Disconnect(e.Payload.ToNode, e.Payload.FromNode) + s.record(e, fmt.Sprintf("LINK DOWN %s <-> %s", e.Payload.FromNode, e.Payload.ToNode)) + + case EvLinkUp: + s.Cluster.Connect(e.Payload.FromNode, e.Payload.ToNode) + s.Cluster.Connect(e.Payload.ToNode, e.Payload.FromNode) + s.record(e, fmt.Sprintf("LINK UP %s <-> %s", e.Payload.FromNode, e.Payload.ToNode)) + + case EvFlusherTick: + if node != nil && node.Running { + node.Storage.AdvanceCheckpoint(node.Storage.FlushedLSN) + s.record(e, fmt.Sprintf("flusher tick %s checkpoint=%d", e.NodeID, node.Storage.CheckpointLSN)) + } + + case EvPromote: + if err := s.Cluster.Promote(e.Payload.PromoteID); err != nil { + s.record(e, fmt.Sprintf("promote %s FAILED: %v", e.Payload.PromoteID, err)) + } else { + s.record(e, fmt.Sprintf("PROMOTE %s epoch=%d", e.Payload.PromoteID, s.Cluster.Coordinator.Epoch)) + } + + case EvLockAcquire: + s.executeLockAcquire(e) + + case EvLockRelease: + s.executeLockRelease(e) + } +} + +func (s *Simulator) executeWriteStart(e Event) { + c := s.Cluster + primary := c.Primary() + if primary == nil || !primary.Running || primary.Epoch != c.Coordinator.Epoch { + s.record(e, "write rejected: no valid primary") + return + } + c.nextLSN++ + w := Write{LSN: c.nextLSN, Block: e.Payload.Write.Block, Value: c.nextLSN} + primary.Storage.AppendWrite(w) + primary.Storage.AdvanceFlush(w.LSN) + c.Reference.Apply(w) + c.Pending[w.LSN] = &PendingCommit{ + Write: w, + DurableOn: map[string]bool{primary.ID: true}, + } + c.refreshCommits() + + // Ship to each replica with jitter. + for _, rid := range c.replicaIDs() { + shipTime := s.Cluster.Now + s.jitter() + s.EnqueueAt(shipTime, EvShipEntry, primary.ID, EventPayload{ + Write: w, + ToNode: rid, + }) + } + // Barrier after ship. + for _, rid := range c.replicaIDs() { + barrierTime := s.Cluster.Now + s.jitter() + 2 + s.EnqueueAt(barrierTime, EvBarrierSend, primary.ID, EventPayload{ + TargetLSN: w.LSN, + ToNode: rid, + }) + } + + s.record(e, fmt.Sprintf("write block=%d LSN=%d", w.Block, w.LSN)) +} + +func (s *Simulator) executeLockAcquire(e Event) { + name := e.Payload.LockName + ls, ok := s.locks[name] + if !ok { + ls = &lockState{} + s.locks[name] = ls + } + if !ls.held { + ls.held = true + ls.holder = e.Payload.ThreadID + s.record(e, fmt.Sprintf("lock %s acquired by %s", name, e.Payload.ThreadID)) + } else { + // Park — will be released when current holder releases. + ls.waiting = append(ls.waiting, e) + s.record(e, fmt.Sprintf("lock %s BLOCKED %s (held by %s)", name, e.Payload.ThreadID, ls.holder)) + } +} + +func (s *Simulator) executeLockRelease(e Event) { + name := e.Payload.LockName + ls := s.locks[name] + if ls == nil || !ls.held { + return + } + // Validate: only the holder can release. + if ls.holder != e.Payload.ThreadID { + s.record(e, fmt.Sprintf("lock %s release REJECTED: %s is not holder (held by %s)", name, e.Payload.ThreadID, ls.holder)) + return + } + s.record(e, fmt.Sprintf("lock %s released by %s", name, ls.holder)) + ls.held = false + ls.holder = "" + // Grant to next waiter (random pick among waiters for interleaving exploration). + if len(ls.waiting) > 0 { + idx := s.rng.Intn(len(ls.waiting)) + next := ls.waiting[idx] + ls.waiting = append(ls.waiting[:idx], ls.waiting[idx+1:]...) + ls.held = true + ls.holder = next.Payload.ThreadID + s.record(next, fmt.Sprintf("lock %s granted to %s (was waiting)", name, next.Payload.ThreadID)) + } +} + +func (s *Simulator) dropEventsForNode(nodeID string) { + var kept eventHeap + for _, e := range s.queue { + if e.NodeID != nodeID { + kept = append(kept, e) + } + } + s.queue = kept + heap.Init(&s.queue) +} + +// --- Invariant checking --- + +func (s *Simulator) checkInvariants(after Event) { + // 1. Commit safety: committed LSN must be durable on policy-required nodes. + for lsn := uint64(1); lsn <= s.Cluster.Coordinator.CommittedLSN; lsn++ { + p := s.Cluster.Pending[lsn] + if p == nil { + continue + } + if !s.Cluster.commitSatisfied(p) { + s.addError(after, fmt.Sprintf("committed LSN %d not durable per policy", lsn)) + } + } + + // 2. No false commit on promoted node. + primary := s.Cluster.Primary() + if primary != nil && primary.Running { + committedLSN := s.Cluster.Coordinator.CommittedLSN + for lsn := committedLSN + 1; lsn <= s.Cluster.nextLSN; lsn++ { + p := s.Cluster.Pending[lsn] + if p != nil && !p.Committed && p.DurableOn[primary.ID] { + // Uncommitted but durable on primary — only a problem if primary changed. + // This is expected on the original primary. Only flag if this is a PROMOTED node. + } + } + } + + // 3. Data correctness: primary state matches reference at the LSN it actually has. + // After promotion, the new primary may not have all writes the old primary committed. + // Verify correctness only up to what the current primary has durably received. + if primary != nil && primary.Running { + checkLSN := primary.Storage.FlushedLSN + if checkLSN > s.Cluster.Coordinator.CommittedLSN { + checkLSN = s.Cluster.Coordinator.CommittedLSN + } + if checkLSN > 0 { + refState := s.Cluster.Reference.StateAt(checkLSN) + nodeState := primary.Storage.StateAt(checkLSN) + if !EqualState(refState, nodeState) { + s.addError(after, fmt.Sprintf("data divergence on primary %s at LSN=%d", + primary.ID, checkLSN)) + } + } + } + + // 4. Epoch fencing: no node has accepted a stale epoch. + for id, node := range s.Cluster.Nodes { + if node.Running && node.Epoch > s.Cluster.Coordinator.Epoch { + s.addError(after, fmt.Sprintf("node %s has future epoch %d > coordinator %d", id, node.Epoch, s.Cluster.Coordinator.Epoch)) + } + } + + // 5. Lock safety: no two threads hold the same lock. + for name, ls := range s.locks { + if ls.held && ls.holder == "" { + s.addError(after, fmt.Sprintf("lock %s held but no holder", name)) + } + } +} + +func (s *Simulator) addError(after Event, msg string) { + s.Errors = append(s.Errors, fmt.Sprintf("t=%d after %s on %s: %s", + after.Time, after.Kind, after.NodeID, msg)) +} + +func (s *Simulator) record(e Event, note string) { + s.trace = append(s.trace, TraceEntry{Time: e.Time, Event: e, Note: note}) +} + +// --- Random fault injection --- + +// InjectRandomFault schedules a random fault (crash, partition, heal) +// at a random future time within [Now+1, Now+spread). +func (s *Simulator) InjectRandomFault() { + s.InjectRandomFaultWithin(30) +} + +// InjectRandomFaultWithin schedules a random fault at a random time +// within [Now+1, Now+spread). +func (s *Simulator) InjectRandomFaultWithin(spread uint64) { + if s.rng.Float64() > s.FaultRate { + return + } + members := s.Cluster.Coordinator.Members + if len(members) == 0 { + return + } + faultTime := s.Cluster.Now + 1 + uint64(s.rng.Int63n(int64(spread))) + + switch s.rng.Intn(3) { + case 0: // crash a random node + id := members[s.rng.Intn(len(members))] + s.EnqueueAt(faultTime, EvNodeCrash, id, EventPayload{}) + case 1: // drop a link + from := members[s.rng.Intn(len(members))] + to := members[s.rng.Intn(len(members))] + if from != to { + s.EnqueueAt(faultTime, EvLinkDown, from, EventPayload{FromNode: from, ToNode: to}) + } + case 2: // restore a link + from := members[s.rng.Intn(len(members))] + to := members[s.rng.Intn(len(members))] + if from != to { + s.EnqueueAt(faultTime, EvLinkUp, from, EventPayload{FromNode: from, ToNode: to}) + } + } +} + +// --- Scenario helpers --- + +// ScheduleWrites enqueues n writes at random times in [start, start+spread). +func (s *Simulator) ScheduleWrites(n int, start, spread uint64) { + for i := 0; i < n; i++ { + t := start + uint64(s.rng.Int63n(int64(spread))) + block := uint64(s.rng.Intn(16)) + s.EnqueueAt(t, EvWriteStart, s.Cluster.Coordinator.PrimaryID, EventPayload{ + Write: Write{Block: block}, + }) + } +} + +// ScheduleCrashAndPromote enqueues a primary crash at crashTime and promotes promoteID at promoteTime. +func (s *Simulator) ScheduleCrashAndPromote(crashTime uint64, promoteID string, promoteTime uint64) { + s.EnqueueAt(crashTime, EvNodeCrash, s.Cluster.Coordinator.PrimaryID, EventPayload{}) + s.EnqueueAt(promoteTime, EvPromote, "", EventPayload{PromoteID: promoteID}) +} + +// ScheduleFlusherTicks enqueues periodic flusher ticks for a node. +func (s *Simulator) ScheduleFlusherTicks(nodeID string, start, interval uint64, count int) { + for i := 0; i < count; i++ { + s.EnqueueAt(start+uint64(i)*interval, EvFlusherTick, nodeID, EventPayload{}) + } +} + +// --- Output --- + +// TraceString returns the full trace as a string. +func (s *Simulator) TraceString() string { + var sb strings.Builder + for _, te := range s.trace { + fmt.Fprintf(&sb, "[t=%d] %s on %s: %s\n", te.Time, te.Event.Kind, te.Event.NodeID, te.Note) + } + return sb.String() +} + +// ErrorString returns all errors. +func (s *Simulator) ErrorString() string { + return strings.Join(s.Errors, "\n") +} + +// AssertCommittedDataCorrect checks that the current primary's state matches the reference. +func (s *Simulator) AssertCommittedDataCorrect() error { + primary := s.Cluster.Primary() + if primary == nil { + return fmt.Errorf("no primary") + } + committedLSN := s.Cluster.Coordinator.CommittedLSN + if committedLSN == 0 { + return nil + } + refState := s.Cluster.Reference.StateAt(committedLSN) + nodeState := primary.Storage.StateAt(committedLSN) + if !EqualState(refState, nodeState) { + return fmt.Errorf("data divergence on %s at LSN=%d: ref=%v node=%v", + primary.ID, committedLSN, refState, nodeState) + } + return nil +} diff --git a/sw-block/prototype/distsim/simulator_test.go b/sw-block/prototype/distsim/simulator_test.go new file mode 100644 index 000000000..a1ac0482c --- /dev/null +++ b/sw-block/prototype/distsim/simulator_test.go @@ -0,0 +1,285 @@ +package distsim + +import ( + "fmt" + "strings" + "testing" +) + +// --- Fixed scenarios --- + +func TestSim_BasicWriteAndCommit(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + sim := NewSimulator(c, 42) + + sim.ScheduleWrites(3, 1, 5) + sim.Run() + + if c.Coordinator.CommittedLSN < 1 { + t.Fatalf("expected at least 1 committed write, got %d", c.Coordinator.CommittedLSN) + } + if err := sim.AssertCommittedDataCorrect(); err != nil { + t.Fatal(err) + } + if len(sim.Errors) > 0 { + t.Fatalf("invariant violations:\n%s", sim.ErrorString()) + } +} + +func TestSim_CrashAfterCommit_DataSurvives(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + sim := NewSimulator(c, 99) + + // Write, let it commit, then crash primary, promote r1. + sim.ScheduleWrites(5, 1, 3) + sim.ScheduleCrashAndPromote(20, "r1", 22) + + sim.Run() + + if len(sim.Errors) > 0 { + t.Fatalf("invariant violations:\n%s\nTrace:\n%s", sim.ErrorString(), sim.TraceString()) + } + if err := sim.AssertCommittedDataCorrect(); err != nil { + t.Fatal(err) + } +} + +func TestSim_PartitionThenHeal(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + sim := NewSimulator(c, 777) + + // Write some data. + sim.ScheduleWrites(3, 1, 3) + // Partition r2 at time 5. + sim.EnqueueAt(5, EvLinkDown, "p", EventPayload{FromNode: "p", ToNode: "r2"}) + // Write more during partition. + sim.ScheduleWrites(3, 8, 3) + // Heal at time 15. + sim.EnqueueAt(15, EvLinkUp, "p", EventPayload{FromNode: "p", ToNode: "r2"}) + // Write after heal. + sim.ScheduleWrites(2, 18, 3) + + sim.Run() + + if len(sim.Errors) > 0 { + t.Fatalf("invariant violations:\n%s", sim.ErrorString()) + } + if err := sim.AssertCommittedDataCorrect(); err != nil { + t.Fatal(err) + } +} + +func TestSim_SyncAll_UncommittedNotVisible(t *testing.T) { + c := NewCluster(CommitSyncAll, "p", "r1") + sim := NewSimulator(c, 123) + + // Partition r1 so nothing can commit under sync_all. + sim.EnqueueAt(0, EvLinkDown, "p", EventPayload{FromNode: "p", ToNode: "r1"}) + sim.ScheduleWrites(3, 1, 3) + + sim.Run() + + // Nothing should be committed. + if c.Coordinator.CommittedLSN != 0 { + t.Fatalf("sync_all with partitioned replica should not commit, got %d", c.Coordinator.CommittedLSN) + } + if len(sim.Errors) > 0 { + t.Fatalf("invariant violations:\n%s", sim.ErrorString()) + } +} + +func TestSim_MessageReorderingDoesNotBreakSafety(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + sim := NewSimulator(c, 555) + sim.jitterMax = 8 // high jitter to force reordering + + sim.ScheduleWrites(10, 1, 5) + sim.Run() + + if len(sim.Errors) > 0 { + t.Fatalf("invariant violations with high jitter:\n%s", sim.ErrorString()) + } + if err := sim.AssertCommittedDataCorrect(); err != nil { + t.Fatal(err) + } +} + +// --- Randomized property-based testing --- + +func TestSim_Randomized_CommitSafety(t *testing.T) { + const numSeeds = 500 + const numWrites = 20 + failures := 0 + + for seed := int64(0); seed < numSeeds; seed++ { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + sim := NewSimulator(c, seed) + sim.MaxEvents = 2000 + + // Random writes. + sim.ScheduleWrites(numWrites, 1, 30) + + // Random crash + promote somewhere in the middle. + crashTime := uint64(sim.rng.Intn(25) + 5) + sim.ScheduleCrashAndPromote(crashTime, "r1", crashTime+3) + + sim.Run() + + if len(sim.Errors) > 0 { + t.Errorf("seed %d: invariant violation:\n%s\nTrace (last 20):\n%s", + seed, sim.ErrorString(), lastN(sim.trace, 20)) + failures++ + if failures >= 3 { + t.Fatal("too many failures, stopping") + } + } + } + t.Logf("randomized: %d/%d seeds passed", numSeeds-failures, numSeeds) +} + +func TestSim_Randomized_WithFaults(t *testing.T) { + const numSeeds = 300 + failures := 0 + + for seed := int64(0); seed < numSeeds; seed++ { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + sim := NewSimulator(c, seed) + sim.FaultRate = 0.08 + sim.MaxEvents = 1500 + sim.jitterMax = 5 + + // Interleave writes and random faults. + for i := 0; i < 15; i++ { + t := uint64(i*3 + 1) + sim.EnqueueAt(t, EvWriteStart, "p", EventPayload{ + Write: Write{Block: uint64(sim.rng.Intn(8))}, + }) + sim.InjectRandomFault() + } + + sim.Run() + + if len(sim.Errors) > 0 { + t.Errorf("seed %d: invariant violation:\n%s", seed, sim.ErrorString()) + failures++ + if failures >= 3 { + t.Fatal("too many failures, stopping") + } + } + } + t.Logf("randomized+faults: %d/%d seeds passed", numSeeds-failures, numSeeds) +} + +func TestSim_Randomized_SyncAll(t *testing.T) { + const numSeeds = 200 + failures := 0 + + for seed := int64(0); seed < numSeeds; seed++ { + c := NewCluster(CommitSyncAll, "p", "r1") + sim := NewSimulator(c, seed) + sim.MaxEvents = 1000 + + sim.ScheduleWrites(10, 1, 20) + + // Random partition/heal. + if sim.rng.Float64() < 0.5 { + pTime := uint64(sim.rng.Intn(15) + 1) + sim.EnqueueAt(pTime, EvLinkDown, "p", EventPayload{FromNode: "p", ToNode: "r1"}) + sim.EnqueueAt(pTime+uint64(sim.rng.Intn(10)+3), EvLinkUp, "p", EventPayload{FromNode: "p", ToNode: "r1"}) + } + + sim.Run() + + if len(sim.Errors) > 0 { + t.Errorf("seed %d: invariant violation:\n%s", seed, sim.ErrorString()) + failures++ + if failures >= 3 { + t.Fatal("too many failures, stopping") + } + } + } + t.Logf("sync_all randomized: %d/%d seeds passed", numSeeds-failures, numSeeds) +} + +// --- Lock contention tests --- + +func TestSim_LockContention_NoDoubleHold(t *testing.T) { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + sim := NewSimulator(c, 42) + + // Two threads try to acquire the same lock at the same time. + sim.EnqueueAt(5, EvLockAcquire, "p", EventPayload{LockName: "shipMu", ThreadID: "writer-1"}) + sim.EnqueueAt(5, EvLockAcquire, "p", EventPayload{LockName: "shipMu", ThreadID: "writer-2"}) + + // First release. + sim.EnqueueAt(8, EvLockRelease, "p", EventPayload{LockName: "shipMu", ThreadID: "writer-1"}) + // Second release (whoever got granted after writer-1 releases). + sim.EnqueueAt(11, EvLockRelease, "p", EventPayload{LockName: "shipMu", ThreadID: "writer-2"}) + + sim.Run() + + if len(sim.Errors) > 0 { + t.Fatalf("lock invariant violated:\n%s\nTrace:\n%s", sim.ErrorString(), sim.TraceString()) + } + + // Verify the trace shows one blocked, one granted. + trace := sim.TraceString() + if !containsStr(trace, "BLOCKED") { + t.Fatal("expected one thread to be BLOCKED on lock contention") + } + if !containsStr(trace, "granted to") { + t.Fatal("expected blocked thread to be granted after release") + } +} + +func TestSim_LockContention_Randomized(t *testing.T) { + // Run many seeds with concurrent lock acquires at the same time. + // The simulator should pick a random winner each time (seed-dependent). + winners := map[string]int{} + for seed := int64(0); seed < 100; seed++ { + c := NewCluster(CommitSyncQuorum, "p", "r1", "r2") + sim := NewSimulator(c, seed) + + sim.EnqueueAt(1, EvLockAcquire, "p", EventPayload{LockName: "mu", ThreadID: "A"}) + sim.EnqueueAt(1, EvLockAcquire, "p", EventPayload{LockName: "mu", ThreadID: "B"}) + sim.EnqueueAt(3, EvLockRelease, "p", EventPayload{LockName: "mu", ThreadID: "A"}) + sim.EnqueueAt(3, EvLockRelease, "p", EventPayload{LockName: "mu", ThreadID: "B"}) + + sim.Run() + + if len(sim.Errors) > 0 { + t.Fatalf("seed %d: %s", seed, sim.ErrorString()) + } + + // Check who got the lock first by looking at the trace. + for _, te := range sim.trace { + if te.Event.Kind == EvLockAcquire && containsStr(te.Note, "acquired") { + winners[te.Event.Payload.ThreadID]++ + break + } + } + } + // Both threads should win at least some seeds (randomization works). + if winners["A"] == 0 || winners["B"] == 0 { + t.Fatalf("lock winner not randomized: A=%d B=%d", winners["A"], winners["B"]) + } + t.Logf("lock winner distribution: A=%d B=%d", winners["A"], winners["B"]) +} + +func containsStr(s, substr string) bool { + return len(s) > 0 && len(substr) > 0 && strings.Contains(s, substr) +} + +// --- Helpers --- + +func lastN(trace []TraceEntry, n int) string { + start := len(trace) - n + if start < 0 { + start = 0 + } + s := "" + for _, te := range trace[start:] { + s += fmt.Sprintf("[t=%d] %s on %s: %s\n", te.Time, te.Event.Kind, te.Event.NodeID, te.Note) + } + return s +} diff --git a/sw-block/prototype/distsim/storage.go b/sw-block/prototype/distsim/storage.go new file mode 100644 index 000000000..b5f52153b --- /dev/null +++ b/sw-block/prototype/distsim/storage.go @@ -0,0 +1,129 @@ +package distsim + +import "sort" + +type SnapshotState struct { + ID string + LSN uint64 + State map[uint64]uint64 +} + +type Storage struct { + WAL []Write + Extent map[uint64]uint64 + ReceivedLSN uint64 + FlushedLSN uint64 + CheckpointLSN uint64 + Snapshots map[string]SnapshotState + BaseSnapshot *SnapshotState +} + +func NewStorage() *Storage { + return &Storage{ + Extent: map[uint64]uint64{}, + Snapshots: map[string]SnapshotState{}, + } +} + +func (s *Storage) AppendWrite(w Write) { + // Insert in LSN order (handles out-of-order delivery from jitter). + inserted := false + for i, existing := range s.WAL { + if w.LSN == existing.LSN { + return // duplicate, skip + } + if w.LSN < existing.LSN { + s.WAL = append(s.WAL[:i], append([]Write{w}, s.WAL[i:]...)...) + inserted = true + break + } + } + if !inserted { + s.WAL = append(s.WAL, w) + } + s.Extent[w.Block] = w.Value + if w.LSN > s.ReceivedLSN { + s.ReceivedLSN = w.LSN + } +} + +func (s *Storage) AdvanceFlush(lsn uint64) { + if lsn > s.ReceivedLSN { + lsn = s.ReceivedLSN + } + if lsn > s.FlushedLSN { + s.FlushedLSN = lsn + } +} + +func (s *Storage) AdvanceCheckpoint(lsn uint64) { + if lsn > s.FlushedLSN { + lsn = s.FlushedLSN + } + if lsn > s.CheckpointLSN { + s.CheckpointLSN = lsn + } +} + +func (s *Storage) StateAt(lsn uint64) map[uint64]uint64 { + state := map[uint64]uint64{} + if s.BaseSnapshot != nil { + if s.BaseSnapshot.LSN > lsn { + return cloneMap(s.BaseSnapshot.State) + } + state = cloneMap(s.BaseSnapshot.State) + } + for _, w := range s.WAL { + if w.LSN > lsn { + break + } + if s.BaseSnapshot != nil && w.LSN <= s.BaseSnapshot.LSN { + continue + } + state[w.Block] = w.Value + } + return state +} + +func (s *Storage) TakeSnapshot(id string, lsn uint64) SnapshotState { + snap := SnapshotState{ + ID: id, + LSN: lsn, + State: cloneMap(s.StateAt(lsn)), + } + s.Snapshots[id] = snap + return snap +} + +func (s *Storage) LoadSnapshot(snap SnapshotState) { + s.Extent = cloneMap(snap.State) + s.FlushedLSN = snap.LSN + s.ReceivedLSN = snap.LSN + s.CheckpointLSN = snap.LSN + s.BaseSnapshot = &SnapshotState{ + ID: snap.ID, + LSN: snap.LSN, + State: cloneMap(snap.State), + } + s.WAL = nil +} + +func (s *Storage) ReplaceWAL(writes []Write) { + s.WAL = append([]Write(nil), writes...) + sort.Slice(s.WAL, func(i, j int) bool { return s.WAL[i].LSN < s.WAL[j].LSN }) + s.Extent = s.StateAt(s.ReceivedLSN) +} + +func writesInRange(writes []Write, startExclusive, endInclusive uint64) []Write { + out := make([]Write, 0) + for _, w := range writes { + if w.LSN <= startExclusive { + continue + } + if w.LSN > endInclusive { + break + } + out = append(out, w) + } + return out +} diff --git a/sw-block/prototype/enginev2/assignment.go b/sw-block/prototype/enginev2/assignment.go new file mode 100644 index 000000000..4f775a565 --- /dev/null +++ b/sw-block/prototype/enginev2/assignment.go @@ -0,0 +1,64 @@ +package enginev2 + +// AssignmentIntent represents a coordinator-driven assignment update. +// It specifies the desired replica set and which replicas need recovery. +type AssignmentIntent struct { + Endpoints map[string]Endpoint // desired replica set + Epoch uint64 // current epoch + RecoveryTargets map[string]SessionKind // replicas that need recovery (nil = no recovery) +} + +// AssignmentResult records what the SenderGroup did in response to an assignment. +type AssignmentResult struct { + Added []string // new senders created + Removed []string // old senders stopped + SessionsCreated []string // fresh recovery sessions attached + SessionsSuperseded []string // existing sessions superseded by new ones + SessionsFailed []string // recovery sessions that couldn't be created +} + +// ApplyAssignment processes a coordinator assignment intent: +// 1. Reconcile endpoints — add/remove/update senders +// 2. For each recovery target, create a recovery session on the sender +// +// Epoch fencing: if intent.Epoch < sender.Epoch for any target, that target +// is rejected. Stale assignment intent cannot create live sessions. +func (sg *SenderGroup) ApplyAssignment(intent AssignmentIntent) AssignmentResult { + var result AssignmentResult + + // Step 1: reconcile topology. + result.Added, result.Removed = sg.Reconcile(intent.Endpoints, intent.Epoch) + + // Step 2: create recovery sessions for designated targets. + if intent.RecoveryTargets == nil { + return result + } + + sg.mu.RLock() + defer sg.mu.RUnlock() + for replicaID, kind := range intent.RecoveryTargets { + sender, ok := sg.senders[replicaID] + if !ok { + result.SessionsFailed = append(result.SessionsFailed, replicaID) + continue + } + // Reject stale assignment: intent epoch must match sender epoch. + if intent.Epoch < sender.Epoch { + result.SessionsFailed = append(result.SessionsFailed, replicaID) + continue + } + _, err := sender.AttachSession(intent.Epoch, kind) + if err != nil { + // Session already active at current epoch — supersede it. + sess := sender.SupersedeSession(kind, "assignment_intent") + if sess != nil { + result.SessionsSuperseded = append(result.SessionsSuperseded, replicaID) + } else { + result.SessionsFailed = append(result.SessionsFailed, replicaID) + } + continue + } + result.SessionsCreated = append(result.SessionsCreated, replicaID) + } + return result +} diff --git a/sw-block/prototype/enginev2/execution_test.go b/sw-block/prototype/enginev2/execution_test.go new file mode 100644 index 000000000..9f293c041 --- /dev/null +++ b/sw-block/prototype/enginev2/execution_test.go @@ -0,0 +1,420 @@ +package enginev2 + +import "testing" + +// ============================================================ +// Phase 04 P1: Session execution and sender-group orchestration +// ============================================================ + +// --- Execution API: full lifecycle --- + +func TestExec_FullRecoveryLifecycle(t *testing.T) { + s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) + sess, _ := s.AttachSession(1, SessionCatchUp) + id := sess.ID + + // init → connecting + if err := s.BeginConnect(id); err != nil { + t.Fatalf("BeginConnect: %v", err) + } + if s.State != StateConnecting { + t.Fatalf("state=%s, want connecting", s.State) + } + + // connecting → handshake + if err := s.RecordHandshake(id, 5, 20); err != nil { + t.Fatalf("RecordHandshake: %v", err) + } + if sess.StartLSN != 5 || sess.TargetLSN != 20 { + t.Fatalf("range: start=%d target=%d", sess.StartLSN, sess.TargetLSN) + } + + // handshake → catchup + if err := s.BeginCatchUp(id); err != nil { + t.Fatalf("BeginCatchUp: %v", err) + } + if s.State != StateCatchingUp { + t.Fatalf("state=%s, want catching_up", s.State) + } + + // progress + if err := s.RecordCatchUpProgress(id, 15); err != nil { + t.Fatalf("progress to 15: %v", err) + } + if err := s.RecordCatchUpProgress(id, 20); err != nil { + t.Fatalf("progress to 20: %v", err) + } + if !sess.Converged() { + t.Fatal("should be converged at 20/20") + } + + // complete + if !s.CompleteSessionByID(id) { + t.Fatal("completion should succeed") + } + if s.State != StateInSync { + t.Fatalf("state=%s, want in_sync", s.State) + } +} + +// --- Stale sessionID rejection across all execution APIs --- + +func TestExec_StaleID_AllAPIsReject(t *testing.T) { + s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) + sess1, _ := s.AttachSession(1, SessionCatchUp) + oldID := sess1.ID + + // Supersede with new session. + s.UpdateEpoch(2) + sess2, _ := s.AttachSession(2, SessionCatchUp) + _ = sess2 + + // All APIs must reject oldID. + if err := s.BeginConnect(oldID); err == nil { + t.Fatal("BeginConnect should reject stale ID") + } + if err := s.RecordHandshake(oldID, 0, 10); err == nil { + t.Fatal("RecordHandshake should reject stale ID") + } + if err := s.BeginCatchUp(oldID); err == nil { + t.Fatal("BeginCatchUp should reject stale ID") + } + if err := s.RecordCatchUpProgress(oldID, 5); err == nil { + t.Fatal("RecordCatchUpProgress should reject stale ID") + } + if s.CompleteSessionByID(oldID) { + t.Fatal("CompleteSessionByID should reject stale ID") + } +} + +// --- Phase ordering enforcement --- + +func TestExec_WrongPhaseOrder_Rejected(t *testing.T) { + s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) + sess, _ := s.AttachSession(1, SessionCatchUp) + id := sess.ID + + // Skip connecting → go directly to handshake: rejected. + if err := s.RecordHandshake(id, 0, 10); err == nil { + t.Fatal("handshake from init should be rejected") + } + + // Skip to catch-up from init: rejected. + if err := s.BeginCatchUp(id); err == nil { + t.Fatal("catch-up from init should be rejected") + } + + // Progress from init: rejected (not in catch-up phase). + if err := s.RecordCatchUpProgress(id, 5); err == nil { + t.Fatal("progress from init should be rejected") + } + + // Correct path: init → connecting. + s.BeginConnect(id) + // Now try catch-up from connecting: rejected (must handshake first). + if err := s.BeginCatchUp(id); err == nil { + t.Fatal("catch-up from connecting should be rejected") + } +} + +// --- Progress regression rejection --- + +func TestExec_ProgressRegression_Rejected(t *testing.T) { + s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) + sess, _ := s.AttachSession(1, SessionCatchUp) + id := sess.ID + + s.BeginConnect(id) + s.RecordHandshake(id, 0, 100) + s.BeginCatchUp(id) + + s.RecordCatchUpProgress(id, 50) + + // Regression: 30 < 50. + if err := s.RecordCatchUpProgress(id, 30); err == nil { + t.Fatal("progress regression should be rejected") + } + + // Same value: 50 = 50. + if err := s.RecordCatchUpProgress(id, 50); err == nil { + t.Fatal("non-advancing progress should be rejected") + } + + // Advance: 60 > 50. + if err := s.RecordCatchUpProgress(id, 60); err != nil { + t.Fatalf("valid progress should succeed: %v", err) + } +} + +// --- Epoch bump during execution --- + +func TestExec_EpochBumpDuringExecution_InvalidatesAuthority(t *testing.T) { + s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) + sess, _ := s.AttachSession(1, SessionCatchUp) + id := sess.ID + + s.BeginConnect(id) + s.RecordHandshake(id, 0, 100) + s.BeginCatchUp(id) + s.RecordCatchUpProgress(id, 50) + + // Epoch bumps mid-execution. + s.UpdateEpoch(2) + + // All further execution on old session rejected. + if err := s.RecordCatchUpProgress(id, 60); err == nil { + t.Fatal("progress after epoch bump should be rejected") + } + if s.CompleteSessionByID(id) { + t.Fatal("completion after epoch bump should be rejected") + } + + // Sender is disconnected, ready for new session. + if s.State != StateDisconnected { + t.Fatalf("state=%s, want disconnected", s.State) + } +} + +// --- Endpoint change during execution --- + +func TestExec_EndpointChangeDuringExecution_InvalidatesAuthority(t *testing.T) { + s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", CtrlAddr: "r1:9334", Version: 1}, 1) + sess, _ := s.AttachSession(1, SessionCatchUp) + id := sess.ID + + s.BeginConnect(id) + s.RecordHandshake(id, 0, 50) + s.BeginCatchUp(id) + + // Endpoint changes mid-execution. + s.UpdateEndpoint(Endpoint{DataAddr: "r1:9444", CtrlAddr: "r1:9445", Version: 2}) + + // All further execution rejected. + if err := s.RecordCatchUpProgress(id, 10); err == nil { + t.Fatal("progress after endpoint change should be rejected") + } + if s.CompleteSessionByID(id) { + t.Fatal("completion after endpoint change should be rejected") + } +} + +// --- Completion authority enforcement --- + +func TestExec_CompletionRejected_FromInit(t *testing.T) { + s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) + sess, _ := s.AttachSession(1, SessionCatchUp) + + if s.CompleteSessionByID(sess.ID) { + t.Fatal("completion from PhaseInit should be rejected") + } +} + +func TestExec_CompletionRejected_FromConnecting(t *testing.T) { + s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) + sess, _ := s.AttachSession(1, SessionCatchUp) + s.BeginConnect(sess.ID) + + if s.CompleteSessionByID(sess.ID) { + t.Fatal("completion from PhaseConnecting should be rejected") + } +} + +func TestExec_CompletionRejected_FromHandshakeWithGap(t *testing.T) { + s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) + sess, _ := s.AttachSession(1, SessionCatchUp) + s.BeginConnect(sess.ID) + s.RecordHandshake(sess.ID, 5, 20) // gap exists: 5 → 20 + + if s.CompleteSessionByID(sess.ID) { + t.Fatal("completion from PhaseHandshake with gap should be rejected") + } +} + +func TestExec_CompletionAllowed_FromHandshakeZeroGap(t *testing.T) { + // Fast path: handshake shows replica already at target (zero gap). + s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) + sess, _ := s.AttachSession(1, SessionCatchUp) + s.BeginConnect(sess.ID) + s.RecordHandshake(sess.ID, 10, 10) // zero gap: start == target + + if !s.CompleteSessionByID(sess.ID) { + t.Fatal("completion from handshake with zero gap should be allowed") + } + if s.State != StateInSync { + t.Fatalf("state=%s, want in_sync", s.State) + } +} + +func TestExec_CompletionRejected_FromCatchUpNotConverged(t *testing.T) { + s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) + sess, _ := s.AttachSession(1, SessionCatchUp) + s.BeginConnect(sess.ID) + s.RecordHandshake(sess.ID, 0, 100) + s.BeginCatchUp(sess.ID) + s.RecordCatchUpProgress(sess.ID, 50) // not converged (50 < 100) + + if s.CompleteSessionByID(sess.ID) { + t.Fatal("completion before convergence should be rejected") + } +} + +func TestExec_HandshakeInvalidRange_Rejected(t *testing.T) { + s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) + sess, _ := s.AttachSession(1, SessionCatchUp) + s.BeginConnect(sess.ID) + + if err := s.RecordHandshake(sess.ID, 20, 5); err == nil { + t.Fatal("handshake with target < start should be rejected") + } +} + +// --- SenderGroup orchestration --- + +func TestOrch_RepeatedReconnectCycles_PreserveSenderIdentity(t *testing.T) { + sg := NewSenderGroup() + sg.Reconcile(map[string]Endpoint{ + "r1:9333": {DataAddr: "r1:9333", Version: 1}, + }, 1) + + s := sg.Sender("r1:9333") + original := s // save pointer + + // 5 reconnect cycles — sender identity preserved. + for cycle := 0; cycle < 5; cycle++ { + sess, err := s.AttachSession(1, SessionCatchUp) + if err != nil { + t.Fatalf("cycle %d attach: %v", cycle, err) + } + s.BeginConnect(sess.ID) + s.RecordHandshake(sess.ID, 0, 10) + s.BeginCatchUp(sess.ID) + s.RecordCatchUpProgress(sess.ID, 10) + s.CompleteSessionByID(sess.ID) + + if s.State != StateInSync { + t.Fatalf("cycle %d: state=%s, want in_sync", cycle, s.State) + } + } + + // Same pointer — identity preserved. + if sg.Sender("r1:9333") != original { + t.Fatal("sender identity should be preserved across cycles") + } +} + +func TestOrch_EndpointUpdateSupersedesActiveSession(t *testing.T) { + sg := NewSenderGroup() + sg.Reconcile(map[string]Endpoint{ + "r1:9333": {DataAddr: "r1:9333", CtrlAddr: "r1:9334", Version: 1}, + }, 1) + + s := sg.Sender("r1:9333") + sess, _ := s.AttachSession(1, SessionCatchUp) + s.BeginConnect(sess.ID) + + // Endpoint update via reconcile — session invalidated. + sg.Reconcile(map[string]Endpoint{ + "r1:9333": {DataAddr: "r1:9333", CtrlAddr: "r1:9334", Version: 2}, + }, 1) + + if sess.Active() { + t.Fatal("session should be invalidated by endpoint update") + } + // Sender preserved, session gone. + if sg.Sender("r1:9333") != s { + t.Fatal("sender identity should be preserved") + } + if s.Session() != nil { + t.Fatal("session should be nil after endpoint invalidation") + } +} + +func TestOrch_ReconcileMixedAddRemoveUpdate(t *testing.T) { + sg := NewSenderGroup() + sg.Reconcile(map[string]Endpoint{ + "r1:9333": {DataAddr: "r1:9333", Version: 1}, + "r2:9333": {DataAddr: "r2:9333", Version: 1}, + "r3:9333": {DataAddr: "r3:9333", Version: 1}, + }, 1) + + r1 := sg.Sender("r1:9333") + r2 := sg.Sender("r2:9333") + + // Attach sessions to r1 and r2. + r1Sess, _ := r1.AttachSession(1, SessionCatchUp) + r2Sess, _ := r2.AttachSession(1, SessionCatchUp) + + // Reconcile: keep r1, remove r2, update r3, add r4. + added, removed := sg.Reconcile(map[string]Endpoint{ + "r1:9333": {DataAddr: "r1:9333", Version: 1}, // kept + "r3:9333": {DataAddr: "r3:9333", Version: 2}, // updated + "r4:9333": {DataAddr: "r4:9333", Version: 1}, // added + }, 1) + + if len(added) != 1 || added[0] != "r4:9333" { + t.Fatalf("added=%v", added) + } + if len(removed) != 1 || removed[0] != "r2:9333" { + t.Fatalf("removed=%v", removed) + } + + // r1: preserved with active session. + if sg.Sender("r1:9333") != r1 { + t.Fatal("r1 should be preserved") + } + if !r1Sess.Active() { + t.Fatal("r1 session should still be active") + } + + // r2: stopped and removed. + if sg.Sender("r2:9333") != nil { + t.Fatal("r2 should be removed") + } + if r2.Stopped() != true { + t.Fatal("r2 should be stopped") + } + if r2Sess.Active() { + t.Fatal("r2 session should be invalidated (sender stopped)") + } + + // r4: new sender, no session. + if sg.Sender("r4:9333") == nil { + t.Fatal("r4 should exist") + } +} + +func TestOrch_EpochBumpInvalidatesExecutingSessions(t *testing.T) { + sg := NewSenderGroup() + sg.Reconcile(map[string]Endpoint{ + "r1:9333": {DataAddr: "r1:9333", Version: 1}, + "r2:9333": {DataAddr: "r2:9333", Version: 1}, + }, 1) + + r1 := sg.Sender("r1:9333") + r2 := sg.Sender("r2:9333") + + sess1, _ := r1.AttachSession(1, SessionCatchUp) + r1.BeginConnect(sess1.ID) + r1.RecordHandshake(sess1.ID, 0, 50) + r1.BeginCatchUp(sess1.ID) + r1.RecordCatchUpProgress(sess1.ID, 25) // mid-execution + + sess2, _ := r2.AttachSession(1, SessionCatchUp) + r2.BeginConnect(sess2.ID) + + // Epoch bump. + count := sg.InvalidateEpoch(2) + if count != 2 { + t.Fatalf("should invalidate 2 sessions, got %d", count) + } + + // Both sessions dead. + if sess1.Active() || sess2.Active() { + t.Fatal("both sessions should be invalidated") + } + + // r1's mid-execution progress cannot continue. + if err := r1.RecordCatchUpProgress(sess1.ID, 30); err == nil { + t.Fatal("progress on invalidated session should be rejected") + } +} diff --git a/sw-block/prototype/enginev2/go.mod b/sw-block/prototype/enginev2/go.mod new file mode 100644 index 000000000..058957b63 --- /dev/null +++ b/sw-block/prototype/enginev2/go.mod @@ -0,0 +1,3 @@ +module github.com/seaweedfs/seaweedfs/sw-block/prototype/enginev2 + +go 1.23.0 diff --git a/sw-block/prototype/enginev2/outcome.go b/sw-block/prototype/enginev2/outcome.go new file mode 100644 index 000000000..0f0573761 --- /dev/null +++ b/sw-block/prototype/enginev2/outcome.go @@ -0,0 +1,39 @@ +package enginev2 + +// HandshakeResult captures what the reconnect handshake reveals about a +// replica's state relative to the primary's lineage-safe boundary. +type HandshakeResult struct { + ReplicaFlushedLSN uint64 // highest LSN durably persisted on replica + CommittedLSN uint64 // lineage-safe recovery target (committed prefix) + RetentionStartLSN uint64 // oldest LSN still available in primary WAL +} + +// RecoveryOutcome classifies the gap between replica and primary. +type RecoveryOutcome string + +const ( + OutcomeZeroGap RecoveryOutcome = "zero_gap" // replica has full committed prefix + OutcomeCatchUp RecoveryOutcome = "catchup" // gap within WAL retention + OutcomeNeedsRebuild RecoveryOutcome = "needs_rebuild" // gap exceeds retention +) + +// ClassifyRecoveryOutcome determines the recovery path from handshake data. +// +// Uses CommittedLSN (not WAL head) as the target boundary. This is the +// lineage-safe recovery point — only acknowledged data counts. A replica +// with FlushedLSN > CommittedLSN has divergent/uncommitted tail that must +// NOT be treated as "already in sync." +// +// Decision matrix (matches CP13-5 gap analysis): +// - ReplicaFlushedLSN >= CommittedLSN → zero gap, has full committed prefix +// - ReplicaFlushedLSN+1 >= RetentionStartLSN → recoverable via WAL catch-up +// - otherwise → gap too large, needs rebuild +func ClassifyRecoveryOutcome(result HandshakeResult) RecoveryOutcome { + if result.ReplicaFlushedLSN >= result.CommittedLSN { + return OutcomeZeroGap + } + if result.RetentionStartLSN == 0 || result.ReplicaFlushedLSN+1 >= result.RetentionStartLSN { + return OutcomeCatchUp + } + return OutcomeNeedsRebuild +} diff --git a/sw-block/prototype/enginev2/p2_test.go b/sw-block/prototype/enginev2/p2_test.go new file mode 100644 index 000000000..c1904a08d --- /dev/null +++ b/sw-block/prototype/enginev2/p2_test.go @@ -0,0 +1,482 @@ +package enginev2 + +import "testing" + +// ============================================================ +// Phase 04 P2: Outcome branching, assignment intent, end-to-end +// ============================================================ + +// --- Recovery outcome classification --- + +func TestOutcome_ZeroGap(t *testing.T) { + o := ClassifyRecoveryOutcome(HandshakeResult{ + ReplicaFlushedLSN: 100, + CommittedLSN: 100, + RetentionStartLSN: 50, + }) + if o != OutcomeZeroGap { + t.Fatalf("got %s, want zero_gap", o) + } +} + +func TestOutcome_ZeroGap_ReplicaAtCommitted(t *testing.T) { + // Replica has exactly the committed prefix — zero gap. + // Note: replica may have uncommitted tail beyond CommittedLSN; + // that is handled by truncation, not by recovery classification. + o := ClassifyRecoveryOutcome(HandshakeResult{ + ReplicaFlushedLSN: 100, + CommittedLSN: 100, + RetentionStartLSN: 50, + }) + if o != OutcomeZeroGap { + t.Fatalf("got %s, want zero_gap", o) + } +} + +func TestOutcome_CatchUp(t *testing.T) { + o := ClassifyRecoveryOutcome(HandshakeResult{ + ReplicaFlushedLSN: 80, + CommittedLSN: 100, + RetentionStartLSN: 50, + }) + if o != OutcomeCatchUp { + t.Fatalf("got %s, want catchup", o) + } +} + +func TestOutcome_CatchUp_ExactBoundary(t *testing.T) { + // ReplicaFlushedLSN+1 == RetentionStartLSN → recoverable (just barely). + o := ClassifyRecoveryOutcome(HandshakeResult{ + ReplicaFlushedLSN: 49, + CommittedLSN: 100, + RetentionStartLSN: 50, + }) + if o != OutcomeCatchUp { + t.Fatalf("got %s, want catchup (exact boundary)", o) + } +} + +func TestOutcome_NeedsRebuild(t *testing.T) { + o := ClassifyRecoveryOutcome(HandshakeResult{ + ReplicaFlushedLSN: 10, + CommittedLSN: 100, + RetentionStartLSN: 50, + }) + if o != OutcomeNeedsRebuild { + t.Fatalf("got %s, want needs_rebuild", o) + } +} + +func TestOutcome_NeedsRebuild_OffByOne(t *testing.T) { + // ReplicaFlushedLSN+1 < RetentionStartLSN → unrecoverable. + o := ClassifyRecoveryOutcome(HandshakeResult{ + ReplicaFlushedLSN: 48, + CommittedLSN: 100, + RetentionStartLSN: 50, + }) + if o != OutcomeNeedsRebuild { + t.Fatalf("got %s, want needs_rebuild (off-by-one)", o) + } +} + +// --- RecordHandshakeWithOutcome execution --- + +func TestExec_HandshakeOutcome_ZeroGap_FastComplete(t *testing.T) { + s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) + sess, _ := s.AttachSession(1, SessionCatchUp) + s.BeginConnect(sess.ID) + + outcome, err := s.RecordHandshakeWithOutcome(sess.ID, HandshakeResult{ + ReplicaFlushedLSN: 100, + CommittedLSN: 100, + RetentionStartLSN: 50, + }) + if err != nil { + t.Fatal(err) + } + if outcome != OutcomeZeroGap { + t.Fatalf("outcome=%s, want zero_gap", outcome) + } + + // Zero-gap: can complete directly from handshake phase. + if !s.CompleteSessionByID(sess.ID) { + t.Fatal("zero-gap fast completion should succeed") + } + if s.State != StateInSync { + t.Fatalf("state=%s, want in_sync", s.State) + } +} + +func TestExec_HandshakeOutcome_CatchUp_NormalPath(t *testing.T) { + s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) + sess, _ := s.AttachSession(1, SessionCatchUp) + s.BeginConnect(sess.ID) + + outcome, err := s.RecordHandshakeWithOutcome(sess.ID, HandshakeResult{ + ReplicaFlushedLSN: 80, + CommittedLSN: 100, + RetentionStartLSN: 50, + }) + if err != nil { + t.Fatal(err) + } + if outcome != OutcomeCatchUp { + t.Fatalf("outcome=%s, want catchup", outcome) + } + + // Must catch up before completing. + if s.CompleteSessionByID(sess.ID) { + t.Fatal("completion should be rejected before catch-up") + } + + s.BeginCatchUp(sess.ID) + s.RecordCatchUpProgress(sess.ID, 100) + if !s.CompleteSessionByID(sess.ID) { + t.Fatal("completion should succeed after convergence") + } +} + +func TestExec_HandshakeOutcome_NeedsRebuild_InvalidatesSession(t *testing.T) { + s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) + sess, _ := s.AttachSession(1, SessionCatchUp) + s.BeginConnect(sess.ID) + + outcome, err := s.RecordHandshakeWithOutcome(sess.ID, HandshakeResult{ + ReplicaFlushedLSN: 10, + CommittedLSN: 100, + RetentionStartLSN: 50, + }) + if err != nil { + t.Fatal(err) + } + if outcome != OutcomeNeedsRebuild { + t.Fatalf("outcome=%s, want needs_rebuild", outcome) + } + + // Session invalidated, sender at NeedsRebuild. + if sess.Active() { + t.Fatal("session should be invalidated") + } + if s.State != StateNeedsRebuild { + t.Fatalf("state=%s, want needs_rebuild", s.State) + } + if s.Session() != nil { + t.Fatal("session should be nil after NeedsRebuild") + } +} + +// --- Assignment-intent orchestration --- + +func TestAssignment_CreatesSessionsForTargets(t *testing.T) { + sg := NewSenderGroup() + + result := sg.ApplyAssignment(AssignmentIntent{ + Endpoints: map[string]Endpoint{ + "r1:9333": {DataAddr: "r1:9333", Version: 1}, + "r2:9333": {DataAddr: "r2:9333", Version: 1}, + }, + Epoch: 1, + RecoveryTargets: map[string]SessionKind{ + "r1:9333": SessionCatchUp, + }, + }) + + if len(result.Added) != 2 { + t.Fatalf("added=%d, want 2", len(result.Added)) + } + if len(result.SessionsCreated) != 1 || result.SessionsCreated[0] != "r1:9333" { + t.Fatalf("sessions created=%v", result.SessionsCreated) + } + + // r1 has session, r2 does not. + r1 := sg.Sender("r1:9333") + if r1.Session() == nil { + t.Fatal("r1 should have a session") + } + r2 := sg.Sender("r2:9333") + if r2.Session() != nil { + t.Fatal("r2 should not have a session") + } +} + +func TestAssignment_SupersedesExistingSession(t *testing.T) { + sg := NewSenderGroup() + + // First assignment with catch-up session. + sg.ApplyAssignment(AssignmentIntent{ + Endpoints: map[string]Endpoint{ + "r1:9333": {DataAddr: "r1:9333", Version: 1}, + }, + Epoch: 1, + RecoveryTargets: map[string]SessionKind{"r1:9333": SessionCatchUp}, + }) + oldSess := sg.Sender("r1:9333").Session() + + // Second assignment with rebuild session — supersedes. + result := sg.ApplyAssignment(AssignmentIntent{ + Endpoints: map[string]Endpoint{ + "r1:9333": {DataAddr: "r1:9333", Version: 1}, + }, + Epoch: 1, + RecoveryTargets: map[string]SessionKind{"r1:9333": SessionRebuild}, + }) + newSess := sg.Sender("r1:9333").Session() + + if oldSess.Active() { + t.Fatal("old session should be invalidated") + } + if !newSess.Active() { + t.Fatal("new session should be active") + } + if newSess.Kind != SessionRebuild { + t.Fatalf("new session kind=%s, want rebuild", newSess.Kind) + } + if len(result.SessionsSuperseded) != 1 || result.SessionsSuperseded[0] != "r1:9333" { + t.Fatalf("superseded=%v, want [r1:9333]", result.SessionsSuperseded) + } +} + +func TestAssignment_FailsForUnknownReplica(t *testing.T) { + sg := NewSenderGroup() + + result := sg.ApplyAssignment(AssignmentIntent{ + Endpoints: map[string]Endpoint{ + "r1:9333": {DataAddr: "r1:9333", Version: 1}, + }, + Epoch: 1, + RecoveryTargets: map[string]SessionKind{"r99:9333": SessionCatchUp}, + }) + + if len(result.SessionsFailed) != 1 || result.SessionsFailed[0] != "r99:9333" { + t.Fatalf("sessions failed=%v, want [r99:9333]", result.SessionsFailed) + } +} + +func TestAssignment_StaleEpoch_Rejected(t *testing.T) { + sg := NewSenderGroup() + + // Epoch 2 assignment. + sg.ApplyAssignment(AssignmentIntent{ + Endpoints: map[string]Endpoint{ + "r1:9333": {DataAddr: "r1:9333", Version: 1}, + }, + Epoch: 2, + }) + + // Stale epoch 1 assignment with recovery — must be rejected. + result := sg.ApplyAssignment(AssignmentIntent{ + Endpoints: map[string]Endpoint{ + "r1:9333": {DataAddr: "r1:9333", Version: 1}, + }, + Epoch: 1, + RecoveryTargets: map[string]SessionKind{"r1:9333": SessionCatchUp}, + }) + + if len(result.SessionsFailed) != 1 || result.SessionsFailed[0] != "r1:9333" { + t.Fatalf("stale epoch should fail: failed=%v created=%v", result.SessionsFailed, result.SessionsCreated) + } + if sg.Sender("r1:9333").Session() != nil { + t.Fatal("stale intent must not create a session") + } +} + +// --- End-to-end prototype recovery flows --- + +func TestE2E_CatchUpRecovery_FullFlow(t *testing.T) { + sg := NewSenderGroup() + + // Step 1: Assignment creates replicas + recovery intent. + sg.ApplyAssignment(AssignmentIntent{ + Endpoints: map[string]Endpoint{ + "r1:9333": {DataAddr: "r1:9333", Version: 1}, + "r2:9333": {DataAddr: "r2:9333", Version: 1}, + }, + Epoch: 1, + RecoveryTargets: map[string]SessionKind{"r1:9333": SessionCatchUp}, + }) + + r1 := sg.Sender("r1:9333") + sess := r1.Session() + + // Step 2: Execute recovery. + r1.BeginConnect(sess.ID) + + outcome, _ := r1.RecordHandshakeWithOutcome(sess.ID, HandshakeResult{ + ReplicaFlushedLSN: 80, + CommittedLSN: 100, + RetentionStartLSN: 50, + }) + if outcome != OutcomeCatchUp { + t.Fatalf("outcome=%s", outcome) + } + + r1.BeginCatchUp(sess.ID) + r1.RecordCatchUpProgress(sess.ID, 90) + r1.RecordCatchUpProgress(sess.ID, 100) // converged + + // Step 3: Complete. + if !r1.CompleteSessionByID(sess.ID) { + t.Fatal("completion should succeed") + } + + // Step 4: Verify final state. + if r1.State != StateInSync { + t.Fatalf("r1 state=%s, want in_sync", r1.State) + } + if r1.Session() != nil { + t.Fatal("session should be nil after completion") + } + + t.Logf("e2e catch-up: assignment → connect → handshake(catchup) → progress → complete → InSync") +} + +func TestE2E_NeedsRebuild_Escalation(t *testing.T) { + sg := NewSenderGroup() + + // Step 1: Assignment with catch-up intent. + sg.ApplyAssignment(AssignmentIntent{ + Endpoints: map[string]Endpoint{ + "r1:9333": {DataAddr: "r1:9333", Version: 1}, + }, + Epoch: 1, + RecoveryTargets: map[string]SessionKind{"r1:9333": SessionCatchUp}, + }) + + r1 := sg.Sender("r1:9333") + sess := r1.Session() + + // Step 2: Connect + handshake → unrecoverable gap. + r1.BeginConnect(sess.ID) + outcome, _ := r1.RecordHandshakeWithOutcome(sess.ID, HandshakeResult{ + ReplicaFlushedLSN: 10, + CommittedLSN: 100, + RetentionStartLSN: 50, + }) + if outcome != OutcomeNeedsRebuild { + t.Fatalf("outcome=%s", outcome) + } + + // Step 3: Sender is at NeedsRebuild, session dead. + if r1.State != StateNeedsRebuild { + t.Fatalf("state=%s", r1.State) + } + + // Step 4: New assignment with rebuild intent. + sg.ApplyAssignment(AssignmentIntent{ + Endpoints: map[string]Endpoint{ + "r1:9333": {DataAddr: "r1:9333", Version: 1}, + }, + Epoch: 1, + RecoveryTargets: map[string]SessionKind{"r1:9333": SessionRebuild}, + }) + + rebuildSess := r1.Session() + if rebuildSess == nil || rebuildSess.Kind != SessionRebuild { + t.Fatal("should have rebuild session") + } + + // Step 5: Execute rebuild recovery (simulated). + r1.BeginConnect(rebuildSess.ID) + r1.RecordHandshake(rebuildSess.ID, 0, 100) // full rebuild range + r1.BeginCatchUp(rebuildSess.ID) + r1.RecordCatchUpProgress(rebuildSess.ID, 100) + + if !r1.CompleteSessionByID(rebuildSess.ID) { + t.Fatal("rebuild completion should succeed") + } + if r1.State != StateInSync { + t.Fatalf("after rebuild: state=%s, want in_sync", r1.State) + } + + t.Logf("e2e rebuild: catch-up→NeedsRebuild→rebuild assignment→recover→InSync") +} + +func TestE2E_ZeroGap_FastPath(t *testing.T) { + sg := NewSenderGroup() + + sg.ApplyAssignment(AssignmentIntent{ + Endpoints: map[string]Endpoint{ + "r1:9333": {DataAddr: "r1:9333", Version: 1}, + }, + Epoch: 1, + RecoveryTargets: map[string]SessionKind{"r1:9333": SessionCatchUp}, + }) + + r1 := sg.Sender("r1:9333") + sess := r1.Session() + + r1.BeginConnect(sess.ID) + outcome, _ := r1.RecordHandshakeWithOutcome(sess.ID, HandshakeResult{ + ReplicaFlushedLSN: 100, + CommittedLSN: 100, + RetentionStartLSN: 50, + }) + if outcome != OutcomeZeroGap { + t.Fatalf("outcome=%s", outcome) + } + + // Fast path: complete directly from handshake. + if !r1.CompleteSessionByID(sess.ID) { + t.Fatal("zero-gap fast completion should succeed") + } + if r1.State != StateInSync { + t.Fatalf("state=%s, want in_sync", r1.State) + } + + t.Logf("e2e zero-gap: assignment → connect → handshake(zero_gap) → complete → InSync") +} + +func TestE2E_EpochBump_MidRecovery_FullCycle(t *testing.T) { + sg := NewSenderGroup() + + // Epoch 1: start recovery. + sg.ApplyAssignment(AssignmentIntent{ + Endpoints: map[string]Endpoint{ + "r1:9333": {DataAddr: "r1:9333", Version: 1}, + }, + Epoch: 1, + RecoveryTargets: map[string]SessionKind{"r1:9333": SessionCatchUp}, + }) + + r1 := sg.Sender("r1:9333") + sess1 := r1.Session() + r1.BeginConnect(sess1.ID) + + // Epoch bumps mid-recovery. + sg.InvalidateEpoch(2) + // Must also update sender epoch for the new assignment. + r1.UpdateEpoch(2) + + // Old session dead. + if sess1.Active() { + t.Fatal("epoch-1 session should be invalidated") + } + + // Epoch 2: new assignment, new session. + sg.ApplyAssignment(AssignmentIntent{ + Endpoints: map[string]Endpoint{ + "r1:9333": {DataAddr: "r1:9333", Version: 1}, + }, + Epoch: 2, + RecoveryTargets: map[string]SessionKind{"r1:9333": SessionCatchUp}, + }) + + sess2 := r1.Session() + if sess2 == nil || sess2.Epoch != 2 { + t.Fatal("should have new session at epoch 2") + } + + // Complete at epoch 2. + r1.BeginConnect(sess2.ID) + r1.RecordHandshakeWithOutcome(sess2.ID, HandshakeResult{ + ReplicaFlushedLSN: 100, + CommittedLSN: 100, + RetentionStartLSN: 50, + }) + r1.CompleteSessionByID(sess2.ID) + + if r1.State != StateInSync { + t.Fatalf("state=%s", r1.State) + } + t.Logf("e2e epoch bump: epoch1 recovery → bump → epoch2 recovery → InSync") +} diff --git a/sw-block/prototype/enginev2/sender.go b/sw-block/prototype/enginev2/sender.go new file mode 100644 index 000000000..d9020222d --- /dev/null +++ b/sw-block/prototype/enginev2/sender.go @@ -0,0 +1,347 @@ +// Package enginev2 implements V2 per-replica sender/session ownership. +// +// Each replica has exactly one Sender that owns its identity (canonical address) +// and at most one active RecoverySession per epoch. The Sender survives topology +// changes; the session does not survive epoch bumps. +package enginev2 + +import ( + "fmt" + "sync" +) + +// ReplicaState tracks the per-replica replication state machine. +type ReplicaState string + +const ( + StateDisconnected ReplicaState = "disconnected" + StateConnecting ReplicaState = "connecting" + StateCatchingUp ReplicaState = "catching_up" + StateInSync ReplicaState = "in_sync" + StateDegraded ReplicaState = "degraded" + StateNeedsRebuild ReplicaState = "needs_rebuild" +) + +// Endpoint represents a replica's network identity. +type Endpoint struct { + DataAddr string + CtrlAddr string + Version uint64 // bumped on address change +} + +// Sender owns the replication channel to one replica. It is identified +// by ReplicaID (canonical data address at creation time) and survives +// topology changes as long as the replica stays in the set. +// +// A Sender holds at most one active RecoverySession. Normal in-sync +// operation does not require a session — Ship/Barrier work directly. +type Sender struct { + mu sync.Mutex + + ReplicaID string // canonical identity — stable across reconnects + Endpoint Endpoint // current network address (may change via UpdateEndpoint) + Epoch uint64 // current epoch + State ReplicaState + + session *RecoverySession // nil when in-sync or disconnected without recovery + stopped bool +} + +// NewSender creates a sender for a replica at the given endpoint and epoch. +func NewSender(replicaID string, endpoint Endpoint, epoch uint64) *Sender { + return &Sender{ + ReplicaID: replicaID, + Endpoint: endpoint, + Epoch: epoch, + State: StateDisconnected, + } +} + +// UpdateEpoch updates the sender's epoch. If a recovery session is active +// at a stale epoch, it is invalidated. +func (s *Sender) UpdateEpoch(epoch uint64) { + s.mu.Lock() + defer s.mu.Unlock() + if s.stopped || epoch <= s.Epoch { + return + } + oldEpoch := s.Epoch + s.Epoch = epoch + if s.session != nil && s.session.Epoch < epoch { + s.session.invalidate(fmt.Sprintf("epoch_advanced_%d_to_%d", oldEpoch, epoch)) + s.session = nil + s.State = StateDisconnected + } +} + +// UpdateEndpoint updates the sender's target address after a control-plane +// assignment refresh. If a recovery session is active and the address changed, +// the session is invalidated (the new address needs a fresh session). +func (s *Sender) UpdateEndpoint(ep Endpoint) { + s.mu.Lock() + defer s.mu.Unlock() + if s.stopped { + return + } + addrChanged := s.Endpoint.DataAddr != ep.DataAddr || s.Endpoint.CtrlAddr != ep.CtrlAddr || s.Endpoint.Version != ep.Version + s.Endpoint = ep + if addrChanged && s.session != nil { + s.session.invalidate("endpoint_changed") + s.session = nil + s.State = StateDisconnected + } +} + +// AttachSession creates and attaches a new recovery session for this sender. +// The session epoch must match the sender's current epoch — stale or future +// epoch sessions are rejected. Returns an error if a session is already active, +// the sender is stopped, or the epoch doesn't match. +func (s *Sender) AttachSession(epoch uint64, kind SessionKind) (*RecoverySession, error) { + s.mu.Lock() + defer s.mu.Unlock() + if s.stopped { + return nil, fmt.Errorf("sender stopped") + } + if epoch != s.Epoch { + return nil, fmt.Errorf("epoch mismatch: sender=%d session=%d", s.Epoch, epoch) + } + if s.session != nil && s.session.Active() { + return nil, fmt.Errorf("session already active (epoch=%d kind=%s)", s.session.Epoch, s.session.Kind) + } + sess := newRecoverySession(s.ReplicaID, epoch, kind) + s.session = sess + // Ownership established but execution not started. + // BeginConnect() is the first execution-state transition. + return sess, nil +} + +// SupersedeSession invalidates the current session (if any) and attaches +// a new one at the sender's current epoch. Used when an assignment change +// requires a fresh recovery path. The old session is invalidated with the +// given reason. Always uses s.Epoch — does not accept an epoch parameter +// to prevent epoch coherence drift. +// +// Establishes ownership only — does not mutate sender state. +// BeginConnect() starts execution. +func (s *Sender) SupersedeSession(kind SessionKind, reason string) *RecoverySession { + s.mu.Lock() + defer s.mu.Unlock() + if s.stopped { + return nil + } + if s.session != nil { + s.session.invalidate(reason) + } + sess := newRecoverySession(s.ReplicaID, s.Epoch, kind) + s.session = sess + return sess +} + +// Session returns the current recovery session, or nil if none. +func (s *Sender) Session() *RecoverySession { + s.mu.Lock() + defer s.mu.Unlock() + return s.session +} + +// CompleteSessionByID marks the session as completed and transitions the +// sender to InSync. Requires: +// - sessionID matches the current active session +// - session is in PhaseCatchUp and has Converged (normal path) +// - OR session is in PhaseHandshake and gap is zero (fast path: already in sync) +// +// Returns false if any check fails (stale ID, wrong phase, not converged). +func (s *Sender) CompleteSessionByID(sessionID uint64) bool { + s.mu.Lock() + defer s.mu.Unlock() + if err := s.checkSessionAuthority(sessionID); err != nil { + return false + } + sess := s.session + switch sess.Phase { + case PhaseCatchUp: + if !sess.Converged() { + return false // not converged yet + } + case PhaseHandshake: + if sess.TargetLSN != sess.StartLSN { + return false // has a gap — must catch up first + } + // Zero-gap fast path: handshake showed replica already at target. + default: + return false // not at a completion-ready phase + } + sess.complete() + s.session = nil + s.State = StateInSync + return true +} + +// === Execution APIs — sender-owned authority gate === +// +// All execution APIs validate the sessionID against the current active session. +// This prevents stale results from old/superseded sessions from mutating state. +// The sender is the authority boundary, not the session object. + +// BeginConnect transitions the session from init to connecting. +// Mutates: session.Phase → PhaseConnecting. Sender.State → StateConnecting. +// Rejects: wrong sessionID, stopped sender, session not in PhaseInit. +func (s *Sender) BeginConnect(sessionID uint64) error { + s.mu.Lock() + defer s.mu.Unlock() + if err := s.checkSessionAuthority(sessionID); err != nil { + return err + } + if !s.session.Advance(PhaseConnecting) { + return fmt.Errorf("cannot begin connect: session phase=%s", s.session.Phase) + } + s.State = StateConnecting + return nil +} + +// RecordHandshake records a successful handshake result and sets the catch-up range. +// Mutates: session.Phase → PhaseHandshake, session.StartLSN/TargetLSN. +// Rejects: wrong sessionID, wrong phase, invalid range. +func (s *Sender) RecordHandshake(sessionID uint64, startLSN, targetLSN uint64) error { + s.mu.Lock() + defer s.mu.Unlock() + if err := s.checkSessionAuthority(sessionID); err != nil { + return err + } + if targetLSN < startLSN { + return fmt.Errorf("invalid handshake range: target=%d < start=%d", targetLSN, startLSN) + } + if !s.session.Advance(PhaseHandshake) { + return fmt.Errorf("cannot record handshake: session phase=%s", s.session.Phase) + } + s.session.SetRange(startLSN, targetLSN) + return nil +} + +// RecordHandshakeWithOutcome records the handshake AND classifies the recovery +// outcome. This is the preferred handshake API — it determines the recovery +// path in one step: +// - OutcomeZeroGap: sets zero range, ready for fast completion +// - OutcomeCatchUp: sets catch-up range, ready for BeginCatchUp +// - OutcomeNeedsRebuild: invalidates session, transitions sender to NeedsRebuild +// +// Returns the outcome. On NeedsRebuild, the session is dead and the caller +// should not attempt further execution. +func (s *Sender) RecordHandshakeWithOutcome(sessionID uint64, result HandshakeResult) (RecoveryOutcome, error) { + outcome := ClassifyRecoveryOutcome(result) + + s.mu.Lock() + defer s.mu.Unlock() + if err := s.checkSessionAuthority(sessionID); err != nil { + return outcome, err + } + // Must be in PhaseConnecting — require valid execution entry point. + if s.session.Phase != PhaseConnecting { + return outcome, fmt.Errorf("handshake requires PhaseConnecting, got %s", s.session.Phase) + } + + if outcome == OutcomeNeedsRebuild { + s.session.invalidate("gap_exceeds_retention") + s.session = nil + s.State = StateNeedsRebuild + return outcome, nil + } + + if !s.session.Advance(PhaseHandshake) { + return outcome, fmt.Errorf("cannot record handshake: session phase=%s", s.session.Phase) + } + + switch outcome { + case OutcomeZeroGap: + s.session.SetRange(result.ReplicaFlushedLSN, result.ReplicaFlushedLSN) + case OutcomeCatchUp: + s.session.SetRange(result.ReplicaFlushedLSN, result.CommittedLSN) + } + return outcome, nil +} + +// BeginCatchUp transitions the session from handshake to catch-up phase. +// Mutates: session.Phase → PhaseCatchUp. Sender.State → StateCatchingUp. +// Rejects: wrong sessionID, wrong phase. +func (s *Sender) BeginCatchUp(sessionID uint64) error { + s.mu.Lock() + defer s.mu.Unlock() + if err := s.checkSessionAuthority(sessionID); err != nil { + return err + } + if !s.session.Advance(PhaseCatchUp) { + return fmt.Errorf("cannot begin catch-up: session phase=%s", s.session.Phase) + } + s.State = StateCatchingUp + return nil +} + +// RecordCatchUpProgress records catch-up progress (highest LSN recovered). +// Mutates: session.RecoveredTo (monotonic only). +// Rejects: wrong sessionID, wrong phase, progress regression, invalidated session. +func (s *Sender) RecordCatchUpProgress(sessionID uint64, recoveredTo uint64) error { + s.mu.Lock() + defer s.mu.Unlock() + if err := s.checkSessionAuthority(sessionID); err != nil { + return err + } + if s.session.Phase != PhaseCatchUp { + return fmt.Errorf("cannot record progress: session phase=%s, want catchup", s.session.Phase) + } + if recoveredTo <= s.session.RecoveredTo { + return fmt.Errorf("progress regression: current=%d proposed=%d", s.session.RecoveredTo, recoveredTo) + } + s.session.UpdateProgress(recoveredTo) + return nil +} + +// checkSessionAuthority validates that the sender has an active session +// matching the given ID. Must be called with s.mu held. +func (s *Sender) checkSessionAuthority(sessionID uint64) error { + if s.stopped { + return fmt.Errorf("sender stopped") + } + if s.session == nil { + return fmt.Errorf("no active session") + } + if s.session.ID != sessionID { + return fmt.Errorf("session ID mismatch: active=%d requested=%d", s.session.ID, sessionID) + } + if !s.session.Active() { + return fmt.Errorf("session %d is no longer active (phase=%s)", sessionID, s.session.Phase) + } + return nil +} + +// InvalidateSession invalidates the current session with a reason. +// Transitions the sender to the given target state. +func (s *Sender) InvalidateSession(reason string, targetState ReplicaState) { + s.mu.Lock() + defer s.mu.Unlock() + if s.session != nil { + s.session.invalidate(reason) + s.session = nil + } + s.State = targetState +} + +// Stop shuts down the sender and any active session. +func (s *Sender) Stop() { + s.mu.Lock() + defer s.mu.Unlock() + if s.stopped { + return + } + s.stopped = true + if s.session != nil { + s.session.invalidate("sender_stopped") + s.session = nil + } +} + +// Stopped returns true if the sender has been stopped. +func (s *Sender) Stopped() bool { + s.mu.Lock() + defer s.mu.Unlock() + return s.stopped +} diff --git a/sw-block/prototype/enginev2/sender_group.go b/sw-block/prototype/enginev2/sender_group.go new file mode 100644 index 000000000..31c21919c --- /dev/null +++ b/sw-block/prototype/enginev2/sender_group.go @@ -0,0 +1,119 @@ +package enginev2 + +import ( + "sort" + "sync" +) + +// SenderGroup manages per-replica Senders with identity-preserving reconciliation. +// It is the V2 equivalent of ShipperGroup. +type SenderGroup struct { + mu sync.RWMutex + senders map[string]*Sender // keyed by ReplicaID +} + +// NewSenderGroup creates an empty SenderGroup. +func NewSenderGroup() *SenderGroup { + return &SenderGroup{ + senders: map[string]*Sender{}, + } +} + +// Reconcile diffs the current sender set against newEndpoints. +// Matching senders (same ReplicaID) are preserved with all state. +// Removed senders are stopped. New senders are created at the given epoch. +// Returns lists of added and removed ReplicaIDs. +func (sg *SenderGroup) Reconcile(newEndpoints map[string]Endpoint, epoch uint64) (added, removed []string) { + sg.mu.Lock() + defer sg.mu.Unlock() + + // Stop and remove senders not in the new set. + for id, s := range sg.senders { + if _, keep := newEndpoints[id]; !keep { + s.Stop() + delete(sg.senders, id) + removed = append(removed, id) + } + } + + // Add new senders; update endpoints and epoch for existing. + for id, ep := range newEndpoints { + if existing, ok := sg.senders[id]; ok { + existing.UpdateEndpoint(ep) + existing.UpdateEpoch(epoch) + } else { + sg.senders[id] = NewSender(id, ep, epoch) + added = append(added, id) + } + } + + sort.Strings(added) + sort.Strings(removed) + return added, removed +} + +// Sender returns the sender for a ReplicaID, or nil. +func (sg *SenderGroup) Sender(replicaID string) *Sender { + sg.mu.RLock() + defer sg.mu.RUnlock() + return sg.senders[replicaID] +} + +// All returns all senders in deterministic order (sorted by ReplicaID). +func (sg *SenderGroup) All() []*Sender { + sg.mu.RLock() + defer sg.mu.RUnlock() + out := make([]*Sender, 0, len(sg.senders)) + for _, s := range sg.senders { + out = append(out, s) + } + sort.Slice(out, func(i, j int) bool { + return out[i].ReplicaID < out[j].ReplicaID + }) + return out +} + +// Len returns the number of senders. +func (sg *SenderGroup) Len() int { + sg.mu.RLock() + defer sg.mu.RUnlock() + return len(sg.senders) +} + +// StopAll stops all senders. +func (sg *SenderGroup) StopAll() { + sg.mu.Lock() + defer sg.mu.Unlock() + for _, s := range sg.senders { + s.Stop() + } +} + +// InSyncCount returns the number of senders in StateInSync. +func (sg *SenderGroup) InSyncCount() int { + sg.mu.RLock() + defer sg.mu.RUnlock() + count := 0 + for _, s := range sg.senders { + if s.State == StateInSync { + count++ + } + } + return count +} + +// InvalidateEpoch invalidates all active sessions that are bound to +// a stale epoch. Called after promotion/epoch bump. +func (sg *SenderGroup) InvalidateEpoch(currentEpoch uint64) int { + sg.mu.RLock() + defer sg.mu.RUnlock() + count := 0 + for _, s := range sg.senders { + sess := s.Session() + if sess != nil && sess.Epoch < currentEpoch && sess.Active() { + s.InvalidateSession("epoch_bump", StateDisconnected) + count++ + } + } + return count +} diff --git a/sw-block/prototype/enginev2/sender_group_test.go b/sw-block/prototype/enginev2/sender_group_test.go new file mode 100644 index 000000000..fcd1d5cec --- /dev/null +++ b/sw-block/prototype/enginev2/sender_group_test.go @@ -0,0 +1,203 @@ +package enginev2 + +import "testing" + +// === SenderGroup reconciliation === + +func TestSenderGroup_Reconcile_AddNew(t *testing.T) { + sg := NewSenderGroup() + + eps := map[string]Endpoint{ + "r1:9333": {DataAddr: "r1:9333", Version: 1}, + "r2:9333": {DataAddr: "r2:9333", Version: 1}, + } + added, removed := sg.Reconcile(eps, 1) + + if len(added) != 2 || len(removed) != 0 { + t.Fatalf("added=%v removed=%v", added, removed) + } + if sg.Len() != 2 { + t.Fatalf("len: got %d, want 2", sg.Len()) + } +} + +func TestSenderGroup_Reconcile_RemoveStale(t *testing.T) { + sg := NewSenderGroup() + sg.Reconcile(map[string]Endpoint{ + "r1:9333": {DataAddr: "r1:9333", Version: 1}, + "r2:9333": {DataAddr: "r2:9333", Version: 1}, + }, 1) + + // Remove r2, keep r1. + _, removed := sg.Reconcile(map[string]Endpoint{ + "r1:9333": {DataAddr: "r1:9333", Version: 1}, + }, 1) + + if len(removed) != 1 || removed[0] != "r2:9333" { + t.Fatalf("removed=%v, want [r2:9333]", removed) + } + if sg.Sender("r2:9333") != nil { + t.Fatal("r2 should be removed") + } + if sg.Sender("r1:9333") == nil { + t.Fatal("r1 should be preserved") + } +} + +func TestSenderGroup_Reconcile_PreservesState(t *testing.T) { + sg := NewSenderGroup() + sg.Reconcile(map[string]Endpoint{ + "r1:9333": {DataAddr: "r1:9333", Version: 1}, + }, 1) + + // Attach session and advance. + s := sg.Sender("r1:9333") + sess, _ := s.AttachSession(1, SessionCatchUp) + sess.SetRange(0, 100) + sess.UpdateProgress(50) + + // Reconcile with same address — sender preserved. + sg.Reconcile(map[string]Endpoint{ + "r1:9333": {DataAddr: "r1:9333", Version: 1}, + }, 1) + + s2 := sg.Sender("r1:9333") + if s2 != s { + t.Fatal("reconcile should preserve the same sender object") + } + if s2.Session() != sess { + t.Fatal("reconcile should preserve the session") + } + if !sess.Active() { + t.Fatal("session should still be active after same-address reconcile") + } +} + +func TestSenderGroup_Reconcile_MixedUpdate(t *testing.T) { + sg := NewSenderGroup() + sg.Reconcile(map[string]Endpoint{ + "r1:9333": {DataAddr: "r1:9333", Version: 1}, + "r2:9333": {DataAddr: "r2:9333", Version: 1}, + }, 1) + + // Keep r1, remove r2, add r3. + added, removed := sg.Reconcile(map[string]Endpoint{ + "r1:9333": {DataAddr: "r1:9333", Version: 1}, + "r3:9333": {DataAddr: "r3:9333", Version: 1}, + }, 1) + + if len(added) != 1 || added[0] != "r3:9333" { + t.Fatalf("added=%v, want [r3:9333]", added) + } + if len(removed) != 1 || removed[0] != "r2:9333" { + t.Fatalf("removed=%v, want [r2:9333]", removed) + } + if sg.Len() != 2 { + t.Fatalf("len=%d, want 2", sg.Len()) + } +} + +func TestSenderGroup_Reconcile_EndpointChange_InvalidatesSession(t *testing.T) { + sg := NewSenderGroup() + sg.Reconcile(map[string]Endpoint{ + "r1:9333": {DataAddr: "r1:9333", Version: 1}, + }, 1) + + s := sg.Sender("r1:9333") + sess, _ := s.AttachSession(1, SessionCatchUp) + + // Same ReplicaID but new endpoint version. + sg.Reconcile(map[string]Endpoint{ + "r1:9333": {DataAddr: "r1:9333", Version: 2}, + }, 1) + + if sess.Active() { + t.Fatal("endpoint version change should invalidate session") + } + if s.Session() != nil { + t.Fatal("session should be nil after endpoint change") + } +} + +// === Epoch invalidation === + +func TestSenderGroup_InvalidateEpoch(t *testing.T) { + sg := NewSenderGroup() + sg.Reconcile(map[string]Endpoint{ + "r1:9333": {DataAddr: "r1:9333", Version: 1}, + "r2:9333": {DataAddr: "r2:9333", Version: 1}, + }, 1) + + // Both have sessions at epoch 1. + s1 := sg.Sender("r1:9333") + s2 := sg.Sender("r2:9333") + sess1, _ := s1.AttachSession(1, SessionCatchUp) + sess2, _ := s2.AttachSession(1, SessionCatchUp) + + // Epoch bumps to 2. Both sessions stale. + count := sg.InvalidateEpoch(2) + if count != 2 { + t.Fatalf("should invalidate 2 sessions, got %d", count) + } + if sess1.Active() || sess2.Active() { + t.Fatal("both sessions should be invalidated") + } + if s1.State != StateDisconnected || s2.State != StateDisconnected { + t.Fatal("senders should be disconnected after epoch invalidation") + } +} + +func TestSenderGroup_InvalidateEpoch_SkipsCurrentEpoch(t *testing.T) { + sg := NewSenderGroup() + sg.Reconcile(map[string]Endpoint{ + "r1:9333": {DataAddr: "r1:9333", Version: 1}, + }, 2) + + s := sg.Sender("r1:9333") + sess, _ := s.AttachSession(2, SessionCatchUp) // epoch 2 session + + // Invalidate epoch 2 — session AT epoch 2 should NOT be invalidated. + count := sg.InvalidateEpoch(2) + if count != 0 { + t.Fatalf("should not invalidate current-epoch session, got %d", count) + } + if !sess.Active() { + t.Fatal("current-epoch session should remain active") + } +} + +func TestSenderGroup_StopAll(t *testing.T) { + sg := NewSenderGroup() + sg.Reconcile(map[string]Endpoint{ + "r1:9333": {DataAddr: "r1:9333", Version: 1}, + "r2:9333": {DataAddr: "r2:9333", Version: 1}, + }, 1) + + sg.StopAll() + + for _, s := range sg.All() { + if !s.Stopped() { + t.Fatalf("%s should be stopped", s.ReplicaID) + } + } +} + +func TestSenderGroup_All_DeterministicOrder(t *testing.T) { + sg := NewSenderGroup() + sg.Reconcile(map[string]Endpoint{ + "r3:9333": {DataAddr: "r3:9333", Version: 1}, + "r1:9333": {DataAddr: "r1:9333", Version: 1}, + "r2:9333": {DataAddr: "r2:9333", Version: 1}, + }, 1) + + all := sg.All() + if len(all) != 3 { + t.Fatalf("len=%d, want 3", len(all)) + } + expected := []string{"r1:9333", "r2:9333", "r3:9333"} + for i, exp := range expected { + if all[i].ReplicaID != exp { + t.Fatalf("all[%d]=%s, want %s", i, all[i].ReplicaID, exp) + } + } +} diff --git a/sw-block/prototype/enginev2/sender_test.go b/sw-block/prototype/enginev2/sender_test.go new file mode 100644 index 000000000..db59138ba --- /dev/null +++ b/sw-block/prototype/enginev2/sender_test.go @@ -0,0 +1,407 @@ +package enginev2 + +import "testing" + +// === Sender lifecycle === + +func TestSender_NewSender_Disconnected(t *testing.T) { + s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", CtrlAddr: "r1:9334", Version: 1}, 1) + if s.State != StateDisconnected { + t.Fatalf("new sender should be Disconnected, got %s", s.State) + } + if s.Session() != nil { + t.Fatal("new sender should have no session") + } +} + +func TestSender_AttachSession_Success(t *testing.T) { + s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) + + sess, err := s.AttachSession(1, SessionCatchUp) + if err != nil { + t.Fatal(err) + } + if sess.Kind != SessionCatchUp { + t.Fatalf("session kind: got %s, want catchup", sess.Kind) + } + if sess.Epoch != 1 { + t.Fatalf("session epoch: got %d, want 1", sess.Epoch) + } + if !sess.Active() { + t.Fatal("session should be active") + } + // AttachSession is ownership-only — sender stays Disconnected until BeginConnect. + if s.State != StateDisconnected { + t.Fatalf("sender state after attach: got %s, want disconnected (ownership-only)", s.State) + } +} + +func TestSender_AttachSession_RejectsDouble(t *testing.T) { + s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) + + _, err := s.AttachSession(1, SessionCatchUp) + if err != nil { + t.Fatal(err) + } + _, err = s.AttachSession(1, SessionBootstrap) + if err == nil { + t.Fatal("should reject second attach while session active") + } +} + +func TestSender_CompleteSession_TransitionsInSync(t *testing.T) { + s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) + + sess, _ := s.AttachSession(1, SessionCatchUp) + // Must execute full lifecycle before completing. + s.BeginConnect(sess.ID) + s.RecordHandshake(sess.ID, 5, 10) + s.BeginCatchUp(sess.ID) + s.RecordCatchUpProgress(sess.ID, 10) // converged + + if !s.CompleteSessionByID(sess.ID) { + t.Fatal("completion should succeed when converged") + } + if s.State != StateInSync { + t.Fatalf("after complete: got %s, want in_sync", s.State) + } + if s.Session() != nil { + t.Fatal("session should be nil after complete") + } + if sess.Active() { + t.Fatal("completed session should not be active") + } +} + +func TestSender_SupersedeSession(t *testing.T) { + s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) + + old, _ := s.AttachSession(1, SessionCatchUp) + s.UpdateEpoch(2) // epoch bumps — old session invalidated by UpdateEpoch + new := s.SupersedeSession(SessionReassign, "explicit_supersede") + + if old.Active() { + t.Fatal("old session should be invalidated") + } + // Invalidated by UpdateEpoch, not by SupersedeSession (already dead). + if old.InvalidateReason == "" { + t.Fatal("old session should have invalidation reason") + } + if !new.Active() { + t.Fatal("new session should be active") + } + if new.Epoch != 2 { + t.Fatalf("new session epoch: got %d, want 2", new.Epoch) + } +} + +func TestSender_UpdateEndpoint_InvalidatesSession(t *testing.T) { + s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) + + sess, _ := s.AttachSession(1, SessionCatchUp) + s.UpdateEndpoint(Endpoint{DataAddr: "r1:9444", Version: 2}) + + if sess.Active() { + t.Fatal("session should be invalidated after endpoint change") + } + if sess.InvalidateReason != "endpoint_changed" { + t.Fatalf("invalidation reason: got %q", sess.InvalidateReason) + } + if s.State != StateDisconnected { + t.Fatalf("sender should be disconnected after endpoint change, got %s", s.State) + } + if s.Session() != nil { + t.Fatal("session should be nil after endpoint change") + } +} + +func TestSender_UpdateEndpoint_SameAddr_PreservesSession(t *testing.T) { + s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) + + sess, _ := s.AttachSession(1, SessionCatchUp) + s.UpdateEndpoint(Endpoint{DataAddr: "r1:9333", Version: 1}) + + if !sess.Active() { + t.Fatal("same-address update should preserve session") + } +} + +func TestSender_UpdateEndpoint_CtrlAddrOnly_InvalidatesSession(t *testing.T) { + s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", CtrlAddr: "r1:9334", Version: 1}, 1) + + sess, _ := s.AttachSession(1, SessionCatchUp) + s.UpdateEndpoint(Endpoint{DataAddr: "r1:9333", CtrlAddr: "r1:9444", Version: 1}) + + if sess.Active() { + t.Fatal("CtrlAddr-only change should invalidate session") + } + if s.State != StateDisconnected { + t.Fatalf("sender should be disconnected, got %s", s.State) + } +} + +func TestSender_Stop_InvalidatesSession(t *testing.T) { + s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) + + sess, _ := s.AttachSession(1, SessionCatchUp) + s.Stop() + + if sess.Active() { + t.Fatal("session should be invalidated after stop") + } + if !s.Stopped() { + t.Fatal("sender should be stopped") + } + + // Attach after stop fails. + _, err := s.AttachSession(1, SessionBootstrap) + if err == nil { + t.Fatal("attach after stop should fail") + } +} + +func TestSender_InvalidateSession_TargetState(t *testing.T) { + s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) + + sess, _ := s.AttachSession(1, SessionCatchUp) + s.InvalidateSession("timeout", StateNeedsRebuild) + + if sess.Active() { + t.Fatal("session should be invalidated") + } + if s.State != StateNeedsRebuild { + t.Fatalf("sender state: got %s, want needs_rebuild", s.State) + } +} + +// === Session lifecycle === + +func TestSession_Advance_ValidTransitions(t *testing.T) { + sess := newRecoverySession("r1", 1, SessionCatchUp) + + if !sess.Advance(PhaseConnecting) { + t.Fatal("init → connecting should succeed") + } + if !sess.Advance(PhaseHandshake) { + t.Fatal("connecting → handshake should succeed") + } + if !sess.Advance(PhaseCatchUp) { + t.Fatal("handshake → catchup should succeed") + } + if !sess.Advance(PhaseCompleted) { + t.Fatal("catchup → completed should succeed") + } +} + +func TestSession_Advance_RejectsInvalidJump(t *testing.T) { + sess := newRecoverySession("r1", 1, SessionCatchUp) + + // init → catchup is not valid (must go through connecting, handshake) + if sess.Advance(PhaseCatchUp) { + t.Fatal("init → catchup should be rejected") + } + // init → completed is not valid + if sess.Advance(PhaseCompleted) { + t.Fatal("init → completed should be rejected") + } +} + +func TestSession_Advance_StopsOnInvalidate(t *testing.T) { + sess := newRecoverySession("r1", 1, SessionCatchUp) + sess.Advance(PhaseConnecting) + sess.Advance(PhaseHandshake) + sess.invalidate("test") + + if sess.Advance(PhaseCatchUp) { + t.Fatal("advance after invalidate should fail") + } +} + +func TestSender_AttachSession_RejectsEpochMismatch(t *testing.T) { + s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) + + _, err := s.AttachSession(2, SessionCatchUp) + if err == nil { + t.Fatal("should reject session at epoch 2 when sender is at epoch 1") + } +} + +func TestSender_UpdateEpoch_InvalidatesStaleSession(t *testing.T) { + s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) + sess, _ := s.AttachSession(1, SessionCatchUp) + + s.UpdateEpoch(2) + + if sess.Active() { + t.Fatal("session at epoch 1 should be invalidated after UpdateEpoch(2)") + } + if s.Epoch != 2 { + t.Fatalf("sender epoch should be 2, got %d", s.Epoch) + } + if s.State != StateDisconnected { + t.Fatalf("sender should be disconnected after epoch bump, got %s", s.State) + } + + // Can now attach at epoch 2. + sess2, err := s.AttachSession(2, SessionCatchUp) + if err != nil { + t.Fatalf("attach at new epoch should succeed: %v", err) + } + if sess2.Epoch != 2 { + t.Fatalf("new session epoch: got %d, want 2", sess2.Epoch) + } +} + +func TestSession_Progress_StopsOnComplete(t *testing.T) { + sess := newRecoverySession("r1", 1, SessionCatchUp) + sess.SetRange(0, 100) + + sess.UpdateProgress(50) + if sess.Converged() { + t.Fatal("should not converge at 50/100") + } + + sess.complete() + + if sess.UpdateProgress(100) { + t.Fatal("update after complete should return false") + } +} + +func TestSession_Converged(t *testing.T) { + sess := newRecoverySession("r1", 1, SessionCatchUp) + sess.SetRange(0, 10) + + sess.UpdateProgress(9) + if sess.Converged() { + t.Fatal("9 < 10: not converged") + } + + sess.UpdateProgress(10) + if !sess.Converged() { + t.Fatal("10 >= 10: should be converged") + } +} + +// === Bridge tests: ownership invariants matching distsim scenarios === + +func TestBridge_StaleCompletion_AfterSupersede_HasNoEffect(t *testing.T) { + // Matches distsim TestP04a_StaleCompletion_AfterSupersede_Rejected. + s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) + + // First session. + sess1, _ := s.AttachSession(1, SessionCatchUp) + sess1.Advance(PhaseConnecting) + sess1.Advance(PhaseHandshake) + sess1.Advance(PhaseCatchUp) + + // Supersede with new session. + s.UpdateEpoch(2) + sess2, _ := s.AttachSession(2, SessionCatchUp) + + // Old session: advance/complete has no effect (already invalidated). + if sess1.Advance(PhaseCompleted) { + t.Fatal("stale session should not advance to completed") + } + if sess1.Active() { + t.Fatal("old session should be inactive") + } + + // New session: still active and owns the sender. + if !sess2.Active() { + t.Fatal("new session should be active") + } + if s.Session() != sess2 { + t.Fatal("sender should own the new session") + } + + // Stale completion by OLD session ID — REJECTED by identity check. + if s.CompleteSessionByID(sess1.ID) { + t.Fatal("stale completion with old session ID must be rejected") + } + // Sender must NOT have moved to InSync. + if s.State == StateInSync { + t.Fatal("sender must not be InSync after stale completion") + } + // New session must still be active. + if !sess2.Active() { + t.Fatal("new session must still be active after stale completion rejected") + } + + // Correct completion by NEW session ID — requires full execution path. + s.BeginConnect(sess2.ID) + s.RecordHandshake(sess2.ID, 0, 10) + s.BeginCatchUp(sess2.ID) + s.RecordCatchUpProgress(sess2.ID, 10) + if !s.CompleteSessionByID(sess2.ID) { + t.Fatal("completion with correct session ID should succeed after convergence") + } + if s.State != StateInSync { + t.Fatalf("sender should be InSync after correct completion, got %s", s.State) + } +} + +func TestBridge_EpochBump_RejectedCompletion(t *testing.T) { + // Matches distsim TestP04a_EpochBumpDuringCatchup_InvalidatesSession. + s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) + + sess, _ := s.AttachSession(1, SessionCatchUp) + sess.Advance(PhaseConnecting) + + // Epoch bumps — session invalidated. + s.UpdateEpoch(2) + + // Attempting to advance the old session fails. + if sess.Advance(PhaseHandshake) { + t.Fatal("stale session should not advance after epoch bump") + } + + // Attempting to attach at old epoch fails. + _, err := s.AttachSession(1, SessionCatchUp) + if err == nil { + t.Fatal("attach at stale epoch should fail") + } + + // Attach at new epoch succeeds. + sess2, err := s.AttachSession(2, SessionCatchUp) + if err != nil { + t.Fatalf("attach at new epoch should succeed: %v", err) + } + if sess2.Epoch != 2 { + t.Fatalf("new session epoch=%d, want 2", sess2.Epoch) + } +} + +func TestBridge_EndpointChange_InvalidatesAndAllowsNewSession(t *testing.T) { + // Matches distsim TestP04a_EndpointChangeDuringCatchup_InvalidatesSession. + s := NewSender("r1:9333", Endpoint{DataAddr: "r1:9333", Version: 1}, 1) + + sess, _ := s.AttachSession(1, SessionCatchUp) + + // Endpoint changes. + s.UpdateEndpoint(Endpoint{DataAddr: "r1:9444", Version: 2}) + + // Old session dead. + if sess.Active() { + t.Fatal("session should be invalidated") + } + + // New session can be attached (same epoch, new endpoint). + sess2, err := s.AttachSession(1, SessionCatchUp) + if err != nil { + t.Fatalf("new session after endpoint change: %v", err) + } + if !sess2.Active() { + t.Fatal("new session should be active") + } +} + +func TestSession_DoubleInvalidate_Safe(t *testing.T) { + sess := newRecoverySession("r1", 1, SessionCatchUp) + sess.invalidate("first") + sess.invalidate("second") // should not panic or change reason + + if sess.InvalidateReason != "first" { + t.Fatalf("reason should be first, got %q", sess.InvalidateReason) + } +} diff --git a/sw-block/prototype/enginev2/session.go b/sw-block/prototype/enginev2/session.go new file mode 100644 index 000000000..922138af6 --- /dev/null +++ b/sw-block/prototype/enginev2/session.go @@ -0,0 +1,151 @@ +package enginev2 + +import ( + "sync" + "sync/atomic" +) + +// SessionKind identifies how the recovery session was created. +type SessionKind string + +const ( + SessionBootstrap SessionKind = "bootstrap" // fresh replica, no prior state + SessionCatchUp SessionKind = "catchup" // WAL gap recovery + SessionRebuild SessionKind = "rebuild" // full extent + WAL rebuild + SessionReassign SessionKind = "reassign" // address change recovery +) + +// SessionPhase tracks progress within a recovery session. +type SessionPhase string + +const ( + PhaseInit SessionPhase = "init" + PhaseConnecting SessionPhase = "connecting" + PhaseHandshake SessionPhase = "handshake" + PhaseCatchUp SessionPhase = "catchup" + PhaseCompleted SessionPhase = "completed" + PhaseInvalidated SessionPhase = "invalidated" +) + +// sessionIDCounter generates unique session IDs across all senders. +var sessionIDCounter atomic.Uint64 + +// RecoverySession represents one recovery attempt for a specific replica +// at a specific epoch. It is owned by a Sender and has exclusive authority +// to transition the replica through connecting → handshake → catchup → complete. +// +// Each session has a unique ID. Stale completions are rejected by ID, not +// by pointer comparison. This prevents old sessions from mutating state +// even if they retain a reference to the sender. +// +// Lifecycle rules: +// - At most one active session per Sender +// - Session is bound to an epoch; epoch bump invalidates it +// - Session is bound to an endpoint; address change invalidates it +// - Completed sessions release ownership back to the Sender +// - Invalidated sessions are dead and cannot be reused +type RecoverySession struct { + mu sync.Mutex + + ID uint64 // unique, monotonic, never reused + ReplicaID string + Epoch uint64 + Kind SessionKind + Phase SessionPhase + InvalidateReason string // non-empty when invalidated + + // Progress tracking. + StartLSN uint64 // gap start (exclusive) + TargetLSN uint64 // gap end (inclusive) + RecoveredTo uint64 // highest LSN recovered so far +} + +func newRecoverySession(replicaID string, epoch uint64, kind SessionKind) *RecoverySession { + return &RecoverySession{ + ID: sessionIDCounter.Add(1), + ReplicaID: replicaID, + Epoch: epoch, + Kind: kind, + Phase: PhaseInit, + } +} + +// Active returns true if the session has not been completed or invalidated. +func (rs *RecoverySession) Active() bool { + rs.mu.Lock() + defer rs.mu.Unlock() + return rs.Phase != PhaseCompleted && rs.Phase != PhaseInvalidated +} + +// validTransitions defines the allowed phase transitions. +// Each phase maps to the set of phases it can transition to. +var validTransitions = map[SessionPhase]map[SessionPhase]bool{ + PhaseInit: {PhaseConnecting: true, PhaseInvalidated: true}, + PhaseConnecting: {PhaseHandshake: true, PhaseInvalidated: true}, + PhaseHandshake: {PhaseCatchUp: true, PhaseCompleted: true, PhaseInvalidated: true}, + PhaseCatchUp: {PhaseCompleted: true, PhaseInvalidated: true}, +} + +// Advance moves the session to the next phase. Returns false if the +// transition is not valid (wrong source phase, already terminal, or +// illegal jump). Enforces the lifecycle: +// +// init → connecting → handshake → catchup → completed +// ↘ invalidated (from any non-terminal) +func (rs *RecoverySession) Advance(phase SessionPhase) bool { + rs.mu.Lock() + defer rs.mu.Unlock() + if rs.Phase == PhaseCompleted || rs.Phase == PhaseInvalidated { + return false + } + allowed := validTransitions[rs.Phase] + if !allowed[phase] { + return false + } + rs.Phase = phase + return true +} + +// UpdateProgress records catch-up progress. Returns false if stale. +func (rs *RecoverySession) UpdateProgress(recoveredTo uint64) bool { + rs.mu.Lock() + defer rs.mu.Unlock() + if rs.Phase == PhaseCompleted || rs.Phase == PhaseInvalidated { + return false + } + if recoveredTo > rs.RecoveredTo { + rs.RecoveredTo = recoveredTo + } + return true +} + +// SetRange sets the recovery LSN range. +func (rs *RecoverySession) SetRange(start, target uint64) { + rs.mu.Lock() + defer rs.mu.Unlock() + rs.StartLSN = start + rs.TargetLSN = target +} + +// Converged returns true if recovery has reached the target. +func (rs *RecoverySession) Converged() bool { + rs.mu.Lock() + defer rs.mu.Unlock() + return rs.TargetLSN > 0 && rs.RecoveredTo >= rs.TargetLSN +} + +func (rs *RecoverySession) complete() { + rs.mu.Lock() + defer rs.mu.Unlock() + rs.Phase = PhaseCompleted +} + +func (rs *RecoverySession) invalidate(reason string) { + rs.mu.Lock() + defer rs.mu.Unlock() + if rs.Phase == PhaseCompleted || rs.Phase == PhaseInvalidated { + return + } + rs.Phase = PhaseInvalidated + rs.InvalidateReason = reason +} diff --git a/sw-block/prototype/fsmv2/apply.go b/sw-block/prototype/fsmv2/apply.go new file mode 100644 index 000000000..95adeee6e --- /dev/null +++ b/sw-block/prototype/fsmv2/apply.go @@ -0,0 +1,162 @@ +package fsmv2 + +func (f *FSM) Apply(evt Event) ([]Action, error) { + switch evt.Kind { + case EventEpochChanged: + if evt.Epoch <= f.Epoch { + return nil, nil + } + f.Epoch = evt.Epoch + switch f.State { + case StateInSync: + f.State = StateLagging + f.clearCatchup() + f.clearRebuild() + return []Action{ActionRevokeSyncEligibility}, nil + case StateCatchingUp, StatePromotionHold, StateRebuilding, StateCatchUpAfterBuild: + f.State = StateLagging + f.clearCatchup() + f.clearRebuild() + return []Action{ActionAbortRecovery, ActionRevokeSyncEligibility}, nil + default: + f.clearCatchup() + f.clearRebuild() + return nil, nil + } + case EventFatal: + f.State = StateFailed + f.clearCatchup() + f.clearRebuild() + return []Action{ActionFailReplica, ActionRevokeSyncEligibility}, nil + } + + switch f.State { + case StateBootstrapping: + switch evt.Kind { + case EventBootstrapComplete: + f.ReplicaFlushedLSN = evt.ReplicaFlushedLSN + f.State = StateInSync + return []Action{ActionGrantSyncEligibility}, nil + case EventDisconnect: + f.State = StateLagging + return nil, nil + } + case StateInSync: + switch evt.Kind { + case EventDurableProgress: + if evt.ReplicaFlushedLSN < f.ReplicaFlushedLSN { + return nil, invalid(f.State, evt.Kind) + } + f.ReplicaFlushedLSN = evt.ReplicaFlushedLSN + return nil, nil + case EventDisconnect: + f.State = StateLagging + return []Action{ActionRevokeSyncEligibility}, nil + } + case StateLagging: + switch evt.Kind { + case EventReconnectCatchup: + f.ReplicaFlushedLSN = evt.ReplicaFlushedLSN + f.CatchupStartLSN = evt.ReplicaFlushedLSN + f.CatchupTargetLSN = evt.TargetLSN + f.PromotionBarrierLSN = evt.TargetLSN + f.RecoveryReservationID = evt.ReservationID + f.ReservationExpiry = evt.ReservationTTL + f.State = StateCatchingUp + return []Action{ActionStartCatchup}, nil + case EventReconnectRebuild: + f.State = StateNeedsRebuild + return []Action{ActionRevokeSyncEligibility}, nil + } + case StateCatchingUp: + switch evt.Kind { + case EventCatchupProgress: + if evt.ReplicaFlushedLSN < f.ReplicaFlushedLSN { + return nil, invalid(f.State, evt.Kind) + } + f.ReplicaFlushedLSN = evt.ReplicaFlushedLSN + if evt.ReplicaFlushedLSN >= f.PromotionBarrierLSN { + f.State = StatePromotionHold + f.PromotionHoldUntil = evt.PromotionHoldTill + return []Action{ActionEnterPromotionHold}, nil + } + return nil, nil + case EventRetentionLost, EventCatchupTimeout: + f.State = StateNeedsRebuild + f.clearCatchup() + return []Action{ActionAbortRecovery}, nil + case EventDisconnect: + f.State = StateLagging + f.clearCatchup() + return []Action{ActionAbortRecovery}, nil + } + case StatePromotionHold: + switch evt.Kind { + case EventDurableProgress: + if evt.ReplicaFlushedLSN < f.ReplicaFlushedLSN { + return nil, invalid(f.State, evt.Kind) + } + f.ReplicaFlushedLSN = evt.ReplicaFlushedLSN + return nil, nil + case EventPromotionHealthy: + if evt.Now < f.PromotionHoldUntil { + return nil, nil + } + f.State = StateInSync + f.clearCatchup() + return []Action{ActionGrantSyncEligibility}, nil + case EventDisconnect: + f.State = StateLagging + f.clearCatchup() + return []Action{ActionRevokeSyncEligibility}, nil + } + case StateNeedsRebuild: + switch evt.Kind { + case EventStartRebuild: + f.State = StateRebuilding + f.SnapshotID = evt.SnapshotID + f.SnapshotCpLSN = evt.SnapshotCpLSN + f.RecoveryReservationID = evt.ReservationID + f.ReservationExpiry = evt.ReservationTTL + return []Action{ActionStartRebuild}, nil + } + case StateRebuilding: + switch evt.Kind { + case EventRebuildBaseApplied: + f.State = StateCatchUpAfterBuild + f.ReplicaFlushedLSN = f.SnapshotCpLSN + f.CatchupStartLSN = f.SnapshotCpLSN + f.CatchupTargetLSN = evt.TargetLSN + f.PromotionBarrierLSN = evt.TargetLSN + return []Action{ActionStartCatchup}, nil + case EventRetentionLost, EventRebuildTooSlow, EventDisconnect: + f.State = StateNeedsRebuild + f.clearCatchup() + f.clearRebuild() + return []Action{ActionAbortRecovery}, nil + } + case StateCatchUpAfterBuild: + switch evt.Kind { + case EventCatchupProgress: + if evt.ReplicaFlushedLSN < f.ReplicaFlushedLSN { + return nil, invalid(f.State, evt.Kind) + } + f.ReplicaFlushedLSN = evt.ReplicaFlushedLSN + if evt.ReplicaFlushedLSN >= f.PromotionBarrierLSN { + f.State = StatePromotionHold + f.PromotionHoldUntil = evt.PromotionHoldTill + return []Action{ActionEnterPromotionHold}, nil + } + return nil, nil + case EventRetentionLost, EventCatchupTimeout, EventDisconnect: + f.State = StateNeedsRebuild + f.clearCatchup() + f.clearRebuild() + return []Action{ActionAbortRecovery}, nil + } + case StateFailed: + return nil, nil + } + + return nil, invalid(f.State, evt.Kind) +} diff --git a/sw-block/prototype/fsmv2/events.go b/sw-block/prototype/fsmv2/events.go new file mode 100644 index 000000000..f2d21c4ed --- /dev/null +++ b/sw-block/prototype/fsmv2/events.go @@ -0,0 +1,37 @@ +package fsmv2 + +type EventKind string + +const ( + EventBootstrapComplete EventKind = "BootstrapComplete" + EventDisconnect EventKind = "Disconnect" + EventReconnectCatchup EventKind = "ReconnectCatchup" + EventReconnectRebuild EventKind = "ReconnectRebuild" + EventDurableProgress EventKind = "DurableProgress" + EventCatchupProgress EventKind = "CatchupProgress" + EventPromotionHealthy EventKind = "PromotionHealthy" + EventStartRebuild EventKind = "StartRebuild" + EventRebuildBaseApplied EventKind = "RebuildBaseApplied" + EventRetentionLost EventKind = "RetentionLost" + EventCatchupTimeout EventKind = "CatchupTimeout" + EventRebuildTooSlow EventKind = "RebuildTooSlow" + EventEpochChanged EventKind = "EpochChanged" + EventFatal EventKind = "Fatal" +) + +type Event struct { + Kind EventKind + + Epoch uint64 + Now uint64 + + ReplicaFlushedLSN uint64 + TargetLSN uint64 + PromotionHoldTill uint64 + + SnapshotID string + SnapshotCpLSN uint64 + + ReservationID string + ReservationTTL uint64 +} diff --git a/sw-block/prototype/fsmv2/fsm.go b/sw-block/prototype/fsmv2/fsm.go new file mode 100644 index 000000000..3f08239b1 --- /dev/null +++ b/sw-block/prototype/fsmv2/fsm.go @@ -0,0 +1,73 @@ +package fsmv2 + +import "fmt" + +type State string + +const ( + StateBootstrapping State = "Bootstrapping" + StateInSync State = "InSync" + StateLagging State = "Lagging" + StateCatchingUp State = "CatchingUp" + StatePromotionHold State = "PromotionHold" + StateNeedsRebuild State = "NeedsRebuild" + StateRebuilding State = "Rebuilding" + StateCatchUpAfterBuild State = "CatchUpAfterRebuild" + StateFailed State = "Failed" +) + +type Action string + +const ( + ActionNone Action = "None" + ActionGrantSyncEligibility Action = "GrantSyncEligibility" + ActionRevokeSyncEligibility Action = "RevokeSyncEligibility" + ActionStartCatchup Action = "StartCatchup" + ActionEnterPromotionHold Action = "EnterPromotionHold" + ActionStartRebuild Action = "StartRebuild" + ActionAbortRecovery Action = "AbortRecovery" + ActionFailReplica Action = "FailReplica" +) + +type FSM struct { + State State + Epoch uint64 + + ReplicaFlushedLSN uint64 + CatchupStartLSN uint64 + CatchupTargetLSN uint64 + PromotionBarrierLSN uint64 + PromotionHoldUntil uint64 + + SnapshotID string + SnapshotCpLSN uint64 + + RecoveryReservationID string + ReservationExpiry uint64 +} + +func New(epoch uint64) *FSM { + return &FSM{State: StateBootstrapping, Epoch: epoch} +} + +func (f *FSM) IsSyncEligible() bool { + return f.State == StateInSync +} + +func (f *FSM) clearCatchup() { + f.CatchupStartLSN = 0 + f.CatchupTargetLSN = 0 + f.PromotionBarrierLSN = 0 + f.PromotionHoldUntil = 0 + f.RecoveryReservationID = "" + f.ReservationExpiry = 0 +} + +func (f *FSM) clearRebuild() { + f.SnapshotID = "" + f.SnapshotCpLSN = 0 +} + +func invalid(state State, kind EventKind) error { + return fmt.Errorf("fsmv2: invalid event %s in state %s", kind, state) +} diff --git a/sw-block/prototype/fsmv2/fsm_test.go b/sw-block/prototype/fsmv2/fsm_test.go new file mode 100644 index 000000000..77c795784 --- /dev/null +++ b/sw-block/prototype/fsmv2/fsm_test.go @@ -0,0 +1,95 @@ +package fsmv2 + +import "testing" + +func mustApply(t *testing.T, f *FSM, evt Event) []Action { + t.Helper() + actions, err := f.Apply(evt) + if err != nil { + t.Fatalf("apply %s: %v", evt.Kind, err) + } + return actions +} + +func TestFSMBootstrapToInSync(t *testing.T) { + f := New(7) + mustApply(t, f, Event{Kind: EventBootstrapComplete, ReplicaFlushedLSN: 10}) + if f.State != StateInSync || !f.IsSyncEligible() || f.ReplicaFlushedLSN != 10 { + t.Fatalf("unexpected bootstrap result: state=%s eligible=%v lsn=%d", f.State, f.IsSyncEligible(), f.ReplicaFlushedLSN) + } +} + +func TestFSMCatchupPromotionHoldFlow(t *testing.T) { + f := New(3) + mustApply(t, f, Event{Kind: EventBootstrapComplete, ReplicaFlushedLSN: 5}) + mustApply(t, f, Event{Kind: EventDisconnect}) + mustApply(t, f, Event{Kind: EventReconnectCatchup, ReplicaFlushedLSN: 5, TargetLSN: 20, ReservationID: "r1", ReservationTTL: 100}) + if f.State != StateCatchingUp { + t.Fatalf("expected catching up, got %s", f.State) + } + mustApply(t, f, Event{Kind: EventCatchupProgress, ReplicaFlushedLSN: 20, PromotionHoldTill: 30}) + if f.State != StatePromotionHold { + t.Fatalf("expected promotion hold, got %s", f.State) + } + mustApply(t, f, Event{Kind: EventPromotionHealthy, Now: 29}) + if f.State != StatePromotionHold { + t.Fatalf("hold exited too early: %s", f.State) + } + mustApply(t, f, Event{Kind: EventPromotionHealthy, Now: 30}) + if f.State != StateInSync || !f.IsSyncEligible() { + t.Fatalf("expected insync after hold, got %s eligible=%v", f.State, f.IsSyncEligible()) + } +} + +func TestFSMRebuildFlow(t *testing.T) { + f := New(11) + mustApply(t, f, Event{Kind: EventDisconnect}) + mustApply(t, f, Event{Kind: EventReconnectRebuild}) + if f.State != StateNeedsRebuild { + t.Fatalf("expected needs rebuild, got %s", f.State) + } + mustApply(t, f, Event{Kind: EventStartRebuild, SnapshotID: "snap-1", SnapshotCpLSN: 100, ReservationID: "rr", ReservationTTL: 200}) + if f.State != StateRebuilding { + t.Fatalf("expected rebuilding, got %s", f.State) + } + mustApply(t, f, Event{Kind: EventRebuildBaseApplied, TargetLSN: 140}) + if f.State != StateCatchUpAfterBuild || f.ReplicaFlushedLSN != 100 { + t.Fatalf("unexpected rebuild-base state=%s lsn=%d", f.State, f.ReplicaFlushedLSN) + } + mustApply(t, f, Event{Kind: EventCatchupProgress, ReplicaFlushedLSN: 140, PromotionHoldTill: 150}) + mustApply(t, f, Event{Kind: EventPromotionHealthy, Now: 150}) + if f.State != StateInSync || f.SnapshotID != "snap-1" { + t.Fatalf("expected insync after rebuild, got state=%s snapshot=%q", f.State, f.SnapshotID) + } +} + +func TestFSMEpochChangeAbortsRecovery(t *testing.T) { + f := New(1) + mustApply(t, f, Event{Kind: EventBootstrapComplete, ReplicaFlushedLSN: 1}) + mustApply(t, f, Event{Kind: EventDisconnect}) + mustApply(t, f, Event{Kind: EventReconnectCatchup, ReplicaFlushedLSN: 1, TargetLSN: 5, ReservationID: "r1", ReservationTTL: 99}) + mustApply(t, f, Event{Kind: EventEpochChanged, Epoch: 2}) + if f.State != StateLagging || f.RecoveryReservationID != "" || f.IsSyncEligible() { + t.Fatalf("unexpected state after epoch change: state=%s reservation=%q eligible=%v", f.State, f.RecoveryReservationID, f.IsSyncEligible()) + } +} + +func TestFSMReservationLostNeedsRebuild(t *testing.T) { + f := New(5) + mustApply(t, f, Event{Kind: EventBootstrapComplete, ReplicaFlushedLSN: 9}) + mustApply(t, f, Event{Kind: EventDisconnect}) + mustApply(t, f, Event{Kind: EventReconnectCatchup, ReplicaFlushedLSN: 9, TargetLSN: 15, ReservationID: "r2", ReservationTTL: 80}) + mustApply(t, f, Event{Kind: EventRetentionLost}) + if f.State != StateNeedsRebuild { + t.Fatalf("expected needs rebuild after reservation lost, got %s", f.State) + } +} + +func TestFSMDurableProgressWhileInSync(t *testing.T) { + f := New(2) + mustApply(t, f, Event{Kind: EventBootstrapComplete, ReplicaFlushedLSN: 4}) + mustApply(t, f, Event{Kind: EventDurableProgress, ReplicaFlushedLSN: 8}) + if f.ReplicaFlushedLSN != 8 || f.State != StateInSync { + t.Fatalf("unexpected in-sync durable progress: state=%s lsn=%d", f.State, f.ReplicaFlushedLSN) + } +} diff --git a/sw-block/prototype/fsmv2/fsmv2.test.exe b/sw-block/prototype/fsmv2/fsmv2.test.exe new file mode 100644 index 000000000..ff0ad2bc1 Binary files /dev/null and b/sw-block/prototype/fsmv2/fsmv2.test.exe differ diff --git a/sw-block/prototype/run-tests.ps1 b/sw-block/prototype/run-tests.ps1 new file mode 100644 index 000000000..6c1f06d16 --- /dev/null +++ b/sw-block/prototype/run-tests.ps1 @@ -0,0 +1,37 @@ +param( + [string[]]$Packages = @( + './sw-block/prototype/fsmv2', + './sw-block/prototype/volumefsm', + './sw-block/prototype/distsim' + ) +) + +$ErrorActionPreference = 'Stop' +$root = Split-Path -Parent (Split-Path -Parent $PSScriptRoot) +Set-Location $root + +$cacheDir = Join-Path $root '.gocache_v2' +$tmpDir = Join-Path $root '.gotmp_v2' +New-Item -ItemType Directory -Force -Path $cacheDir,$tmpDir | Out-Null +$env:GOCACHE = $cacheDir +$env:GOTMPDIR = $tmpDir + +foreach ($pkg in $Packages) { + $name = Split-Path $pkg -Leaf + $out = Join-Path $root ("sw-block\\prototype\\{0}\\{0}.test.exe" -f $name) + Write-Host "==> building $pkg" + go test -c -o $out $pkg + if (!(Test-Path $out)) { + throw "go test -c build failed for $pkg" + } + if ($LASTEXITCODE -ne 0) { + Write-Warning "go test -c reported a non-zero exit code for $pkg, but the test binary was produced. Continuing." + } + Write-Host "==> running $out" + cmd /c "cd /d $root && $out -test.v -test.count=1" + if ($LASTEXITCODE -ne 0) { + throw "test binary failed for $pkg" + } +} + +Write-Host "Done." diff --git a/sw-block/prototype/volumefsm/events.go b/sw-block/prototype/volumefsm/events.go new file mode 100644 index 000000000..c50633eb0 --- /dev/null +++ b/sw-block/prototype/volumefsm/events.go @@ -0,0 +1,148 @@ +package volumefsm + +import fsmv2 "github.com/seaweedfs/seaweedfs/sw-block/prototype/fsmv2" + +type EventKind string + +const ( + EventWriteCommitted EventKind = "WriteCommitted" + EventCheckpointAdvanced EventKind = "CheckpointAdvanced" + EventBarrierCompleted EventKind = "BarrierCompleted" + EventBootstrapReplica EventKind = "BootstrapReplica" + EventReplicaDisconnect EventKind = "ReplicaDisconnect" + EventReplicaReconnect EventKind = "ReplicaReconnect" + EventReplicaNeedsRebuild EventKind = "ReplicaNeedsRebuild" + EventReplicaCatchupProgress EventKind = "ReplicaCatchupProgress" + EventReplicaPromotionHealthy EventKind = "ReplicaPromotionHealthy" + EventReplicaStartRebuild EventKind = "ReplicaStartRebuild" + EventReplicaRebuildBaseApplied EventKind = "ReplicaRebuildBaseApplied" + EventReplicaReservationLost EventKind = "ReplicaReservationLost" + EventReplicaCatchupTimeout EventKind = "ReplicaCatchupTimeout" + EventReplicaRebuildTooSlow EventKind = "ReplicaRebuildTooSlow" + EventPrimaryLeaseLost EventKind = "PrimaryLeaseLost" + EventPromoteReplica EventKind = "PromoteReplica" +) + +type Event struct { + Kind EventKind + ReplicaID string + + LSN uint64 + CheckpointLSN uint64 + ReplicaFlushedLSN uint64 + TargetLSN uint64 + Now uint64 + HoldUntil uint64 + SnapshotID string + SnapshotCpLSN uint64 + ReservationID string + ReservationTTL uint64 +} + +func (m *Model) Apply(evt Event) error { + switch evt.Kind { + case EventWriteCommitted: + if evt.LSN > m.HeadLSN { + m.HeadLSN = evt.LSN + } else { + m.HeadLSN++ + } + return nil + case EventCheckpointAdvanced: + if evt.CheckpointLSN > m.CheckpointLSN { + m.CheckpointLSN = evt.CheckpointLSN + } + return nil + case EventPrimaryLeaseLost: + m.PrimaryState = PrimaryLost + m.Epoch++ + for _, r := range m.Replicas { + _, err := r.FSM.Apply(fsmv2.Event{Kind: fsmv2.EventEpochChanged, Epoch: m.Epoch}) + if err != nil { + return err + } + } + return nil + case EventPromoteReplica: + m.PrimaryID = evt.ReplicaID + m.PrimaryState = PrimaryServing + m.Epoch++ + for _, r := range m.Replicas { + _, err := r.FSM.Apply(fsmv2.Event{Kind: fsmv2.EventEpochChanged, Epoch: m.Epoch}) + if err != nil { + return err + } + } + return nil + } + + r := m.Replica(evt.ReplicaID) + if r == nil { + return nil + } + + var fEvt fsmv2.Event + switch evt.Kind { + case EventBarrierCompleted: + fEvt = fsmv2.Event{Kind: fsmv2.EventDurableProgress, ReplicaFlushedLSN: evt.ReplicaFlushedLSN} + case EventBootstrapReplica: + fEvt = fsmv2.Event{Kind: fsmv2.EventBootstrapComplete, ReplicaFlushedLSN: evt.ReplicaFlushedLSN} + case EventReplicaDisconnect: + fEvt = fsmv2.Event{Kind: fsmv2.EventDisconnect} + case EventReplicaReconnect: + if evt.ReservationID != "" { + fEvt = fsmv2.Event{Kind: fsmv2.EventReconnectCatchup, ReplicaFlushedLSN: evt.ReplicaFlushedLSN, TargetLSN: evt.TargetLSN, ReservationID: evt.ReservationID, ReservationTTL: evt.ReservationTTL} + } else { + fEvt = fsmv2.Event{Kind: fsmv2.EventReconnectRebuild} + } + case EventReplicaNeedsRebuild: + fEvt = fsmv2.Event{Kind: fsmv2.EventReconnectRebuild} + case EventReplicaCatchupProgress: + fEvt = fsmv2.Event{Kind: fsmv2.EventCatchupProgress, ReplicaFlushedLSN: evt.ReplicaFlushedLSN, PromotionHoldTill: evt.HoldUntil} + case EventReplicaPromotionHealthy: + fEvt = fsmv2.Event{Kind: fsmv2.EventPromotionHealthy, Now: evt.Now} + case EventReplicaStartRebuild: + fEvt = fsmv2.Event{Kind: fsmv2.EventStartRebuild, SnapshotID: evt.SnapshotID, SnapshotCpLSN: evt.SnapshotCpLSN, ReservationID: evt.ReservationID, ReservationTTL: evt.ReservationTTL} + case EventReplicaRebuildBaseApplied: + fEvt = fsmv2.Event{Kind: fsmv2.EventRebuildBaseApplied, TargetLSN: evt.TargetLSN} + case EventReplicaReservationLost: + fEvt = fsmv2.Event{Kind: fsmv2.EventRetentionLost} + case EventReplicaCatchupTimeout: + fEvt = fsmv2.Event{Kind: fsmv2.EventCatchupTimeout} + case EventReplicaRebuildTooSlow: + fEvt = fsmv2.Event{Kind: fsmv2.EventRebuildTooSlow} + default: + return nil + } + _, err := r.FSM.Apply(fEvt) + return err +} + +func (m *Model) EvaluateReconnect(replicaID string, flushedLSN, targetLSN uint64) (RecoveryDecision, error) { + decision := m.Planner.PlanReconnect(replicaID, flushedLSN, targetLSN) + r := m.Replica(replicaID) + if r == nil { + return decision, nil + } + switch decision.Disposition { + case RecoveryCatchup: + err := m.Apply(Event{ + Kind: EventReplicaReconnect, + ReplicaID: replicaID, + ReplicaFlushedLSN: flushedLSN, + TargetLSN: targetLSN, + ReservationID: decision.ReservationID, + ReservationTTL: decision.ReservationTTL, + }) + return decision, err + default: + if r.FSM.State == fsmv2.StateNeedsRebuild { + return decision, nil + } + err := m.Apply(Event{ + Kind: EventReplicaNeedsRebuild, + ReplicaID: replicaID, + }) + return decision, err + } +} diff --git a/sw-block/prototype/volumefsm/format.go b/sw-block/prototype/volumefsm/format.go new file mode 100644 index 000000000..555613d44 --- /dev/null +++ b/sw-block/prototype/volumefsm/format.go @@ -0,0 +1,38 @@ +package volumefsm + +import ( + "fmt" + "sort" + "strings" +) + +func FormatSnapshot(s Snapshot) string { + ids := make([]string, 0, len(s.Replicas)) + for id := range s.Replicas { + ids = append(ids, id) + } + sort.Strings(ids) + + parts := []string{ + fmt.Sprintf("step=%s", s.Step), + fmt.Sprintf("epoch=%d", s.Epoch), + fmt.Sprintf("primary=%s/%s", s.PrimaryID, s.PrimaryState), + fmt.Sprintf("head=%d", s.HeadLSN), + fmt.Sprintf("write=%t:%s", s.WriteGate.Allowed, s.WriteGate.Reason), + fmt.Sprintf("ack=%t:%s", s.AckGate.Allowed, s.AckGate.Reason), + } + for _, id := range ids { + r := s.Replicas[id] + parts = append(parts, fmt.Sprintf("%s=%s@%d", id, r.State, r.FlushedLSN)) + } + return strings.Join(parts, " ") +} + +func FormatTrace(trace []Snapshot) string { + lines := make([]string, 0, len(trace)) + for _, s := range trace { + lines = append(lines, FormatSnapshot(s)) + } + return strings.Join(lines, "\n") +} + diff --git a/sw-block/prototype/volumefsm/model.go b/sw-block/prototype/volumefsm/model.go new file mode 100644 index 000000000..d6bb9bd0a --- /dev/null +++ b/sw-block/prototype/volumefsm/model.go @@ -0,0 +1,142 @@ +package volumefsm + +import fsmv2 "github.com/seaweedfs/seaweedfs/sw-block/prototype/fsmv2" + +type Mode string + +const ( + ModeBestEffort Mode = "best_effort" + ModeSyncAll Mode = "sync_all" + ModeSyncQuorum Mode = "sync_quorum" +) + +type PrimaryState string + +const ( + PrimaryServing PrimaryState = "serving" + PrimaryDraining PrimaryState = "draining" + PrimaryLost PrimaryState = "lost" +) + +type Replica struct { + ID string + FSM *fsmv2.FSM +} + +type Model struct { + Epoch uint64 + PrimaryID string + PrimaryState PrimaryState + Mode Mode + + HeadLSN uint64 + CheckpointLSN uint64 + + RequiredReplicaIDs []string + Replicas map[string]*Replica + Planner RecoveryPlanner +} + +func New(primaryID string, mode Mode, epoch uint64, replicaIDs ...string) *Model { + m := &Model{ + Epoch: epoch, + PrimaryID: primaryID, + PrimaryState: PrimaryServing, + Mode: mode, + Replicas: make(map[string]*Replica, len(replicaIDs)), + Planner: StaticRecoveryPlanner{}, + } + for _, id := range replicaIDs { + m.Replicas[id] = &Replica{ID: id, FSM: fsmv2.New(epoch)} + m.RequiredReplicaIDs = append(m.RequiredReplicaIDs, id) + } + return m +} + +func (m *Model) Replica(id string) *Replica { + return m.Replicas[id] +} + +func (m *Model) SyncEligibleCount() int { + count := 0 + for _, id := range m.RequiredReplicaIDs { + r := m.Replicas[id] + if r != nil && r.FSM.IsSyncEligible() { + count++ + } + } + return count +} + +func (m *Model) DurableReplicaCount(targetLSN uint64) int { + count := 0 + for _, id := range m.RequiredReplicaIDs { + r := m.Replicas[id] + if r != nil && r.FSM.IsSyncEligible() && r.FSM.ReplicaFlushedLSN >= targetLSN { + count++ + } + } + return count +} + +func (m *Model) Quorum() int { + rf := len(m.RequiredReplicaIDs) + 1 + return rf/2 + 1 +} + +func (m *Model) CanServeWrite() bool { + return m.WriteAdmission().Allowed +} + +type AdmissionDecision struct { + Allowed bool + Reason string +} + +func (m *Model) WriteAdmission() AdmissionDecision { + if m.PrimaryState != PrimaryServing { + return AdmissionDecision{Allowed: false, Reason: "primary_not_serving"} + } + switch m.Mode { + case ModeBestEffort: + return AdmissionDecision{Allowed: true, Reason: "best_effort_local_durable"} + case ModeSyncAll: + if m.SyncEligibleCount() == len(m.RequiredReplicaIDs) { + return AdmissionDecision{Allowed: true, Reason: "all_replicas_sync_eligible"} + } + return AdmissionDecision{Allowed: false, Reason: "required_replica_not_in_sync"} + case ModeSyncQuorum: + if 1+m.SyncEligibleCount() >= m.Quorum() { + return AdmissionDecision{Allowed: true, Reason: "quorum_sync_eligible"} + } + return AdmissionDecision{Allowed: false, Reason: "quorum_not_available"} + default: + return AdmissionDecision{Allowed: false, Reason: "unknown_mode"} + } +} + +func (m *Model) CanAcknowledgeLSN(targetLSN uint64) bool { + return m.AckAdmission(targetLSN).Allowed +} + +func (m *Model) AckAdmission(targetLSN uint64) AdmissionDecision { + if m.PrimaryState != PrimaryServing { + return AdmissionDecision{Allowed: false, Reason: "primary_not_serving"} + } + switch m.Mode { + case ModeBestEffort: + return AdmissionDecision{Allowed: true, Reason: "best_effort_local_durable"} + case ModeSyncAll: + if m.DurableReplicaCount(targetLSN) == len(m.RequiredReplicaIDs) { + return AdmissionDecision{Allowed: true, Reason: "all_replicas_durable"} + } + return AdmissionDecision{Allowed: false, Reason: "required_replica_not_durable"} + case ModeSyncQuorum: + if 1+m.DurableReplicaCount(targetLSN) >= m.Quorum() { + return AdmissionDecision{Allowed: true, Reason: "quorum_durable"} + } + return AdmissionDecision{Allowed: false, Reason: "durable_quorum_not_available"} + default: + return AdmissionDecision{Allowed: false, Reason: "unknown_mode"} + } +} diff --git a/sw-block/prototype/volumefsm/model_test.go b/sw-block/prototype/volumefsm/model_test.go new file mode 100644 index 000000000..7cd45dfdc --- /dev/null +++ b/sw-block/prototype/volumefsm/model_test.go @@ -0,0 +1,421 @@ +package volumefsm + +import ( + "strings" + "testing" + + fsmv2 "github.com/seaweedfs/seaweedfs/sw-block/prototype/fsmv2" +) + +type scriptedPlanner struct { + decision RecoveryDecision +} + +func (s scriptedPlanner) PlanReconnect(replicaID string, flushedLSN, targetLSN uint64) RecoveryDecision { + return s.decision +} + +func mustApply(t *testing.T, m *Model, evt Event) { + t.Helper() + if err := m.Apply(evt); err != nil { + t.Fatalf("apply %s: %v", evt.Kind, err) + } +} + +func TestModelSyncAllBlocksOnLaggingReplica(t *testing.T) { + m := New("p1", ModeSyncAll, 1, "r1", "r2") + mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 1}) + mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r2", ReplicaFlushedLSN: 1}) + if !m.CanServeWrite() { + t.Fatal("sync_all should serve when all replicas are in sync") + } + mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r2"}) + if m.CanServeWrite() { + t.Fatal("sync_all should block when one required replica lags") + } +} + +func TestModelSyncQuorumSurvivesOneLaggingReplica(t *testing.T) { + m := New("p1", ModeSyncQuorum, 1, "r1", "r2") + mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 1}) + mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r2", ReplicaFlushedLSN: 1}) + mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r2"}) + if !m.CanServeWrite() { + t.Fatal("sync_quorum should still serve with primary + one in-sync replica") + } +} + +func TestModelCatchupFlowRestoresEligibility(t *testing.T) { + m := New("p1", ModeSyncAll, 1, "r1") + mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 1}) + mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"}) + mustApply(t, m, Event{Kind: EventWriteCommitted, LSN: 10}) + mustApply(t, m, Event{Kind: EventReplicaReconnect, ReplicaID: "r1", ReplicaFlushedLSN: 1, TargetLSN: 10, ReservationID: "res-1", ReservationTTL: 100}) + if got := m.Replica("r1").FSM.State; got != fsmv2.StateCatchingUp { + t.Fatalf("expected catching up, got %s", got) + } + mustApply(t, m, Event{Kind: EventReplicaCatchupProgress, ReplicaID: "r1", ReplicaFlushedLSN: 10, HoldUntil: 20}) + mustApply(t, m, Event{Kind: EventReplicaPromotionHealthy, ReplicaID: "r1", Now: 20}) + if got := m.Replica("r1").FSM.State; got != fsmv2.StateInSync { + t.Fatalf("expected in sync, got %s", got) + } + if !m.CanServeWrite() { + t.Fatal("sync_all should serve after replica returns to in-sync") + } +} + +func TestModelLongGapRebuildFlow(t *testing.T) { + m := New("p1", ModeBestEffort, 1, "r1") + mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"}) + mustApply(t, m, Event{Kind: EventReplicaNeedsRebuild, ReplicaID: "r1"}) + mustApply(t, m, Event{Kind: EventReplicaStartRebuild, ReplicaID: "r1", SnapshotID: "snap1", SnapshotCpLSN: 100, ReservationID: "rebuild-1", ReservationTTL: 200}) + mustApply(t, m, Event{Kind: EventReplicaRebuildBaseApplied, ReplicaID: "r1", TargetLSN: 130}) + mustApply(t, m, Event{Kind: EventReplicaCatchupProgress, ReplicaID: "r1", ReplicaFlushedLSN: 130, HoldUntil: 150}) + mustApply(t, m, Event{Kind: EventReplicaPromotionHealthy, ReplicaID: "r1", Now: 150}) + if got := m.Replica("r1").FSM.State; got != fsmv2.StateInSync { + t.Fatalf("expected in sync after rebuild, got %s", got) + } +} + +func TestModelPrimaryLeaseLostFencesRecovery(t *testing.T) { + m := New("p1", ModeSyncAll, 1, "r1") + mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 1}) + mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"}) + mustApply(t, m, Event{Kind: EventReplicaReconnect, ReplicaID: "r1", ReplicaFlushedLSN: 1, TargetLSN: 5, ReservationID: "res-2", ReservationTTL: 100}) + mustApply(t, m, Event{Kind: EventPrimaryLeaseLost}) + if m.PrimaryState != PrimaryLost { + t.Fatalf("expected lost primary, got %s", m.PrimaryState) + } + if got := m.Replica("r1").FSM.State; got != fsmv2.StateLagging { + t.Fatalf("expected lagging after fencing, got %s", got) + } +} + +func TestModelPromoteReplicaChangesEpoch(t *testing.T) { + m := New("p1", ModeSyncQuorum, 1, "r1", "r2") + mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 10}) + oldEpoch := m.Epoch + mustApply(t, m, Event{Kind: EventPromoteReplica, ReplicaID: "r1"}) + if m.PrimaryID != "r1" { + t.Fatalf("expected promoted primary r1, got %s", m.PrimaryID) + } + if m.Epoch != oldEpoch+1 { + t.Fatalf("expected epoch increment, got %d want %d", m.Epoch, oldEpoch+1) + } +} + +func TestModelSyncQuorumWithThreeReplicasMixedStates(t *testing.T) { + m := New("p1", ModeSyncQuorum, 1, "r1", "r2", "r3") + mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 1}) + mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r2", ReplicaFlushedLSN: 1}) + mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r3", ReplicaFlushedLSN: 1}) + + mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r2"}) + + if !m.CanServeWrite() { + t.Fatal("sync_quorum should serve with primary + two in-sync replicas out of RF=4") + } +} + +func TestModelFailoverFencesMixedReplicaStates(t *testing.T) { + m := New("p1", ModeSyncQuorum, 10, "r1", "r2", "r3") + mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 8}) + mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r2", ReplicaFlushedLSN: 8}) + mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r3", ReplicaFlushedLSN: 6}) + mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r2"}) + mustApply(t, m, Event{Kind: EventReplicaReconnect, ReplicaID: "r2", ReplicaFlushedLSN: 8, TargetLSN: 12, ReservationID: "catch-r2", ReservationTTL: 100}) + mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r3"}) + mustApply(t, m, Event{Kind: EventReplicaNeedsRebuild, ReplicaID: "r3"}) + mustApply(t, m, Event{Kind: EventReplicaStartRebuild, ReplicaID: "r3", SnapshotID: "snap-x", SnapshotCpLSN: 6, ReservationID: "rebuild-r3", ReservationTTL: 200}) + + mustApply(t, m, Event{Kind: EventPrimaryLeaseLost}) + mustApply(t, m, Event{Kind: EventPromoteReplica, ReplicaID: "r1"}) + + if m.PrimaryID != "r1" { + t.Fatalf("expected r1 promoted, got %s", m.PrimaryID) + } + if got := m.Replica("r2").FSM.State; got != fsmv2.StateLagging { + t.Fatalf("expected r2 fenced back to lagging, got %s", got) + } + if got := m.Replica("r3").FSM.State; got != fsmv2.StateLagging { + t.Fatalf("expected r3 fenced back to lagging, got %s", got) + } +} + +func TestModelRebuildInterruptedByEpochChange(t *testing.T) { + m := New("p1", ModeBestEffort, 1, "r1") + mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"}) + mustApply(t, m, Event{Kind: EventReplicaNeedsRebuild, ReplicaID: "r1"}) + mustApply(t, m, Event{Kind: EventReplicaStartRebuild, ReplicaID: "r1", SnapshotID: "snap-2", SnapshotCpLSN: 100, ReservationID: "rebuild-2", ReservationTTL: 200}) + if got := m.Replica("r1").FSM.State; got != fsmv2.StateRebuilding { + t.Fatalf("expected rebuilding, got %s", got) + } + + mustApply(t, m, Event{Kind: EventPromoteReplica, ReplicaID: "r1"}) + if got := m.Replica("r1").FSM.State; got != fsmv2.StateLagging { + t.Fatalf("expected lagging after epoch change fencing, got %s", got) + } +} + +func TestModelReservationLostDuringCatchupAfterRebuild(t *testing.T) { + m := New("p1", ModeBestEffort, 1, "r1") + mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"}) + mustApply(t, m, Event{Kind: EventReplicaNeedsRebuild, ReplicaID: "r1"}) + mustApply(t, m, Event{Kind: EventReplicaStartRebuild, ReplicaID: "r1", SnapshotID: "snap-3", SnapshotCpLSN: 50, ReservationID: "rebuild-3", ReservationTTL: 200}) + mustApply(t, m, Event{Kind: EventReplicaRebuildBaseApplied, ReplicaID: "r1", TargetLSN: 80}) + if got := m.Replica("r1").FSM.State; got != fsmv2.StateCatchUpAfterBuild { + t.Fatalf("expected catch-up-after-rebuild, got %s", got) + } + + mustApply(t, m, Event{Kind: EventReplicaReservationLost, ReplicaID: "r1"}) + if got := m.Replica("r1").FSM.State; got != fsmv2.StateNeedsRebuild { + t.Fatalf("expected needs rebuild after reservation loss, got %s", got) + } +} + +func TestModelSyncAllBarrierAcknowledgeTargetLSN(t *testing.T) { + m := New("p1", ModeSyncAll, 1, "r1", "r2") + mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 5}) + mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r2", ReplicaFlushedLSN: 5}) + mustApply(t, m, Event{Kind: EventWriteCommitted, LSN: 10}) + + if m.CanAcknowledgeLSN(10) { + t.Fatal("sync_all should not acknowledge target LSN before barriers advance replica durability") + } + + mustApply(t, m, Event{Kind: EventBarrierCompleted, ReplicaID: "r1", ReplicaFlushedLSN: 10}) + if m.CanAcknowledgeLSN(10) { + t.Fatal("sync_all should still wait for second replica durability") + } + + mustApply(t, m, Event{Kind: EventBarrierCompleted, ReplicaID: "r2", ReplicaFlushedLSN: 10}) + if !m.CanAcknowledgeLSN(10) { + t.Fatal("sync_all should acknowledge once all required replicas are durable at target LSN") + } +} + +func TestModelSyncQuorumBarrierAcknowledgeTargetLSN(t *testing.T) { + m := New("p1", ModeSyncQuorum, 1, "r1", "r2", "r3") + mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 5}) + mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r2", ReplicaFlushedLSN: 5}) + mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r3", ReplicaFlushedLSN: 5}) + mustApply(t, m, Event{Kind: EventWriteCommitted, LSN: 9}) + + if m.CanAcknowledgeLSN(9) { + t.Fatal("sync_quorum should not acknowledge before any replica reaches target durability") + } + + mustApply(t, m, Event{Kind: EventBarrierCompleted, ReplicaID: "r1", ReplicaFlushedLSN: 9}) + if m.CanAcknowledgeLSN(9) { + t.Fatal("sync_quorum should still wait because RF=4 quorum needs primary + two durable replicas") + } + mustApply(t, m, Event{Kind: EventBarrierCompleted, ReplicaID: "r2", ReplicaFlushedLSN: 9}) + if !m.CanAcknowledgeLSN(9) { + t.Fatal("sync_quorum should acknowledge with primary + two durable replicas in RF=4") + } +} + +func TestModelWriteAdmissionReasons(t *testing.T) { + m := New("p1", ModeSyncAll, 1, "r1") + dec := m.WriteAdmission() + if dec.Allowed || dec.Reason != "required_replica_not_in_sync" { + t.Fatalf("unexpected admission before bootstrap: %+v", dec) + } + + mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 1}) + dec = m.WriteAdmission() + if !dec.Allowed || dec.Reason != "all_replicas_sync_eligible" { + t.Fatalf("unexpected admission after bootstrap: %+v", dec) + } +} + +func TestModelEvaluateReconnectUsesPlanner(t *testing.T) { + m := New("p1", ModeBestEffort, 1, "r1") + mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 2}) + mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"}) + + decision, err := m.EvaluateReconnect("r1", 2, 8) + if err != nil { + t.Fatalf("evaluate reconnect: %v", err) + } + if decision.Disposition != RecoveryCatchup { + t.Fatalf("expected catchup decision, got %+v", decision) + } + if got := m.Replica("r1").FSM.State; got != fsmv2.StateCatchingUp { + t.Fatalf("expected catching up, got %s", got) + } +} + +func TestModelEvaluateReconnectNeedsRebuildFromPlanner(t *testing.T) { + m := New("p1", ModeBestEffort, 1, "r1") + m.Planner = scriptedPlanner{decision: RecoveryDecision{ + Disposition: RecoveryNeedsRebuild, + Reason: "payload_not_resolvable", + }} + mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 2}) + mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"}) + + decision, err := m.EvaluateReconnect("r1", 2, 8) + if err != nil { + t.Fatalf("evaluate reconnect: %v", err) + } + if decision.Disposition != RecoveryNeedsRebuild || decision.Reason != "payload_not_resolvable" { + t.Fatalf("unexpected decision: %+v", decision) + } + if got := m.Replica("r1").FSM.State; got != fsmv2.StateNeedsRebuild { + t.Fatalf("expected needs rebuild, got %s", got) + } +} + +func TestModelEvaluateReconnectCarriesRecoveryClasses(t *testing.T) { + m := New("p1", ModeBestEffort, 1, "r1") + m.Planner = scriptedPlanner{decision: RecoveryDecision{ + Disposition: RecoveryCatchup, + ReservationID: "extent-resv", + ReservationTTL: 42, + Reason: "extent_payload_resolvable", + Classes: []RecoveryClass{RecoveryClassExtentReferenced}, + }} + mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 3}) + mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"}) + + decision, err := m.EvaluateReconnect("r1", 3, 9) + if err != nil { + t.Fatalf("evaluate reconnect: %v", err) + } + if len(decision.Classes) != 1 || decision.Classes[0] != RecoveryClassExtentReferenced { + t.Fatalf("unexpected recovery classes: %+v", decision.Classes) + } + if got := m.Replica("r1").FSM.State; got != fsmv2.StateCatchingUp { + t.Fatalf("expected catching up, got %s", got) + } + if got := m.Replica("r1").FSM.RecoveryReservationID; got != "extent-resv" { + t.Fatalf("expected reservation extent-resv, got %q", got) + } +} + +func TestModelEvaluateReconnectCanChangeOverTime(t *testing.T) { + m := New("p1", ModeBestEffort, 1, "r1") + mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 4}) + mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"}) + + m.Planner = scriptedPlanner{decision: RecoveryDecision{ + Disposition: RecoveryCatchup, + ReservationID: "resv-1", + ReservationTTL: 10, + Reason: "temporarily_recoverable", + Classes: []RecoveryClass{RecoveryClassWALInline}, + }} + decision, err := m.EvaluateReconnect("r1", 4, 12) + if err != nil { + t.Fatalf("first evaluate reconnect: %v", err) + } + if decision.Disposition != RecoveryCatchup { + t.Fatalf("expected catchup on first evaluation, got %+v", decision) + } + + mustApply(t, m, Event{Kind: EventReplicaReservationLost, ReplicaID: "r1"}) + if got := m.Replica("r1").FSM.State; got != fsmv2.StateNeedsRebuild { + t.Fatalf("expected needs rebuild after reservation loss, got %s", got) + } + + m.Planner = scriptedPlanner{decision: RecoveryDecision{ + Disposition: RecoveryNeedsRebuild, + Reason: "recoverability_expired", + }} + decision, err = m.EvaluateReconnect("r1", 4, 12) + if err != nil { + t.Fatalf("second evaluate reconnect: %v", err) + } + if decision.Reason != "recoverability_expired" { + t.Fatalf("unexpected second decision: %+v", decision) + } +} + +func TestRunScenarioProducesStateTrace(t *testing.T) { + m := New("p1", ModeSyncAll, 1, "r1") + trace, err := RunScenario(m, []ScenarioStep{ + {Name: "bootstrap", Event: Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 1}}, + {Name: "write10", Event: Event{Kind: EventWriteCommitted, LSN: 10}}, + {Name: "barrier10", Event: Event{Kind: EventBarrierCompleted, ReplicaID: "r1", ReplicaFlushedLSN: 10}}, + }) + if err != nil { + t.Fatalf("run scenario: %v", err) + } + if len(trace) != 4 { + t.Fatalf("expected 4 snapshots, got %d", len(trace)) + } + last := trace[len(trace)-1] + if last.HeadLSN != 10 { + t.Fatalf("expected head 10, got %d", last.HeadLSN) + } + if got := last.Replicas["r1"].FlushedLSN; got != 10 { + t.Fatalf("expected replica flushed 10, got %d", got) + } + if !last.AckGate.Allowed { + t.Fatalf("expected ack gate allowed at final step, got %+v", last.AckGate) + } +} + +func TestScriptedRecoveryPlannerChangesDecisionOverTime(t *testing.T) { + m := New("p1", ModeBestEffort, 1, "r1") + m.Planner = &ScriptedRecoveryPlanner{ + Decisions: []RecoveryDecision{ + { + Disposition: RecoveryCatchup, + ReservationID: "resv-a", + ReservationTTL: 10, + Reason: "recoverable_now", + Classes: []RecoveryClass{RecoveryClassWALInline}, + }, + { + Disposition: RecoveryNeedsRebuild, + Reason: "recoverability_expired", + }, + }, + } + mustApply(t, m, Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 4}) + mustApply(t, m, Event{Kind: EventReplicaDisconnect, ReplicaID: "r1"}) + + first, err := m.EvaluateReconnect("r1", 4, 12) + if err != nil { + t.Fatalf("first reconnect: %v", err) + } + if first.Disposition != RecoveryCatchup { + t.Fatalf("unexpected first decision: %+v", first) + } + mustApply(t, m, Event{Kind: EventReplicaReservationLost, ReplicaID: "r1"}) + + second, err := m.EvaluateReconnect("r1", 4, 12) + if err != nil { + t.Fatalf("second reconnect: %v", err) + } + if second.Disposition != RecoveryNeedsRebuild || second.Reason != "recoverability_expired" { + t.Fatalf("unexpected second decision: %+v", second) + } +} + +func TestFormatTraceIncludesReplicaStatesAndGates(t *testing.T) { + m := New("p1", ModeSyncAll, 1, "r1") + trace, err := RunScenario(m, []ScenarioStep{ + {Name: "bootstrap", Event: Event{Kind: EventBootstrapReplica, ReplicaID: "r1", ReplicaFlushedLSN: 1}}, + {Name: "write10", Event: Event{Kind: EventWriteCommitted, LSN: 10}}, + {Name: "barrier10", Event: Event{Kind: EventBarrierCompleted, ReplicaID: "r1", ReplicaFlushedLSN: 10}}, + }) + if err != nil { + t.Fatalf("run scenario: %v", err) + } + got := FormatTrace(trace) + wantParts := []string{ + "step=bootstrap", + "write=true:all_replicas_sync_eligible", + "step=barrier10", + "ack=true:all_replicas_durable", + "r1=InSync@10", + } + for _, part := range wantParts { + if !strings.Contains(got, part) { + t.Fatalf("trace missing %q:\n%s", part, got) + } + } +} diff --git a/sw-block/prototype/volumefsm/recovery.go b/sw-block/prototype/volumefsm/recovery.go new file mode 100644 index 000000000..03085cb8d --- /dev/null +++ b/sw-block/prototype/volumefsm/recovery.go @@ -0,0 +1,70 @@ +package volumefsm + +type RecoveryClass string + +const ( + RecoveryClassWALInline RecoveryClass = "wal_inline" + RecoveryClassExtentReferenced RecoveryClass = "extent_referenced" +) + +type RecoveryDisposition string + +const ( + RecoveryCatchup RecoveryDisposition = "catchup" + RecoveryNeedsRebuild RecoveryDisposition = "needs_rebuild" +) + +type RecoveryDecision struct { + Disposition RecoveryDisposition + ReservationID string + ReservationTTL uint64 + Reason string + Classes []RecoveryClass +} + +type RecoveryPlanner interface { + PlanReconnect(replicaID string, flushedLSN, targetLSN uint64) RecoveryDecision +} + +// StaticRecoveryPlanner is the minimal default planner for the prototype. +// If targetLSN >= flushedLSN and the caller provided a real target, reconnect +// is treated as catch-up; otherwise rebuild is required. +type StaticRecoveryPlanner struct{} + +func (StaticRecoveryPlanner) PlanReconnect(replicaID string, flushedLSN, targetLSN uint64) RecoveryDecision { + if targetLSN > flushedLSN { + return RecoveryDecision{ + Disposition: RecoveryCatchup, + ReservationID: replicaID + "-resv", + ReservationTTL: 100, + Reason: "static_recoverable_window", + Classes: []RecoveryClass{RecoveryClassWALInline}, + } + } + return RecoveryDecision{ + Disposition: RecoveryNeedsRebuild, + Reason: "static_no_recoverable_window", + } +} + +// ScriptedRecoveryPlanner returns pre-seeded reconnect decisions in order. +// Once the scripted list is exhausted, the last decision is reused. +type ScriptedRecoveryPlanner struct { + Decisions []RecoveryDecision + index int +} + +func (s *ScriptedRecoveryPlanner) PlanReconnect(replicaID string, flushedLSN, targetLSN uint64) RecoveryDecision { + if len(s.Decisions) == 0 { + return RecoveryDecision{ + Disposition: RecoveryNeedsRebuild, + Reason: "scripted_no_decision", + } + } + if s.index >= len(s.Decisions) { + return s.Decisions[len(s.Decisions)-1] + } + d := s.Decisions[s.index] + s.index++ + return d +} diff --git a/sw-block/prototype/volumefsm/scenario.go b/sw-block/prototype/volumefsm/scenario.go new file mode 100644 index 000000000..421b91751 --- /dev/null +++ b/sw-block/prototype/volumefsm/scenario.go @@ -0,0 +1,61 @@ +package volumefsm + +import ( + "fmt" + + fsmv2 "github.com/seaweedfs/seaweedfs/sw-block/prototype/fsmv2" +) + +type ScenarioStep struct { + Name string + Event Event +} + +type ReplicaSnapshot struct { + State fsmv2.State + FlushedLSN uint64 +} + +type Snapshot struct { + Step string + Epoch uint64 + PrimaryID string + PrimaryState PrimaryState + HeadLSN uint64 + WriteGate AdmissionDecision + AckGate AdmissionDecision + Replicas map[string]ReplicaSnapshot +} + +func (m *Model) Snapshot(step string) Snapshot { + replicas := make(map[string]ReplicaSnapshot, len(m.Replicas)) + for id, r := range m.Replicas { + replicas[id] = ReplicaSnapshot{ + State: r.FSM.State, + FlushedLSN: r.FSM.ReplicaFlushedLSN, + } + } + return Snapshot{ + Step: step, + Epoch: m.Epoch, + PrimaryID: m.PrimaryID, + PrimaryState: m.PrimaryState, + HeadLSN: m.HeadLSN, + WriteGate: m.WriteAdmission(), + AckGate: m.AckAdmission(m.HeadLSN), + Replicas: replicas, + } +} + +func RunScenario(m *Model, steps []ScenarioStep) ([]Snapshot, error) { + trace := make([]Snapshot, 0, len(steps)+1) + trace = append(trace, m.Snapshot("initial")) + for _, step := range steps { + if err := m.Apply(step.Event); err != nil { + return trace, fmt.Errorf("scenario step %q: %w", step.Name, err) + } + trace = append(trace, m.Snapshot(step.Name)) + } + return trace, nil +} + diff --git a/sw-block/prototype/volumefsm/volumefsm.test.exe b/sw-block/prototype/volumefsm/volumefsm.test.exe new file mode 100644 index 000000000..bf234eaee Binary files /dev/null and b/sw-block/prototype/volumefsm/volumefsm.test.exe differ diff --git a/sw-block/test/README.md b/sw-block/test/README.md new file mode 100644 index 000000000..f08e83a9c --- /dev/null +++ b/sw-block/test/README.md @@ -0,0 +1,17 @@ +# V2 Test Reference + +This directory holds V2-facing test reference material copied from the project test database. + +Files: + +- `test_db.md` + - copied from `learn/projects/sw-block/test/test_db.md` + - full block-service test inventory +- `v2_selected.md` + - V2-focused working subset + - includes the currently selected simulator-relevant cases and the 4 Phase 13 V2-boundary tests + +Use: + +- `learn/projects/sw-block/test/test_db.md` as the project-wide source inventory +- `sw-block/test/v2_selected.md` as the active V2 reference/worklist diff --git a/sw-block/test/test_db.md b/sw-block/test/test_db.md new file mode 100644 index 000000000..ebd4a1901 --- /dev/null +++ b/sw-block/test/test_db.md @@ -0,0 +1,1675 @@ +# Block Service Test Database + +Date: 2026-03-27 +Total tests: 1573 + +## Domain Summary + +| Domain | Count | Description | +|--------|-------|-------------| +| other | 10 | Uncategorized | +| testrunner | 179 | sw-test-runner: engine, parser, actions, reporting | +| mode | 82 | Durability mode validation (best_effort, sync_all, sync_quorum) | +| component | 14 | Component tests (real weed processes, localhost) | +| replication | 73 | sync_all protocol, barriers, catch-up, rebuild, shipper | +| operations | 143 | Expand, resize, snapshot, scrub, health score, profiles, presets | +| wal | 81 | WAL writer, entries, admission control, pressure, hardening | +| durability | 3 | Group commit, flusher, fsync batching | +| access | 8 | NVMe/iSCSI adapter, naming (IQN/NQN) | +| nvme | 252 | NVMe-oF target: protocol, controller, I/O, fabric, QA | +| control | 683 | Master registry, failover, heartbeat, assignment, placement, API | +| recovery | 1 | Crash recovery, WAL replay, defensive scan | +| iscsi | 19 | iSCSI target: PDU, login, discovery, SCSI, session, stability | +| fencing | 7 | Epoch, lease, role, write gate | +| engine | 18 | BlockVol core: create, open, write, read, trim, close | + +## Full Test List + + +### access + +| # | Test Name | File | Line | Status | Sim | Notes | +|---|-----------|------|------|--------|-----|-------| +| 1 | `TestAdapterImplementsInterface` | adapter_test.go | 10 | | | | +| 2 | `TestAdapterALUAProvider` | adapter_test.go | 72 | | | | +| 3 | `TestRoleToALUA` | adapter_test.go | 108 | | | | +| 4 | `TestUUIDToNAA` | adapter_test.go | 128 | | | | +| 5 | `TestSanitizeFilename` | naming_test.go | 8 | | | | +| 6 | `TestSanitizeIQN` | naming_test.go | 29 | | | | +| 7 | `TestSanitizeIQN_Truncation` | naming_test.go | 48 | | | | +| 8 | `TestSanitizeConsistency` | naming_test.go | 60 | | | | + +### component + +| # | Test Name | File | Line | Status | Sim | Notes | +|---|-----------|------|------|--------|-----|-------| +| 1 | `TestMain(m *testing.M) {` | component_test.go | 22 | | | | +| 2 | `TestComponent_VolumeLifecycle` | component_test.go | 69 | | | | +| 3 | `TestComponent_FailoverPromote` | component_test.go | 155 | | | | +| 4 | `TestComponent_ManualPromote` | component_test.go | 227 | | | | +| 5 | `TestComponent_FastReconnect` | component_test.go | 297 | | | | +| 6 | `TestComponent_MultiReplica` | component_test.go | 357 | | | | +| 7 | `TestComponent_ExpandThenFailover` | component_test.go | 437 | | | | +| 8 | `TestComponent_NVMePublicationLifecycle` | component_test.go | 509 | | | | +| 9 | `TestCP13_SyncAll_CreateVerifyMode` | cp13_protocol_test.go | 31 | | | | +| 10 | `TestCP13_BestEffort_SurvivesReplicaDeath` | cp13_protocol_test.go | 94 | | | | +| 11 | `TestCP13_SyncAll_FailoverPromotesReplica` | cp13_protocol_test.go | 159 | | | | +| 12 | `TestCP13_SyncAll_ReplicaRestart_Rejoin` | cp13_protocol_test.go | 228 | | | | +| 13 | `TestCP13_DurabilityModeDefault` | cp13_protocol_test.go | 312 | | | | +| 14 | `TestCP13_ReplicaAddressCanonical` | cp13_protocol_test.go | 343 | | | | + +### control + +| # | Test Name | File | Line | Status | Sim | Notes | +|---|-----------|------|------|--------|-----|-------| +| 1 | `TestBlockQA_StopBeforeRun_NoPanic` | block_heartbeat_loop_test.go | 22 | | | | +| 2 | `TestBlockQA_DoubleStop` | block_heartbeat_loop_test.go | 41 | | | | +| 3 | `TestBlockQA_StopDuringCallback` | block_heartbeat_loop_test.go | 62 | | | | +| 4 | `TestBlockQA_ZeroInterval_Clamped` | block_heartbeat_loop_test.go | 96 | | | | +| 5 | `TestBlockQA_NegativeInterval_Clamped` | block_heartbeat_loop_test.go | 113 | | | | +| 6 | `TestBlockQA_CallbackPanic_Survives` | block_heartbeat_loop_test.go | 132 | | | | +| 7 | `TestBlockQA_SlowCallback_NoAccumulation` | block_heartbeat_loop_test.go | 161 | | | | +| 8 | `TestBlockQA_CallbackSetAfterRun` | block_heartbeat_loop_test.go | 186 | | | | +| 9 | `TestBlockQA_ConcurrentStop` | block_heartbeat_loop_test.go | 217 | | | | +| 10 | `TestBlockCollectorPeriodicTick` | block_heartbeat_loop_test.go | 258 | | | | +| 11 | `TestBlockCollectorStopNoLeak` | block_heartbeat_loop_test.go | 305 | | | | +| 12 | `TestBlockCollectorNilCallback` | block_heartbeat_loop_test.go | 342 | | | | +| 13 | `TestBlockAssign_Success` | block_heartbeat_loop_test.go | 371 | | | | +| 14 | `TestBlockAssign_UnknownVolume` | block_heartbeat_loop_test.go | 394 | | | | +| 15 | `TestBlockAssign_InvalidTransition` | block_heartbeat_loop_test.go | 413 | | | | +| 16 | `TestBlockAssign_EmptyAssignments` | block_heartbeat_loop_test.go | 429 | | | | +| 17 | `TestBlockAssign_NilSource` | block_heartbeat_loop_test.go | 445 | | | | +| 18 | `TestBlockAssign_MixedBatch` | block_heartbeat_loop_test.go | 468 | | | | +| 19 | `TestBlockAssign_RoleNoneIgnored` | block_heartbeat_loop_test.go | 500 | | | | +| 20 | `TestInfoMessageRoundTrip` | block_heartbeat_proto_test.go | 8 | | | | +| 21 | `TestShortInfoRoundTrip` | block_heartbeat_proto_test.go | 27 | | | | +| 22 | `TestAssignmentRoundTrip` | block_heartbeat_proto_test.go | 41 | | | | +| 23 | `TestInfoMessagesSliceRoundTrip` | block_heartbeat_proto_test.go | 55 | | | | +| 24 | `TestAssignmentRoundTripWithReplicaAddrs` | block_heartbeat_proto_test.go | 72 | | | | +| 25 | `TestInfoMessageRoundTripWithReplicaAddrs` | block_heartbeat_proto_test.go | 89 | | | | +| 26 | `TestAssignmentFromProtoNilFields` | block_heartbeat_proto_test.go | 110 | | | | +| 27 | `TestInfoMessageFromProtoNilFields` | block_heartbeat_proto_test.go | 124 | | | | +| 28 | `TestLeaseTTLWithReplicaAddrs` | block_heartbeat_proto_test.go | 136 | | | | +| 29 | `TestInfoMessage_ReplicaAddrsRoundTrip` | block_heartbeat_proto_test.go | 155 | | | | +| 30 | `TestAssignmentsToProto` | block_heartbeat_proto_test.go | 171 | | | | +| 31 | `TestNilProtoConversions` | block_heartbeat_proto_test.go | 188 | | | | +| 32 | `TestInfoMessage_HealthScoreRoundTrip` | block_heartbeat_proto_test.go | 206 | | | | +| 33 | `TestInfoMessage_ReplicaDegradedRoundTrip` | block_heartbeat_proto_test.go | 229 | | | | +| 34 | `TestAssignment_MultiReplicaRoundTrip` | block_heartbeat_proto_test.go | 250 | | | | +| 35 | `TestAssignment_PrecedenceRule` | block_heartbeat_proto_test.go | 280 | | | | +| 36 | `TestAssignment_BackwardCompatScalar` | block_heartbeat_proto_test.go | 315 | | | | +| 37 | `TestAssignmentsSlice_MultiReplicaRoundTrip` | block_heartbeat_proto_test.go | 339 | | | | +| 38 | `TestInfoMessage_DurabilityModeRoundTrip` | block_heartbeat_proto_test.go | 373 | | | | +| 39 | `TestInfoMessage_DurabilityModeEmpty_BackwardCompat` | block_heartbeat_proto_test.go | 393 | | | | +| 40 | `TestInfoMessage_NvmeFieldsRoundTrip` | block_heartbeat_proto_test.go | 406 | | | | +| 41 | `TestInfoMessage_NvmeFieldsEmpty_BackwardCompat` | block_heartbeat_proto_test.go | 429 | | | | +| 42 | `TestInfoMessage_HealthFieldsZeroDefault` | block_heartbeat_proto_test.go | 445 | | | | +| 43 | `TestBlockHeartbeat` | block_heartbeat_test.go | 10 | | | | +| 44 | `TestIntegration_FailoverCSIPublish` | integration_block_test.go | 57 | | | | +| 45 | `TestIntegration_RebuildOnRecovery` | integration_block_test.go | 136 | | | | +| 46 | `TestIntegration_AssignmentDeliveryConfirmation` | integration_block_test.go | 237 | | | | +| 47 | `TestIntegration_LeaseAwarePromotion` | integration_block_test.go | 333 | | | | +| 48 | `TestIntegration_ReplicaFailureSingleCopy` | integration_block_test.go | 381 | | | | +| 49 | `TestIntegration_TransientDisconnectNoSplitBrain` | integration_block_test.go | 441 | | | | +| 50 | `TestIntegration_FullLifecycle` | integration_block_test.go | 507 | | | | +| 51 | `TestIntegration_DoubleFailover` | integration_block_test.go | 624 | | | | +| 52 | `TestIntegration_MultiVolumeFailoverRebuild` | integration_block_test.go | 700 | | | | +| 53 | `TestQueue_EnqueuePeek` | master_block_assignment_queue_test.go | 14 | | | | +| 54 | `TestQueue_PeekEmpty` | master_block_assignment_queue_test.go | 23 | | | | +| 55 | `TestQueue_EnqueueBatch` | master_block_assignment_queue_test.go | 31 | | | | +| 56 | `TestQueue_PeekDoesNotRemove` | master_block_assignment_queue_test.go | 42 | | | | +| 57 | `TestQueue_PeekDoesNotAffectOtherServers` | master_block_assignment_queue_test.go | 52 | | | | +| 58 | `TestQueue_ConcurrentEnqueuePeek` | master_block_assignment_queue_test.go | 65 | | | | +| 59 | `TestQueue_Pending` | master_block_assignment_queue_test.go | 83 | | | | +| 60 | `TestQueue_MultipleEnqueue` | master_block_assignment_queue_test.go | 95 | | | | +| 61 | `TestQueue_ConfirmRemovesMatching` | master_block_assignment_queue_test.go | 105 | | | | +| 62 | `TestQueue_ConfirmFromHeartbeat_PrunesConfirmed` | master_block_assignment_queue_test.go | 125 | | | | +| 63 | `TestQueue_PeekPrunesStaleEpochs` | master_block_assignment_queue_test.go | 146 | | | | +| 64 | `TestFailover_PrimaryDies_ReplicaPromoted` | master_block_failover_test.go | 74 | | | | +| 65 | `TestFailover_ReplicaDies_NoAction` | master_block_failover_test.go | 89 | | | | +| 66 | `TestFailover_NoReplica_NoPromotion` | master_block_failover_test.go | 102 | | | | +| 67 | `TestFailover_EpochBumped` | master_block_failover_test.go | 127 | | | | +| 68 | `TestFailover_RegistryUpdated` | master_block_failover_test.go | 139 | | | | +| 69 | `TestFailover_AssignmentQueued` | master_block_failover_test.go | 157 | | | | +| 70 | `TestFailover_MultipleVolumes` | master_block_failover_test.go | 183 | | | | +| 71 | `TestFailover_LeaseNotExpired_DeferredPromotion` | master_block_failover_test.go | 200 | | | | +| 72 | `TestFailover_LeaseExpired_ImmediatePromotion` | master_block_failover_test.go | 242 | | | | +| 73 | `TestRebuild_PendingRecordedOnFailover` | master_block_failover_test.go | 260 | | | | +| 74 | `TestRebuild_ReconnectTriggersDrain` | master_block_failover_test.go | 278 | | | | +| 75 | `TestRebuild_StaleAndRebuildingAssignments` | master_block_failover_test.go | 296 | | | | +| 76 | `TestRebuild_VolumeDeletedWhileDown` | master_block_failover_test.go | 317 | | | | +| 77 | `TestRebuild_PendingClearedAfterDrain` | master_block_failover_test.go | 338 | | | | +| 78 | `TestRebuild_NoPendingRebuilds_NoAction` | master_block_failover_test.go | 355 | | | | +| 79 | `TestRebuild_MultipleVolumes` | master_block_failover_test.go | 367 | | | | +| 80 | `TestRebuild_RegistryUpdatedWithNewReplica` | master_block_failover_test.go | 388 | | | | +| 81 | `TestRebuild_AssignmentContainsRebuildAddr` | master_block_failover_test.go | 406 | | | | +| 82 | `TestFailover_TransientDisconnect_NoPromotion` | master_block_failover_test.go | 454 | | | | +| 83 | `TestFailover_NoPrimary_NoAction` | master_block_failover_test.go | 489 | | | | +| 84 | `TestLifecycle_CreateFailoverRebuild` | master_block_failover_test.go | 515 | | | | +| 85 | `TestRF3_PrimaryDies_BestReplicaPromoted` | master_block_failover_test.go | 622 | | | | +| 86 | `TestRF3_OneReplicaDies_PrimaryUnchanged` | master_block_failover_test.go | 651 | | | | +| 87 | `TestRF3_RecoverRebuildsDeadReplica` | master_block_failover_test.go | 685 | | | | +| 88 | `TestRF3_OtherReplicasSurvivePromotion` | master_block_failover_test.go | 718 | | | | +| 89 | `TestRF2_Unchanged_AfterCP82` | master_block_failover_test.go | 761 | | | | +| 90 | `TestRF3_AllReplicasDead_NoPromotion` | master_block_failover_test.go | 788 | | | | +| 91 | `TestRF3_LeaseDeferred_Promotion` | master_block_failover_test.go | 815 | | | | +| 92 | `TestRF3_CancelDeferredOnReconnect` | master_block_failover_test.go | 865 | | | | +| 93 | `TestT2_OrphanedPrimary_ReplicaReconnect_Promotes` | master_block_failover_test.go | 922 | | | | +| 94 | `TestT2_PrimaryAlive_NoPromotion` | master_block_failover_test.go | 943 | | | | +| 95 | `TestT2_MultipleOrphanedVolumes` | master_block_failover_test.go | 960 | | | | +| 96 | `TestT2_RepeatedHeartbeats_NoDuplicatePromotion` | master_block_failover_test.go | 996 | | | | +| 97 | `TestT2_OrphanedPrimary_LeaseNotExpired_DefersPromotion` | master_block_failover_test.go | 1021 | | | | +| 98 | `TestT3_DeferredTimer_VolumeDeleted_NoPromotion` | master_block_failover_test.go | 1075 | | | | +| 99 | `TestT3_DeferredTimer_EpochChanged_NoPromotion` | master_block_failover_test.go | 1108 | | | | +| 100 | `TestT4_RebuildEmptyAddr_StillQueued` | master_block_failover_test.go | 1152 | | | | +| 101 | `TestHealthState_Healthy_PrimaryWithReplicas` | master_block_observability_test.go | 19 | | | | +| 102 | `TestHealthState_Unsafe_NotPrimary` | master_block_observability_test.go | 33 | | | | +| 103 | `TestHealthState_Unsafe_StrictBelowRequired` | master_block_observability_test.go | 44 | | | | +| 104 | `TestHealthState_Rebuilding` | master_block_observability_test.go | 57 | | | | +| 105 | `TestHealthState_Degraded_BelowDesired` | master_block_observability_test.go | 71 | | | | +| 106 | `TestHealthState_Degraded_FlagSet` | master_block_observability_test.go | 86 | | | | +| 107 | `TestHealthState_WithLiveness_PrimaryDead` | master_block_observability_test.go | 99 | | | | +| 108 | `TestHealthState_BestEffort_ZeroReplicas_NotUnsafe` | master_block_observability_test.go | 111 | | | | +| 109 | `TestHealthState_RF1_NoReplicas_Healthy` | master_block_observability_test.go | 125 | | | | +| 110 | `TestClusterHealthSummary` | master_block_observability_test.go | 139 | | | | +| 111 | `TestBlockStatusHandler_IncludesHealthCounts` | master_block_observability_test.go | 180 | | | | +| 112 | `TestEntryToVolumeInfo_IncludesHealthState` | master_block_observability_test.go | 220 | | | | +| 113 | `TestEntryToVolumeInfo_PrimaryDead_Unsafe` | master_block_observability_test.go | 235 | | | | +| 114 | `TestAlertRules_FileExists` | master_block_observability_test.go | 257 | | | | +| 115 | `TestAlertRules_ReferencesRealMetrics` | master_block_observability_test.go | 275 | | | | +| 116 | `TestDashboard_FileExists` | master_block_observability_test.go | 299 | | | | +| 117 | `TestDashboard_QueriesReferenceRealMetrics` | master_block_observability_test.go | 324 | | | | +| 118 | `TestEvaluateBlockPlacement_SingleCandidate_RF1` | master_block_plan_test.go | 19 | | | | +| 119 | `TestEvaluateBlockPlacement_LeastLoaded` | master_block_plan_test.go | 38 | | | | +| 120 | `TestEvaluateBlockPlacement_DeterministicTiebreak` | master_block_plan_test.go | 60 | | | | +| 121 | `TestEvaluateBlockPlacement_RF2_PrimaryAndReplica` | master_block_plan_test.go | 79 | | | | +| 122 | `TestEvaluateBlockPlacement_RF3_AllSelected` | master_block_plan_test.go | 97 | | | | +| 123 | `TestEvaluateBlockPlacement_RF_ExceedsServers` | master_block_plan_test.go | 115 | | | | +| 124 | `TestEvaluateBlockPlacement_SingleServer_BestEffort_RF2` | master_block_plan_test.go | 142 | | | | +| 125 | `TestEvaluateBlockPlacement_SingleServer_SyncAll_RF2` | master_block_plan_test.go | 167 | | | | +| 126 | `TestEvaluateBlockPlacement_NoServers` | master_block_plan_test.go | 188 | | | | +| 127 | `TestEvaluateBlockPlacement_DiskTypeMismatch` | master_block_plan_test.go | 207 | | | | +| 128 | `TestEvaluateBlockPlacement_InsufficientSpace` | master_block_plan_test.go | 229 | | | | +| 129 | `TestEvaluateBlockPlacement_UnknownCapacity_Allowed` | master_block_plan_test.go | 246 | | | | +| 130 | `TestBlockVolumePlanHandler_HappyPath` | master_block_plan_test.go | 288 | | | | +| 131 | `TestBlockVolumePlanHandler_WithPreset` | master_block_plan_test.go | 321 | | | | +| 132 | `TestBlockVolumePlanHandler_NoServers` | master_block_plan_test.go | 348 | | | | +| 133 | `TestRegistry_RegisterLookup` | master_block_registry_test.go | 14 | | | | +| 134 | `TestRegistry_Unregister` | master_block_registry_test.go | 42 | | | | +| 135 | `TestRegistry_DuplicateRegister` | master_block_registry_test.go | 58 | | | | +| 136 | `TestRegistry_ListByServer` | master_block_registry_test.go | 67 | | | | +| 137 | `TestRegistry_UpdateFullHeartbeat` | master_block_registry_test.go | 87 | | | | +| 138 | `TestRegistry_UpdateDeltaHeartbeat` | master_block_registry_test.go | 116 | | | | +| 139 | `TestRegistry_PendingToActive` | master_block_registry_test.go | 142 | | | | +| 140 | `TestRegistry_PickServer` | master_block_registry_test.go | 160 | | | | +| 141 | `TestRegistry_PickServerEmpty` | master_block_registry_test.go | 176 | | | | +| 142 | `TestRegistry_InflightLock` | master_block_registry_test.go | 184 | | | | +| 143 | `TestRegistry_UnmarkDeadServer` | master_block_registry_test.go | 212 | | | | +| 144 | `TestRegistry_FullHeartbeatUpdatesSizeBytes` | master_block_registry_test.go | 234 | | | | +| 145 | `TestRegistry_ConcurrentAccess` | master_block_registry_test.go | 252 | | | | +| 146 | `TestRegistry_SetReplica` | master_block_registry_test.go | 297 | | | | +| 147 | `TestRegistry_ClearReplica` | master_block_registry_test.go | 327 | | | | +| 148 | `TestRegistry_SetReplicaNotFound` | master_block_registry_test.go | 352 | | | | +| 149 | `TestRegistry_SwapPrimaryReplica` | master_block_registry_test.go | 360 | | | | +| 150 | `TestFullHeartbeat_UpdatesReplicaAddrs` | master_block_registry_test.go | 404 | | | | +| 151 | `TestRegistry_AddReplica` | master_block_registry_test.go | 443 | | | | +| 152 | `TestRegistry_AddReplica_TwoRF3` | master_block_registry_test.go | 476 | | | | +| 153 | `TestRegistry_AddReplica_Upsert` | master_block_registry_test.go | 496 | | | | +| 154 | `TestRegistry_RemoveReplica` | master_block_registry_test.go | 512 | | | | +| 155 | `TestRegistry_PromoteBestReplica_PicksHighest` | master_block_registry_test.go | 540 | | | | +| 156 | `TestRegistry_PromoteBestReplica_NoReplica` | master_block_registry_test.go | 586 | | | | +| 157 | `TestRegistry_PromoteBestReplica_TiebreakByLSN` | master_block_registry_test.go | 596 | | | | +| 158 | `TestRegistry_PromoteBestReplica_KeepsOthers` | master_block_registry_test.go | 633 | | | | +| 159 | `TestRegistry_BackwardCompatAccessors` | master_block_registry_test.go | 664 | | | | +| 160 | `TestRegistry_ReplicaFactorDefault` | master_block_registry_test.go | 696 | | | | +| 161 | `TestRegistry_FullHeartbeat_UpdatesHealthScore` | master_block_registry_test.go | 714 | | | | +| 162 | `TestRegistry_ReplicaHeartbeat_DoesNotDeleteVolume` | master_block_registry_test.go | 745 | | | | +| 163 | `TestRegistry_ReplicaHeartbeat_StaleReplicaRemoved` | master_block_registry_test.go | 773 | | | | +| 164 | `TestRegistry_ReplicaHeartbeat_ReconstructsAfterRestart` | master_block_registry_test.go | 803 | | | | +| 165 | `TestRegistry_PromoteBestReplica_StaleHeartbeatIneligible` | master_block_registry_test.go | 847 | | | | +| 166 | `TestRegistry_PromoteBestReplica_WALLagIneligible` | master_block_registry_test.go | 874 | | | | +| 167 | `TestRegistry_PromoteBestReplica_RebuildingIneligible` | master_block_registry_test.go | 902 | | | | +| 168 | `TestRegistry_PromoteBestReplica_EligibilityFiltersCorrectly` | master_block_registry_test.go | 930 | | | | +| 169 | `TestRegistry_PromoteBestReplica_ConfigurableTolerance` | master_block_registry_test.go | 971 | | | | +| 170 | `TestRegistry_PromoteBestReplica_DeadServerIneligible` | master_block_registry_test.go | 1012 | | | | +| 171 | `TestRegistry_PromoteBestReplica_DeadSkipped_AlivePromoted` | master_block_registry_test.go | 1044 | | | | +| 172 | `TestRegistry_EvaluatePromotion_Basic` | master_block_registry_test.go | 1075 | | | | +| 173 | `TestRegistry_EvaluatePromotion_AllRejected` | master_block_registry_test.go | 1112 | | | | +| 174 | `TestRegistry_EvaluatePromotion_NotFound` | master_block_registry_test.go | 1144 | | | | +| 175 | `TestRegistry_PromoteBestReplica_NoHeartbeatIneligible` | master_block_registry_test.go | 1153 | | | | +| 176 | `TestRegistry_PromoteBestReplica_UnsetRoleIneligible` | master_block_registry_test.go | 1185 | | | | +| 177 | `TestRegistry_PromoteBestReplica_ClearsRebuildAddr` | master_block_registry_test.go | 1217 | | | | +| 178 | `TestRegistry_LeaseGrants_PrimaryOnly` | master_block_registry_test.go | 1243 | | | | +| 179 | `TestRegistry_LeaseGrants_PendingExcluded` | master_block_registry_test.go | 1297 | | | | +| 180 | `TestRegistry_LeaseGrants_InactiveExcluded` | master_block_registry_test.go | 1332 | | | | +| 181 | `TestRegistry_LeaseGrants_UnknownServer` | master_block_registry_test.go | 1352 | | | | +| 182 | `TestRegistry_IsBlockCapable` | master_block_registry_test.go | 1364 | | | | +| 183 | `TestRegistry_VolumesWithDeadPrimary_Basic` | master_block_registry_test.go | 1381 | | | | +| 184 | `TestRegistry_VolumesWithDeadPrimary_PrimaryServer_NotIncluded` | master_block_registry_test.go | 1407 | | | | +| 185 | `TestRegistry_EvaluatePromotion_PrimaryDead_StillShowsCandidate` | master_block_registry_test.go | 1425 | | | | +| 186 | `TestRegistry_ManualPromote_AutoTarget` | master_block_registry_test.go | 1460 | | | | +| 187 | `TestRegistry_ManualPromote_SpecificTarget` | master_block_registry_test.go | 1493 | | | | +| 188 | `TestRegistry_ManualPromote_TargetNotFound` | master_block_registry_test.go | 1523 | | | | +| 189 | `TestRegistry_ManualPromote_PrimaryAlive_Rejected` | master_block_registry_test.go | 1545 | | | | +| 190 | `TestRegistry_ManualPromote_Force_StaleHeartbeat` | master_block_registry_test.go | 1573 | | | | +| 191 | `TestRegistry_ManualPromote_Force_StillRejectsDeadServer` | master_block_registry_test.go | 1603 | | | | +| 192 | `TestMasterRestart_HigherEpochWins` | master_block_registry_test.go | 1626 | | | | +| 193 | `TestMasterRestart_LowerEpochBecomesReplica` | master_block_registry_test.go | 1660 | | | | +| 194 | `TestMasterRestart_SameEpoch_HigherLSNWins` | master_block_registry_test.go | 1685 | | | | +| 195 | `TestMasterRestart_SameEpoch_SameLSN_ExistingWins` | master_block_registry_test.go | 1707 | | | | +| 196 | `TestMasterRestart_ReplicaHeartbeat_AddedCorrectly` | master_block_registry_test.go | 1729 | | | | +| 197 | `TestMasterRestart_SameEpoch_RoleTrusted` | master_block_registry_test.go | 1751 | | | | +| 198 | `TestMasterRestart_DuplicateReplicaHeartbeat_NoDuplicate` | master_block_registry_test.go | 1774 | | | | +| 199 | `TestLookup_ReturnsCopy` | master_block_registry_test.go | 1807 | | | | +| 200 | `TestLookup_ReplicaSliceCopy` | master_block_registry_test.go | 1838 | | | | +| 201 | `TestListAll_ReturnsCopies` | master_block_registry_test.go | 1863 | | | | +| 202 | `TestUpdateEntry_MutatesRegistry` | master_block_registry_test.go | 1878 | | | | +| 203 | `TestUpdateEntry_NotFound` | master_block_registry_test.go | 1894 | | | | +| 204 | `TestMaster_CreateBlockVolume` | master_grpc_server_block_test.go | 37 | | | | +| 205 | `TestMaster_CreateIdempotent` | master_grpc_server_block_test.go | 69 | | | | +| 206 | `TestMaster_CreateIdempotentSizeMismatch` | master_grpc_server_block_test.go | 94 | | | | +| 207 | `TestMaster_CreateNoServers` | master_grpc_server_block_test.go | 125 | | | | +| 208 | `TestMaster_CreateVSFailure_Retry` | master_grpc_server_block_test.go | 138 | | | | +| 209 | `TestMaster_CreateVSFailure_Cleanup` | master_grpc_server_block_test.go | 171 | | | | +| 210 | `TestMaster_CreateConcurrentSameName` | master_grpc_server_block_test.go | 193 | | | | +| 211 | `TestMaster_DeleteBlockVolume` | master_grpc_server_block_test.go | 239 | | | | +| 212 | `TestMaster_DeleteNotFound` | master_grpc_server_block_test.go | 263 | | | | +| 213 | `TestMaster_CreateWithReplica` | master_grpc_server_block_test.go | 274 | | | | +| 214 | `TestMaster_CreateSingleServer_NoReplica` | master_grpc_server_block_test.go | 328 | | | | +| 215 | `TestMaster_CreateReplica_SecondFails_SingleCopy` | master_grpc_server_block_test.go | 364 | | | | +| 216 | `TestMaster_CreateEnqueuesAssignments` | master_grpc_server_block_test.go | 402 | | | | +| 217 | `TestMaster_CreateSingleCopy_NoReplicaAssignment` | master_grpc_server_block_test.go | 441 | | | | +| 218 | `TestMaster_LookupReturnsReplicaServer` | master_grpc_server_block_test.go | 463 | | | | +| 219 | `TestMaster_CreateBlockSnapshot` | master_grpc_server_block_test.go | 498 | | | | +| 220 | `TestMaster_CreateBlockSnapshot_VolumeNotFound` | master_grpc_server_block_test.go | 524 | | | | +| 221 | `TestMaster_DeleteBlockSnapshot` | master_grpc_server_block_test.go | 534 | | | | +| 222 | `TestMaster_ListBlockSnapshots` | master_grpc_server_block_test.go | 553 | | | | +| 223 | `TestMaster_ExpandBlockVolume` | master_grpc_server_block_test.go | 578 | | | | +| 224 | `TestMaster_ExpandBlockVolume_VSFailure` | master_grpc_server_block_test.go | 609 | | | | +| 225 | `TestMaster_LookupBlockVolume` | master_grpc_server_block_test.go | 634 | | | | +| 226 | `TestMaster_CreateRF3_ThreeServers` | master_grpc_server_block_test.go | 691 | | | | +| 227 | `TestMaster_CreateRF3_TwoServers` | master_grpc_server_block_test.go | 728 | | | | +| 228 | `TestMaster_CreateRF2_Unchanged` | master_grpc_server_block_test.go | 759 | | | | +| 229 | `TestMaster_DeleteRF3_DeletesAllReplicas` | master_grpc_server_block_test.go | 785 | | | | +| 230 | `TestMaster_ExpandRF3_ExpandsAllReplicas` | master_grpc_server_block_test.go | 818 | | | | +| 231 | `TestMaster_CreateRF3_AssignmentsIncludeReplicaAddrs` | master_grpc_server_block_test.go | 860 | | | | +| 232 | `TestMaster_CreateResponse_IncludesReplicaServers` | master_grpc_server_block_test.go | 890 | | | | +| 233 | `TestMaster_LookupResponse_IncludesReplicaFields` | master_grpc_server_block_test.go | 917 | | | | +| 234 | `TestMaster_CreateIdempotent_IncludesReplicaServers` | master_grpc_server_block_test.go | 946 | | | | +| 235 | `TestMaster_LookupResponse_ReplicaFactorDefault` | master_grpc_server_block_test.go | 971 | | | | +| 236 | `TestMaster_ResponseConsistency_ReplicaServerVsReplicaServers` | master_grpc_server_block_test.go | 997 | | | | +| 237 | `TestMaster_NvmeFieldsFlowThroughCreateAndLookup` | master_grpc_server_block_test.go | 1028 | | | | +| 238 | `TestMaster_NoNvmeFieldsWhenDisabled` | master_grpc_server_block_test.go | 1084 | | | | +| 239 | `TestMaster_PromotionCopiesNvmeFields` | master_grpc_server_block_test.go | 1114 | | | | +| 240 | `TestMaster_ExpandCoordinated_Success` | master_grpc_server_block_test.go | 1219 | | | | +| 241 | `TestMaster_ExpandCoordinated_PrepareFailure_Cancels` | master_grpc_server_block_test.go | 1268 | | | | +| 242 | `TestMaster_ExpandCoordinated_Standalone_DirectCommit` | master_grpc_server_block_test.go | 1317 | | | | +| 243 | `TestMaster_ExpandCoordinated_ConcurrentRejected` | master_grpc_server_block_test.go | 1345 | | | | +| 244 | `TestMaster_ExpandCoordinated_Idempotent` | master_grpc_server_block_test.go | 1395 | | | | +| 245 | `TestMaster_ExpandCoordinated_CommitFailure_MarksInconsistent` | master_grpc_server_block_test.go | 1419 | | | | +| 246 | `TestMaster_ExpandCoordinated_HeartbeatSuppressedAfterPartialCommit` | master_grpc_server_block_test.go | 1497 | | | | +| 247 | `TestMaster_ExpandCoordinated_FailoverDuringPrepare` | master_grpc_server_block_test.go | 1557 | | | | +| 248 | `TestMaster_ExpandCoordinated_RestartRecovery` | master_grpc_server_block_test.go | 1610 | | | | +| 249 | `TestMaster_ExpandCoordinated_B09_ReReadsEntryAfterLock` | master_grpc_server_block_test.go | 1662 | | | | +| 250 | `TestMaster_ExpandCoordinated_B10_HeartbeatDoesNotDeleteDuringExpand` | master_grpc_server_block_test.go | 1747 | | | | +| 251 | `TestBlockVolumeCreateHandler` | master_server_handlers_block_test.go | 48 | | | | +| 252 | `TestBlockVolumeListHandler` | master_server_handlers_block_test.go | 76 | | | | +| 253 | `TestBlockVolumeLookupHandler` | master_server_handlers_block_test.go | 110 | | | | +| 254 | `TestBlockVolumeDeleteHandler` | master_server_handlers_block_test.go | 157 | | | | +| 255 | `TestBlockAssignHandler` | master_server_handlers_block_test.go | 183 | | | | +| 256 | `TestBlockServersHandler` | master_server_handlers_block_test.go | 230 | | | | +| 257 | `TestListAll` | master_server_handlers_block_test.go | 254 | | | | +| 258 | `TestServerSummaries` | master_server_handlers_block_test.go | 269 | | | | +| 259 | `TestQABlockAssignmentProcessing` | qa_block_assign_test.go | 24 | | | | +| 260 | `TestQABlockAssignmentProcessing` | qa_block_assign_test.go | 24 | | | | +| 261 | `TestQA_CP11B1_DatabasePreset_Defaults` | qa_block_cp11b1_adversarial_test.go | 64 | | | | +| 262 | `TestQA_CP11B1_DatabasePreset_Defaults` | qa_block_cp11b1_adversarial_test.go | 64 | | | | +| 263 | `TestQA_CP11B1_OverridePrecedence_DurabilityWins` | qa_block_cp11b1_adversarial_test.go | 96 | | | | +| 264 | `TestQA_CP11B1_OverridePrecedence_DurabilityWins` | qa_block_cp11b1_adversarial_test.go | 96 | | | | +| 265 | `TestQA_CP11B1_InvalidPreset_Rejected` | qa_block_cp11b1_adversarial_test.go | 118 | | | | +| 266 | `TestQA_CP11B1_InvalidPreset_Rejected` | qa_block_cp11b1_adversarial_test.go | 118 | | | | +| 267 | `TestQA_CP11B1_SyncQuorum_RF2_Rejected` | qa_block_cp11b1_adversarial_test.go | 129 | | | | +| 268 | `TestQA_CP11B1_SyncQuorum_RF2_Rejected` | qa_block_cp11b1_adversarial_test.go | 129 | | | | +| 269 | `TestQA_CP11B1_NVMePref_NoNVMe_Warning` | qa_block_cp11b1_adversarial_test.go | 142 | | | | +| 270 | `TestQA_CP11B1_NVMePref_NoNVMe_Warning` | qa_block_cp11b1_adversarial_test.go | 142 | | | | +| 271 | `TestQA_CP11B1_NoPreset_BackwardCompat` | qa_block_cp11b1_adversarial_test.go | 161 | | | | +| 272 | `TestQA_CP11B1_NoPreset_BackwardCompat` | qa_block_cp11b1_adversarial_test.go | 161 | | | | +| 273 | `TestQA_CP11B1_ResolveHandler_HTTP` | qa_block_cp11b1_adversarial_test.go | 183 | | | | +| 274 | `TestQA_CP11B1_ResolveHandler_HTTP` | qa_block_cp11b1_adversarial_test.go | 183 | | | | +| 275 | `TestQA_CP11B1_CreateWithPreset_StoresPreset` | qa_block_cp11b1_adversarial_test.go | 216 | | | | +| 276 | `TestQA_CP11B1_CreateWithPreset_StoresPreset` | qa_block_cp11b1_adversarial_test.go | 216 | | | | +| 277 | `TestQA_CP11B1_RF_ExceedsServers_Warning` | qa_block_cp11b1_adversarial_test.go | 251 | | | | +| 278 | `TestQA_CP11B1_RF_ExceedsServers_Warning` | qa_block_cp11b1_adversarial_test.go | 251 | | | | +| 279 | `TestQA_CP11B1_StableOutputFields` | qa_block_cp11b1_adversarial_test.go | 270 | | | | +| 280 | `TestQA_CP11B1_StableOutputFields` | qa_block_cp11b1_adversarial_test.go | 270 | | | | +| 281 | `TestQA_CP11B1_CreateResolve_Parity` | qa_block_cp11b1_adversarial_test.go | 318 | | | | +| 282 | `TestQA_CP11B1_CreateResolve_Parity` | qa_block_cp11b1_adversarial_test.go | 318 | | | | +| 283 | `TestQA_CP11B1_NVMeFromHeartbeat_FreshCluster` | qa_block_cp11b1_adversarial_test.go | 367 | | | | +| 284 | `TestQA_CP11B1_NVMeFromHeartbeat_FreshCluster` | qa_block_cp11b1_adversarial_test.go | 367 | | | | +| 285 | `TestQA_CP11B1_NVMeLostAfterUnmark` | qa_block_cp11b1_adversarial_test.go | 392 | | | | +| 286 | `TestQA_CP11B1_NVMeLostAfterUnmark` | qa_block_cp11b1_adversarial_test.go | 392 | | | | +| 287 | `TestQA_CP11B1_MarkBlockCapable_PreservesNVMe` | qa_block_cp11b1_adversarial_test.go | 410 | | | | +| 288 | `TestQA_CP11B1_MarkBlockCapable_PreservesNVMe` | qa_block_cp11b1_adversarial_test.go | 410 | | | | +| 289 | `TestQA_CP11B1_InvalidDurabilityString_Rejected` | qa_block_cp11b1_adversarial_test.go | 425 | | | | +| 290 | `TestQA_CP11B1_InvalidDurabilityString_Rejected` | qa_block_cp11b1_adversarial_test.go | 425 | | | | +| 291 | `TestQA_CP11B1_OverrideDiskType` | qa_block_cp11b1_adversarial_test.go | 438 | | | | +| 292 | `TestQA_CP11B1_OverrideDiskType` | qa_block_cp11b1_adversarial_test.go | 438 | | | | +| 293 | `TestQA_CP11B1_AllOverridesAtOnce` | qa_block_cp11b1_adversarial_test.go | 458 | | | | +| 294 | `TestQA_CP11B1_AllOverridesAtOnce` | qa_block_cp11b1_adversarial_test.go | 458 | | | | +| 295 | `TestQA_CP11B1_PresetOverride_SyncAll_RF1_Warning` | qa_block_cp11b1_adversarial_test.go | 485 | | | | +| 296 | `TestQA_CP11B1_PresetOverride_SyncAll_RF1_Warning` | qa_block_cp11b1_adversarial_test.go | 485 | | | | +| 297 | `TestQA_CP11B1_ZeroServerCount_NoRFWarning` | qa_block_cp11b1_adversarial_test.go | 505 | | | | +| 298 | `TestQA_CP11B1_ZeroServerCount_NoRFWarning` | qa_block_cp11b1_adversarial_test.go | 505 | | | | +| 299 | `TestQA_CP11B1_ResolveEndpoint_InvalidPreset_Returns200WithErrors` | qa_block_cp11b1_adversarial_test.go | 520 | | | | +| 300 | `TestQA_CP11B1_ResolveEndpoint_InvalidPreset_Returns200WithErrors` | qa_block_cp11b1_adversarial_test.go | 520 | | | | +| 301 | `TestQA_CP11B1_CreateHandler_InvalidPreset_Returns400` | qa_block_cp11b1_adversarial_test.go | 543 | | | | +| 302 | `TestQA_CP11B1_CreateHandler_InvalidPreset_Returns400` | qa_block_cp11b1_adversarial_test.go | 543 | | | | +| 303 | `TestQA_CP11B1_ConcurrentResolve_NoPanic` | qa_block_cp11b1_adversarial_test.go | 559 | | | | +| 304 | `TestQA_CP11B1_ConcurrentResolve_NoPanic` | qa_block_cp11b1_adversarial_test.go | 559 | | | | +| 305 | `TestQA_CP11B2_ConcurrentPlanCalls` | qa_block_cp11b2_adversarial_test.go | 22 | | | | +| 306 | `TestQA_CP11B2_ConcurrentPlanCalls` | qa_block_cp11b2_adversarial_test.go | 22 | | | | +| 307 | `TestQA_CP11B2_NoBlockCapableServers` | qa_block_cp11b2_adversarial_test.go | 49 | | | | +| 308 | `TestQA_CP11B2_NoBlockCapableServers` | qa_block_cp11b2_adversarial_test.go | 49 | | | | +| 309 | `TestQA_CP11B2_RF_ExceedsAvailable` | qa_block_cp11b2_adversarial_test.go | 76 | | | | +| 310 | `TestQA_CP11B2_RF_ExceedsAvailable` | qa_block_cp11b2_adversarial_test.go | 76 | | | | +| 311 | `TestQA_CP11B2_PlanThenCreate_PolicyConsistency` | qa_block_cp11b2_adversarial_test.go | 108 | | | | +| 312 | `TestQA_CP11B2_PlanThenCreate_PolicyConsistency` | qa_block_cp11b2_adversarial_test.go | 108 | | | | +| 313 | `TestQA_CP11B2_PlanThenCreate_OrderedCandidateParity` | qa_block_cp11b2_adversarial_test.go | 148 | | | | +| 314 | `TestQA_CP11B2_PlanThenCreate_OrderedCandidateParity` | qa_block_cp11b2_adversarial_test.go | 148 | | | | +| 315 | `TestQA_CP11B2_PlanThenCreate_ReplicaOrderParity` | qa_block_cp11b2_adversarial_test.go | 197 | | | | +| 316 | `TestQA_CP11B2_PlanThenCreate_ReplicaOrderParity` | qa_block_cp11b2_adversarial_test.go | 197 | | | | +| 317 | `TestQA_CP11B2_Create_FallbackOnRPCFailure` | qa_block_cp11b2_adversarial_test.go | 251 | | | | +| 318 | `TestQA_CP11B2_Create_FallbackOnRPCFailure` | qa_block_cp11b2_adversarial_test.go | 251 | | | | +| 319 | `TestQA_CP11B2_PlanIsReadOnly` | qa_block_cp11b2_adversarial_test.go | 307 | | | | +| 320 | `TestQA_CP11B2_PlanIsReadOnly` | qa_block_cp11b2_adversarial_test.go | 307 | | | | +| 321 | `TestQA_CP11B2_RejectionReasonStability` | qa_block_cp11b2_adversarial_test.go | 337 | | | | +| 322 | `TestQA_CP11B2_RejectionReasonStability` | qa_block_cp11b2_adversarial_test.go | 337 | | | | +| 323 | `TestQA_CP11B2_DeterministicOrder_MultipleInvocations` | qa_block_cp11b2_adversarial_test.go | 368 | | | | +| 324 | `TestQA_CP11B2_DeterministicOrder_MultipleInvocations` | qa_block_cp11b2_adversarial_test.go | 368 | | | | +| 325 | `TestQA_CP11B2_RF0_BehavesAsRF1` | qa_block_cp11b2_adversarial_test.go | 409 | | | | +| 326 | `TestQA_CP11B2_RF0_BehavesAsRF1` | qa_block_cp11b2_adversarial_test.go | 409 | | | | +| 327 | `TestQA_CP11B2_RF1_NoReplicaNeeded` | qa_block_cp11b2_adversarial_test.go | 427 | | | | +| 328 | `TestQA_CP11B2_RF1_NoReplicaNeeded` | qa_block_cp11b2_adversarial_test.go | 427 | | | | +| 329 | `TestQA_CP11B2_AllRejected_DiskType` | qa_block_cp11b2_adversarial_test.go | 444 | | | | +| 330 | `TestQA_CP11B2_AllRejected_DiskType` | qa_block_cp11b2_adversarial_test.go | 444 | | | | +| 331 | `TestQA_CP11B2_AllRejected_Capacity` | qa_block_cp11b2_adversarial_test.go | 468 | | | | +| 332 | `TestQA_CP11B2_AllRejected_Capacity` | qa_block_cp11b2_adversarial_test.go | 468 | | | | +| 333 | `TestQA_CP11B2_MixedRejections` | qa_block_cp11b2_adversarial_test.go | 483 | | | | +| 334 | `TestQA_CP11B2_MixedRejections` | qa_block_cp11b2_adversarial_test.go | 483 | | | | +| 335 | `TestQA_CP11B2_SyncQuorum_RF3_FilteredTo2` | qa_block_cp11b2_adversarial_test.go | 510 | | | | +| 336 | `TestQA_CP11B2_SyncQuorum_RF3_FilteredTo2` | qa_block_cp11b2_adversarial_test.go | 510 | | | | +| 337 | `TestQA_CP11B2_UnknownDiskType_PassesFilter` | qa_block_cp11b2_adversarial_test.go | 532 | | | | +| 338 | `TestQA_CP11B2_UnknownDiskType_PassesFilter` | qa_block_cp11b2_adversarial_test.go | 532 | | | | +| 339 | `TestQA_CP11B2_LargeCandidateList` | qa_block_cp11b2_adversarial_test.go | 548 | | | | +| 340 | `TestQA_CP11B2_LargeCandidateList` | qa_block_cp11b2_adversarial_test.go | 548 | | | | +| 341 | `TestQA_CP11B2_FailedPrimary_TriedAsReplica` | qa_block_cp11b2_adversarial_test.go | 573 | | | | +| 342 | `TestQA_CP11B2_FailedPrimary_TriedAsReplica` | qa_block_cp11b2_adversarial_test.go | 573 | | | | +| 343 | `TestQA_CP11B2_PlanWithInvalidPreset` | qa_block_cp11b2_adversarial_test.go | 605 | | | | +| 344 | `TestQA_CP11B2_PlanWithInvalidPreset` | qa_block_cp11b2_adversarial_test.go | 605 | | | | +| 345 | `TestQA_T1_AllGatesFail_SingleReplica` | qa_block_cp11b3_adversarial_test.go | 26 | | | | +| 346 | `TestQA_T1_AllGatesFail_SingleReplica` | qa_block_cp11b3_adversarial_test.go | 26 | | | | +| 347 | `TestQA_T1_WALLag_ExactBoundary` | qa_block_cp11b3_adversarial_test.go | 56 | | | | +| 348 | `TestQA_T1_WALLag_ExactBoundary` | qa_block_cp11b3_adversarial_test.go | 56 | | | | +| 349 | `TestQA_T1_ZeroLeaseTTL_FallbackFreshness` | qa_block_cp11b3_adversarial_test.go | 93 | | | | +| 350 | `TestQA_T1_ZeroLeaseTTL_FallbackFreshness` | qa_block_cp11b3_adversarial_test.go | 93 | | | | +| 351 | `TestQA_T1_RF3_MixedGates_OnlyHealthyPromoted` | qa_block_cp11b3_adversarial_test.go | 118 | | | | +| 352 | `TestQA_T1_RF3_MixedGates_OnlyHealthyPromoted` | qa_block_cp11b3_adversarial_test.go | 118 | | | | +| 353 | `TestQA_T1_EvaluatePromotion_ReadOnly` | qa_block_cp11b3_adversarial_test.go | 155 | | | | +| 354 | `TestQA_T1_EvaluatePromotion_ReadOnly` | qa_block_cp11b3_adversarial_test.go | 155 | | | | +| 355 | `TestQA_T1_ConcurrentEvaluateAndPromote` | qa_block_cp11b3_adversarial_test.go | 190 | | | | +| 356 | `TestQA_T1_ConcurrentEvaluateAndPromote` | qa_block_cp11b3_adversarial_test.go | 190 | | | | +| 357 | `TestQA_T1_PromotionDuringExpand` | qa_block_cp11b3_adversarial_test.go | 233 | | | | +| 358 | `TestQA_T1_PromotionDuringExpand` | qa_block_cp11b3_adversarial_test.go | 233 | | | | +| 359 | `TestQA_T1_DoublePromotion_SecondFails` | qa_block_cp11b3_adversarial_test.go | 261 | | | | +| 360 | `TestQA_T1_DoublePromotion_SecondFails` | qa_block_cp11b3_adversarial_test.go | 261 | | | | +| 361 | `TestQA_T2_OrphanAndFailover_NoDoublePromotion` | qa_block_cp11b3_adversarial_test.go | 292 | | | | +| 362 | `TestQA_T2_OrphanAndFailover_NoDoublePromotion` | qa_block_cp11b3_adversarial_test.go | 292 | | | | +| 363 | `TestQA_T2_OrphanButReplicaNotPromotable` | qa_block_cp11b3_adversarial_test.go | 317 | | | | +| 364 | `TestQA_T2_OrphanButReplicaNotPromotable` | qa_block_cp11b3_adversarial_test.go | 317 | | | | +| 365 | `TestQA_T2_ConcurrentReevaluation_NoPanic` | qa_block_cp11b3_adversarial_test.go | 351 | | | | +| 366 | `TestQA_T2_ConcurrentReevaluation_NoPanic` | qa_block_cp11b3_adversarial_test.go | 351 | | | | +| 367 | `TestQA_T2_HeartbeatOrphanCheck_NoVolumes_NoOp` | qa_block_cp11b3_adversarial_test.go | 377 | | | | +| 368 | `TestQA_T2_HeartbeatOrphanCheck_NoVolumes_NoOp` | qa_block_cp11b3_adversarial_test.go | 377 | | | | +| 369 | `TestQA_T3_VolumeRecreated_TimerRejected` | qa_block_cp11b3_adversarial_test.go | 389 | | | | +| 370 | `TestQA_T3_VolumeRecreated_TimerRejected` | qa_block_cp11b3_adversarial_test.go | 389 | | | | +| 371 | `TestQA_T3_MultipleTimers_AllCancelled` | qa_block_cp11b3_adversarial_test.go | 430 | | | | +| 372 | `TestQA_T3_MultipleTimers_AllCancelled` | qa_block_cp11b3_adversarial_test.go | 430 | | | | +| 373 | `TestQA_T4_PromotionClearsStaleMetadata` | qa_block_cp11b3_adversarial_test.go | 483 | | | | +| 374 | `TestQA_T4_PromotionClearsStaleMetadata` | qa_block_cp11b3_adversarial_test.go | 483 | | | | +| 375 | `TestQA_T4_RebuildAddr_FromOldPrimary_NotUsed` | qa_block_cp11b3_adversarial_test.go | 522 | | | | +| 376 | `TestQA_T4_RebuildAddr_FromOldPrimary_NotUsed` | qa_block_cp11b3_adversarial_test.go | 522 | | | | +| 377 | `TestQA_T6_Preflight_NoReplicas` | qa_block_cp11b3_adversarial_test.go | 563 | | | | +| 378 | `TestQA_T6_Preflight_NoReplicas` | qa_block_cp11b3_adversarial_test.go | 563 | | | | +| 379 | `TestQA_T6_Preflight_MultipleRejections` | qa_block_cp11b3_adversarial_test.go | 580 | | | | +| 380 | `TestQA_T6_Preflight_MultipleRejections` | qa_block_cp11b3_adversarial_test.go | 580 | | | | +| 381 | `TestQA_T6_Preflight_NonExistent` | qa_block_cp11b3_adversarial_test.go | 621 | | | | +| 382 | `TestQA_T6_Preflight_NonExistent` | qa_block_cp11b3_adversarial_test.go | 621 | | | | +| 383 | `TestQA_T1_ZeroPrimaryLSN_AllReplicasEligible` | qa_block_cp11b3_adversarial_test.go | 637 | | | | +| 384 | `TestQA_T1_ZeroPrimaryLSN_AllReplicasEligible` | qa_block_cp11b3_adversarial_test.go | 637 | | | | +| 385 | `TestQA_T1_ReplicaWithPrimaryRole_Rejected` | qa_block_cp11b3_adversarial_test.go | 663 | | | | +| 386 | `TestQA_T1_ReplicaWithPrimaryRole_Rejected` | qa_block_cp11b3_adversarial_test.go | 663 | | | | +| 387 | `TestQA_T1_HeartbeatExactlyAtCutoff` | qa_block_cp11b3_adversarial_test.go | 686 | | | | +| 388 | `TestQA_T1_HeartbeatExactlyAtCutoff` | qa_block_cp11b3_adversarial_test.go | 686 | | | | +| 389 | `TestQA_T2_RF3_OrphanedPrimary_BestReplicaPromoted` | qa_block_cp11b3_adversarial_test.go | 713 | | | | +| 390 | `TestQA_T2_RF3_OrphanedPrimary_BestReplicaPromoted` | qa_block_cp11b3_adversarial_test.go | 713 | | | | +| 391 | `TestQA_T2_FailoverThenOrphan_SameVolume_NoDuplicate` | qa_block_cp11b3_adversarial_test.go | 743 | | | | +| 392 | `TestQA_T2_FailoverThenOrphan_SameVolume_NoDuplicate` | qa_block_cp11b3_adversarial_test.go | 743 | | | | +| 393 | `TestQA_T2_OrphanDeferredTimer_CancelledOnPrimaryReconnect` | qa_block_cp11b3_adversarial_test.go | 771 | | | | +| 394 | `TestQA_T2_OrphanDeferredTimer_CancelledOnPrimaryReconnect` | qa_block_cp11b3_adversarial_test.go | 771 | | | | +| 395 | `TestQA_T2_VolumeDeletedDuringReevaluation` | qa_block_cp11b3_adversarial_test.go | 823 | | | | +| 396 | `TestQA_T2_VolumeDeletedDuringReevaluation` | qa_block_cp11b3_adversarial_test.go | 823 | | | | +| 397 | `TestQA_T3_OrphanDeferredTimer_FiresAndPromotes` | qa_block_cp11b3_adversarial_test.go | 844 | | | | +| 398 | `TestQA_T3_OrphanDeferredTimer_FiresAndPromotes` | qa_block_cp11b3_adversarial_test.go | 844 | | | | +| 399 | `TestQA_T3_OrphanDeferredTimer_EpochChanged_NoPromotion` | qa_block_cp11b3_adversarial_test.go | 884 | | | | +| 400 | `TestQA_T3_OrphanDeferredTimer_EpochChanged_NoPromotion` | qa_block_cp11b3_adversarial_test.go | 884 | | | | +| 401 | `TestQA_T4_RebuildAddr_UpdatedByHeartbeat` | qa_block_cp11b3_adversarial_test.go | 921 | | | | +| 402 | `TestQA_T4_RebuildAddr_UpdatedByHeartbeat` | qa_block_cp11b3_adversarial_test.go | 921 | | | | +| 403 | `TestQA_T6_Preflight_FullResultFields` | qa_block_cp11b3_adversarial_test.go | 956 | | | | +| 404 | `TestQA_T6_Preflight_FullResultFields` | qa_block_cp11b3_adversarial_test.go | 956 | | | | +| 405 | `TestQA_T6_Preflight_StaleRole_Rejected` | qa_block_cp11b3_adversarial_test.go | 1003 | | | | +| 406 | `TestQA_T6_Preflight_StaleRole_Rejected` | qa_block_cp11b3_adversarial_test.go | 1003 | | | | +| 407 | `TestQA_T6_Preflight_DrainingRole_Rejected` | qa_block_cp11b3_adversarial_test.go | 1026 | | | | +| 408 | `TestQA_T6_Preflight_DrainingRole_Rejected` | qa_block_cp11b3_adversarial_test.go | 1026 | | | | +| 409 | `TestQA_ConcurrentFailoverAndOrphanReevaluation` | qa_block_cp11b3_adversarial_test.go | 1051 | | | | +| 410 | `TestQA_ConcurrentFailoverAndOrphanReevaluation` | qa_block_cp11b3_adversarial_test.go | 1051 | | | | +| 411 | `TestQA_ConcurrentVolumesWithDeadPrimaryAndUnmark` | qa_block_cp11b3_adversarial_test.go | 1072 | | | | +| 412 | `TestQA_ConcurrentVolumesWithDeadPrimaryAndUnmark` | qa_block_cp11b3_adversarial_test.go | 1072 | | | | +| 413 | `TestQA_T5_ManualPromote_ForceNoHeartbeat_Rejected` | qa_block_cp11b3_adversarial_test.go | 1104 | | | | +| 414 | `TestQA_T5_ManualPromote_ForceNoHeartbeat_Rejected` | qa_block_cp11b3_adversarial_test.go | 1104 | | | | +| 415 | `TestQA_T5_ManualPromote_ForceWrongRole_Rejected` | qa_block_cp11b3_adversarial_test.go | 1127 | | | | +| 416 | `TestQA_T5_ManualPromote_ForceWrongRole_Rejected` | qa_block_cp11b3_adversarial_test.go | 1127 | | | | +| 417 | `TestQA_T5_ManualPromote_ForceBypassesWALLag` | qa_block_cp11b3_adversarial_test.go | 1150 | | | | +| 418 | `TestQA_T5_ManualPromote_ForceBypassesWALLag` | qa_block_cp11b3_adversarial_test.go | 1150 | | | | +| 419 | `TestQA_T5_ManualPromote_PrimaryAlive_ForceOverrides` | qa_block_cp11b3_adversarial_test.go | 1185 | | | | +| 420 | `TestQA_T5_ManualPromote_PrimaryAlive_ForceOverrides` | qa_block_cp11b3_adversarial_test.go | 1185 | | | | +| 421 | `TestQA_T5_ManualPromote_ConcurrentWithAutoPromotion` | qa_block_cp11b3_adversarial_test.go | 1223 | | | | +| 422 | `TestQA_T5_ManualPromote_ConcurrentWithAutoPromotion` | qa_block_cp11b3_adversarial_test.go | 1223 | | | | +| 423 | `TestQA_T5_ManualPromote_ReturnsStructuredRejections` | qa_block_cp11b3_adversarial_test.go | 1264 | | | | +| 424 | `TestQA_T5_ManualPromote_ReturnsStructuredRejections` | qa_block_cp11b3_adversarial_test.go | 1264 | | | | +| 425 | `TestQA_T5_PromoteHandler_HTTP` | qa_block_cp11b3_adversarial_test.go | 1301 | | | | +| 426 | `TestQA_T5_PromoteHandler_HTTP` | qa_block_cp11b3_adversarial_test.go | 1301 | | | | +| 427 | `TestQA_T5_PromotionsTotal_CountsBothAutoAndManual` | qa_block_cp11b3_adversarial_test.go | 1361 | | | | +| 428 | `TestQA_T5_PromotionsTotal_CountsBothAutoAndManual` | qa_block_cp11b3_adversarial_test.go | 1361 | | | | +| 429 | `TestQA_T5_ManualPromote_ReturnsOldPrimary` | qa_block_cp11b3_adversarial_test.go | 1412 | | | | +| 430 | `TestQA_T5_ManualPromote_ReturnsOldPrimary` | qa_block_cp11b3_adversarial_test.go | 1412 | | | | +| 431 | `TestQA_T5_ManualPromote_DoubleExhaustsReplicas` | qa_block_cp11b3_adversarial_test.go | 1445 | | | | +| 432 | `TestQA_T5_ManualPromote_DoubleExhaustsReplicas` | qa_block_cp11b3_adversarial_test.go | 1445 | | | | +| 433 | `TestQA_T5_ManualPromote_TransfersNVMeFields` | qa_block_cp11b3_adversarial_test.go | 1480 | | | | +| 434 | `TestQA_T5_ManualPromote_TransfersNVMeFields` | qa_block_cp11b3_adversarial_test.go | 1480 | | | | +| 435 | `TestQA_T5_RF3_ForceSpecificTarget_LowerHealth` | qa_block_cp11b3_adversarial_test.go | 1508 | | | | +| 436 | `TestQA_T5_RF3_ForceSpecificTarget_LowerHealth` | qa_block_cp11b3_adversarial_test.go | 1508 | | | | +| 437 | `TestQA_T5_ManualPromote_DuringExpand` | qa_block_cp11b3_adversarial_test.go | 1545 | | | | +| 438 | `TestQA_T5_ManualPromote_DuringExpand` | qa_block_cp11b3_adversarial_test.go | 1545 | | | | +| 439 | `TestQA_T5_ManualPromote_NonExistentVolume` | qa_block_cp11b3_adversarial_test.go | 1577 | | | | +| 440 | `TestQA_T5_ManualPromote_NonExistentVolume` | qa_block_cp11b3_adversarial_test.go | 1577 | | | | +| 441 | `TestQA_Reg_FullHeartbeatCrossTalk` | qa_block_cp62_test.go | 26 | | | | +| 442 | `TestQA_Reg_FullHeartbeatCrossTalk` | qa_block_cp62_test.go | 26 | | | | +| 443 | `TestQA_Reg_FullHeartbeatEmptyServer` | qa_block_cp62_test.go | 46 | | | | +| 444 | `TestQA_Reg_FullHeartbeatEmptyServer` | qa_block_cp62_test.go | 46 | | | | +| 445 | `TestQA_Reg_ConcurrentHeartbeatAndRegister` | qa_block_cp62_test.go | 64 | | | | +| 446 | `TestQA_Reg_ConcurrentHeartbeatAndRegister` | qa_block_cp62_test.go | 64 | | | | +| 447 | `TestQA_Reg_DeltaHeartbeatUnknownPath` | qa_block_cp62_test.go | 107 | | | | +| 448 | `TestQA_Reg_DeltaHeartbeatUnknownPath` | qa_block_cp62_test.go | 107 | | | | +| 449 | `TestQA_Reg_PickServerTiebreaker` | qa_block_cp62_test.go | 125 | | | | +| 450 | `TestQA_Reg_PickServerTiebreaker` | qa_block_cp62_test.go | 125 | | | | +| 451 | `TestQA_Reg_ReregisterDifferentServer` | qa_block_cp62_test.go | 146 | | | | +| 452 | `TestQA_Reg_ReregisterDifferentServer` | qa_block_cp62_test.go | 146 | | | | +| 453 | `TestQA_Reg_InflightIndependence` | qa_block_cp62_test.go | 173 | | | | +| 454 | `TestQA_Reg_InflightIndependence` | qa_block_cp62_test.go | 173 | | | | +| 455 | `TestQA_Reg_BlockCapableServersAfterUnmark` | qa_block_cp62_test.go | 194 | | | | +| 456 | `TestQA_Reg_BlockCapableServersAfterUnmark` | qa_block_cp62_test.go | 194 | | | | +| 457 | `TestQA_Master_DeleteVSUnreachable` | qa_block_cp62_test.go | 219 | | | | +| 458 | `TestQA_Master_DeleteVSUnreachable` | qa_block_cp62_test.go | 219 | | | | +| 459 | `TestQA_Master_CreateSanitizedName` | qa_block_cp62_test.go | 249 | | | | +| 460 | `TestQA_Master_CreateSanitizedName` | qa_block_cp62_test.go | 249 | | | | +| 461 | `TestQA_Master_ConcurrentCreateDelete` | qa_block_cp62_test.go | 274 | | | | +| 462 | `TestQA_Master_ConcurrentCreateDelete` | qa_block_cp62_test.go | 274 | | | | +| 463 | `TestQA_Master_AllVSFailNoOrphan` | qa_block_cp62_test.go | 316 | | | | +| 464 | `TestQA_Master_AllVSFailNoOrphan` | qa_block_cp62_test.go | 316 | | | | +| 465 | `TestQA_Master_SlowAllocateBlocksSecond` | qa_block_cp62_test.go | 346 | | | | +| 466 | `TestQA_Master_SlowAllocateBlocksSecond` | qa_block_cp62_test.go | 346 | | | | +| 467 | `TestQA_Master_CreateZeroSize` | qa_block_cp62_test.go | 392 | | | | +| 468 | `TestQA_Master_CreateZeroSize` | qa_block_cp62_test.go | 392 | | | | +| 469 | `TestQA_Master_CreateEmptyName` | qa_block_cp62_test.go | 405 | | | | +| 470 | `TestQA_Master_CreateEmptyName` | qa_block_cp62_test.go | 405 | | | | +| 471 | `TestQA_Master_EmptyNameValidation` | qa_block_cp62_test.go | 417 | | | | +| 472 | `TestQA_Master_EmptyNameValidation` | qa_block_cp62_test.go | 417 | | | | +| 473 | `TestQA_VS_ConcurrentCreate` | qa_block_cp62_test.go | 436 | | | | +| 474 | `TestQA_VS_ConcurrentCreate` | qa_block_cp62_test.go | 436 | | | | +| 475 | `TestQA_VS_ConcurrentCreateDelete` | qa_block_cp62_test.go | 475 | | | | +| 476 | `TestQA_VS_ConcurrentCreateDelete` | qa_block_cp62_test.go | 475 | | | | +| 477 | `TestQA_VS_DeleteCleansSnapshots` | qa_block_cp62_test.go | 510 | | | | +| 478 | `TestQA_VS_DeleteCleansSnapshots` | qa_block_cp62_test.go | 510 | | | | +| 479 | `TestQA_VS_SanitizationCollision` | qa_block_cp62_test.go | 530 | | | | +| 480 | `TestQA_VS_SanitizationCollision` | qa_block_cp62_test.go | 530 | | | | +| 481 | `TestQA_VS_CreateIdempotentReaddTarget` | qa_block_cp62_test.go | 552 | | | | +| 482 | `TestQA_VS_CreateIdempotentReaddTarget` | qa_block_cp62_test.go | 552 | | | | +| 483 | `TestQA_VS_GrpcNilBlockService` | qa_block_cp62_test.go | 573 | | | | +| 484 | `TestQA_VS_GrpcNilBlockService` | qa_block_cp62_test.go | 573 | | | | +| 485 | `TestQA_Queue_ConfirmWrongEpoch` | qa_block_cp63_test.go | 92 | | | | +| 486 | `TestQA_Queue_ConfirmWrongEpoch` | qa_block_cp63_test.go | 92 | | | | +| 487 | `TestQA_Queue_HeartbeatPartialConfirm` | qa_block_cp63_test.go | 112 | | | | +| 488 | `TestQA_Queue_HeartbeatPartialConfirm` | qa_block_cp63_test.go | 112 | | | | +| 489 | `TestQA_Queue_HeartbeatWrongEpochNoConfirm` | qa_block_cp63_test.go | 131 | | | | +| 490 | `TestQA_Queue_HeartbeatWrongEpochNoConfirm` | qa_block_cp63_test.go | 131 | | | | +| 491 | `TestQA_Queue_SamePathSameEpochDifferentRoles` | qa_block_cp63_test.go | 144 | | | | +| 492 | `TestQA_Queue_SamePathSameEpochDifferentRoles` | qa_block_cp63_test.go | 144 | | | | +| 493 | `TestQA_Queue_ConfirmOnUnknownServer` | qa_block_cp63_test.go | 157 | | | | +| 494 | `TestQA_Queue_ConfirmOnUnknownServer` | qa_block_cp63_test.go | 157 | | | | +| 495 | `TestQA_Queue_PeekReturnsCopy` | qa_block_cp63_test.go | 164 | | | | +| 496 | `TestQA_Queue_PeekReturnsCopy` | qa_block_cp63_test.go | 164 | | | | +| 497 | `TestQA_Queue_ConcurrentEnqueueConfirmPeek` | qa_block_cp63_test.go | 179 | | | | +| 498 | `TestQA_Queue_ConcurrentEnqueueConfirmPeek` | qa_block_cp63_test.go | 179 | | | | +| 499 | `TestQA_Reg_DoubleSwap` | qa_block_cp63_test.go | 205 | | | | +| 500 | `TestQA_Reg_DoubleSwap` | qa_block_cp63_test.go | 205 | | | | +| 501 | `TestQA_Reg_SwapNoReplica` | qa_block_cp63_test.go | 244 | | | | +| 502 | `TestQA_Reg_SwapNoReplica` | qa_block_cp63_test.go | 244 | | | | +| 503 | `TestQA_Reg_SwapNotFound` | qa_block_cp63_test.go | 257 | | | | +| 504 | `TestQA_Reg_SwapNotFound` | qa_block_cp63_test.go | 257 | | | | +| 505 | `TestQA_Reg_ConcurrentSwapAndLookup` | qa_block_cp63_test.go | 265 | | | | +| 506 | `TestQA_Reg_ConcurrentSwapAndLookup` | qa_block_cp63_test.go | 265 | | | | +| 507 | `TestQA_Reg_SetReplicaTwice_ReplacesOld` | qa_block_cp63_test.go | 291 | | | | +| 508 | `TestQA_Reg_SetReplicaTwice_ReplacesOld` | qa_block_cp63_test.go | 291 | | | | +| 509 | `TestQA_Reg_FullHeartbeatDoesNotClobberReplicaServer` | qa_block_cp63_test.go | 322 | | | | +| 510 | `TestQA_Reg_FullHeartbeatDoesNotClobberReplicaServer` | qa_block_cp63_test.go | 322 | | | | +| 511 | `TestQA_Reg_ListByServerIncludesBothPrimaryAndReplica` | qa_block_cp63_test.go | 342 | | | | +| 512 | `TestQA_Reg_ListByServerIncludesBothPrimaryAndReplica` | qa_block_cp63_test.go | 342 | | | | +| 513 | `TestQA_Failover_DeferredCancelledOnReconnect` | qa_block_cp63_test.go | 363 | | | | +| 514 | `TestQA_Failover_DeferredCancelledOnReconnect` | qa_block_cp63_test.go | 363 | | | | +| 515 | `TestQA_Failover_DoubleDisconnect_NoPanic` | qa_block_cp63_test.go | 389 | | | | +| 516 | `TestQA_Failover_DoubleDisconnect_NoPanic` | qa_block_cp63_test.go | 389 | | | | +| 517 | `TestQA_Failover_PromoteIdempotent_NoReplicaAfterFirstSwap` | qa_block_cp63_test.go | 398 | | | | +| 518 | `TestQA_Failover_PromoteIdempotent_NoReplicaAfterFirstSwap` | qa_block_cp63_test.go | 398 | | | | +| 519 | `TestQA_Failover_MixedLeaseStates` | qa_block_cp63_test.go | 431 | | | | +| 520 | `TestQA_Failover_MixedLeaseStates` | qa_block_cp63_test.go | 431 | | | | +| 521 | `TestQA_Failover_NoRegistryNoPanic` | qa_block_cp63_test.go | 460 | | | | +| 522 | `TestQA_Failover_NoRegistryNoPanic` | qa_block_cp63_test.go | 460 | | | | +| 523 | `TestQA_Failover_VolumeDeletedDuringDeferredTimer` | qa_block_cp63_test.go | 466 | | | | +| 524 | `TestQA_Failover_VolumeDeletedDuringDeferredTimer` | qa_block_cp63_test.go | 466 | | | | +| 525 | `TestQA_Failover_ConcurrentFailoverDifferentServers` | qa_block_cp63_test.go | 485 | | | | +| 526 | `TestQA_Failover_ConcurrentFailoverDifferentServers` | qa_block_cp63_test.go | 485 | | | | +| 527 | `TestQA_Create_LeaseNonZero_ImmediateFailoverSafe` | qa_block_cp63_test.go | 512 | | | | +| 528 | `TestQA_Create_LeaseNonZero_ImmediateFailoverSafe` | qa_block_cp63_test.go | 512 | | | | +| 529 | `TestQA_Create_ReplicaDeleteOnVolDelete` | qa_block_cp63_test.go | 540 | | | | +| 530 | `TestQA_Create_ReplicaDeleteOnVolDelete` | qa_block_cp63_test.go | 540 | | | | +| 531 | `TestQA_Create_ReplicaDeleteFailure_PrimaryStillDeleted` | qa_block_cp63_test.go | 579 | | | | +| 532 | `TestQA_Create_ReplicaDeleteFailure_PrimaryStillDeleted` | qa_block_cp63_test.go | 579 | | | | +| 533 | `TestQA_Rebuild_DoubleReconnect_NoDuplicateAssignments` | qa_block_cp63_test.go | 616 | | | | +| 534 | `TestQA_Rebuild_DoubleReconnect_NoDuplicateAssignments` | qa_block_cp63_test.go | 616 | | | | +| 535 | `TestQA_Rebuild_RecoverNilFailoverState` | qa_block_cp63_test.go | 635 | | | | +| 536 | `TestQA_Rebuild_RecoverNilFailoverState` | qa_block_cp63_test.go | 635 | | | | +| 537 | `TestQA_Rebuild_FullCycle_CreateFailoverRecoverRebuild` | qa_block_cp63_test.go | 647 | | | | +| 538 | `TestQA_Rebuild_FullCycle_CreateFailoverRecoverRebuild` | qa_block_cp63_test.go | 647 | | | | +| 539 | `TestQA_FailoverEnqueuesNewPrimaryAssignment` | qa_block_cp63_test.go | 711 | | | | +| 540 | `TestQA_FailoverEnqueuesNewPrimaryAssignment` | qa_block_cp63_test.go | 711 | | | | +| 541 | `TestQA_HeartbeatConfirmsFailoverAssignment` | qa_block_cp63_test.go | 733 | | | | +| 542 | `TestQA_HeartbeatConfirmsFailoverAssignment` | qa_block_cp63_test.go | 733 | | | | +| 543 | `TestQA_SwapEpochMonotonicallyIncreasing` | qa_block_cp63_test.go | 754 | | | | +| 544 | `TestQA_SwapEpochMonotonicallyIncreasing` | qa_block_cp63_test.go | 754 | | | | +| 545 | `TestQA_CancelDeferredTimers_NoPendingRebuilds` | qa_block_cp63_test.go | 775 | | | | +| 546 | `TestQA_CancelDeferredTimers_NoPendingRebuilds` | qa_block_cp63_test.go | 775 | | | | +| 547 | `TestQA_Failover_ReplicaServerDies_PrimaryUntouched` | qa_block_cp63_test.go | 781 | | | | +| 548 | `TestQA_Failover_ReplicaServerDies_PrimaryUntouched` | qa_block_cp63_test.go | 781 | | | | +| 549 | `TestQA_Queue_EnqueueBatchEmpty` | qa_block_cp63_test.go | 797 | | | | +| 550 | `TestQA_Queue_EnqueueBatchEmpty` | qa_block_cp63_test.go | 797 | | | | +| 551 | `TestQA_CP82_PrimaryCrash_AfterAck_BeforeReplicaBarrier` | qa_block_cp82_adversarial_test.go | 159 | | | | +| 552 | `TestQA_CP82_PrimaryCrash_AfterAck_BeforeReplicaBarrier` | qa_block_cp82_adversarial_test.go | 159 | | | | +| 553 | `TestQA_CP82_ReplicaHeartbeatSpoof_DoesNotDeletePrimary` | qa_block_cp82_adversarial_test.go | 187 | | | | +| 554 | `TestQA_CP82_ReplicaHeartbeatSpoof_DoesNotDeletePrimary` | qa_block_cp82_adversarial_test.go | 187 | | | | +| 555 | `TestQA_CP82_PromotionRejects_StaleButHealthyReplica` | qa_block_cp82_adversarial_test.go | 229 | | | | +| 556 | `TestQA_CP82_PromotionRejects_StaleButHealthyReplica` | qa_block_cp82_adversarial_test.go | 229 | | | | +| 557 | `TestQA_CP82_PromotionRejects_RebuildingReplica` | qa_block_cp82_adversarial_test.go | 263 | | | | +| 558 | `TestQA_CP82_PromotionRejects_RebuildingReplica` | qa_block_cp82_adversarial_test.go | 263 | | | | +| 559 | `TestQA_CP82_PromotionToleranceBoundary_ExactLSN` | qa_block_cp82_adversarial_test.go | 317 | | | | +| 560 | `TestQA_CP82_PromotionToleranceBoundary_ExactLSN` | qa_block_cp82_adversarial_test.go | 317 | | | | +| 561 | `TestQA_CP82_MasterRestart_ReconstructReplicas_ThenFailover` | qa_block_cp82_adversarial_test.go | 372 | | | | +| 562 | `TestQA_CP82_MasterRestart_ReconstructReplicas_ThenFailover` | qa_block_cp82_adversarial_test.go | 372 | | | | +| 563 | `TestQA_CP82_RF3_OneReplicaFlaps_UnderWriteLoad` | qa_block_cp82_adversarial_test.go | 443 | | | | +| 564 | `TestQA_CP82_RF3_OneReplicaFlaps_UnderWriteLoad` | qa_block_cp82_adversarial_test.go | 443 | | | | +| 565 | `TestQA_CP82_AssignmentPrecedence_ReplicaAddrsVsScalar` | qa_block_cp82_adversarial_test.go | 539 | | | | +| 566 | `TestQA_CP82_AssignmentPrecedence_ReplicaAddrsVsScalar` | qa_block_cp82_adversarial_test.go | 539 | | | | +| 567 | `TestQA_CP82_ScrubConcurrentWrites_NoFalseCorruption` | qa_block_cp82_adversarial_test.go | 594 | | | | +| 568 | `TestQA_CP82_ScrubConcurrentWrites_NoFalseCorruption` | qa_block_cp82_adversarial_test.go | 594 | | | | +| 569 | `TestQA_CP82_ScrubDetectsCorruption_HealthDrops_PromotionAvoids` | qa_block_cp82_adversarial_test.go | 667 | | | | +| 570 | `TestQA_CP82_ScrubDetectsCorruption_HealthDrops_PromotionAvoids` | qa_block_cp82_adversarial_test.go | 667 | | | | +| 571 | `TestQA_CP82_ExpandRF3_PartialReplicaFailure` | qa_block_cp82_adversarial_test.go | 723 | | | | +| 572 | `TestQA_CP82_ExpandRF3_PartialReplicaFailure` | qa_block_cp82_adversarial_test.go | 723 | | | | +| 573 | `TestQA_CP82_ByServerIndexConsistency_AfterReplicaMove` | qa_block_cp82_adversarial_test.go | 800 | | | | +| 574 | `TestQA_CP82_ByServerIndexConsistency_AfterReplicaMove` | qa_block_cp82_adversarial_test.go | 800 | | | | +| 575 | `TestEdge_LSNLag_StaleReplicaSkipped` | qa_block_edge_cases_test.go | 25 | | | | +| 576 | `TestEdge_LSNLag_StaleReplicaSkipped` | qa_block_edge_cases_test.go | 25 | | | | +| 577 | `TestEdge_CascadeFailover_RF3_EpochChain` | qa_block_edge_cases_test.go | 73 | | | | +| 578 | `TestEdge_CascadeFailover_RF3_EpochChain` | qa_block_edge_cases_test.go | 73 | | | | +| 579 | `TestEdge_ConcurrentFailoverAndHeartbeat_NoPanic` | qa_block_edge_cases_test.go | 144 | | | | +| 580 | `TestEdge_ConcurrentFailoverAndHeartbeat_NoPanic` | qa_block_edge_cases_test.go | 144 | | | | +| 581 | `TestEdge_LSNWithinTolerance_HealthWins` | qa_block_edge_cases_test.go | 187 | | | | +| 582 | `TestEdge_LSNWithinTolerance_HealthWins` | qa_block_edge_cases_test.go | 187 | | | | +| 583 | `TestEdge_NetworkFlap_RapidMarkUnmark` | qa_block_edge_cases_test.go | 220 | | | | +| 584 | `TestEdge_NetworkFlap_RapidMarkUnmark` | qa_block_edge_cases_test.go | 220 | | | | +| 585 | `TestEdge_RF3_MixedGates_BestEligiblePromoted` | qa_block_edge_cases_test.go | 280 | | | | +| 586 | `TestEdge_RF3_MixedGates_BestEligiblePromoted` | qa_block_edge_cases_test.go | 280 | | | | +| 587 | `TestEdge_PromotionUpdatesPublication` | qa_block_edge_cases_test.go | 340 | | | | +| 588 | `TestEdge_PromotionUpdatesPublication` | qa_block_edge_cases_test.go | 340 | | | | +| 589 | `TestEdge_OrphanReevaluation_LSNLag_StillPromotes` | qa_block_edge_cases_test.go | 375 | | | | +| 590 | `TestEdge_OrphanReevaluation_LSNLag_StillPromotes` | qa_block_edge_cases_test.go | 375 | | | | +| 591 | `TestEdge_RebuildAddr_ClearedThenRepopulated` | qa_block_edge_cases_test.go | 415 | | | | +| 592 | `TestEdge_RebuildAddr_ClearedThenRepopulated` | qa_block_edge_cases_test.go | 415 | | | | +| 593 | `TestEdge_MultipleVolumes_SameServer_AllFailover` | qa_block_edge_cases_test.go | 444 | | | | +| 594 | `TestEdge_MultipleVolumes_SameServer_AllFailover` | qa_block_edge_cases_test.go | 444 | | | | +| 595 | `TestQABlockHeartbeatCollector` | qa_block_heartbeat_loop_test.go | 11 | | | | +| 596 | `TestQABlockHeartbeatCollector` | qa_block_heartbeat_loop_test.go | 11 | | | | +| 597 | `TestQA_NVMe_CreateSetsFields` | qa_block_nvme_publication_test.go | 28 | | | | +| 598 | `TestQA_NVMe_CreateSetsFields` | qa_block_nvme_publication_test.go | 28 | | | | +| 599 | `TestQA_NVMe_MissingFieldsDegradeToISCSI` | qa_block_nvme_publication_test.go | 61 | | | | +| 600 | `TestQA_NVMe_MissingFieldsDegradeToISCSI` | qa_block_nvme_publication_test.go | 61 | | | | +| 601 | `TestQA_NVMe_HeartbeatSetsNvmeFields` | qa_block_nvme_publication_test.go | 94 | | | | +| 602 | `TestQA_NVMe_HeartbeatSetsNvmeFields` | qa_block_nvme_publication_test.go | 94 | | | | +| 603 | `TestQA_NVMe_HeartbeatClearsStaleNvme` | qa_block_nvme_publication_test.go | 135 | | | | +| 604 | `TestQA_NVMe_HeartbeatClearsStaleNvme` | qa_block_nvme_publication_test.go | 135 | | | | +| 605 | `TestQA_NVMe_PartialFields_OnlyAddr` | qa_block_nvme_publication_test.go | 170 | | | | +| 606 | `TestQA_NVMe_PartialFields_OnlyAddr` | qa_block_nvme_publication_test.go | 170 | | | | +| 607 | `TestQA_NVMe_PartialFields_OnlyNQN` | qa_block_nvme_publication_test.go | 192 | | | | +| 608 | `TestQA_NVMe_PartialFields_OnlyNQN` | qa_block_nvme_publication_test.go | 192 | | | | +| 609 | `TestQA_NVMe_SwapPrimaryReplica_PreservesNvme` | qa_block_nvme_publication_test.go | 213 | | | | +| 610 | `TestQA_NVMe_SwapPrimaryReplica_PreservesNvme` | qa_block_nvme_publication_test.go | 213 | | | | +| 611 | `TestQA_NVMe_PromoteBestReplica_NvmeFieldsCopied` | qa_block_nvme_publication_test.go | 256 | | | | +| 612 | `TestQA_NVMe_PromoteBestReplica_NvmeFieldsCopied` | qa_block_nvme_publication_test.go | 256 | | | | +| 613 | `TestQA_NVMe_HeartbeatProto_RoundTrip` | qa_block_nvme_publication_test.go | 303 | | | | +| 614 | `TestQA_NVMe_HeartbeatProto_RoundTrip` | qa_block_nvme_publication_test.go | 303 | | | | +| 615 | `TestQA_NVMe_HeartbeatProto_EmptyFields` | qa_block_nvme_publication_test.go | 333 | | | | +| 616 | `TestQA_NVMe_HeartbeatProto_EmptyFields` | qa_block_nvme_publication_test.go | 333 | | | | +| 617 | `TestQA_NVMe_FullHeartbeat_MasterRestart` | qa_block_nvme_publication_test.go | 358 | | | | +| 618 | `TestQA_NVMe_FullHeartbeat_MasterRestart` | qa_block_nvme_publication_test.go | 358 | | | | +| 619 | `TestQA_NVMe_ListByServerIncludesNvmeFields` | qa_block_nvme_publication_test.go | 402 | | | | +| 620 | `TestQA_NVMe_ListByServerIncludesNvmeFields` | qa_block_nvme_publication_test.go | 402 | | | | +| 621 | `TestIntegration_NVMe_CreateReturnsNvmeAddr` | qa_block_nvme_publication_test.go | 484 | | | | +| 622 | `TestIntegration_NVMe_CreateReturnsNvmeAddr` | qa_block_nvme_publication_test.go | 484 | | | | +| 623 | `TestIntegration_NVMe_LookupReturnsNvmeAddr` | qa_block_nvme_publication_test.go | 519 | | | | +| 624 | `TestIntegration_NVMe_LookupReturnsNvmeAddr` | qa_block_nvme_publication_test.go | 519 | | | | +| 625 | `TestIntegration_NVMe_FailoverUpdatesNvmeAddr` | qa_block_nvme_publication_test.go | 559 | | | | +| 626 | `TestIntegration_NVMe_FailoverUpdatesNvmeAddr` | qa_block_nvme_publication_test.go | 559 | | | | +| 627 | `TestIntegration_NVMe_HeartbeatReconstructionAfterMasterRestart` | qa_block_nvme_publication_test.go | 626 | | | | +| 628 | `TestIntegration_NVMe_HeartbeatReconstructionAfterMasterRestart` | qa_block_nvme_publication_test.go | 626 | | | | +| 629 | `TestIntegration_NVMe_MixedCluster` | qa_block_nvme_publication_test.go | 676 | | | | +| 630 | `TestIntegration_NVMe_MixedCluster` | qa_block_nvme_publication_test.go | 676 | | | | +| 631 | `TestIntegration_NVMe_VolumeServerHeartbeatCollector` | qa_block_nvme_publication_test.go | 741 | | | | +| 632 | `TestIntegration_NVMe_VolumeServerHeartbeatCollector` | qa_block_nvme_publication_test.go | 741 | | | | +| 633 | `TestIntegration_NVMe_VolumeServerNoNvme` | qa_block_nvme_publication_test.go | 791 | | | | +| 634 | `TestIntegration_NVMe_VolumeServerNoNvme` | qa_block_nvme_publication_test.go | 791 | | | | +| 635 | `TestIntegration_NVMe_FullLifecycle_K8s` | qa_block_nvme_publication_test.go | 825 | | | | +| 636 | `TestIntegration_NVMe_FullLifecycle_K8s` | qa_block_nvme_publication_test.go | 825 | | | | +| 637 | `TestQA_NVMe_ToggleNvmeOnRunningVS` | qa_block_nvme_publication_test.go | 945 | | | | +| 638 | `TestQA_NVMe_ToggleNvmeOnRunningVS` | qa_block_nvme_publication_test.go | 945 | | | | +| 639 | `TestQA_NVMe_ToggleNvmeOnRunningVS_ReplicaSide` | qa_block_nvme_publication_test.go | 1028 | | | | +| 640 | `TestQA_NVMe_ToggleNvmeOnRunningVS_ReplicaSide` | qa_block_nvme_publication_test.go | 1028 | | | | +| 641 | `TestQA_NVMe_PromotionThenImmediateLookup` | qa_block_nvme_publication_test.go | 1128 | | | | +| 642 | `TestQA_NVMe_PromotionThenImmediateLookup` | qa_block_nvme_publication_test.go | 1128 | | | | +| 643 | `TestQA_RF3_CreateAndVerifyReplicas` | qa_block_rf3_test.go | 51 | | | | +| 644 | `TestQA_RF3_CreateAndVerifyReplicas` | qa_block_rf3_test.go | 51 | | | | +| 645 | `TestQA_RF3_PrimaryDies_BestReplicaPromoted` | qa_block_rf3_test.go | 125 | | | | +| 646 | `TestQA_RF3_PrimaryDies_BestReplicaPromoted` | qa_block_rf3_test.go | 125 | | | | +| 647 | `TestQA_RF3_OneReplicaDies_WritesUnaffected` | qa_block_rf3_test.go | 175 | | | | +| 648 | `TestQA_RF3_OneReplicaDies_WritesUnaffected` | qa_block_rf3_test.go | 175 | | | | +| 649 | `TestQA_RF3_Rebuild_DeadReplicaCatchesUp` | qa_block_rf3_test.go | 214 | | | | +| 650 | `TestQA_RF3_Rebuild_DeadReplicaCatchesUp` | qa_block_rf3_test.go | 214 | | | | +| 651 | `TestQA_RF3_HealthScore_FailoverPreference` | qa_block_rf3_test.go | 265 | | | | +| 652 | `TestQA_RF3_HealthScore_FailoverPreference` | qa_block_rf3_test.go | 265 | | | | +| 653 | `TestQA_RF3_BackwardCompat_RF2_Unchanged` | qa_block_rf3_test.go | 308 | | | | +| 654 | `TestQA_RF3_BackwardCompat_RF2_Unchanged` | qa_block_rf3_test.go | 308 | | | | +| 655 | `TestQA_RF3_FullLifecycle` | qa_block_rf3_test.go | 357 | | | | +| 656 | `TestQA_RF3_FullLifecycle` | qa_block_rf3_test.go | 357 | | | | +| 657 | `TestVS_AllocateBlockVolume` | volume_grpc_block_test.go | 23 | | | | +| 658 | `TestVS_AllocateIdempotent` | volume_grpc_block_test.go | 49 | | | | +| 659 | `TestVS_AllocateSizeMismatch` | volume_grpc_block_test.go | 67 | | | | +| 660 | `TestVS_DeleteBlockVolume` | volume_grpc_block_test.go | 82 | | | | +| 661 | `TestVS_DeleteNotFound` | volume_grpc_block_test.go | 103 | | | | +| 662 | `TestVS_SnapshotBlockVol` | volume_grpc_block_test.go | 112 | | | | +| 663 | `TestVS_SnapshotVolumeNotFound` | volume_grpc_block_test.go | 129 | | | | +| 664 | `TestVS_DeleteBlockSnapshot` | volume_grpc_block_test.go | 138 | | | | +| 665 | `TestVS_ListBlockSnapshots` | volume_grpc_block_test.go | 154 | | | | +| 666 | `TestVS_ListSnapshotsVolumeNotFound` | volume_grpc_block_test.go | 173 | | | | +| 667 | `TestVS_ExpandBlockVol` | volume_grpc_block_test.go | 182 | | | | +| 668 | `TestVS_ExpandVolumeNotFound` | volume_grpc_block_test.go | 196 | | | | +| 669 | `TestVS_PrepareExpand` | volume_grpc_block_test.go | 205 | | | | +| 670 | `TestVS_CommitExpand` | volume_grpc_block_test.go | 214 | | | | +| 671 | `TestVS_CancelExpand` | volume_grpc_block_test.go | 230 | | | | +| 672 | `TestVS_PrepareExpand_AlreadyInFlight` | volume_grpc_block_test.go | 242 | | | | +| 673 | `TestBlockServiceDisabledByDefault` | volume_server_block_test.go | 27 | | | | +| 674 | `TestBlockServiceStartAndShutdown` | volume_server_block_test.go | 40 | | | | +| 675 | `TestBlockService_ProcessAssignment_Primary` | volume_server_block_test.go | 91 | | | | +| 676 | `TestBlockService_ProcessAssignment_Replica` | volume_server_block_test.go | 112 | | | | +| 677 | `TestBlockService_ProcessAssignment_UnknownVolume` | volume_server_block_test.go | 130 | | | | +| 678 | `TestBlockService_ProcessAssignment_LeaseRefresh` | volume_server_block_test.go | 138 | | | | +| 679 | `TestBlockService_ProcessAssignment_WithReplicaAddrs` | volume_server_block_test.go | 156 | | | | +| 680 | `TestBlockService_HeartbeatIncludesReplicaAddrs` | volume_server_block_test.go | 173 | | | | +| 681 | `TestBlockService_ReplicationPorts_Deterministic` | volume_server_block_test.go | 190 | | | | +| 682 | `TestBlockService_ReplicationPorts_StableAcrossRestarts` | volume_server_block_test.go | 202 | | | | +| 683 | `TestBlockService_ProcessAssignment_InvalidTransition` | volume_server_block_test.go | 212 | | | | + +### durability + +| # | Test Name | File | Line | Status | Sim | Notes | +|---|-----------|------|------|--------|-----|-------| +| 1 | `TestBugGCPanicWaitersHung` | bug_gc_panic_test.go | 19 | | | | +| 2 | `TestFlusher` | flusher_test.go | 13 | | | | +| 3 | `TestGroupCommitter` | group_commit_test.go | 11 | | | | + +### engine + +| # | Test Name | File | Line | Status | Sim | Notes | +|---|-----------|------|------|--------|-----|-------| +| 1 | `TestQA` | blockvol_qa_test.go | 20 | | | | +| 2 | `TestQASuperblockValidation` | blockvol_qa_test.go | 804 | | | | +| 3 | `TestQAWALWriterEdgeCases` | blockvol_qa_test.go | 862 | | | | +| 4 | `TestQAWALEntryEdgeCases` | blockvol_qa_test.go | 960 | | | | +| 5 | `TestQAGroupCommitter` | blockvol_qa_test.go | 1020 | | | | +| 6 | `TestQAFlusher` | blockvol_qa_test.go | 1291 | | | | +| 7 | `TestQARecovery` | blockvol_qa_test.go | 1677 | | | | +| 8 | `TestQALifecycle` | blockvol_qa_test.go | 2221 | | | | +| 9 | `TestQACrashStress` | blockvol_qa_test.go | 2517 | | | | +| 10 | `TestQARecoveryEdgeCases` | blockvol_qa_test.go | 2999 | | | | +| 11 | `TestQAFlusherEdgeCases` | blockvol_qa_test.go | 3330 | | | | +| 12 | `TestQALifecycleConcurrency` | blockvol_qa_test.go | 3536 | | | | +| 13 | `TestQAParameterExtremes` | blockvol_qa_test.go | 3718 | | | | +| 14 | `TestBlockVol` | blockvol_test.go | 18 | | | | +| 15 | `TestBlockVolConfig` | config_test.go | 9 | | | | +| 16 | `TestDirtyMap` | dirty_map_test.go | 8 | | | | +| 17 | `TestLBAValidation` | lba_test.go | 8 | | | | +| 18 | `TestSuperblock` | superblock_test.go | 10 | | | | + +### fencing + +| # | Test Name | File | Line | Status | Sim | Notes | +|---|-----------|------|------|--------|-----|-------| +| 1 | `TestLeaseGrant` | lease_grant_test.go | 10 | | | | +| 2 | `TestQAPhase4ACP1` | qa_phase4a_cp1_test.go | 16 | | | | +| 3 | `TestQAPhase4ACP2` | qa_phase4a_cp2_test.go | 18 | | | | +| 4 | `TestQAPhase4ACP3` | qa_phase4a_cp3_test.go | 15 | | | | +| 5 | `TestQAPhase4ACP4a` | qa_phase4a_cp4a_test.go | 15 | | | | +| 6 | `TestQAPhase4ACP4b1` | qa_phase4a_cp4b1_test.go | 13 | | | | +| 7 | `TestQAPhase4ACP4b4` | qa_phase4a_cp4b4_test.go | 15 | | | | + +### iscsi + +| # | Test Name | File | Line | Status | Sim | Notes | +|---|-----------|------|------|--------|-----|-------| +| 1 | `TestCHAP_LoginSuccess` | auth_test.go | 13 | | | | +| 2 | `TestCHAP_LoginWrongPassword` | auth_test.go | 105 | | | | +| 3 | `TestCHAP_DisabledAllowsLogin` | auth_test.go | 146 | | | | +| 4 | `TestBugCollectDataOutNoTimeout` | bug_dataout_timeout_test.go | 23 | | | | +| 5 | `TestBugPendingQueueUnbounded` | bug_pending_unbounded_test.go | 22 | | | | +| 6 | `TestDataIO` | dataio_test.go | 8 | | | | +| 7 | `TestDiscovery` | discovery_test.go | 8 | | | | +| 8 | `TestLogin` | login_test.go | 27 | | | | +| 9 | `TestPDU` | pdu_test.go | 11 | | | | +| 10 | `TestQACHAP` | qa_chap_test.go | 12 | | | | +| 11 | `TestQAPhase3Engine` | qa_phase3_engine_test.go | 15 | | | | +| 12 | `TestQAPhase3RXTX` | qa_rxtx_test.go | 15 | | | | +| 13 | `TestQAPhase3Stability` | qa_stability_test.go | 20 | | | | +| 14 | `TestMapBlockVolError_Durability_HardwareError` | scsi_durability_test.go | 11 | | | | +| 15 | `TestMapBlockVolError_NonDurability_MediumError` | scsi_durability_test.go | 32 | | | | +| 16 | `TestMapBlockVolError_StringMatchNoLongerWorks` | scsi_durability_test.go | 50 | | | | +| 17 | `TestSCSI` | scsi_test.go | 74 | | | | +| 18 | `TestSession` | session_test.go | 52 | | | | +| 19 | `TestTarget` | target_test.go | 13 | | | | + +### mode + +| # | Test Name | File | Line | Status | Sim | Notes | +|---|-----------|------|------|--------|-----|-------| +| 1 | `TestParseDurabilityMode_Valid` | durability_mode_test.go | 10 | | | | +| 2 | `TestParseDurabilityMode_Invalid` | durability_mode_test.go | 31 | | | | +| 3 | `TestDurabilityMode_StringRoundTrip` | durability_mode_test.go | 38 | | | | +| 4 | `TestDurabilityMode_Validate_SyncQuorum_RF2_Rejected` | durability_mode_test.go | 51 | | | | +| 5 | `TestDurabilityMode_Validate_SyncQuorum_RF3_OK` | durability_mode_test.go | 58 | | | | +| 6 | `TestDurabilityMode_Validate_BestEffort_RF1_OK` | durability_mode_test.go | 64 | | | | +| 7 | `TestDurabilityMode_RequiredReplicas_SyncAll_RF2` | durability_mode_test.go | 70 | | | | +| 8 | `TestDurabilityMode_RequiredReplicas_SyncQuorum_RF3` | durability_mode_test.go | 77 | | | | +| 9 | `TestDurabilityMode_IsStrict` | durability_mode_test.go | 84 | | | | +| 10 | `TestDurabilityMode_CreateCloseOpenRoundTrip` | durability_mode_test.go | 96 | | | | +| 11 | `TestDurabilityMode_Accessor_Default` | durability_mode_test.go | 122 | | | | +| 12 | `TestDurabilityMode_InvalidOnDisk_OpenFails` | durability_mode_test.go | 136 | | | | +| 13 | `TestQA_CP831_IdempotentCreate_ModeMismatch_Rejected` | qa_block_cp831_adversarial_test.go | 29 | | | | +| 14 | `TestQA_CP831_IdempotentCreate_ModeMismatch_Rejected` | qa_block_cp831_adversarial_test.go | 29 | | | | +| 15 | `TestQA_CP831_IdempotentCreate_RFMismatch_Rejected` | qa_block_cp831_adversarial_test.go | 63 | | | | +| 16 | `TestQA_CP831_IdempotentCreate_RFMismatch_Rejected` | qa_block_cp831_adversarial_test.go | 63 | | | | +| 17 | `TestQA_CP831_IdempotentCreate_AllMatch_Succeeds` | qa_block_cp831_adversarial_test.go | 97 | | | | +| 18 | `TestQA_CP831_IdempotentCreate_AllMatch_Succeeds` | qa_block_cp831_adversarial_test.go | 97 | | | | +| 19 | `TestQA_CP831_InvalidDurabilityMode_Rejected` | qa_block_cp831_adversarial_test.go | 131 | | | | +| 20 | `TestQA_CP831_InvalidDurabilityMode_Rejected` | qa_block_cp831_adversarial_test.go | 131 | | | | +| 21 | `TestQA_CP831_HeartbeatEmptyMode_DoesNotOverwriteStrict` | qa_block_cp831_adversarial_test.go | 158 | | | | +| 22 | `TestQA_CP831_HeartbeatEmptyMode_DoesNotOverwriteStrict` | qa_block_cp831_adversarial_test.go | 158 | | | | +| 23 | `TestQA_CP831_HeartbeatNonEmptyMode_DoesUpdate` | qa_block_cp831_adversarial_test.go | 200 | | | | +| 24 | `TestQA_CP831_HeartbeatNonEmptyMode_DoesUpdate` | qa_block_cp831_adversarial_test.go | 200 | | | | +| 25 | `TestQA_CP831_SyncAll_RF3_PartialReplica_OneOfTwo_Fails` | qa_block_cp831_adversarial_test.go | 240 | | | | +| 26 | `TestQA_CP831_SyncAll_RF3_PartialReplica_OneOfTwo_Fails` | qa_block_cp831_adversarial_test.go | 240 | | | | +| 27 | `TestQA_CP831_SyncQuorum_RF3_OneReplicaOK_Succeeds` | qa_block_cp831_adversarial_test.go | 298 | | | | +| 28 | `TestQA_CP831_SyncQuorum_RF3_OneReplicaOK_Succeeds` | qa_block_cp831_adversarial_test.go | 298 | | | | +| 29 | `TestQA_CP831_SyncQuorum_RF3_AllReplicasFail_Fails` | qa_block_cp831_adversarial_test.go | 343 | | | | +| 30 | `TestQA_CP831_SyncQuorum_RF3_AllReplicasFail_Fails` | qa_block_cp831_adversarial_test.go | 343 | | | | +| 31 | `TestQA_CP831_ConcurrentCreate_SameName_DifferentModes` | qa_block_cp831_adversarial_test.go | 393 | | | | +| 32 | `TestQA_CP831_ConcurrentCreate_SameName_DifferentModes` | qa_block_cp831_adversarial_test.go | 393 | | | | +| 33 | `TestQA_CP831_FailoverPreservesDurabilityMode` | qa_block_cp831_adversarial_test.go | 464 | | | | +| 34 | `TestQA_CP831_FailoverPreservesDurabilityMode` | qa_block_cp831_adversarial_test.go | 464 | | | | +| 35 | `TestQA_CP831_ExpandPreservesDurabilityMode` | qa_block_cp831_adversarial_test.go | 513 | | | | +| 36 | `TestQA_CP831_ExpandPreservesDurabilityMode` | qa_block_cp831_adversarial_test.go | 513 | | | | +| 37 | `TestQA_CP831_LookupReturnsDurabilityMode` | qa_block_cp831_adversarial_test.go | 550 | | | | +| 38 | `TestQA_CP831_LookupReturnsDurabilityMode` | qa_block_cp831_adversarial_test.go | 550 | | | | +| 39 | `TestQA_CP831_BestEffort_NoReplicaCreate_StillSucceeds` | qa_block_cp831_adversarial_test.go | 590 | | | | +| 40 | `TestQA_CP831_BestEffort_NoReplicaCreate_StillSucceeds` | qa_block_cp831_adversarial_test.go | 590 | | | | +| 41 | `TestQA_CP831_SyncAll_SingleServer_Fails` | qa_block_cp831_adversarial_test.go | 624 | | | | +| 42 | `TestQA_CP831_SyncAll_SingleServer_Fails` | qa_block_cp831_adversarial_test.go | 624 | | | | +| 43 | `TestQA_CP831_CleanupPartialCreate_DeletesFails_NoRegistryLeak` | qa_block_cp831_adversarial_test.go | 666 | | | | +| 44 | `TestQA_CP831_CleanupPartialCreate_DeletesFails_NoRegistryLeak` | qa_block_cp831_adversarial_test.go | 666 | | | | +| 45 | `TestQA_CP831_MasterRestart_AutoRegister_PreservesDurabilityMode` | qa_block_cp831_adversarial_test.go | 713 | | | | +| 46 | `TestQA_CP831_MasterRestart_AutoRegister_PreservesDurabilityMode` | qa_block_cp831_adversarial_test.go | 713 | | | | +| 47 | `TestQA_CP831_DurabilityMode_Superblock_Roundtrip_AllModes` | qa_block_cp831_adversarial_test.go | 759 | | | | +| 48 | `TestQA_CP831_DurabilityMode_Superblock_Roundtrip_AllModes` | qa_block_cp831_adversarial_test.go | 759 | | | | +| 49 | `TestQA_CP831_DurabilityMode_Validate_EdgeCases` | qa_block_cp831_adversarial_test.go | 803 | | | | +| 50 | `TestQA_CP831_DurabilityMode_Validate_EdgeCases` | qa_block_cp831_adversarial_test.go | 803 | | | | +| 51 | `TestQA_CP831_DurabilityMode_RequiredReplicas_Math` | qa_block_cp831_adversarial_test.go | 841 | | | | +| 52 | `TestQA_CP831_DurabilityMode_RequiredReplicas_Math` | qa_block_cp831_adversarial_test.go | 841 | | | | +| 53 | `TestQA_CP831_SentinelErrors_AreProperlyCategorized` | qa_block_cp831_adversarial_test.go | 880 | | | | +| 54 | `TestQA_CP831_SentinelErrors_AreProperlyCategorized` | qa_block_cp831_adversarial_test.go | 880 | | | | +| 55 | `TestDurability_BestEffort_BarrierFail_WriteSucceeds` | qa_block_durability_test.go | 62 | | | | +| 56 | `TestDurability_BestEffort_BarrierFail_WriteSucceeds` | qa_block_durability_test.go | 62 | | | | +| 57 | `TestDurability_SyncAll_BarrierFail_WriteErrors` | qa_block_durability_test.go | 87 | | | | +| 58 | `TestDurability_SyncAll_BarrierFail_WriteErrors` | qa_block_durability_test.go | 87 | | | | +| 59 | `TestDurability_SyncAll_AllSucceed_WriteOK` | qa_block_durability_test.go | 112 | | | | +| 60 | `TestDurability_SyncAll_AllSucceed_WriteOK` | qa_block_durability_test.go | 112 | | | | +| 61 | `TestDurability_SyncAll_ZeroGroupStandalone` | qa_block_durability_test.go | 135 | | | | +| 62 | `TestDurability_SyncAll_ZeroGroupStandalone` | qa_block_durability_test.go | 135 | | | | +| 63 | `TestDurability_SyncQuorum_RF3_OneFailOK` | qa_block_durability_test.go | 158 | | | | +| 64 | `TestDurability_SyncQuorum_RF3_OneFailOK` | qa_block_durability_test.go | 158 | | | | +| 65 | `TestDurability_SyncQuorum_RF3_TwoFail_Error` | qa_block_durability_test.go | 172 | | | | +| 66 | `TestDurability_SyncQuorum_RF3_TwoFail_Error` | qa_block_durability_test.go | 172 | | | | +| 67 | `TestDurability_SyncQuorum_RF2_Rejected` | qa_block_durability_test.go | 185 | | | | +| 68 | `TestDurability_SyncQuorum_RF2_Rejected` | qa_block_durability_test.go | 185 | | | | +| 69 | `TestDurability_SuperblockPersistence` | qa_block_durability_test.go | 204 | | | | +| 70 | `TestDurability_SuperblockPersistence` | qa_block_durability_test.go | 204 | | | | +| 71 | `TestDurability_V1Compat_DefaultBestEffort` | qa_block_durability_test.go | 232 | | | | +| 72 | `TestDurability_V1Compat_DefaultBestEffort` | qa_block_durability_test.go | 232 | | | | +| 73 | `TestDurability_SyncAll_DegradedRecovery` | qa_block_durability_test.go | 261 | | | | +| 74 | `TestDurability_SyncAll_DegradedRecovery` | qa_block_durability_test.go | 261 | | | | +| 75 | `TestDurability_Heartbeat_ReportsMode` | qa_block_durability_test.go | 290 | | | | +| 76 | `TestDurability_Heartbeat_ReportsMode` | qa_block_durability_test.go | 290 | | | | +| 77 | `TestDurability_MixedModes_SameServer` | qa_block_durability_test.go | 346 | | | | +| 78 | `TestDurability_MixedModes_SameServer` | qa_block_durability_test.go | 346 | | | | +| 79 | `TestDurability_SyncAll_PartialReplicaFails` | qa_block_durability_test.go | 386 | | | | +| 80 | `TestDurability_SyncAll_PartialReplicaFails` | qa_block_durability_test.go | 386 | | | | +| 81 | `TestDurability_BestEffort_PartialReplicaOK` | qa_block_durability_test.go | 439 | | | | +| 82 | `TestDurability_BestEffort_PartialReplicaOK` | qa_block_durability_test.go | 439 | | | | + +### nvme + +| # | Test Name | File | Line | Status | Sim | Notes | +|---|-----------|------|------|--------|-----|-------| +| 1 | `TestNVMeActions_Registration` | nvme_bench_test.go | 18 | | | | +| 2 | `TestNVMeActions_TierGating` | nvme_bench_test.go | 41 | | | | +| 3 | `TestBenchActions_Registration` | nvme_bench_test.go | 63 | | | | +| 4 | `TestFindNVMeDevice_Parse_LiveTCP` | nvme_bench_test.go | 109 | | | | +| 5 | `TestFindNVMeDevice_Parse_NoMatch` | nvme_bench_test.go | 121 | | | | +| 6 | `TestFindNVMeDevice_Parse_MultipleSubsystems` | nvme_bench_test.go | 133 | | | | +| 7 | `TestFindNVMeDevice_Parse_PreferLiveTCP` | nvme_bench_test.go | 148 | | | | +| 8 | `TestFindNVMeDevice_Parse_FallbackNonLive` | nvme_bench_test.go | 164 | | | | +| 9 | `TestFindNVMeDevice_Parse_EmptyPaths` | nvme_bench_test.go | 176 | | | | +| 10 | `TestFindNVMeDevice_Parse_EmptyName` | nvme_bench_test.go | 185 | | | | +| 11 | `TestFindNVMeDevice_Parse_EmptySubsystems` | nvme_bench_test.go | 197 | | | | +| 12 | `TestFindNVMeDevice_Parse_CaseInsensitive` | nvme_bench_test.go | 204 | | | | +| 13 | `TestTargetSpec_NQN_WithNQNSuffix` | nvme_bench_test.go | 220 | | | | +| 14 | `TestTargetSpec_NQN_FallbackToIQN` | nvme_bench_test.go | 228 | | | | +| 15 | `TestTargetSpec_NQN_BothEmpty` | nvme_bench_test.go | 236 | | | | +| 16 | `TestParseFioMetric_MixedAutoDetectPicksWrite` | nvme_bench_test.go | 249 | | | | +| 17 | `TestParseFioMetric_AllLatencyMetrics` | nvme_bench_test.go | 260 | | | | +| 18 | `TestParseFioMetric_BWBytes` | nvme_bench_test.go | 281 | | | | +| 19 | `TestParseFioMetric_MissingPercentile` | nvme_bench_test.go | 291 | | | | +| 20 | `TestParseFioMetric_NilPercentile` | nvme_bench_test.go | 307 | | | | +| 21 | `TestComputeBenchResult_LatencyWarn` | nvme_bench_test.go | 327 | | | | +| 22 | `TestComputeBenchResult_LatencyMuchWorse` | nvme_bench_test.go | 338 | | | | +| 23 | `TestComputeBenchResult_ExactGate` | nvme_bench_test.go | 348 | | | | +| 24 | `TestComputeBenchResult_JustBelowGate` | nvme_bench_test.go | 355 | | | | +| 25 | `TestComputeBenchResult_ZeroCandidate` | nvme_bench_test.go | 362 | | | | +| 26 | `TestComputeBenchResult_BothZero` | nvme_bench_test.go | 372 | | | | +| 27 | `TestComputeBenchResult_LatencyZeroCandidate` | nvme_bench_test.go | 379 | | | | +| 28 | `TestComputeBenchResult_DeltaSign_ThroughputUp` | nvme_bench_test.go | 389 | | | | +| 29 | `TestComputeBenchResult_DeltaSign_ThroughputDown` | nvme_bench_test.go | 396 | | | | +| 30 | `TestComputeBenchResult_DeltaSign_LatencyDown` | nvme_bench_test.go | 403 | | | | +| 31 | `TestComputeBenchResult_DeltaSign_LatencyUp` | nvme_bench_test.go | 410 | | | | +| 32 | `TestFormatBenchReport_EmptyResults` | nvme_bench_test.go | 421 | | | | +| 33 | `TestFormatBenchReport_MixedPassFail` | nvme_bench_test.go | 428 | | | | +| 34 | `TestBenchCompare_MissingParams` | nvme_bench_test.go | 451 | | | | +| 35 | `TestBenchCompare_EmptyVarValues` | nvme_bench_test.go | 482 | | | | +| 36 | `TestBenchCompare_InvalidGate` | nvme_bench_test.go | 501 | | | | +| 37 | `TestBenchCompare_PassWithDirection` | nvme_bench_test.go | 520 | | | | +| 38 | `TestBenchCompare_LatencyGatePass` | nvme_bench_test.go | 543 | | | | +| 39 | `TestFioParse_Action` | nvme_bench_test.go | 571 | | | | +| 40 | `TestFioParse_MissingVar` | nvme_bench_test.go | 591 | | | | +| 41 | `TestFioParse_MissingParams` | nvme_bench_test.go | 608 | | | | +| 42 | `TestFioParse_WithDirection` | nvme_bench_test.go | 633 | | | | +| 43 | `TestEngine_NVMeBenchScenario` | nvme_bench_test.go | 674 | | | | +| 44 | `TestEngine_BenchCompare_FailsGate` | nvme_bench_test.go | 747 | | | | +| 45 | `TestEngine_BenchCompare_LatencyFails` | nvme_bench_test.go | 788 | | | | +| 46 | `TestBenchCompare_WarnGate_InWarnBand` | nvme_bench_test.go | 833 | | | | +| 47 | `TestBenchCompare_WarnGate_BelowWarnGate` | nvme_bench_test.go | 862 | | | | +| 48 | `TestBenchCompare_WarnGate_AboveGate` | nvme_bench_test.go | 891 | | | | +| 49 | `TestBenchCompare_WarnGate_InvalidValue` | nvme_bench_test.go | 915 | | | | +| 50 | `TestBenchCompare_WarnGate_LatencyInWarnBand` | nvme_bench_test.go | 935 | | | | +| 51 | `TestTargetSpec_NQN_Sanitized` | nvme_bench_test.go | 968 | | | | +| 52 | `TestTargetSpec_IQN_Sanitized` | nvme_bench_test.go | 977 | | | | +| 53 | `TestTargetSpec_NQN_LongNameTruncated` | nvme_bench_test.go | 986 | | | | +| 54 | `TestParamDefault` | nvme_bench_test.go | 1002 | | | | +| 55 | `TestQA_Wire_TruncatedHeaderEOF` | nvme_qa_test.go | 27 | | | | +| 56 | `TestQA_Wire_ZeroLengthStream` | nvme_qa_test.go | 39 | | | | +| 57 | `TestQA_Wire_HeaderLength_Exactly8` | nvme_qa_test.go | 51 | | | | +| 58 | `TestQA_Wire_AllZeroHeader` | nvme_qa_test.go | 71 | | | | +| 59 | `TestQA_Wire_GarbageAfterValidPDU` | nvme_qa_test.go | 82 | | | | +| 60 | `TestQA_UnexpectedPDUType` | nvme_qa_test.go | 119 | | | | +| 61 | `TestQA_CapsuleBeforeIC` | nvme_qa_test.go | 154 | | | | +| 62 | `TestQA_IOWrite_OnAdminQueue` | nvme_qa_test.go | 182 | | | | +| 63 | `TestQA_UnknownAdminOpcode` | nvme_qa_test.go | 201 | | | | +| 64 | `TestQA_UnknownIOOpcode` | nvme_qa_test.go | 218 | | | | +| 65 | `TestQA_ConnectEmptyPayload` | nvme_qa_test.go | 256 | | | | +| 66 | `TestQA_ConnectNoPayload` | nvme_qa_test.go | 292 | | | | +| 67 | `TestQA_UnknownFabricFCType` | nvme_qa_test.go | 327 | | | | +| 68 | `TestQA_PropertyGetUnknownOffset` | nvme_qa_test.go | 345 | | | | +| 69 | `TestQA_Disconnect_CleanShutdown` | nvme_qa_test.go | 364 | | | | +| 70 | `TestQA_IO_WriteOversizedPayload` | nvme_qa_test.go | 414 | | | | +| 71 | `TestQA_IO_WriteExactBoundary` | nvme_qa_test.go | 435 | | | | +| 72 | `TestQA_IO_ReadExactBoundary` | nvme_qa_test.go | 470 | | | | +| 73 | `TestQA_IO_MultiBlockWrite` | nvme_qa_test.go | 512 | | | | +| 74 | `TestQA_IO_WriteZerosOutOfBounds` | nvme_qa_test.go | 573 | | | | +| 75 | `TestQA_IO_FlushOnReplica` | nvme_qa_test.go | 592 | | | | +| 76 | `TestQA_IO_WriteZerosOnReplica` | nvme_qa_test.go | 610 | | | | +| 77 | `TestQA_IO_ReadOnReplicaSucceeds` | nvme_qa_test.go | 630 | | | | +| 78 | `TestQA_IO_SyncCacheError` | nvme_qa_test.go | 659 | | | | +| 79 | `TestQA_IO_TrimError` | nvme_qa_test.go | 680 | | | | +| 80 | `TestQA_Admin_UnknownFeatureID` | nvme_qa_test.go | 704 | | | | +| 81 | `TestQA_Admin_UnknownIdentifyCNS` | nvme_qa_test.go | 735 | | | | +| 82 | `TestQA_Admin_UnknownLogPageLID` | nvme_qa_test.go | 753 | | | | +| 83 | `TestQA_Admin_SetFeaturesZeroQueues` | nvme_qa_test.go | 771 | | | | +| 84 | `TestQA_Admin_GetLogPage_ErrorLog` | nvme_qa_test.go | 795 | | | | +| 85 | `TestQA_ANA_TransitionMidIO` | nvme_qa_test.go | 833 | | | | +| 86 | `TestQA_ANA_NonOptimizedAllowsWrite` | nvme_qa_test.go | 875 | | | | +| 87 | `TestQA_ANA_LogReflectsState` | nvme_qa_test.go | 890 | | | | +| 88 | `TestQA_Server_ConnectAfterVolumeRemoved` | nvme_qa_test.go | 938 | | | | +| 89 | `TestQA_Server_RapidConnectDisconnect` | nvme_qa_test.go | 966 | | | | +| 90 | `TestQA_Server_ConcurrentIO` | nvme_qa_test.go | 1008 | | | | +| 91 | `TestQA_SQHDWraparound` | nvme_qa_test.go | 1063 | | | | +| 92 | `TestQA_LargeReadChunking` | nvme_qa_test.go | 1128 | | | | +| 93 | `TestQA_ErrorInjectionMidStream` | nvme_qa_test.go | 1192 | | | | +| 94 | `TestQA_PropertySet_DisableController` | nvme_qa_test.go | 1247 | | | | +| 95 | `TestQA_IdentifyWithoutSubsystem` | nvme_qa_test.go | 1287 | | | | +| 96 | `TestQA_IO_WithoutSubsystem` | nvme_qa_test.go | 1319 | | | | +| 97 | `TestQA_FlowCtlOff_SQHD` | nvme_qa_test.go | 1372 | | | | +| 98 | `TestQA_IO_4KBlockSize` | nvme_qa_test.go | 1437 | | | | +| 99 | `TestQA_Identify_ControllerModelSerial` | nvme_qa_test.go | 1492 | | | | +| 100 | `TestQA_MapBlockError_WriteHeuristic` | nvme_qa_test.go | 1551 | | | | +| 101 | `TestQA_MapBlockError_WriteHeuristicCapital` | nvme_qa_test.go | 1560 | | | | +| 102 | `TestQA_MapBlockError_ReadHeuristic` | nvme_qa_test.go | 1569 | | | | +| 103 | `TestQA_MapBlockError_ReadHeuristicCapital` | nvme_qa_test.go | 1578 | | | | +| 104 | `TestQA_MapBlockError_UnknownError` | nvme_qa_test.go | 1587 | | | | +| 105 | `TestQA_MapBlockError_Nil` | nvme_qa_test.go | 1596 | | | | +| 106 | `TestQA_PropertySet_8ByteValue` | nvme_qa_test.go | 1607 | | | | +| 107 | `TestQA_KATO_Zero_NoTimer` | nvme_qa_test.go | 1643 | | | | +| 108 | `TestQA_LogPage_ErrorLog_LargeNUMD` | nvme_qa_test.go | 1692 | | | | +| 109 | `TestQA_LogPage_SMART_LargeNUMD` | nvme_qa_test.go | 1728 | | | | +| 110 | `TestQA_LogPage_ANA_LargeNUMD` | nvme_qa_test.go | 1774 | | | | +| 111 | `TestQA_SetFeatures_MultipleCallsSameSession` | nvme_qa_test.go | 1816 | | | | +| 112 | `TestQA_SetFeatures_KATOOverwrite` | nvme_qa_test.go | 1876 | | | | +| 113 | `TestQA_CNTLID_MonotonicallyIncreasing` | nvme_qa_test.go | 1918 | | | | +| 114 | `TestQA_Wire_ConnectionDropMidReceive` | nvme_qa_test.go | 1937 | | | | +| 115 | `TestQA_Wire_ConnectionDropMidPayload` | nvme_qa_test.go | 1975 | | | | +| 116 | `TestQA_IO_WriteZeros_DEALLOC_TrimError` | nvme_qa_test.go | 2025 | | | | +| 117 | `TestQA_IO_WriteZeros_NoDEALLOC_WriteError` | nvme_qa_test.go | 2044 | | | | +| 118 | `TestQA_PropertyGet_CAP_8Byte` | nvme_qa_test.go | 2071 | | | | +| 119 | `TestQA_PropertyGet_CC_4Byte` | nvme_qa_test.go | 2094 | | | | +| 120 | `TestQA_Connect_QueueSizeConversion` | nvme_qa_test.go | 2116 | | | | +| 121 | `TestQA_ANA_NonANAProvider_Healthy` | nvme_qa_test.go | 2182 | | | | +| 122 | `TestQA_ANA_NonANAProvider_Unhealthy` | nvme_qa_test.go | 2211 | | | | +| 123 | `TestQA_Capsule_Lba64Bit` | nvme_qa_test.go | 2269 | | | | +| 124 | `TestQA_Capsule_LbaLengthZeroBased` | nvme_qa_test.go | 2278 | | | | +| 125 | `TestQA_Admin_AsyncEvent_Stub` | nvme_qa_test.go | 2299 | | | | +| 126 | `TestQA_H2CTermReq_ClosesSession` | nvme_qa_test.go | 2317 | | | | +| 127 | `TestQA_Padding_MaxDataOffset255` | nvme_qa_test.go | 2355 | | | | +| 128 | `TestQA_Padding_ExactlyPadBufBoundary` | nvme_qa_test.go | 2397 | | | | +| 129 | `TestQA_Padding_OneBeyondPadBuf` | nvme_qa_test.go | 2436 | | | | +| 130 | `TestQA_Padding_ZeroPad` | nvme_qa_test.go | 2474 | | | | +| 131 | `TestQA_Padding_StreamEOFMidPad` | nvme_qa_test.go | 2516 | | | | +| 132 | `TestQA_Padding_TwoConsecutivePDUs` | nvme_qa_test.go | 2547 | | | | +| 133 | `TestQA_BufPool_StaleDataNotLeaked` | nvme_qa_test.go | 2619 | | | | +| 134 | `TestQA_BufPool_ConcurrentGetPut` | nvme_qa_test.go | 2647 | | | | +| 135 | `TestQA_BufPool_ZeroSize` | nvme_qa_test.go | 2668 | | | | +| 136 | `TestQA_BufPool_PutWrongCap` | nvme_qa_test.go | 2682 | | | | +| 137 | `TestQA_BufPool_WriteZerosPooled` | nvme_qa_test.go | 2689 | | | | +| 138 | `TestQA_Batch_MultiChunkC2H_InterleavedVerify` | nvme_qa_test.go | 2748 | | | | +| 139 | `TestQA_Batch_SingleBlockNoChunking` | nvme_qa_test.go | 2861 | | | | +| 140 | `TestQA_Batch_WriteReadCycle_PooledBuffers` | nvme_qa_test.go | 2915 | | | | +| 141 | `TestQA_MaxDataLen_VerySmallChunk` | nvme_qa_test.go | 2995 | | | | +| 142 | `TestQA_MaxDataLen_ExactMultiple` | nvme_qa_test.go | 3067 | | | | +| 143 | `TestQA_MaxDataLen_NonMultiple` | nvme_qa_test.go | 3125 | | | | +| 144 | `TestQA_NQN_SpecialChars` | nvme_qa_test.go | 3191 | | | | +| 145 | `TestQA_NQN_LongName` | nvme_qa_test.go | 3214 | | | | +| 146 | `TestQA_TuneConn_RapidAcceptClose` | nvme_qa_test.go | 3248 | | | | +| 147 | `TestQA_Batch_FlushBufWithoutWrite` | nvme_qa_test.go | 3283 | | | | +| 148 | `TestQA_Batch_MultipleFlushBuf` | nvme_qa_test.go | 3296 | | | | +| 149 | `TestQA_WAL_ConcurrentWritesUnderPressure` | nvme_qa_test.go | 3329 | | | | +| 150 | `TestQA_WAL_ReadsDuringWritePressure` | nvme_qa_test.go | 3372 | | | | +| 151 | `TestQA_WAL_WriteZerosUnderPressure` | nvme_qa_test.go | 3430 | | | | +| 152 | `TestQA_WAL_PressureTransition` | nvme_qa_test.go | 3471 | | | | +| 153 | `TestQA_WAL_ErrorEscalationPrevention` | nvme_qa_test.go | 3513 | | | | +| 154 | `TestQA_WAL_ThrottleDoesNotBlockReads` | nvme_qa_test.go | 3577 | | | | +| 155 | `TestQA_WAL_WrappedErrorProtocolPath` | nvme_qa_test.go | 3628 | | | | +| 156 | `TestQA_WAL_FlushDuringPressure` | nvme_qa_test.go | 3655 | | | | +| 157 | `TestQA_Batch_BackToBack_HeaderOnly` | nvme_qa_test.go | 3684 | | | | +| 158 | `TestCommonHeader_MarshalRoundTrip` | nvme_test.go | 111 | | | | +| 159 | `TestCapsuleCommand_MarshalRoundTrip` | nvme_test.go | 129 | | | | +| 160 | `TestCapsuleResponse_MarshalRoundTrip` | nvme_test.go | 154 | | | | +| 161 | `TestICRequest_MarshalRoundTrip` | nvme_test.go | 173 | | | | +| 162 | `TestICResponse_MarshalRoundTrip` | nvme_test.go | 189 | | | | +| 163 | `TestC2HDataHeader_MarshalRoundTrip` | nvme_test.go | 201 | | | | +| 164 | `TestConnectData_MarshalRoundTrip` | nvme_test.go | 217 | | | | +| 165 | `TestStatusWord_Encoding` | nvme_test.go | 237 | | | | +| 166 | `TestStatusWord_IsError` | nvme_test.go | 275 | | | | +| 167 | `TestWire_WriteReadRoundTrip_HeaderOnly` | nvme_test.go | 288 | | | | +| 168 | `TestWire_WriteReadRoundTrip_WithData` | nvme_test.go | 321 | | | | +| 169 | `TestWire_MultiPDU` | nvme_test.go | 372 | | | | +| 170 | `TestWire_PayloadSize` | nvme_test.go | 404 | | | | +| 171 | `TestWire_CapsuleCmdWithData` | nvme_test.go | 419 | | | | +| 172 | `TestController_ICHandshake` | nvme_test.go | 568 | | | | +| 173 | `TestController_AdminConnect` | nvme_test.go | 588 | | | | +| 174 | `TestController_ConnectUnknownNQN` | nvme_test.go | 598 | | | | +| 175 | `TestController_PropertyGetCAP` | nvme_test.go | 621 | | | | +| 176 | `TestController_PropertySetCC_EN` | nvme_test.go | 648 | | | | +| 177 | `TestIdentify_Controller` | nvme_test.go | 688 | | | | +| 178 | `TestIdentify_Namespace_512B` | nvme_test.go | 764 | | | | +| 179 | `TestIdentify_Namespace_4K` | nvme_test.go | 768 | | | | +| 180 | `TestIdentify_ActiveNSList` | nvme_test.go | 833 | | | | +| 181 | `TestIdentify_NSDescriptors` | nvme_test.go | 863 | | | | +| 182 | `TestAdmin_SetFeatures_NumQueues` | nvme_test.go | 897 | | | | +| 183 | `TestAdmin_GetLogPage_SMART` | nvme_test.go | 925 | | | | +| 184 | `TestAdmin_GetLogPage_ANA` | nvme_test.go | 956 | | | | +| 185 | `TestAdmin_KeepAlive` | nvme_test.go | 993 | | | | +| 186 | `TestAdmin_GetFeatures` | nvme_test.go | 1011 | | | | +| 187 | `TestIO_ReadWrite` | nvme_test.go | 1068 | | | | +| 188 | `TestIO_HandleRead` | nvme_test.go | 1107 | | | | +| 189 | `TestIO_HandleWrite` | nvme_test.go | 1173 | | | | +| 190 | `TestIO_HandleFlush` | nvme_test.go | 1220 | | | | +| 191 | `TestIO_HandleWriteZeros_Trim` | nvme_test.go | 1256 | | | | +| 192 | `TestIO_ReadOutOfBounds` | nvme_test.go | 1306 | | | | +| 193 | `TestIO_WriteR2TFlow` | nvme_test.go | 1349 | | | | +| 194 | `TestIO_WriteUnhealthy` | nvme_test.go | 1429 | | | | +| 195 | `TestIO_ReadError` | nvme_test.go | 1470 | | | | +| 196 | `TestIO_WriteError` | nvme_test.go | 1510 | | | | +| 197 | `TestErrorMapping_AllSentinels` | nvme_test.go | 1555 | | | | +| 198 | `TestErrorMapping_DNR` | nvme_test.go | 1581 | | | | +| 199 | `TestANAState_AllRoles` | nvme_test.go | 1598 | | | | +| 200 | `TestNGUID_Generation` | nvme_test.go | 1622 | | | | +| 201 | `TestServer_StartStop` | nvme_test.go | 1652 | | | | +| 202 | `TestServer_DisabledNoOp` | nvme_test.go | 1672 | | | | +| 203 | `TestServer_AddRemoveVolume` | nvme_test.go | 1682 | | | | +| 204 | `TestServer_ConcurrentAccept` | nvme_test.go | 1699 | | | | +| 205 | `TestController_KATOTimeout` | nvme_test.go | 1749 | | | | +| 206 | `TestFullSequence_ICConnectIdentifyReadWrite` | nvme_test.go | 1802 | | | | +| 207 | `TestServer_NQN` | nvme_test.go | 1913 | | | | +| 208 | `TestIOQueue_CrossConnection` | nvme_test.go | 1930 | | | | +| 209 | `TestIOQueue_InvalidCNTLID` | nvme_test.go | 2031 | | | | +| 210 | `TestIOQueue_NQNMismatch` | nvme_test.go | 2062 | | | | +| 211 | `TestAdminSession_UnregisteredOnShutdown` | nvme_test.go | 2112 | | | | +| 212 | `TestReader_MalformedHeader_TooSmall` | nvme_test.go | 2151 | | | | +| 213 | `TestReader_MalformedHeader_TooLarge` | nvme_test.go | 2167 | | | | +| 214 | `TestReader_MalformedHeader_DataOffsetLessThanHeaderLength` | nvme_test.go | 2182 | | | | +| 215 | `TestReader_MalformedHeader_DataOffsetGtDataLength` | nvme_test.go | 2198 | | | | +| 216 | `TestReader_MalformedHeader_DataLengthLtHeaderLength` | nvme_test.go | 2214 | | | | +| 217 | `TestReader_MalformedHeader_DataOffsetZero_ExtraDataLength` | nvme_test.go | 2230 | | | | +| 218 | `TestIOQueue_HostNQNMismatch` | nvme_test.go | 2251 | | | | +| 219 | `TestIO_WritePayloadSizeMismatch` | nvme_test.go | 2301 | | | | +| 220 | `TestIO_WritePayloadTooLarge` | nvme_test.go | 2344 | | | | +| 221 | `TestDisconnect_NoError` | nvme_test.go | 2391 | | | | +| 222 | `TestReader_LargePadding` | nvme_test.go | 2425 | | | | +| 223 | `TestTuneConn_NoError` | nvme_test.go | 2489 | | | | +| 224 | `TestTuneConn_NonTCP` | nvme_test.go | 2515 | | | | +| 225 | `TestWriterBatchedFlush` | nvme_test.go | 2523 | | | | +| 226 | `TestSendWithData_UsesSharedEncode` | nvme_test.go | 2556 | | | | +| 227 | `TestNewWriterSize` | nvme_test.go | 2610 | | | | +| 228 | `TestBufPool_GetPut` | nvme_test.go | 2628 | | | | +| 229 | `TestBufPool_WriteReuse` | nvme_test.go | 2654 | | | | +| 230 | `TestMaxH2CDataLen_Config` | nvme_test.go | 2736 | | | | +| 231 | `TestMaxH2CDataLen_Default` | nvme_test.go | 2774 | | | | +| 232 | `TestC2HChunking_ConfigurableMaxDataLen` | nvme_test.go | 2807 | | | | +| 233 | `TestDataOffset_LargePadding` | nvme_test.go | 2900 | | | | +| 234 | `TestNQN_Sanitization` | nvme_test.go | 2951 | | | | +| 235 | `TestIsRetryableWALPressure_Classification` | nvme_test.go | 2968 | | | | +| 236 | `TestWriteWithRetry_TransientSuccess` | nvme_test.go | 3004 | | | | +| 237 | `TestWriteWithRetry_PersistentFailure` | nvme_test.go | 3065 | | | | +| 238 | `TestWriteWithRetry_NonWALError` | nvme_test.go | 3092 | | | | +| 239 | `TestWriteWithRetry_ImmediateSuccess` | nvme_test.go | 3112 | | | | +| 240 | `TestThrottleOnWALPressure_Deterministic` | nvme_test.go | 3131 | | | | +| 241 | `TestWriteWithRetry_ConcurrentPressure` | nvme_test.go | 3204 | | | | +| 242 | `TestWriteWithRetry_ConcurrentTransient` | nvme_test.go | 3241 | | | | +| 243 | `TestWriteWithRetry_WrappedWALError` | nvme_test.go | 3286 | | | | +| 244 | `TestMockDevice_WALPressureProvider` | nvme_test.go | 3307 | | | | +| 245 | `TestIO_WriteWALPressure_ProtocolResponse` | nvme_test.go | 3324 | | | | +| 246 | `TestWriteWithRetry_SharedTransientConcurrency` | nvme_test.go | 3385 | | | | +| 247 | `TestTxLoop_R2TDoneClosedOnError` | nvme_test.go | 3446 | | | | +| 248 | `TestTxLoop_R2TDoneClosedOnError_Integration` | nvme_test.go | 3471 | | | | +| 249 | `TestShutdown_ReleasePendingCapsuleBuffers` | nvme_test.go | 3528 | | | | +| 250 | `TestCompleteWaiters_NilR2TDone` | nvme_test.go | 3552 | | | | +| 251 | `TestIsPathSafe` | protocol_test.go | 7 | | | | +| 252 | `TestAuthTokenHeaderConstant` | protocol_test.go | 32 | | | | + +### operations + +| # | Test Name | File | Line | Status | Sim | Notes | +|---|-----------|------|------|--------|-----|-------| +| 1 | `TestExpand_Standalone_DirectCommit` | expand_test.go | 31 | | | | +| 2 | `TestExpand_Standalone_Idempotent` | expand_test.go | 47 | | | | +| 3 | `TestExpand_Standalone_ShrinkRejected` | expand_test.go | 59 | | | | +| 4 | `TestExpand_Standalone_SurvivesReopen` | expand_test.go | 69 | | | | +| 5 | `TestPrepareExpand_Success` | expand_test.go | 88 | | | | +| 6 | `TestPrepareExpand_WriteBeyondOldSize_Rejected` | expand_test.go | 105 | | | | +| 7 | `TestPrepareExpand_WriteWithinOldSize_OK` | expand_test.go | 121 | | | | +| 8 | `TestCommitExpand_Success` | expand_test.go | 143 | | | | +| 9 | `TestCommitExpand_WriteBeyondNewSize_OK` | expand_test.go | 163 | | | | +| 10 | `TestCommitExpand_EpochMismatch_Rejected` | expand_test.go | 189 | | | | +| 11 | `TestCancelExpand_ClearsPreparedState` | expand_test.go | 205 | | | | +| 12 | `TestCancelExpand_WriteStillRejectedInNewRange` | expand_test.go | 225 | | | | +| 13 | `TestPrepareExpand_AlreadyInFlight_Rejected` | expand_test.go | 244 | | | | +| 14 | `TestRecovery_PreparedState_Cleared` | expand_test.go | 257 | | | | +| 15 | `TestExpand_WithProfile_Single` | expand_test.go | 280 | | | | +| 16 | `TestHealthScore_DefaultPerfect` | health_score_test.go | 8 | | | | +| 17 | `TestHealthScore_ScrubErrorsLowerScore` | health_score_test.go | 15 | | | | +| 18 | `TestHealthScore_CappedPenalty` | health_score_test.go | 27 | | | | +| 19 | `TestHealthScore_StaleScrubPenalty` | health_score_test.go | 38 | | | | +| 20 | `TestHealthScore_RecordScrubComplete` | health_score_test.go | 48 | | | | +| 21 | `TestResolvePolicy_DatabaseDefaults` | preset_test.go | 8 | | | | +| 22 | `TestResolvePolicy_GeneralDefaults` | preset_test.go | 42 | | | | +| 23 | `TestResolvePolicy_ThroughputDefaults` | preset_test.go | 67 | | | | +| 24 | `TestResolvePolicy_OverrideDurability` | preset_test.go | 86 | | | | +| 25 | `TestResolvePolicy_OverrideRF` | preset_test.go | 101 | | | | +| 26 | `TestResolvePolicy_NoPreset_SystemDefaults` | preset_test.go | 116 | | | | +| 27 | `TestResolvePolicy_InvalidPreset` | preset_test.go | 138 | | | | +| 28 | `TestResolvePolicy_IncompatibleCombo` | preset_test.go | 148 | | | | +| 29 | `TestResolvePolicy_WALWarning` | preset_test.go | 160 | | | | +| 30 | `TestQA_B09_ExpandAfterDoubleFailover_RF3` | qa_block_expand_adversarial_test.go | 82 | | | | +| 31 | `TestQA_B09_ExpandAfterDoubleFailover_RF3` | qa_block_expand_adversarial_test.go | 82 | | | | +| 32 | `TestQA_B09_ExpandSeesDeletedVolume_AfterLockAcquire` | qa_block_expand_adversarial_test.go | 139 | | | | +| 33 | `TestQA_B09_ExpandSeesDeletedVolume_AfterLockAcquire` | qa_block_expand_adversarial_test.go | 139 | | | | +| 34 | `TestQA_B09_ConcurrentExpandAndFailover` | qa_block_expand_adversarial_test.go | 183 | | | | +| 35 | `TestQA_B09_ConcurrentExpandAndFailover` | qa_block_expand_adversarial_test.go | 183 | | | | +| 36 | `TestQA_B09_ConcurrentExpandsSameVolume` | qa_block_expand_adversarial_test.go | 232 | | | | +| 37 | `TestQA_B09_ConcurrentExpandsSameVolume` | qa_block_expand_adversarial_test.go | 232 | | | | +| 38 | `TestQA_B10_RepeatedEmptyHeartbeats_DuringExpand` | qa_block_expand_adversarial_test.go | 280 | | | | +| 39 | `TestQA_B10_RepeatedEmptyHeartbeats_DuringExpand` | qa_block_expand_adversarial_test.go | 280 | | | | +| 40 | `TestQA_B10_ExpandFailed_HeartbeatStillProtected` | qa_block_expand_adversarial_test.go | 312 | | | | +| 41 | `TestQA_B10_ExpandFailed_HeartbeatStillProtected` | qa_block_expand_adversarial_test.go | 312 | | | | +| 42 | `TestQA_B10_HeartbeatSizeSuppress_DuringExpand` | qa_block_expand_adversarial_test.go | 355 | | | | +| 43 | `TestQA_B10_HeartbeatSizeSuppress_DuringExpand` | qa_block_expand_adversarial_test.go | 355 | | | | +| 44 | `TestQA_B10_ConcurrentHeartbeatsAndExpand` | qa_block_expand_adversarial_test.go | 409 | | | | +| 45 | `TestQA_B10_ConcurrentHeartbeatsAndExpand` | qa_block_expand_adversarial_test.go | 409 | | | | +| 46 | `TestQA_Expand_ConcurrentPrepare` | qa_expand_test.go | 36 | | | | +| 47 | `TestQA_Expand_CommitWithoutPrepare` | qa_expand_test.go | 74 | | | | +| 48 | `TestQA_Expand_CancelWithoutPrepare_ForceEpoch` | qa_expand_test.go | 90 | | | | +| 49 | `TestQA_Expand_CancelWithWrongEpoch` | qa_expand_test.go | 105 | | | | +| 50 | `TestQA_Expand_ForceCancel_IgnoresEpoch` | qa_expand_test.go | 126 | | | | +| 51 | `TestQA_Expand_DoubleCommit` | qa_expand_test.go | 145 | | | | +| 52 | `TestQA_Expand_PrepareAfterCommit` | qa_expand_test.go | 165 | | | | +| 53 | `TestQA_Expand_PrepareAfterCancel` | qa_expand_test.go | 194 | | | | +| 54 | `TestQA_Expand_PrepareShrink` | qa_expand_test.go | 216 | | | | +| 55 | `TestQA_Expand_PrepareUnaligned` | qa_expand_test.go | 227 | | | | +| 56 | `TestQA_Expand_DataIntegrityAcrossCommit` | qa_expand_test.go | 244 | | | | +| 57 | `TestQA_Expand_RecoveryClearsAndDataSurvives` | qa_expand_test.go | 304 | | | | +| 58 | `TestQA_Expand_CommittedSurvivesReopen` | qa_expand_test.go | 351 | | | | +| 59 | `TestQA_Expand_ClosedVolume` | qa_expand_test.go | 394 | | | | +| 60 | `TestQA_Expand_PrepareSameSize` | qa_expand_test.go | 414 | | | | +| 61 | `TestQA_Expand_ConcurrentWriteDuringPrepare` | qa_expand_test.go | 431 | | | | +| 62 | `TestQA_Expand_ExpandStateRaceWithCommit` | qa_expand_test.go | 478 | | | | +| 63 | `TestQA_Expand_TrimDuringPrepared` | qa_expand_test.go | 519 | | | | +| 64 | `TestQA_Expand_SuperblockValidatePreparedSize` | qa_expand_test.go | 553 | | | | +| 65 | `TestQA_Expand_SuperblockValidateOrphanEpoch` | qa_expand_test.go | 572 | | | | +| 66 | `TestQAResize` | qa_resize_test.go | 11 | | | | +| 67 | `TestQA_Export_ConcurrentWriteDuringImport` | qa_snapshot_export_adversarial_test.go | 26 | | | | +| 68 | `TestQA_Export_PartialImportFailure_ExtentState` | qa_snapshot_export_adversarial_test.go | 95 | | | | +| 69 | `TestQA_Export_ImportWithActiveSnapshot_Rejected` | qa_snapshot_export_adversarial_test.go | 190 | | | | +| 70 | `TestQA_Export_DoubleImportRejected` | qa_snapshot_export_adversarial_test.go | 244 | | | | +| 71 | `TestQA_Export_ExportAfterClose` | qa_snapshot_export_adversarial_test.go | 289 | | | | +| 72 | `TestQA_Export_ImportAfterClose` | qa_snapshot_export_adversarial_test.go | 306 | | | | +| 73 | `TestQA_Export_ConcurrentExports_NoCollision` | qa_snapshot_export_adversarial_test.go | 329 | | | | +| 74 | `TestQA_Export_FlagImportedSurvivesReopen` | qa_snapshot_export_adversarial_test.go | 385 | | | | +| 75 | `TestQA_Export_ImportContextCancelMidStream` | qa_snapshot_export_adversarial_test.go | 439 | | | | +| 76 | `TestQA_Export_NonChunkAlignedBlockCount` | qa_snapshot_export_adversarial_test.go | 515 | | | | +| 77 | `TestQA_Export_ZeroDataVolume` | qa_snapshot_export_adversarial_test.go | 566 | | | | +| 78 | `TestQA_Export_TempSnapIDUniqueness` | qa_snapshot_export_adversarial_test.go | 604 | | | | +| 79 | `TestQA_SnapshotExport_TruncatedData` | qa_snapshot_export_test.go | 14 | | | | +| 80 | `TestQA_SnapshotExport_WrongChecksum` | qa_snapshot_export_test.go | 38 | | | | +| 81 | `TestQA_SnapshotExport_CorruptedManifest` | qa_snapshot_export_test.go | 62 | | | | +| 82 | `TestQA_SnapshotExport_DataSizeMismatch` | qa_snapshot_export_test.go | 77 | | | | +| 83 | `TestQA_SnapshotExport_RoundTripIntegrity` | qa_snapshot_export_test.go | 98 | | | | +| 84 | `TestQA_SnapshotExport_ContextCancellation` | qa_snapshot_export_test.go | 158 | | | | +| 85 | `TestQA_SnapshotExport_ManifestSerializationStable` | qa_snapshot_export_test.go | 173 | | | | +| 86 | `TestQA_SnapshotExport_ExportDuringLiveIO` | qa_snapshot_export_test.go | 204 | | | | +| 87 | `TestQA_SnapshotExport_NonexistentSnapshotReject` | qa_snapshot_export_test.go | 250 | | | | +| 88 | `TestQASnapshot` | qa_snapshot_test.go | 13 | | | | +| 89 | `TestQA_Profile_WritePath_SingleCorrect` | qa_storage_profile_test.go | 27 | | | | +| 90 | `TestQA_Profile_ConcurrentWrites_Single` | qa_storage_profile_test.go | 104 | | | | +| 91 | `TestQA_Profile_SurvivesCrashRecovery` | qa_storage_profile_test.go | 179 | | | | +| 92 | `TestQA_Profile_CorruptByte_AllValues` | qa_storage_profile_test.go | 230 | | | | +| 93 | `TestQA_Profile_StripedReject_NoFileLeaked` | qa_storage_profile_test.go | 275 | | | | +| 94 | `TestQA_Profile_ConcurrentCreateSameFile` | qa_storage_profile_test.go | 299 | | | | +| 95 | `TestQA_Profile_SuperblockByteOffset` | qa_storage_profile_test.go | 351 | | | | +| 96 | `TestQA_Profile_MultiBlockWriteRead` | qa_storage_profile_test.go | 392 | | | | +| 97 | `TestQA_Profile_ExpandPreservesProfile` | qa_storage_profile_test.go | 431 | | | | +| 98 | `TestQA_Profile_SnapshotPreservesProfile` | qa_storage_profile_test.go | 508 | | | | +| 99 | `TestResize_ExpandWorks` | resize_test.go | 12 | | | | +| 100 | `TestResize_ShrinkRejected` | resize_test.go | 85 | | | | +| 101 | `TestResize_WithSnapshotsRejected` | resize_test.go | 112 | | | | +| 102 | `TestScrub_CleanVolume` | scrub_test.go | 26 | | | | +| 103 | `TestScrub_DetectCorruption` | scrub_test.go | 54 | | | | +| 104 | `TestScrub_SkipDirtyBlocks` | scrub_test.go | 95 | | | | +| 105 | `TestScrub_SkipRecentlyWritten` | scrub_test.go | 122 | | | | +| 106 | `TestScrub_StatsUpdated` | scrub_test.go | 154 | | | | +| 107 | `TestScrub_TriggerNow` | scrub_test.go | 171 | | | | +| 108 | `TestScrub_StopIdempotent` | scrub_test.go | 190 | | | | +| 109 | `TestScrub_HealthScoreImpact` | scrub_test.go | 200 | | | | +| 110 | `TestScrub_InPassWrite_NoFalsePositive` | scrub_test.go | 248 | | | | +| 111 | `TestManifest_RoundTrip` | snapshot_export_test.go | 12 | | | | +| 112 | `TestManifest_Validate_BadVersion` | snapshot_export_test.go | 52 | | | | +| 113 | `TestManifest_Validate_BadProfile` | snapshot_export_test.go | 60 | | | | +| 114 | `TestManifest_Validate_BadLayout` | snapshot_export_test.go | 68 | | | | +| 115 | `TestManifest_Validate_MissingFields` | snapshot_export_test.go | 76 | | | | +| 116 | `TestExportSnapshot_Basic` | snapshot_export_test.go | 97 | | | | +| 117 | `TestExportSnapshot_ChecksumCorrect` | snapshot_export_test.go | 134 | | | | +| 118 | `TestExportSnapshot_ExistingSnapshot` | snapshot_export_test.go | 159 | | | | +| 119 | `TestExportSnapshot_ProfileReject` | snapshot_export_test.go | 194 | | | | +| 120 | `TestImportSnapshot_Basic` | snapshot_export_test.go | 206 | | | | +| 121 | `TestImportSnapshot_SizeMismatch` | snapshot_export_test.go | 242 | | | | +| 122 | `TestImportSnapshot_NonEmptyReject` | snapshot_export_test.go | 258 | | | | +| 123 | `TestImportSnapshot_AllowOverwrite` | snapshot_export_test.go | 278 | | | | +| 124 | `TestImportSnapshot_ChecksumMismatch` | snapshot_export_test.go | 313 | | | | +| 125 | `TestImportSnapshot_DoubleImportReject` | snapshot_export_test.go | 336 | | | | +| 126 | `TestExportSnapshot_ClosedVolume` | snapshot_export_test.go | 373 | | | | +| 127 | `TestImportSnapshot_ClosedVolume` | snapshot_export_test.go | 385 | | | | +| 128 | `TestImportSnapshot_ActiveSnapshotsReject` | snapshot_export_test.go | 403 | | | | +| 129 | `TestExportSnapshot_UniqueTempSnapIDs` | snapshot_export_test.go | 440 | | | | +| 130 | `TestSnapshots` | snapshot_test.go | 12 | | | | +| 131 | `TestStorageProfile_String` | storage_profile_test.go | 11 | | | | +| 132 | `TestParseStorageProfile_Valid` | storage_profile_test.go | 28 | | | | +| 133 | `TestParseStorageProfile_Invalid` | storage_profile_test.go | 51 | | | | +| 134 | `TestStorageProfile_StringRoundTrip` | storage_profile_test.go | 58 | | | | +| 135 | `TestSuperblock_ProfilePersistence` | storage_profile_test.go | 71 | | | | +| 136 | `TestSuperblock_BackwardCompat_ProfileZero` | storage_profile_test.go | 95 | | | | +| 137 | `TestSuperblock_InvalidProfileRejected` | storage_profile_test.go | 110 | | | | +| 138 | `TestCreate_WithProfile` | storage_profile_test.go | 123 | | | | +| 139 | `TestCreate_DefaultProfile` | storage_profile_test.go | 140 | | | | +| 140 | `TestOpen_ProfileSurvivesReopen` | storage_profile_test.go | 154 | | | | +| 141 | `TestCreate_StripedRejected` | storage_profile_test.go | 177 | | | | +| 142 | `TestCreate_InvalidProfileRejected` | storage_profile_test.go | 194 | | | | +| 143 | `TestOpen_InvalidProfileOnDisk` | storage_profile_test.go | 211 | | | | + +### other + +| # | Test Name | File | Line | Status | Sim | Notes | +|---|-----------|------|------|--------|-----|-------| +| 1 | `TestDevOpsActions_Registration` | devops_test.go | 10 | | | | +| 2 | `TestDevOpsActions_Tier` | devops_test.go | 39 | | | | +| 3 | `TestDevOpsActions_TierGating` | devops_test.go | 59 | | | | +| 4 | `TestAllActions_Registration` | devops_test.go | 81 | | | | +| 5 | `TestK8sActions_Registration` | devops_test.go | 123 | | | | +| 6 | `TestK8sActions_TierGating` | devops_test.go | 156 | | | | +| 7 | `TestIntegration` | integration_test.go | 116 | | | | +| 8 | `TestParams` | params_test.go | 9 | | | | +| 9 | `TestQAALUA` | qa_alua_test.go | 11 | | | | +| 10 | `TestQA` | qa_test.go | 18 | | | | + +### recovery + +| # | Test Name | File | Line | Status | Sim | Notes | +|---|-----------|------|------|--------|-----|-------| +| 1 | `TestRecovery` | recovery_test.go | 10 | | | | + +### replication + +| # | Test Name | File | Line | Status | Sim | Notes | +|---|-----------|------|------|--------|-----|-------| +| 1 | `TestDistSync_BestEffort_NilGroup` | dist_group_commit_test.go | 36 | | | | +| 2 | `TestDistSync_SyncAll_NilGroup_Succeeds` | dist_group_commit_test.go | 54 | | | | +| 3 | `TestDistSync_SyncAll_AllDegraded_Fails` | dist_group_commit_test.go | 65 | | | | +| 4 | `TestDistSync_SyncQuorum_AllDegraded_RF3_Fails` | dist_group_commit_test.go | 83 | | | | +| 5 | `TestDistSync_BestEffort_BackwardCompat` | dist_group_commit_test.go | 100 | | | | +| 6 | `TestDistSync_Metrics_IncrementOnFailure` | dist_group_commit_test.go | 115 | | | | +| 7 | `TestDistSync_LocalFsyncFail_AlwaysErrors` | dist_group_commit_test.go | 131 | | | | +| 8 | `TestCanonicalizeAddr_WildcardIPv4_UsesAdvertised` | net_util_test.go | 9 | | | | +| 9 | `TestCanonicalizeAddr_WildcardIPv6_UsesAdvertised` | net_util_test.go | 17 | | | | +| 10 | `TestCanonicalizeAddr_NilIP_UsesAdvertised` | net_util_test.go | 25 | | | | +| 11 | `TestCanonicalizeAddr_AlreadyCanonical_Unchanged` | net_util_test.go | 33 | | | | +| 12 | `TestCanonicalizeAddr_Loopback_Unchanged` | net_util_test.go | 41 | | | | +| 13 | `TestCanonicalizeAddr_NoAdvertised_FallsBackToOutbound` | net_util_test.go | 49 | | | | +| 14 | `TestPreferredOutboundIP_NotEmpty` | net_util_test.go | 61 | | | | +| 15 | `TestHeartbeat_ReportsPerReplicaState` | rebuild_v1_test.go | 13 | | | | +| 16 | `TestHeartbeat_ReportsNeedsRebuild` | rebuild_v1_test.go | 51 | | | | +| 17 | `TestReplicaState_RebuildComplete_ReentersInSync` | rebuild_v1_test.go | 80 | | | | +| 18 | `TestRebuild_AbortOnEpochChange` | rebuild_v1_test.go | 183 | | | | +| 19 | `TestRebuild_PostRebuild_FlushedLSN_IsCheckpoint` | rebuild_v1_test.go | 231 | | | | +| 20 | `TestRebuild_MissingTailRestartsOrFailsCleanly` | rebuild_v1_test.go | 338 | | | | +| 21 | `TestShipperGroup_ShipAll_Single` | shipper_group_test.go | 8 | | | | +| 22 | `TestShipperGroup_ShipAll_Two` | shipper_group_test.go | 19 | | | | +| 23 | `TestShipperGroup_BarrierAll_AllSucceed` | shipper_group_test.go | 30 | | | | +| 24 | `TestShipperGroup_BarrierAll_OneFail` | shipper_group_test.go | 42 | | | | +| 25 | `TestShipperGroup_BarrierAll_AllFail` | shipper_group_test.go | 53 | | | | +| 26 | `TestShipperGroup_AllDegraded_Empty` | shipper_group_test.go | 67 | | | | +| 27 | `TestShipperGroup_AllDegraded_Mixed` | shipper_group_test.go | 74 | | | | +| 28 | `TestShipperGroup_StopAll` | shipper_group_test.go | 88 | | | | +| 29 | `TestShipperGroup_DegradedCount` | shipper_group_test.go | 98 | | | | +| 30 | `TestAdversarial_ConcurrentBarrierDoesNotCorruptCatchupFailures` | sync_all_adversarial_test.go | 21 | | | | +| 31 | `TestAdversarial_FreshShipperUsesBootstrapNotReconnect` | sync_all_adversarial_test.go | 72 | | | | +| 32 | `TestAdversarial_ReconnectUsesHandshakeNotBootstrap` | sync_all_adversarial_test.go | 122 | | | | +| 33 | `TestAdversarial_ReplicaRejectsDuplicateLSN` | sync_all_adversarial_test.go | 193 | | | | +| 34 | `TestAdversarial_ReplicaRejectsGapLSN` | sync_all_adversarial_test.go | 247 | | | | +| 35 | `TestAdversarial_NeedsRebuildBlocksAllPaths` | sync_all_adversarial_test.go | 290 | | | | +| 36 | `TestAdversarial_CatchupDoesNotOverwriteNewerData` | sync_all_adversarial_test.go | 417 | | | | +| 37 | `TestAdversarial_CatchupMultipleDisconnects` | sync_all_adversarial_test.go | 486 | | | | +| 38 | `TestBug3_ReplicaAddr_MustBeIPPort_WildcardBind` | sync_all_bug_test.go | 30 | | | | +| 39 | `TestBug2_SyncAll_SyncCache_AfterDegradedShipperRecovers` | sync_all_bug_test.go | 73 | | | | +| 40 | `TestBug1_SyncAll_WriteDuringDegraded_SyncCacheMustFail` | sync_all_bug_test.go | 163 | | | | +| 41 | `TestSyncAll_FullRoundTrip_WriteAndFlush` | sync_all_bug_test.go | 217 | | | | +| 42 | `TestSyncAll_MultipleFlush_NoWritesBetween` | sync_all_bug_test.go | 264 | | | | +| 43 | `TestReplicaProgress_BarrierUsesFlushedLSN` | sync_all_protocol_test.go | 29 | | | | +| 44 | `TestReplicaProgress_FlushedLSNMonotonicWithinEpoch` | sync_all_protocol_test.go | 86 | | | | +| 45 | `TestBarrier_RejectsReplicaNotInSync` | sync_all_protocol_test.go | 134 | | | | +| 46 | `TestBarrier_EpochMismatchRejected` | sync_all_protocol_test.go | 169 | | | | +| 47 | `TestReconnect_CatchupFromRetainedWal` | sync_all_protocol_test.go | 225 | | | | +| 48 | `TestReconnect_GapBeyondRetainedWal_NeedsRebuild` | sync_all_protocol_test.go | 303 | | | | +| 49 | `TestWalRetention_RequiredReplicaBlocksReclaim` | sync_all_protocol_test.go | 396 | | | | +| 50 | `TestShip_DegradedDoesNotSilentlyCountAsHealthy` | sync_all_protocol_test.go | 464 | | | | +| 51 | `TestReconnect_EpochChangeDuringCatchup_Aborts` | sync_all_protocol_test.go | 520 | | | | +| 52 | `TestReconnect_CatchupTimeout_TransitionsDegraded` | sync_all_protocol_test.go | 592 | | | | +| 53 | `TestBarrier_DuringCatchup_Rejected` | sync_all_protocol_test.go | 658 | | | | +| 54 | `TestBarrier_ReplicaSlowFsync_Timeout` | sync_all_protocol_test.go | 734 | | | | +| 55 | `TestWalRetention_TimeoutTriggersNeedsRebuild` | sync_all_protocol_test.go | 798 | | | | +| 56 | `TestWalRetention_MaxBytesTriggersNeedsRebuild` | sync_all_protocol_test.go | 872 | | | | +| 57 | `TestCatchupReplay_DataIntegrity_AllBlocksMatch` | sync_all_protocol_test.go | 957 | | | | +| 58 | `TestCatchupReplay_DuplicateEntry_Idempotent` | sync_all_protocol_test.go | 1040 | | | | +| 59 | `TestBestEffort_FlushSucceeds_ReplicaDown` | sync_all_protocol_test.go | 1124 | | | | +| 60 | `TestReplicaState_InitialDisconnected` | sync_all_protocol_test.go | 1185 | | | | +| 61 | `TestReplicaState_ShipDoesNotGrantInSync` | sync_all_protocol_test.go | 1192 | | | | +| 62 | `TestReplicaState_BarrierBootstrapGrantsInSync` | sync_all_protocol_test.go | 1227 | | | | +| 63 | `TestReplicaState_ShipFailureTransitionsToDegraded` | sync_all_protocol_test.go | 1266 | | | | +| 64 | `TestReplicaState_BarrierDegradedReconnectFail_StaysDegraded` | sync_all_protocol_test.go | 1310 | | | | +| 65 | `TestReplicaState_BarrierDegradedReconnectSuccess_RestoresInSync` | sync_all_protocol_test.go | 1324 | | | | +| 66 | `TestShipperGroup_InSyncCount` | sync_all_protocol_test.go | 1359 | | | | +| 67 | `TestBarrierResp_FlushedLSN_Roundtrip` | sync_all_protocol_test.go | 1392 | | | | +| 68 | `TestBarrierResp_BackwardCompat_1Byte` | sync_all_protocol_test.go | 1407 | | | | +| 69 | `TestReplica_FlushedLSN_OnlyAfterSync` | sync_all_protocol_test.go | 1419 | | | | +| 70 | `TestReplica_FlushedLSN_NotOnReceive` | sync_all_protocol_test.go | 1473 | | | | +| 71 | `TestShipper_ReplicaFlushedLSN_UpdatedOnBarrier` | sync_all_protocol_test.go | 1508 | | | | +| 72 | `TestShipper_ReplicaFlushedLSN_Monotonic` | sync_all_protocol_test.go | 1558 | | | | +| 73 | `TestShipperGroup_MinReplicaFlushedLSN` | sync_all_protocol_test.go | 1601 | | | | + +### testrunner + +| # | Test Name | File | Line | Status | Sim | Notes | +|---|-----------|------|------|--------|-----|-------| +| 1 | `TestAgent_Health` | agent_test.go | 42 | | | | +| 2 | `TestAgent_Health_NoAuth` | agent_test.go | 68 | | | | +| 3 | `TestAgent_Auth_Rejection` | agent_test.go | 83 | | | | +| 4 | `TestAgent_Auth_ValidToken` | agent_test.go | 99 | | | | +| 5 | `TestAgent_Phase_EchoAction` | agent_test.go | 117 | | | | +| 6 | `TestAgent_Phase_FailStopsExecution` | agent_test.go | 158 | | | | +| 7 | `TestAgent_Upload_PathSafety` | agent_test.go | 196 | | | | +| 8 | `TestAgent_Upload_ValidPath` | agent_test.go | 223 | | | | +| 9 | `TestAgent_Exec_DisabledByDefault` | agent_test.go | 262 | | | | +| 10 | `TestAgent_Exec_Enabled` | agent_test.go | 277 | | | | +| 11 | `TestAgent_Artifacts_PathSafety` | agent_test.go | 301 | | | | +| 12 | `TestAgent_Artifacts_MissingDir` | agent_test.go | 331 | | | | +| 13 | `TestAgent_Artifacts_NoAuth` | agent_test.go | 346 | | | | +| 14 | `TestAgent_Artifacts_ValidDir` | agent_test.go | 361 | | | | +| 15 | `TestAgent_Phase_VarSubstitution` | agent_test.go | 399 | | | | +| 16 | `TestSaveAndLoadBaseline` | baseline_test.go | 10 | | | | +| 17 | `TestSaveBaselineIdempotent` | baseline_test.go | 55 | | | | +| 18 | `TestLoadLatestBaseline_Empty` | baseline_test.go | 77 | | | | +| 19 | `TestLoadLatestBaseline_PicksLatest` | baseline_test.go | 85 | | | | +| 20 | `TestParseFioMetric_WriteIOPS` | bench_test.go | 86 | | | | +| 21 | `TestParseFioMetric_WriteBW` | bench_test.go | 96 | | | | +| 22 | `TestParseFioMetric_WriteLatency` | bench_test.go | 107 | | | | +| 23 | `TestParseFioMetric_WriteP99` | bench_test.go | 118 | | | | +| 24 | `TestParseFioMetric_ReadIOPS` | bench_test.go | 129 | | | | +| 25 | `TestParseFioMetric_ExplicitDirection` | bench_test.go | 139 | | | | +| 26 | `TestParseFioMetric_AutoDetect` | bench_test.go | 159 | | | | +| 27 | `TestParseFioMetric_UnknownMetric` | bench_test.go | 179 | | | | +| 28 | `TestParseFioMetric_PlainNumber` | bench_test.go | 186 | | | | +| 29 | `TestParseFioMetric_PlainInteger` | bench_test.go | 196 | | | | +| 30 | `TestParseFioMetric_QuotedNumber` | bench_test.go | 206 | | | | +| 31 | `TestParseFioMetric_NumberWithWhitespace` | bench_test.go | 216 | | | | +| 32 | `TestParseFioMetric_InvalidInput` | bench_test.go | 226 | | | | +| 33 | `TestParseFioMetric_InvalidJSON` | bench_test.go | 233 | | | | +| 34 | `TestParseFioMetric_EmptyJobs` | bench_test.go | 240 | | | | +| 35 | `TestComputeBenchResult_ThroughputPass` | bench_test.go | 247 | | | | +| 36 | `TestComputeBenchResult_ThroughputFail` | bench_test.go | 257 | | | | +| 37 | `TestComputeBenchResult_ThroughputWarn` | bench_test.go | 264 | | | | +| 38 | `TestComputeBenchResult_LatencyPass` | bench_test.go | 275 | | | | +| 39 | `TestComputeBenchResult_LatencyFail` | bench_test.go | 287 | | | | +| 40 | `TestComputeBenchResult_ZeroBaseline` | bench_test.go | 295 | | | | +| 41 | `TestFormatBenchReport` | bench_test.go | 302 | | | | +| 42 | `TestParsePgbenchTPS` | bench_test.go | 340 | | | | +| 43 | `TestTrimValues` | bench_test.go | 391 | | | | +| 44 | `TestTargetSpecNQN` | bench_test.go | 404 | | | | +| 45 | `TestExtractHost` | benchmark_test.go | 7 | | | | +| 46 | `TestBenchmarkReportHeader_CrossMachineDetection` | benchmark_test.go | 29 | | | | +| 47 | `TestPostcheckPgdataLocalDetection` | benchmark_test.go | 45 | | | | +| 48 | `TestPreflightAddressCheck` | benchmark_test.go | 64 | | | | +| 49 | `TestClusterManager_NilSpec_Noop` | cluster_manager_test.go | 58 | | | | +| 50 | `TestClusterManager_Fallback_Fail` | cluster_manager_test.go | 70 | | | | +| 51 | `TestClusterManager_Fallback_Skip` | cluster_manager_test.go | 90 | | | | +| 52 | `TestClusterManager_SetVars` | cluster_manager_test.go | 110 | | | | +| 53 | `TestClusterManager_Teardown_AutoManaged_Kills` | cluster_manager_test.go | 136 | | | | +| 54 | `TestClusterManager_Teardown_AutoAttached_NoKill` | cluster_manager_test.go | 166 | | | | +| 55 | `TestClusterManager_Teardown_DestroyAttached_Kills` | cluster_manager_test.go | 180 | | | | +| 56 | `TestClusterManager_Teardown_Keep_NoAction` | cluster_manager_test.go | 194 | | | | +| 57 | `TestClusterManager_MeetsRequirements` | cluster_manager_test.go | 211 | | | | +| 58 | `TestConsole_ScenariosEndpoint` | console_test.go | 15 | | | | +| 59 | `TestConsole_StatusEndpoint_NoRun` | console_test.go | 41 | | | | +| 60 | `TestConsole_RunEndpoint_MissingScenario` | console_test.go | 59 | | | | +| 61 | `TestConsole_RunEndpoint_InvalidScenario` | console_test.go | 75 | | | | +| 62 | `TestConsole_RunAndPollStatus` | console_test.go | 91 | | | | +| 63 | `TestConsole_ConflictOnDoubleRun` | console_test.go | 159 | | | | +| 64 | `TestConsole_AgentsEndpoint` | console_test.go | 191 | | | | +| 65 | `TestConsole_TiersEndpoint` | console_test.go | 210 | | | | +| 66 | `TestConsole_ReportNotFound` | console_test.go | 239 | | | | +| 67 | `TestConsole_IndexPage` | console_test.go | 254 | | | | +| 68 | `TestCoordinator_RegisterAgent` | coordinator_test.go | 28 | | | | +| 69 | `TestCoordinator_WaitForAgents_AlreadyRegistered` | coordinator_test.go | 44 | | | | +| 70 | `TestCoordinator_WaitForAgents_Timeout` | coordinator_test.go | 54 | | | | +| 71 | `TestCoordinator_WaitForAgents_NoExpected` | coordinator_test.go | 66 | | | | +| 72 | `TestCoordinator_BuildNodeAgentMap` | coordinator_test.go | 74 | | | | +| 73 | `TestCoordinator_ResolveActionAgent` | coordinator_test.go | 94 | | | | +| 74 | `TestCoordinatorAgent_PhaseDispatch` | coordinator_test.go | 133 | | | | +| 75 | `TestCoordinatorAgent_VarMerge` | coordinator_test.go | 222 | | | | +| 76 | `TestCoordinatorAgent_FailureStopsNormalPhases` | coordinator_test.go | 288 | | | | +| 77 | `TestCoordinatorAgent_RetryOnFailure` | coordinator_test.go | 341 | | | | +| 78 | `TestCoordinatorAgent_VarsInResult` | coordinator_test.go | 390 | | | | +| 79 | `TestCoordinatorAgent_DryRun` | coordinator_test.go | 446 | | | | +| 80 | `TestCoordinator_RegisterTokenValidation` | coordinator_test.go | 474 | | | | +| 81 | `TestCoordinator_RegisterNoToken` | coordinator_test.go | 531 | | | | +| 82 | `TestAgent_PersistentReRegistration` | coordinator_test.go | 562 | | | | +| 83 | `TestEngine_BasicFlow` | engine_test.go | 26 | | | | +| 84 | `TestEngine_FailureStopsPhase` | engine_test.go | 74 | | | | +| 85 | `TestEngine_IgnoreError` | engine_test.go | 116 | | | | +| 86 | `TestEngine_AlwaysPhaseRunsAfterFailure` | engine_test.go | 155 | | | | +| 87 | `TestEngine_VarSubstitution` | engine_test.go | 203 | | | | +| 88 | `TestEngine_EnvVars` | engine_test.go | 242 | | | | +| 89 | `TestEngine_Timeout` | engine_test.go | 278 | | | | +| 90 | `TestEngine_UnknownAction` | engine_test.go | 323 | | | | +| 91 | `TestEngine_ParallelPhase` | engine_test.go | 351 | | | | +| 92 | `TestResolveVars` | engine_test.go | 398 | | | | +| 93 | `TestEngine_VarsInResult` | engine_test.go | 423 | | | | +| 94 | `TestEngine_Repeat3Pass` | engine_test.go | 466 | | | | +| 95 | `TestEngine_RepeatFailStopsEarly` | engine_test.go | 516 | | | | +| 96 | `TestEngine_RepeatAggregateMedian` | engine_test.go | 562 | | | | +| 97 | `TestEngine_RepeatAggregateMean` | engine_test.go | 628 | | | | +| 98 | `TestEngine_RepeatAggregateNone` | engine_test.go | 674 | | | | +| 99 | `TestTrimOutliers` | engine_test.go | 721 | | | | +| 100 | `TestParse_InlineParams` | engine_test.go | 749 | | | | +| 101 | `TestResolveAction_PreservesInlineParams` | engine_test.go | 820 | | | | +| 102 | `TestEngine_CleanupVars` | engine_test.go | 841 | | | | +| 103 | `TestEngine_ActionTimeout_Enforced` | engine_test.go | 893 | | | | +| 104 | `TestEngine_TempRoot_UniquePerRun` | engine_test.go | 947 | | | | +| 105 | `TestEngine_TempRoot_PreservedIfSet` | engine_test.go | 1006 | | | | +| 106 | `TestParse_AggregateValidation` | engine_test.go | 1036 | | | | +| 107 | `TestEngine_EnvMerge_ExistingVarsWin` | engine_test.go | 1095 | | | | +| 108 | `TestInclude_Basic` | include_test.go | 10 | | | | +| 109 | `TestInclude_Params` | include_test.go | 47 | | | | +| 110 | `TestInclude_NestedInclude` | include_test.go | 79 | | | | +| 111 | `TestInclude_CircularDetected` | include_test.go | 123 | | | | +| 112 | `TestInclude_MissingFile` | include_test.go | 150 | | | | +| 113 | `TestInclude_MultiplePhases` | include_test.go | 168 | | | | +| 114 | `TestInclude_ParamsSubstituteNodeAndSaveAs` | include_test.go | 212 | | | | +| 115 | `TestLocalNode_Run_Echo` | local_node_test.go | 18 | | | | +| 116 | `TestLocalNode_Run_ExitCode` | local_node_test.go | 36 | | | | +| 117 | `TestLocalNode_Run_Timeout` | local_node_test.go | 51 | | | | +| 118 | `TestLocalNode_RunRoot_NonRoot` | local_node_test.go | 63 | | | | +| 119 | `TestLocalNode_Upload` | local_node_test.go | 77 | | | | +| 120 | `TestLocalNode_Close` | local_node_test.go | 112 | | | | +| 121 | `TestLocalNode_DetectRoot` | local_node_test.go | 117 | | | | +| 122 | `TestParsePrometheusText` | metrics_test.go | 8 | | | | +| 123 | `TestParsePrometheusText_Empty` | metrics_test.go | 32 | | | | +| 124 | `TestComputeStats` | metrics_test.go | 39 | | | | +| 125 | `TestComputeStats_Empty` | metrics_test.go | 63 | | | | +| 126 | `TestComputeStats_Single` | metrics_test.go | 70 | | | | +| 127 | `TestParsePerfLogLines` | metrics_test.go | 77 | | | | +| 128 | `TestParsePerfLogLines_Empty` | metrics_test.go | 100 | | | | +| 129 | `TestFormatStats` | metrics_test.go | 107 | | | | +| 130 | `TestParse_ValidScenario` | parser_test.go | 8 | | | | +| 131 | `TestParse_MissingName` | parser_test.go | 84 | | | | +| 132 | `TestParse_InvalidNodeRef` | parser_test.go | 98 | | | | +| 133 | `TestParse_PortConflict` | parser_test.go | 124 | | | | +| 134 | `TestParse_InvalidTargetRef` | parser_test.go | 154 | | | | +| 135 | `TestParse_MissingIQNSuffix` | parser_test.go | 180 | | | | +| 136 | `TestParse_NoPhases` | parser_test.go | 205 | | | | +| 137 | `TestParse_AgentTopology_Valid` | parser_test.go | 215 | | | | +| 138 | `TestParse_AgentTopology_InvalidAgentRef` | parser_test.go | 253 | | | | +| 139 | `TestParse_ParallelPhase_SaveAsConflict` | parser_test.go | 275 | | | | +| 140 | `TestParse_SequentialPhase_SaveAsDuplicate_Allowed` | parser_test.go | 303 | | | | +| 141 | `TestParse_ActionRetryAndTimeout` | parser_test.go | 330 | | | | +| 142 | `TestExtractVarsFromString` | parser_test.go | 361 | | | | +| 143 | `TestRegistry_TierGating` | registry_test.go | 9 | | | | +| 144 | `TestRegistry_ListByTier` | registry_test.go | 54 | | | | +| 145 | `TestRegistry_ActionTier` | registry_test.go | 75 | | | | +| 146 | `TestRegistry_EmptyTiersAllowsAll` | registry_test.go | 91 | | | | +| 147 | `TestCompareBaseline_P99Increase` | regression_test.go | 7 | | | | +| 148 | `TestCompareBaseline_P99WithinLimit` | regression_test.go | 22 | | | | +| 149 | `TestCompareBaseline_IOPSDecrease` | regression_test.go | 34 | | | | +| 150 | `TestCompareBaseline_IOPSWithinLimit` | regression_test.go | 46 | | | | +| 151 | `TestHardFail_DataMismatch` | regression_test.go | 58 | | | | +| 152 | `TestHardFail_BarrierLagUnbounded` | regression_test.go | 68 | | | | +| 153 | `TestHardFail_BarrierLagOK` | regression_test.go | 78 | | | | +| 154 | `TestHardFail_BarrierErrorRate` | regression_test.go | 88 | | | | +| 155 | `TestHardFail_HealthZero` | regression_test.go | 101 | | | | +| 156 | `TestHardFail_HealthZeroDuringFault_OK` | regression_test.go | 111 | | | | +| 157 | `TestHardFail_WALFullStall` | regression_test.go | 121 | | | | +| 158 | `TestHardFail_AllPass` | regression_test.go | 131 | | | | +| 159 | `TestFormatRegressionReport` | regression_test.go | 144 | | | | +| 160 | `TestWriteHTMLReport_Basic` | reporter_html_test.go | 11 | | | | +| 161 | `TestWriteHTMLReport_WithFailure` | reporter_html_test.go | 64 | | | | +| 162 | `TestWriteHTMLReport_WithPerfAndMetrics` | reporter_html_test.go | 103 | | | | +| 163 | `TestWriteHTMLReport_WithArtifacts` | reporter_html_test.go | 139 | | | | +| 164 | `TestPrintSummary` | reporter_test.go | 14 | | | | +| 165 | `TestPrintSummary_WithFailure` | reporter_test.go | 55 | | | | +| 166 | `TestWriteJSON` | reporter_test.go | 84 | | | | +| 167 | `TestWriteJUnitXML` | reporter_test.go | 116 | | | | +| 168 | `TestPrintSummary_WithPerfTable` | reporter_test.go | 162 | | | | +| 169 | `TestPrintSummary_WithMetricsTable` | reporter_test.go | 196 | | | | +| 170 | `TestPrintSummary_WithPerfStatsLine` | reporter_test.go | 219 | | | | +| 171 | `TestParsePerfStatsLine` | reporter_test.go | 242 | | | | +| 172 | `TestParsePerfStatsLine_Invalid` | reporter_test.go | 259 | | | | +| 173 | `TestFormatInt` | reporter_test.go | 266 | | | | +| 174 | `TestWriteJSON_WithVars` | reporter_test.go | 284 | | | | +| 175 | `TestBaselineCompare` | reporter_test.go | 322 | | | | +| 176 | `TestCreateRunBundle_CreatesDirectoryAndFiles` | runbundle_test.go | 12 | | | | +| 177 | `TestRunBundle_Finalize_WritesAllOutputs` | runbundle_test.go | 73 | | | | +| 178 | `TestRunBundle_UniqueRunIDs` | runbundle_test.go | 122 | | | | +| 179 | `TestRunBundle_CommandLineRecorded` | runbundle_test.go | 141 | | | | + +### wal + +| # | Test Name | File | Line | Status | Sim | Notes | +|---|-----------|------|------|--------|-----|-------| +| 1 | `TestQA_ParseIOBackend_ValidInputs` | qa_iobackend_config_test.go | 19 | | | | +| 2 | `TestQA_ParseIOBackend_InvalidInputs` | qa_iobackend_config_test.go | 52 | | | | +| 3 | `TestQA_IOBackend_String` | qa_iobackend_config_test.go | 84 | | | | +| 4 | `TestQA_ResolveIOBackend` | qa_iobackend_config_test.go | 105 | | | | +| 5 | `TestQA_Config_Validate_IOBackend_AutoOK` | qa_iobackend_config_test.go | 122 | | | | +| 6 | `TestQA_Config_Validate_IOBackend_StandardOK` | qa_iobackend_config_test.go | 130 | | | | +| 7 | `TestQA_Config_Validate_IOBackend_IOURingRejected` | qa_iobackend_config_test.go | 138 | | | | +| 8 | `TestQA_Config_Validate_IOBackend_OutOfRange` | qa_iobackend_config_test.go | 150 | | | | +| 9 | `TestQA_Config_Validate_IOBackend_NegativeValue` | qa_iobackend_config_test.go | 162 | | | | +| 10 | `TestQA_DefaultConfig_IOBackend_IsAuto` | qa_iobackend_config_test.go | 173 | | | | +| 11 | `TestQA_ApplyDefaults_IOBackend_ZeroStaysAuto` | qa_iobackend_config_test.go | 182 | | | | +| 12 | `TestQA_ApplyDefaults_IOBackend_ExplicitPreserved` | qa_iobackend_config_test.go | 191 | | | | +| 13 | `TestQA_IOBackend_RoundTrip` | qa_iobackend_config_test.go | 201 | | | | +| 14 | `TestQA_IOBackend_IotaValues` | qa_iobackend_config_test.go | 217 | | | | +| 15 | `TestQA_Admission_PressureOscillation` | qa_wal_admission_test.go | 23 | | | | +| 16 | `TestQA_Admission_StarvationUnderSoftPressure` | qa_wal_admission_test.go | 96 | | | | +| 17 | `TestQA_Admission_HardToSoftTransitionNoDeadlock` | qa_wal_admission_test.go | 130 | | | | +| 18 | `TestQA_Admission_SemaphoreFullWithHardPressureDrain` | qa_wal_admission_test.go | 166 | | | | +| 19 | `TestQA_Admission_DoubleReleaseSafety` | qa_wal_admission_test.go | 210 | | | | +| 20 | `TestQA_Admission_SoftDelayScalingBoundary` | qa_wal_admission_test.go | 244 | | | | +| 21 | `TestQA_Admission_CloseRaceBothPaths` | qa_wal_admission_test.go | 286 | | | | +| 22 | `TestQA_Admission_ZeroPressureThroughput` | qa_wal_admission_test.go | 341 | | | | +| 23 | `TestQA_Admission_NotifyFnPanicPropagates` | qa_wal_admission_test.go | 370 | | | | +| 24 | `TestQA_Admission_WALUsedFnReturnsAboveOne` | qa_wal_admission_test.go | 397 | | | | +| 25 | `TestQA_Admission_WriteLBAIntegration` | qa_wal_admission_test.go | 417 | | | | +| 26 | `TestQA_Admission_Metrics_ConcurrentCountersConsistent` | qa_wal_admission_test.go | 473 | | | | +| 27 | `TestQA_Admission_Metrics_SemaphoreWaitPathRecords` | qa_wal_admission_test.go | 560 | | | | +| 28 | `TestQA_Admission_Metrics_SemaphoreTimeoutRecords` | qa_wal_admission_test.go | 606 | | | | +| 29 | `TestQA_Admission_Metrics_CloseDuringSemaphoreRecords` | qa_wal_admission_test.go | 642 | | | | +| 30 | `TestQA_Admission_Metrics_Integration_WriteLBA` | qa_wal_admission_test.go | 681 | | | | +| 31 | `TestQA_CP11A3_SoftMarkEqualsHardMark_NoPanic` | qa_wal_cp11a3_adversarial_test.go | 25 | | | | +| 32 | `TestQA_CP11A3_SoftZoneExactBoundary_DelayIsZero` | qa_wal_cp11a3_adversarial_test.go | 64 | | | | +| 33 | `TestQA_CP11A3_ConcurrentHardWaiters_TimeAccumulates` | qa_wal_cp11a3_adversarial_test.go | 102 | | | | +| 34 | `TestQA_CP11A3_PressureStateAndAcquireRace` | qa_wal_cp11a3_adversarial_test.go | 156 | | | | +| 35 | `TestQA_CP11A3_TimeInZoneMonotonicity` | qa_wal_cp11a3_adversarial_test.go | 219 | | | | +| 36 | `TestQA_CP11A3_WALGuidance_ZeroInputs` | qa_wal_cp11a3_adversarial_test.go | 288 | | | | +| 37 | `TestQA_CP11A3_WALGuidance_OverflowSafe` | qa_wal_cp11a3_adversarial_test.go | 324 | | | | +| 38 | `TestQA_CP11A3_WALStatusSnapshot_PartialInit` | qa_wal_cp11a3_adversarial_test.go | 349 | | | | +| 39 | `TestQA_CP11A3_ObserverPanic_DocumentedBehavior` | qa_wal_cp11a3_adversarial_test.go | 403 | | | | +| 40 | `TestQA_CP11A3_ConcurrentWALStatusReads` | qa_wal_cp11a3_adversarial_test.go | 442 | | | | +| 41 | `TestQA_WALHardening_SoftPressureVisibility` | qa_wal_hardening_test.go | 13 | | | | +| 42 | `TestQA_WALHardening_HardPressureVisibility` | qa_wal_hardening_test.go | 51 | | | | +| 43 | `TestQA_WALHardening_PressureStateTransitions` | qa_wal_hardening_test.go | 88 | | | | +| 44 | `TestQA_WALHardening_NilSafe` | qa_wal_hardening_test.go | 124 | | | | +| 45 | `TestQA_WALHardening_ObserverCallbackContract` | qa_wal_hardening_test.go | 144 | | | | +| 46 | `TestQA_WALHardening_ExportSemantics` | qa_wal_hardening_test.go | 178 | | | | +| 47 | `TestWALAdmission_AcquireRelease_Basic` | wal_admission_test.go | 11 | | | | +| 48 | `TestWALAdmission_SoftWatermark_Throttles` | wal_admission_test.go | 48 | | | | +| 49 | `TestWALAdmission_BelowSoft_NoThrottle` | wal_admission_test.go | 75 | | | | +| 50 | `TestWALAdmission_HardWatermark_BlocksUntilDrain` | wal_admission_test.go | 97 | | | | +| 51 | `TestWALAdmission_HardWatermark_Timeout` | wal_admission_test.go | 133 | | | | +| 52 | `TestWALAdmission_ClosedDuringHardWait` | wal_admission_test.go | 153 | | | | +| 53 | `TestWALAdmission_Concurrent_BoundedWriters` | wal_admission_test.go | 174 | | | | +| 54 | `TestWALAdmission_FlusherNotified_OnSoftAndHard` | wal_admission_test.go | 221 | | | | +| 55 | `TestWALAdmission_SingleBudget_HardThenSemaphore` | wal_admission_test.go | 268 | | | | +| 56 | `TestWALAdmission_CloseDuringSemaphoreWait` | wal_admission_test.go | 319 | | | | +| 57 | `TestWALAdmission_Metrics_NoPressure` | wal_admission_test.go | 358 | | | | +| 58 | `TestWALAdmission_Metrics_SoftWatermark` | wal_admission_test.go | 389 | | | | +| 59 | `TestWALAdmission_Metrics_HardWatermark` | wal_admission_test.go | 418 | | | | +| 60 | `TestWALAdmission_Metrics_Timeout` | wal_admission_test.go | 457 | | | | +| 61 | `TestWALAdmission_Metrics_NilMetrics` | wal_admission_test.go | 486 | | | | +| 62 | `TestWALAdmission_Metrics_ClosedDuringHard` | wal_admission_test.go | 506 | | | | +| 63 | `TestWALAdmission_PressureState_Normal` | wal_admission_test.go | 537 | | | | +| 64 | `TestWALAdmission_PressureState_Soft` | wal_admission_test.go | 551 | | | | +| 65 | `TestWALAdmission_PressureState_Hard` | wal_admission_test.go | 565 | | | | +| 66 | `TestWALAdmission_SoftPressureWaitTracking` | wal_admission_test.go | 579 | | | | +| 67 | `TestWALAdmission_HardPressureWaitTracking` | wal_admission_test.go | 605 | | | | +| 68 | `TestWALAdmission_Metrics_WaitObserverCalled` | wal_admission_test.go | 636 | | | | +| 69 | `TestWALAdmission_ThresholdAccessors` | wal_admission_test.go | 665 | | | | +| 70 | `TestWALEntry` | wal_entry_test.go | 10 | | | | +| 71 | `TestWALSizingGuidance_AdequateGeneral` | wal_guidance_test.go | 9 | | | | +| 72 | `TestWALSizingGuidance_UndersizedDatabase` | wal_guidance_test.go | 16 | | | | +| 73 | `TestWALSizingGuidance_UndersizedThroughput` | wal_guidance_test.go | 26 | | | | +| 74 | `TestWALSizingGuidance_AbsoluteMinimum` | wal_guidance_test.go | 33 | | | | +| 75 | `TestWALSizingGuidance_UnknownHint` | wal_guidance_test.go | 43 | | | | +| 76 | `TestEvaluateWALConfig_HighConcurrencySmallWAL` | wal_guidance_test.go | 60 | | | | +| 77 | `TestEvaluateWALConfig_SaneDefaults` | wal_guidance_test.go | 68 | | | | +| 78 | `TestWALStatus_ReflectsVolumeState` | wal_guidance_test.go | 75 | | | | +| 79 | `TestWALStatus_NilAdmission` | wal_guidance_test.go | 103 | | | | +| 80 | `TestWALStatus_IncludesThresholds` | wal_guidance_test.go | 117 | | | | +| 81 | `TestWALWriter` | wal_writer_test.go | 9 | | | | diff --git a/sw-block/test/test_db_v2.md b/sw-block/test/test_db_v2.md new file mode 100644 index 000000000..bab51388a --- /dev/null +++ b/sw-block/test/test_db_v2.md @@ -0,0 +1,105 @@ +# V2 Test Database + +Date: 2026-03-27 +Status: working subset + +## Purpose + +This is the V2-focused review subset derived from: + +- `sw-block/test/test_db.md` +- `learn/projects/sw-block/phases/phase13_test.md` +- `learn/projects/sw-block/phases/phase-13-v2-boundary-tests.md` + +Use this file to review and track the tests that most directly help: + +- V2 protocol design +- simulator coverage +- V1 / V1.5 / V2 comparison +- V2 acceptance boundaries + +This is intentionally much smaller than the full `test_db.md`. + +## Review Codes + +### Status + +- `picked` +- `reviewed` +- `mapped` + +### Sim + +- `sim_core` +- `sim_reduced` +- `real_only` +- `v2_boundary` +- `sim_not_needed_yet` + +## V2 Boundary Tests + +| # | Test Name | File | Line | Level | Status | Sim | Notes | +|---|---|---|---|---|---|---|---| +| 1 | `TestAdversarial_ReconnectUsesHandshakeNotBootstrap` | `sync_all_adversarial_test.go` | | `unit` | `picked` | `v2_boundary` | V1/V1.5 sender identity loss; should become V2 acceptance case | +| 2 | `TestAdversarial_NeedsRebuildBlocksAllPaths` | `sync_all_adversarial_test.go` | | `unit` | `picked` | `v2_boundary` | `NeedsRebuild` must remain sticky under stable per-replica sender identity | +| 3 | `TestAdversarial_CatchupDoesNotOverwriteNewerData` | `sync_all_adversarial_test.go` | | `unit` | `picked` | `v2_boundary` | Catch-up correctness depends on identity continuity and proper recovery ownership | +| 4 | `TestAdversarial_CatchupMultipleDisconnects` | `sync_all_adversarial_test.go` | | `unit` | `picked` | `v2_boundary` | Multiple reconnect cycles are a V2 sender-loop / recovery-session acceptance target | + +## Core Protocol Tests + +| # | Test Name | File | Line | Level | Status | Sim | Notes | +|---|---|---|---|---|---|---|---| +| 1 | `TestRecovery` | `recovery_test.go` | | `unit` | `picked` | `sim_core` | Crash recovery correctness is fundamental to block protocol reasoning | +| 2 | `TestReplicaProgress_BarrierUsesFlushedLSN` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Durable-progress truth; barrier must count flushed progress, not send progress | +| 3 | `TestReplicaProgress_FlushedLSNMonotonicWithinEpoch` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Progress monotonicity invariant | +| 4 | `TestBarrier_RejectsReplicaNotInSync` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Only eligible replica states count for strict durability | +| 5 | `TestBarrier_EpochMismatchRejected` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Epoch fencing on barrier path | +| 6 | `TestBug1_SyncAll_WriteDuringDegraded_SyncCacheMustFail` | `sync_all_bug_test.go` | | `unit` | `picked` | `sim_core` | `sync_all` strictness during degraded state | +| 7 | `TestSyncAll_FullRoundTrip_WriteAndFlush` | `sync_all_bug_test.go` | | `unit` | `picked` | `sim_core` | End-to-end strict replication contract | +| 8 | `TestBestEffort_FlushSucceeds_ReplicaDown` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Contrasts best_effort vs strict modes | +| 9 | `TestShip_DegradedDoesNotSilentlyCountAsHealthy` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | No false durability from degraded shipper | +| 10 | `TestDistSync_SyncAll_AllDegraded_Fails` | `dist_group_commit_test.go` | | `unit` | `picked` | `sim_core` | Availability semantics under strict mode | +| 11 | `TestBug2_SyncAll_SyncCache_AfterDegradedShipperRecovers` | `sync_all_bug_test.go` | | `unit` | `picked` | `sim_core` | Recoverability after degraded shipper | +| 12 | `TestReconnect_CatchupFromRetainedWal` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Short-gap catch-up | +| 13 | `TestReconnect_GapBeyondRetainedWal_NeedsRebuild` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Catch-up vs rebuild boundary | +| 14 | `TestReconnect_EpochChangeDuringCatchup_Aborts` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Recovery fencing during catch-up | +| 15 | `TestCatchupReplay_DataIntegrity_AllBlocksMatch` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Recovery data correctness | +| 16 | `TestCatchupReplay_DuplicateEntry_Idempotent` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Replay idempotence | +| 17 | `TestBarrier_DuringCatchup_Rejected` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | State-machine correctness during recovery | +| 18 | `TestReplicaState_RebuildComplete_ReentersInSync` | `rebuild_v1_test.go` | | `unit` | `picked` | `sim_core` | Rebuild lifecycle closure | +| 19 | `TestRebuild_AbortOnEpochChange` | `rebuild_v1_test.go` | | `unit` | `picked` | `sim_core` | Rebuild fencing | +| 20 | `TestRebuild_MissingTailRestartsOrFailsCleanly` | `rebuild_v1_test.go` | | `unit` | `picked` | `sim_core` | Safe rebuild failure behavior | +| 21 | `TestWalRetention_RequiredReplicaBlocksReclaim` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Retention rule under lag | +| 22 | `TestWalRetention_TimeoutTriggersNeedsRebuild` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Retention timeout boundary | +| 23 | `TestWalRetention_MaxBytesTriggersNeedsRebuild` | `sync_all_protocol_test.go` | | `unit` | `picked` | `sim_core` | Retention budget boundary | +| 24 | `TestComponent_FailoverPromote` | `component_test.go` | | `component` | `picked` | `sim_core` | Core failover baseline | +| 25 | `TestCP13_SyncAll_FailoverPromotesReplica` | `cp13_protocol_test.go` | | `component` | `picked` | `sim_core` | Strict-mode failover | +| 26 | `TestCP13_SyncAll_ReplicaRestart_Rejoin` | `cp13_protocol_test.go` | | `component` | `picked` | `sim_core` | Restart/rejoin lifecycle | +| 27 | `TestQA_LSNLag_StaleReplicaSkipped` | `qa_block_edge_cases_test.go` | | `unit` | `picked` | `sim_core` | Promotion safety and stale candidate rejection | +| 28 | `TestQA_CascadeFailover_RF3_EpochChain` | `qa_block_edge_cases_test.go` | | `unit` | `picked` | `sim_core` | Multi-promotion lineage | +| 29 | `TestDurabilityMode_Validate_SyncQuorum_RF2_Rejected` | `durability_mode_test.go` | | `unit` | `picked` | `sim_core` | Mode normalization | +| 30 | `TestCP13_BestEffort_SurvivesReplicaDeath` | `cp13_protocol_test.go` | | `component` | `picked` | `sim_core` | Best-effort contract | +| 31 | `CP13-8 T4a: sync_all blocks during outage` | `manual` | | `integration` | `picked` | `sim_core` | Strict outage semantics | + +## Reduced / Supporting Tests + +| # | Test Name | File | Line | Level | Status | Sim | Notes | +|---|---|---|---|---|---|---|---| +| 1 | `testRecoverExtendedScanPastStaleHead` | `recovery_test.go` | | `unit` | `picked` | `sim_reduced` | Advisory WAL-head recovery shape | +| 2 | `testRecoverNoSuperblockPersist` | `recovery_test.go` | | `unit` | `picked` | `sim_reduced` | Recovery despite optimized persist behavior | +| 3 | `TestQAGroupCommitter` | `blockvol_qa_test.go` | | `unit` | `picked` | `sim_reduced` | Commit batching semantics | +| 4 | `TestQA_Admission_WriteLBAIntegration` | `qa_wal_admission_test.go` | | `unit` | `picked` | `sim_reduced` | Backpressure behavior | +| 5 | `TestSyncAll_MultipleFlush_NoWritesBetween` | `sync_all_bug_test.go` | | `unit` | `picked` | `sim_reduced` | Idempotent flush shape | +| 6 | `TestRebuild_PostRebuild_FlushedLSN_IsCheckpoint` | `rebuild_v1_test.go` | | `unit` | `picked` | `sim_reduced` | Progress initialization after rebuild | +| 7 | `TestComponent_ManualPromote` | `component_test.go` | | `component` | `picked` | `sim_reduced` | Manual control-path shape | +| 8 | `TestHeartbeat_ReportsPerReplicaState` | `rebuild_v1_test.go` | | `unit` | `picked` | `sim_reduced` | Heartbeat observability | +| 9 | `TestHeartbeat_ReportsNeedsRebuild` | `rebuild_v1_test.go` | | `unit` | `picked` | `sim_reduced` | Control-plane visibility | +| 10 | `TestComponent_ExpandThenFailover` | `component_test.go` | | `component` | `picked` | `sim_reduced` | State continuity across operations | +| 11 | `TestCP13_DurabilityModeDefault` | `cp13_protocol_test.go` | | `component` | `picked` | `sim_reduced` | Default mode behavior | +| 12 | `CP13-8 T4b: recovery after restart` | `manual` | | `integration` | `picked` | `sim_reduced` | Recovery-time shape and control-plane/local-reconnect interaction | + +## Notes + +- This file is the actionable V2 subset, not the master inventory. +- If `tester` later finalizes a broader 70-case picked set, expand this file from that selection. +- The 4 V2-boundary tests must remain present even if they fail on V1/V1.5. diff --git a/sw-block/test/v2_selected.md b/sw-block/test/v2_selected.md new file mode 100644 index 000000000..5abe207a2 --- /dev/null +++ b/sw-block/test/v2_selected.md @@ -0,0 +1,115 @@ +# V2-Selected Test Worklist + +Date: 2026-03-27 +Status: working + +## Purpose + +This is the V2-facing subset of the larger block-service test database. + +Sources: + +- `sw-block/test/test_db.md` +- `learn/projects/sw-block/phases/phase13_test.md` +- `learn/projects/sw-block/phases/phase-13-v2-boundary-tests.md` + +This file is for: + +- tests that should help V2 design and simulator work +- explicit inclusion of the 4 Phase 13 V2-boundary failures +- a working set that `tester`, `sw`, and design can refine further + +## Current Inclusion Rule + +Include tests that are: + +- `sim_core` +- `sim_reduced` +- `v2_boundary` + +Prefer tests that directly inform: + +- barriers and durability truth +- catch-up vs rebuild +- failover / promotion +- WAL retention / tail-chasing +- mode semantics +- endpoint / identity / reassignment behavior + +## Phase 13 V2-Boundary Tests + +These must stay visible in the V2 worklist: + +| Test | File | Why It Matters To V2 | +|---|---|---| +| `TestAdversarial_ReconnectUsesHandshakeNotBootstrap` | `sync_all_adversarial_test.go` | Sender identity and reconnect ownership | +| `TestAdversarial_NeedsRebuildBlocksAllPaths` | `sync_all_adversarial_test.go` | `NeedsRebuild` must remain sticky and identity-safe | +| `TestAdversarial_CatchupDoesNotOverwriteNewerData` | `sync_all_adversarial_test.go` | Catch-up must preserve data correctness under identity continuity | +| `TestAdversarial_CatchupMultipleDisconnects` | `sync_all_adversarial_test.go` | Multiple reconnect cycles require stable per-replica sender ownership | + +## High-Value V2 Working Set + +This is the current distilled working set from `phase13_test.md`. + +| Test | File | Current Result | Mapping | Why It Helps V2 | +|---|---|---|---|---| +| `TestRecovery` | `recovery_test.go` | PASS | `sim_core` | Crash recovery correctness | +| `TestReplicaProgress_BarrierUsesFlushedLSN` | `sync_all_protocol_test.go` | PASS | `sim_core` | Barrier truth / durable progress | +| `TestReplicaProgress_FlushedLSNMonotonicWithinEpoch` | `sync_all_protocol_test.go` | PASS | `sim_core` | Monotonic progress invariant | +| `TestBarrier_RejectsReplicaNotInSync` | `sync_all_protocol_test.go` | PASS | `sim_core` | State-gated strict durability | +| `TestBarrier_EpochMismatchRejected` | `sync_all_protocol_test.go` | PASS | `sim_core` | Epoch fencing | +| `TestBug1_SyncAll_WriteDuringDegraded_SyncCacheMustFail` | `sync_all_bug_test.go` | PASS | `sim_core` | `sync_all` strictness during outage | +| `TestSyncAll_FullRoundTrip_WriteAndFlush` | `sync_all_bug_test.go` | PASS | `sim_core` | End-to-end strict replication | +| `TestBestEffort_FlushSucceeds_ReplicaDown` | `sync_all_protocol_test.go` | PASS | `sim_core` | Mode difference vs strict sync | +| `TestShip_DegradedDoesNotSilentlyCountAsHealthy` | `sync_all_protocol_test.go` | PASS | `sim_core` | No false durability | +| `TestDistSync_SyncAll_AllDegraded_Fails` | `dist_group_commit_test.go` | PASS | `sim_core` | Availability semantics | +| `TestBug2_SyncAll_SyncCache_AfterDegradedShipperRecovers` | `sync_all_bug_test.go` | PASS | `sim_core` | Recoverability after degraded shipper | +| `TestReconnect_CatchupFromRetainedWal` | `sync_all_protocol_test.go` | PASS | `sim_core` | Short-gap catch-up | +| `TestReconnect_GapBeyondRetainedWal_NeedsRebuild` | `sync_all_protocol_test.go` | PASS | `sim_core` | Catch-up vs rebuild boundary | +| `TestReconnect_EpochChangeDuringCatchup_Aborts` | `sync_all_protocol_test.go` | PASS | `sim_core` | Recovery fencing | +| `TestCatchupReplay_DataIntegrity_AllBlocksMatch` | `sync_all_protocol_test.go` | PASS | `sim_core` | Recovery data correctness | +| `TestCatchupReplay_DuplicateEntry_Idempotent` | `sync_all_protocol_test.go` | PASS | `sim_core` | Replay idempotence | +| `TestBarrier_DuringCatchup_Rejected` | `sync_all_protocol_test.go` | PASS | `sim_core` | State-machine correctness | +| `TestReplicaState_RebuildComplete_ReentersInSync` | `rebuild_v1_test.go` | PASS | `sim_core` | Rebuild lifecycle | +| `TestRebuild_AbortOnEpochChange` | `rebuild_v1_test.go` | PASS | `sim_core` | Rebuild fencing | +| `TestRebuild_MissingTailRestartsOrFailsCleanly` | `rebuild_v1_test.go` | PASS | `sim_core` | No partial/unsafe rebuild success | +| `TestWalRetention_RequiredReplicaBlocksReclaim` | `sync_all_protocol_test.go` | PASS | `sim_core` | Retention rule | +| `TestWalRetention_TimeoutTriggersNeedsRebuild` | `sync_all_protocol_test.go` | PASS | `sim_core` | Retention timeout boundary | +| `TestWalRetention_MaxBytesTriggersNeedsRebuild` | `sync_all_protocol_test.go` | PASS | `sim_core` | Retention budget boundary | +| `TestComponent_FailoverPromote` | `component_test.go` | PASS | `sim_core` | Failover baseline | +| `TestCP13_SyncAll_FailoverPromotesReplica` | `cp13_protocol_test.go` | PASS | `sim_core` | Strict-mode failover | +| `TestCP13_SyncAll_ReplicaRestart_Rejoin` | `cp13_protocol_test.go` | PASS | `sim_core` | Restart/rejoin lifecycle | +| `TestQA_LSNLag_StaleReplicaSkipped` | `qa_block_edge_cases_test.go` | PASS | `sim_core` | Promotion safety | +| `TestQA_CascadeFailover_RF3_EpochChain` | `qa_block_edge_cases_test.go` | PASS | `sim_core` | Multi-promotion lineage | +| `TestDurabilityMode_Validate_SyncQuorum_RF2_Rejected` | `durability_mode_test.go` | PASS | `sim_core` | Mode normalization | +| `TestCP13_BestEffort_SurvivesReplicaDeath` | `cp13_protocol_test.go` | PASS | `sim_core` | Best-effort contract | +| `CP13-8 T4a: sync_all blocks during outage` | `manual` | PASS | `sim_core` | Strict outage semantics | +| `CP13-8 T4b: recovery after restart` | `manual` | PASS | `sim_reduced` | Recovery-time shape | + +## Reduced / Supporting Cases To Keep In View + +| Test | File | Current Result | Mapping | Why It Helps V2 | +|---|---|---|---|---| +| `testRecoverExtendedScanPastStaleHead` | `recovery_test.go` | PASS | `sim_reduced` | Advisory WAL-head recovery shape | +| `testRecoverNoSuperblockPersist` | `recovery_test.go` | PASS | `sim_reduced` | Recoverability despite optimized persist behavior | +| `TestQAGroupCommitter` | `blockvol_qa_test.go` | PASS | `sim_reduced` | Commit batching semantics | +| `TestQA_Admission_WriteLBAIntegration` | `qa_wal_admission_test.go` | PASS | `sim_reduced` | Backpressure behavior | +| `TestSyncAll_MultipleFlush_NoWritesBetween` | `sync_all_bug_test.go` | PASS | `sim_reduced` | Idempotent flush shape | +| `TestRebuild_PostRebuild_FlushedLSN_IsCheckpoint` | `rebuild_v1_test.go` | PASS | `sim_reduced` | Progress initialization | +| `TestComponent_ManualPromote` | `component_test.go` | PASS | `sim_reduced` | Manual control-path shape | +| `TestHeartbeat_ReportsPerReplicaState` | `rebuild_v1_test.go` | PASS | `sim_reduced` | Heartbeat observability | +| `TestHeartbeat_ReportsNeedsRebuild` | `rebuild_v1_test.go` | PASS | `sim_reduced` | Control-plane visibility | +| `TestComponent_ExpandThenFailover` | `component_test.go` | PASS | `sim_reduced` | Cross-operation state continuity | +| `TestCP13_DurabilityModeDefault` | `cp13_protocol_test.go` | PASS | `sim_reduced` | Default mode behavior | + +## Working Note + +`phase13_test.md` currently contains the mapped subset from the real test inventory. + +This V2 copy is intentionally narrower: + +- preserve the core tests that define the protocol story +- preserve the 4 V2-boundary tests explicitly +- keep a smaller reduced set for supporting invariants + +If `tester` finalizes a broader 70-case working set, extend this file rather than editing the full copied database directly.