Browse Source
fix: stable ServerID in assignments, fail-closed on missing identity, wire into ProcessAssignments
fix: stable ServerID in assignments, fail-closed on missing identity, wire into ProcessAssignments
Finding 1: Identity no longer address-derived - ReplicaAddr.ServerID field added (stable server identity from registry) - BlockVolumeAssignment.ReplicaServerID field added (scalar RF=2 path) - ControlBridge uses ServerID, NOT address, for ReplicaID - Missing ServerID → replica skipped (fail closed), logged Finding 2: Wired into real ProcessAssignments - BlockService.v2Bridge field initialized in StartBlockService - ProcessAssignments converts each assignment via v2Bridge.ConvertAssignment BEFORE existing V1 processing (parallel, not replacing yet) - Logged at glog V(1) Finding 3: Fail-closed on missing identity - Empty ServerID in ReplicaAddrs → replica skipped with log - Empty ReplicaServerID in scalar path → no replica created - Test: MissingServerID_FailsClosed verifies both paths 7 tests: StableServerID, AddressChange_IdentityPreserved, MultiReplica_StableServerIDs, MissingServerID_FailsClosed, EpochFencing_IntegratedPath, RebuildAssignment, ReplicaAssignment Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>feature/sw-block
55 changed files with 9024 additions and 270 deletions
-
105sw-block/.private/phase/phase-04-decisions.md
-
40sw-block/.private/phase/phase-04-log.md
-
81sw-block/.private/phase/phase-04.md
-
94sw-block/.private/phase/phase-05-decisions.md
-
78sw-block/.private/phase/phase-05-log.md
-
356sw-block/.private/phase/phase-05.md
-
68sw-block/.private/phase/phase-06-decisions.md
-
51sw-block/.private/phase/phase-06-log.md
-
193sw-block/.private/phase/phase-06.md
-
119sw-block/.private/phase/phase-07-decisions.md
-
63sw-block/.private/phase/phase-07-log.md
-
220sw-block/.private/phase/phase-07.md
-
78sw-block/.private/phase/phase-08-decisions.md
-
21sw-block/.private/phase/phase-08-log.md
-
254sw-block/.private/phase/phase-08.md
-
59sw-block/.private/phase/phase-4.5-decisions.md
-
33sw-block/.private/phase/phase-4.5-log.md
-
397sw-block/.private/phase/phase-4.5-reason.md
-
356sw-block/.private/phase/phase-4.5.md
-
18sw-block/design/README.md
-
117sw-block/design/a5-a8-traceability.md
-
304sw-block/design/agent_dev_process.md
-
403sw-block/design/phase-07-service-slice-plan.md
-
686sw-block/design/v2-algorithm-overview.md
-
660sw-block/design/v2-algorithm-overview.zh.md
-
1068sw-block/design/v2-detailed-algorithm.zh.md
-
170sw-block/design/v2-engine-readiness-review.md
-
191sw-block/design/v2-engine-slicing-plan.md
-
199sw-block/design/v2-production-roadmap.md
-
561sw-block/design/v2-protocol-truths.md
-
13sw-block/prototype/distsim/cluster.go
-
2sw-block/prototype/distsim/cluster_test.go
-
6sw-block/prototype/distsim/phase02_candidate_test.go
-
219sw-block/prototype/distsim/phase045_adversarial_test.go
-
334sw-block/prototype/distsim/phase045_crash_test.go
-
160sw-block/prototype/distsim/predicates.go
-
7sw-block/prototype/distsim/simulator.go
-
242sw-block/prototype/distsim/storage.go
-
118weed/server/master_block_failover.go
-
56weed/server/master_block_registry.go
-
60weed/server/master_block_registry_test.go
-
7weed/server/master_grpc_server.go
-
481weed/server/qa_block_edge_cases_test.go
-
10weed/server/volume_grpc_client_to_master.go
-
5weed/server/volume_server.go
-
32weed/server/volume_server_block.go
-
77weed/server/volume_server_block_debug.go
-
1weed/storage/blockvol/block_heartbeat.go
-
21weed/storage/blockvol/blockvol.go
-
11weed/storage/blockvol/shipper_group.go
-
58weed/storage/blockvol/testrunner/scenarios/internal/robust-slow-replica.yaml
-
77weed/storage/blockvol/v2bridge/control.go
-
220weed/storage/blockvol/v2bridge/control_test.go
-
25weed/storage/blockvol/wal_shipper.go
-
9weed/storage/store_blockvol.go
@ -0,0 +1,94 @@ |
|||
# Phase 05 Decisions |
|||
|
|||
## Decision 1: Real V2 engine work lives under `sw-block/engine/replication/` |
|||
|
|||
The first real engine slice is established under: |
|||
|
|||
- `sw-block/engine/replication/` |
|||
|
|||
This keeps V2 separate from: |
|||
|
|||
- `sw-block/prototype/` |
|||
- `weed/storage/blockvol/` |
|||
|
|||
## Decision 2: Slice 1 is accepted |
|||
|
|||
Accepted scope: |
|||
|
|||
1. stable per-replica sender identity |
|||
2. stable recovery-session identity |
|||
3. stale authority fencing |
|||
4. endpoint / epoch invalidation |
|||
5. ownership registry |
|||
|
|||
## Decision 3: Stable identity must not be address-shaped |
|||
|
|||
The engine registry is now keyed by stable `ReplicaID`, not mutable endpoint address. |
|||
|
|||
This is a required structural break from the V1/V1.5 identity-loss pattern. |
|||
|
|||
## Decision 4: Slice 2 is accepted |
|||
|
|||
Accepted scope: |
|||
|
|||
1. connect / handshake / catch-up flow |
|||
2. zero-gap / catch-up / needs-rebuild branching |
|||
3. stale execution rejection during active recovery |
|||
4. bounded catch-up semantics in engine path |
|||
5. rebuild execution shell |
|||
|
|||
## Decision 5: Slice 3 owns real recoverability inputs |
|||
|
|||
Slice 3 should be the point where: |
|||
|
|||
1. recoverable vs unrecoverable gap uses real engine inputs |
|||
2. trusted-base / rebuild-source decision uses real engine data inputs |
|||
3. truncation / safe-boundary handling is tied to real engine state |
|||
4. historical correctness at recovery target is validated from engine inputs |
|||
|
|||
## Decision 6: Slice 3 is accepted |
|||
|
|||
Accepted scope: |
|||
|
|||
1. real engine recoverability input path |
|||
2. trusted-base / rebuild-source decision from engine data inputs |
|||
3. truncation / safe-boundary handling tied to engine state |
|||
4. recoverability gating without overclaiming full historical reconstruction in engine |
|||
|
|||
## Decision 7: Slice 3 should replace carried-forward heuristics where appropriate |
|||
|
|||
In particular: |
|||
|
|||
1. simple rebuild-source heuristics carried from prototype should not become permanent engine policy |
|||
2. Slice 3 should tighten these decisions against real engine recoverability inputs |
|||
|
|||
## Decision 8: Slice 4 is the engine integration closure slice |
|||
|
|||
Next focus: |
|||
|
|||
1. real assignment/control intent entry path |
|||
2. engine observability / debug surface |
|||
3. focused integration tests for V2-boundary cases |
|||
4. validation against selected real failure classes from `learn/projects/sw-block/` and `weed/storage/block*` |
|||
|
|||
## Decision 9: Slice 4 is accepted |
|||
|
|||
Accepted scope: |
|||
|
|||
1. real orchestrator entry path |
|||
2. assignment/update-driven recovery through that path |
|||
3. engine observability / causal recovery logging |
|||
4. diagnosable V2-boundary integration tests |
|||
|
|||
## Decision 10: Phase 05 is complete |
|||
|
|||
Reason: |
|||
|
|||
1. ownership core is accepted |
|||
2. recovery execution core is accepted |
|||
3. data / recoverability core is accepted |
|||
4. integration closure is accepted |
|||
|
|||
Next: |
|||
|
|||
- `Phase 06` broader engine implementation stage |
|||
@ -0,0 +1,78 @@ |
|||
# Phase 05 Log |
|||
|
|||
## 2026-03-29 |
|||
|
|||
### Opened |
|||
|
|||
`Phase 05` opened as: |
|||
|
|||
- V2 engine planning + Slice 1 ownership core |
|||
|
|||
### Accepted |
|||
|
|||
1. engine module location |
|||
- `sw-block/engine/replication/` |
|||
|
|||
2. Slice 1 ownership core |
|||
- stable per-replica sender identity |
|||
- stable recovery-session identity |
|||
- sender/session fencing |
|||
- endpoint / epoch invalidation |
|||
- ownership registry |
|||
|
|||
3. Slice 1 identity correction |
|||
- registry now keyed by stable `ReplicaID` |
|||
- mutable `Endpoint` separated from identity |
|||
- real changed-`DataAddr` preservation covered by test |
|||
|
|||
4. Slice 1 encapsulation |
|||
- mutable sender/session authority state no longer exposed directly |
|||
- snapshot/read-only inspection path in place |
|||
|
|||
5. Slice 2 recovery execution core |
|||
- connect / handshake / catch-up flow |
|||
- explicit zero-gap / catch-up / needs-rebuild branching |
|||
- stale execution rejection during active recovery |
|||
- bounded catch-up semantics |
|||
- rebuild execution shell |
|||
|
|||
6. Slice 2 validation |
|||
- corrected tester summary accepted |
|||
- `12` ownership tests + `18` recovery tests = `30` total |
|||
- Slice 2 accepted for progression to Slice 3 planning |
|||
|
|||
7. Slice 3 data / recoverability core |
|||
- `RetainedHistory` introduced as engine-level recoverability input |
|||
- history-driven sender APIs added for handshake and rebuild-source selection |
|||
- trusted-base decision now requires both checkpoint trust and replayable tail |
|||
- truncation remains a completion gate / protocol boundary |
|||
|
|||
8. Slice 3 validation |
|||
- corrected tester summary accepted |
|||
- `12` ownership tests + `18` recovery tests + `18` recoverability tests = `48` total |
|||
- accepted boundary: |
|||
- engine proves historical-correctness prerequisites |
|||
- simulator retains stronger historical reconstruction proof |
|||
- Slice 3 accepted for progression to Slice 4 planning |
|||
|
|||
9. Slice 4 integration closure |
|||
- `RecoveryOrchestrator` added as integrated engine entry path |
|||
- assignment/update-driven recovery is exercised through orchestrator |
|||
- observability surface added: |
|||
- `RegistryStatus` |
|||
- `SenderStatus` |
|||
- `SessionSnapshot` |
|||
- `RecoveryLog` |
|||
- causal recovery logging now covers invalidation, escalation, truncation, completion, rebuild transitions |
|||
|
|||
10. Slice 4 validation |
|||
- corrected tester summary accepted |
|||
- `12` ownership tests + `18` recovery tests + `18` recoverability tests + `11` integration tests = `59` total |
|||
- Slice 4 accepted |
|||
- `Phase 05` accepted as complete |
|||
|
|||
### Next |
|||
|
|||
1. `Phase 06` planning |
|||
2. broader engine implementation stage |
|||
3. real-engine integration against selected `weed/storage/block*` constraints and failure classes |
|||
@ -0,0 +1,356 @@ |
|||
# Phase 05 |
|||
|
|||
Date: 2026-03-29 |
|||
Status: complete |
|||
Purpose: begin the real V2 engine track under `sw-block/` by moving from prototype proof to the first engine slice |
|||
|
|||
## Why This Phase Exists |
|||
|
|||
The project has now completed: |
|||
|
|||
1. V2 design/FSM closure |
|||
2. V2 protocol/simulator validation |
|||
3. Phase 04 prototype closure |
|||
4. Phase 4.5 evidence hardening |
|||
|
|||
So the next step is no longer: |
|||
|
|||
- extend prototype breadth |
|||
|
|||
The next step is: |
|||
|
|||
- start disciplined real V2 engine work |
|||
|
|||
## Phase Goal |
|||
|
|||
Start the real V2 engine line under `sw-block/` with: |
|||
|
|||
1. explicit engine module location |
|||
2. Slice 1 ownership-core boundaries |
|||
3. first engine ownership-core implementation |
|||
4. engine-side validation tied back to accepted prototype invariants |
|||
|
|||
## Relationship To Previous Phases |
|||
|
|||
`Phase 05` is built on: |
|||
|
|||
- `sw-block/design/v2-engine-readiness-review.md` |
|||
- `sw-block/design/v2-engine-slicing-plan.md` |
|||
- `sw-block/.private/phase/phase-04.md` |
|||
- `sw-block/.private/phase/phase-4.5.md` |
|||
|
|||
This is a new implementation phase. |
|||
|
|||
It is not: |
|||
|
|||
1. more prototype expansion |
|||
2. V1 integration |
|||
3. backend redesign |
|||
|
|||
## Scope |
|||
|
|||
### In scope |
|||
|
|||
1. choose real V2 engine module location under `sw-block/` |
|||
2. define Slice 1 file/module boundaries |
|||
3. write short engine ownership-core spec |
|||
4. start Slice 1 implementation: |
|||
- stable per-replica sender object |
|||
- stable recovery-session object |
|||
- session identity fencing |
|||
- endpoint / epoch invalidation |
|||
- ownership registry / sender-group equivalent |
|||
5. add focused engine-side ownership/fencing tests |
|||
|
|||
### Out of scope |
|||
|
|||
1. Smart WAL expansion |
|||
2. full storage/backend redesign |
|||
3. full rebuild-source decision logic |
|||
4. V1 production integration |
|||
5. performance work |
|||
6. full product integration |
|||
|
|||
## Planned Slices |
|||
|
|||
### P0: Engine Planning Setup |
|||
|
|||
1. choose real V2 engine module location under `sw-block/` |
|||
2. define Slice 1 file/module boundaries |
|||
3. write ownership-core spec |
|||
4. map 3-5 acceptance scenarios to Slice 1 expectations |
|||
|
|||
Status: |
|||
|
|||
- accepted |
|||
- engine module location chosen: `sw-block/engine/replication/` |
|||
- Slice 1 boundaries are explicit enough to start implementation |
|||
|
|||
### P1: Slice 1 Ownership Core |
|||
|
|||
1. implement stable per-replica sender object |
|||
2. implement stable recovery-session object |
|||
3. implement sender/session identity fencing |
|||
4. implement endpoint / epoch invalidation |
|||
5. implement ownership registry |
|||
|
|||
Status: |
|||
|
|||
- accepted |
|||
- stable `ReplicaID` is now explicit and separate from mutable `Endpoint` |
|||
- engine registry is keyed by stable identity, not address-shaped strings |
|||
- real changed-`DataAddr` preservation is covered by test |
|||
|
|||
### P2: Slice 1 Validation |
|||
|
|||
1. engine-side tests for ownership/fencing |
|||
2. changed-address case |
|||
3. stale-session rejection case |
|||
4. epoch-bump invalidation case |
|||
5. traceability back to accepted prototype behavior |
|||
|
|||
Status: |
|||
|
|||
- accepted |
|||
- Slice 1 ownership/fencing tests are in place and passing |
|||
- acceptance/gate mapping is strong enough to move to Slice 2 |
|||
|
|||
### P3: Slice 2 Planning Setup |
|||
|
|||
1. define Slice 2 boundaries explicitly |
|||
2. distinguish Slice 2 core from carried-forward prototype support |
|||
3. map Slice 2 engine expectations from accepted prototype evidence |
|||
4. prepare Slice 2 validation targets |
|||
|
|||
Status: |
|||
|
|||
- accepted |
|||
- Slice 2 recovery execution core is implemented and validated |
|||
- corrected tester summary accepted: |
|||
- `12` ownership tests |
|||
- `18` recovery tests |
|||
- `30` total |
|||
|
|||
### P4: Slice 3 Planning Setup |
|||
|
|||
1. define Slice 3 boundaries explicitly |
|||
2. connect recovery decisions to real engine recoverability inputs |
|||
3. make trusted-base / rebuild-source decision use real engine data inputs |
|||
4. prepare Slice 3 validation targets |
|||
|
|||
Status: |
|||
|
|||
- accepted |
|||
- Slice 3 data / recoverability core is implemented and validated |
|||
- corrected tester summary accepted: |
|||
- `12` ownership tests |
|||
- `18` recovery tests |
|||
- `18` recoverability tests |
|||
- `48` total |
|||
- important boundary preserved: |
|||
- engine proves historical-correctness prerequisites |
|||
- full historical reconstruction proof remains simulator-side |
|||
|
|||
## Slice 3 Guardrails |
|||
|
|||
Slice 3 is the point where V2 must move from: |
|||
|
|||
- recovery automaton is coherent |
|||
|
|||
to: |
|||
|
|||
- recovery basis is provable |
|||
|
|||
So Slice 3 must stay tight. |
|||
|
|||
### Guardrail 1: No optimistic watermark in place of recoverability proof |
|||
|
|||
Do not accept: |
|||
|
|||
- loose head/tail watermarks |
|||
- "looks retained enough" |
|||
- heuristic recoverability |
|||
|
|||
Slice 3 should prove: |
|||
|
|||
1. why a gap is recoverable |
|||
2. why a gap is unrecoverable |
|||
|
|||
### Guardrail 2: No current extent state pretending to be historical correctness |
|||
|
|||
Do not accept: |
|||
|
|||
- current extent image as substitute for target-LSN truth |
|||
- checkpoint/base state that leaks newer state into older historical queries |
|||
|
|||
Slice 3 should prove historical correctness at the actual recovery target. |
|||
|
|||
### Guardrail 3: No `snapshot + tail` without trusted-base proof |
|||
|
|||
Do not accept: |
|||
|
|||
- "snapshot exists" as sufficient |
|||
|
|||
Require: |
|||
|
|||
1. trusted base exists |
|||
2. trusted base covers the required base state |
|||
3. retained tail can be replayed continuously from that base to the target |
|||
|
|||
If not, recovery must use: |
|||
|
|||
- `FullBase` |
|||
|
|||
### Guardrail 4: Truncation is protocol boundary, not cleanup policy |
|||
|
|||
Do not treat truncation as: |
|||
|
|||
- optional cleanup |
|||
- post-recovery tidying |
|||
|
|||
Treat truncation as: |
|||
|
|||
1. divergent tail removal |
|||
2. explicit safe-boundary restoration |
|||
3. prerequisite for safe `InSync` / recovery completion where applicable |
|||
|
|||
### P5: Slice 4 Planning Setup |
|||
|
|||
1. define Slice 4 boundaries explicitly |
|||
2. connect engine control/recovery core to real assignment/control intent entry path |
|||
3. add engine observability / debug surface for ownership and recovery failures |
|||
4. prepare integration validation against V2-boundary failure classes |
|||
|
|||
Status: |
|||
|
|||
- accepted |
|||
- Slice 4 integration closure is implemented and validated |
|||
- corrected tester summary accepted: |
|||
- `12` ownership tests |
|||
- `18` recovery tests |
|||
- `18` recoverability tests |
|||
- `11` integration tests |
|||
- `59` total |
|||
|
|||
## Slice 4 Guardrails |
|||
|
|||
Slice 4 should close integration, not just add an entry point and some logs. |
|||
|
|||
### Guardrail 1: Entry path must actually drive recovery |
|||
|
|||
Do not accept: |
|||
|
|||
- tests that manually push sender/session state while only pretending to use integration entry points |
|||
|
|||
Require: |
|||
|
|||
1. real assignment/control intent entry path |
|||
2. session creation / invalidation / restart triggered through that path |
|||
3. recovery flow driven from that path, not only from unit-level helper calls |
|||
|
|||
### Guardrail 2: Changed-address must survive the real entry path |
|||
|
|||
Do not accept: |
|||
|
|||
- changed-address correctness proven only at local object level |
|||
|
|||
Require: |
|||
|
|||
1. stable `ReplicaID` survives real assignment/update entry path |
|||
2. endpoint update invalidates old session correctly |
|||
3. new recovery session is created correctly on updated endpoint |
|||
|
|||
### Guardrail 3: Observability must show protocol causality |
|||
|
|||
Do not accept: |
|||
|
|||
- only state snapshots |
|||
- only phase dumps |
|||
|
|||
Require observability that can explain: |
|||
|
|||
1. why recovery entered `NeedsRebuild` |
|||
2. why a session was superseded |
|||
3. why a completion or progress update was rejected |
|||
4. why endpoint / epoch change caused invalidation |
|||
|
|||
### Guardrail 4: Failure replay must be explainable |
|||
|
|||
Do not accept: |
|||
|
|||
- a replay that reproduces failure but cannot explain the cause from engine observability |
|||
|
|||
Require: |
|||
|
|||
1. selected failure-class replays through the real entry path |
|||
2. observability sufficient to explain the control/recovery decision |
|||
3. reviewability against key V2-boundary failures |
|||
|
|||
## Exit Criteria |
|||
|
|||
Phase 05 Slice 1 is done when: |
|||
|
|||
1. the real V2 engine module location is chosen |
|||
2. Slice 1 boundaries are explicit |
|||
3. engine ownership core exists under `sw-block/` |
|||
4. engine-side ownership/fencing tests pass |
|||
5. Slice 1 evidence is reviewable against prototype expectations |
|||
|
|||
This bar is now met. |
|||
|
|||
Phase 05 Slice 2 is done when: |
|||
|
|||
1. engine-side recovery execution flow exists |
|||
2. zero-gap / catch-up / needs-rebuild branching is explicit |
|||
3. stale execution is rejected during active recovery |
|||
4. bounded catch-up semantics are enforced in engine path |
|||
5. rebuild execution shell is validated |
|||
|
|||
This bar is now met. |
|||
|
|||
Phase 05 Slice 3 is done when: |
|||
|
|||
1. recoverable vs unrecoverable gap uses real engine recoverability inputs |
|||
2. trusted-base / rebuild-source decision uses real engine data inputs |
|||
3. truncation / safe-boundary handling is tied to real engine state |
|||
4. history-driven engine APIs exist for recovery decisions |
|||
5. Slice 3 validation is reviewable without overclaiming full historical reconstruction |
|||
|
|||
This bar is now met. |
|||
|
|||
Phase 05 Slice 4 is done when: |
|||
|
|||
1. real assignment/control intent entry path exists |
|||
2. changed-address recovery works through the real entry path |
|||
3. observability explains protocol causality, not only state snapshots |
|||
4. selected V2-boundary failures are replayable and diagnosable through engine integration tests |
|||
|
|||
This bar is now met. |
|||
|
|||
## Assignment For `sw` |
|||
|
|||
Phase 05 is now complete. |
|||
|
|||
Next phase: |
|||
|
|||
- `Phase 06` broader engine implementation stage |
|||
|
|||
## Assignment For `tester` |
|||
|
|||
Phase 05 validation is complete. |
|||
|
|||
Next phase: |
|||
|
|||
- `Phase 06` engine implementation validation against real-engine constraints and failure classes |
|||
|
|||
## Management Rule |
|||
|
|||
`Phase 05` should stay narrow. |
|||
|
|||
It should start the engine line with: |
|||
|
|||
1. ownership |
|||
2. fencing |
|||
3. validation |
|||
|
|||
It should not try to absorb later slices early. |
|||
@ -0,0 +1,68 @@ |
|||
# Phase 06 Decisions |
|||
|
|||
## Decision 1: Phase 06 is broader engine implementation, not new design |
|||
|
|||
The protocol shape and engine core contracts were already accepted. |
|||
|
|||
Phase 06 implemented around them. |
|||
|
|||
## Decision 2: Phase 06 must connect to real constraints |
|||
|
|||
This phase explicitly used: |
|||
|
|||
1. `learn/projects/sw-block/` for failure gates and test lineage |
|||
2. `weed/storage/block*` for real implementation constraints |
|||
|
|||
without importing V1 structure as the V2 design template. |
|||
|
|||
## Decision 3: Phase 06 should replace key synchronous conveniences |
|||
|
|||
The accepted Slice 4 convenience flows were sufficient for closure work, but broader engine work required real step boundaries. |
|||
|
|||
This is now satisfied via planner/executor separation. |
|||
|
|||
## Decision 4: Phase 06 ends with a runnable engine stage decision |
|||
|
|||
Result: |
|||
|
|||
- yes, the project now has a broader runnable engine stage that is ready to proceed to real-system integration / product-path work |
|||
|
|||
## Decision 5: Phase 06 P0 is accepted |
|||
|
|||
Accepted scope: |
|||
|
|||
1. adapter/module boundaries |
|||
2. convenience-flow classification |
|||
3. initial real-engine stage framing |
|||
|
|||
## Decision 6: Phase 06 P1 is accepted |
|||
|
|||
Accepted scope: |
|||
|
|||
1. storage/control adapter interfaces |
|||
2. `RecoveryDriver` planner/resource-acquisition layer |
|||
3. full-base and WAL retention resource contracts |
|||
4. fail-closed preconditions on planning paths |
|||
|
|||
## Decision 7: Phase 06 P2 is accepted |
|||
|
|||
Accepted scope: |
|||
|
|||
1. explicit planner/executor split on top of `RecoveryPlan` |
|||
2. executor-owned cleanup symmetry on success/failure/cancellation |
|||
3. plan-bound rebuild execution with no policy re-derivation at execute time |
|||
4. synchronous orchestrator completion helpers remain test-only convenience |
|||
|
|||
## Decision 8: Phase 06 P3 is accepted |
|||
|
|||
Accepted scope: |
|||
|
|||
1. selected real failure classes validated through the engine path |
|||
2. cross-layer engine/storage proof validation |
|||
3. diagnosable failure when proof or resource acquisition cannot be established |
|||
|
|||
## Decision 9: Phase 06 is complete |
|||
|
|||
Next step: |
|||
|
|||
- `Phase 07` real-system integration / product-path decision |
|||
@ -0,0 +1,51 @@ |
|||
# Phase 06 Log |
|||
|
|||
## 2026-03-30 |
|||
|
|||
### Opened |
|||
|
|||
`Phase 06` opened as: |
|||
|
|||
- broader engine implementation stage |
|||
|
|||
### Starting basis |
|||
|
|||
1. `Phase 05`: complete |
|||
2. engine core and integration closure accepted |
|||
3. next work moves from slice proof to broader runnable engine stage |
|||
|
|||
### Accepted |
|||
|
|||
1. Phase 06 P0 |
|||
- adapter/module boundaries defined |
|||
- convenience flows explicitly classified |
|||
|
|||
2. Phase 06 P1 |
|||
- storage/control adapter surfaces defined |
|||
- `RecoveryDriver` added as planner/resource-acquisition layer |
|||
- full-base rebuild now has explicit resource contract |
|||
- WAL pin contract tied to actual recovery need |
|||
- driver preconditions fail closed |
|||
|
|||
3. Phase 06 P2 |
|||
- explicit planner/executor split accepted |
|||
- executor owns release symmetry on success, failure, and cancellation |
|||
- rebuild execution now consumes plan-bound source/target values |
|||
- tester final validation accepted with reduced-but-sufficient rebuild failure-path coverage |
|||
|
|||
4. Phase 06 P3 |
|||
- selected real failure classes validated through the engine path |
|||
- changed-address restart now uses plan cancellation and re-plan flow |
|||
- stale execution is caught through the executor-managed loop |
|||
- cross-layer trusted-base / replayable-tail proof path validated end-to-end |
|||
- rebuild planning failures now clean up sessions and remain diagnosable |
|||
|
|||
### Closed |
|||
|
|||
`Phase 06` closed as complete. |
|||
|
|||
### Next |
|||
|
|||
1. Phase 07 real-system integration / product-path decision |
|||
2. service-slice integration against real control/storage surroundings |
|||
3. first product-path gating decision |
|||
@ -0,0 +1,193 @@ |
|||
# Phase 06 |
|||
|
|||
Date: 2026-03-30 |
|||
Status: complete |
|||
Purpose: move from validated engine slices to the first broader runnable V2 engine stage |
|||
|
|||
## Why This Phase Exists |
|||
|
|||
`Phase 05` established and validated: |
|||
|
|||
1. ownership core |
|||
2. recovery execution core |
|||
3. recoverability/data gating core |
|||
4. integration closure |
|||
|
|||
What still does not exist is a broader engine stage that can run with: |
|||
|
|||
1. real control-plane inputs |
|||
2. real persistence/backing inputs |
|||
3. non-trivial execution loops instead of only synchronous convenience paths |
|||
|
|||
So `Phase 06` exists to turn the accepted engine shape into the first broader runnable engine stage. |
|||
|
|||
Phase 06 must connect the accepted engine core to real control and real storage truth, not just wrap current abstractions with adapters. |
|||
|
|||
## Phase Goal |
|||
|
|||
Build the first broader V2 engine stage without reopening protocol shape. |
|||
|
|||
This phase should focus on: |
|||
|
|||
1. real engine adapters around the accepted core |
|||
2. asynchronous or stepwise execution paths where Slice 4 used synchronous helpers |
|||
3. real retained-history / checkpoint input plumbing |
|||
4. validation against selected real failure classes and real implementation constraints |
|||
|
|||
## Overall Roadmap |
|||
|
|||
Completed: |
|||
|
|||
1. Phase 01-03: design + simulator |
|||
2. Phase 04: prototype closure |
|||
3. Phase 4.5: evidence hardening |
|||
4. Phase 05: engine slice closure |
|||
5. Phase 06: broader engine implementation stage |
|||
|
|||
Next: |
|||
|
|||
1. Phase 07: real-system integration / product-path decision |
|||
|
|||
This roadmap should stay strict: |
|||
|
|||
- no return to broad prototype expansion |
|||
- no uncontrolled engine sprawl |
|||
|
|||
## Scope |
|||
|
|||
### In scope |
|||
|
|||
1. control-plane adapter into `sw-block/engine/replication/` |
|||
2. retained-history / checkpoint adapter into engine recoverability APIs |
|||
3. replacement of synchronous convenience flows with explicit engine steps where needed |
|||
4. engine error taxonomy and observability tightening |
|||
5. validation against selected real failure classes from: |
|||
- `learn/projects/sw-block/` |
|||
- `weed/storage/block*` |
|||
|
|||
### Out of scope |
|||
|
|||
1. Smart WAL expansion |
|||
2. full backend redesign |
|||
3. performance optimization as primary goal |
|||
4. V1 replacement rollout |
|||
5. full product integration |
|||
|
|||
## Phase 06 Items |
|||
|
|||
### P0: Engine Stage Plan |
|||
|
|||
Status: |
|||
|
|||
- accepted |
|||
- module boundaries now explicit: |
|||
- `adapter.go` |
|||
- `driver.go` |
|||
- `orchestrator.go` classification |
|||
- convenience flows are now classified as: |
|||
- test-only convenience wrapper |
|||
- stepwise engine task |
|||
- planner/executor split |
|||
|
|||
### P1: Control / History Adapters |
|||
|
|||
Status: |
|||
|
|||
- accepted |
|||
- `StorageAdapter` boundary exists and is exercised by tests |
|||
- full-base rebuild now has a real pin/release contract |
|||
- WAL pinning is tied to actual recovery contract, not loose watermark use |
|||
- planner fails closed on missing sender / missing session / wrong session kind |
|||
|
|||
### P2: Execution Driver |
|||
|
|||
Status: |
|||
|
|||
- accepted |
|||
- executor now owns resource lifecycle on success / failure / cancellation |
|||
- catch-up execution is stepwise and budget-checked per progress step |
|||
- rebuild execution consumes plan-bound source/target values and does not re-derive policy at execute time |
|||
- `CompleteCatchUp` / `CompleteRebuild` remain test-only convenience wrappers |
|||
- tester validation accepted with reduced-but-sufficient rebuild failure-path coverage |
|||
|
|||
### P3: Validation Against Real Failure Classes |
|||
|
|||
Status: |
|||
|
|||
- accepted |
|||
- changed-address restart now validated through planner/executor path with plan cancellation |
|||
- stale epoch/session during active execution now validated through the executor-managed loop |
|||
- cross-layer trusted-base / replayable-tail proof path validated end-to-end |
|||
- rebuild fallback and pin-failure cleanup now fail closed and are diagnosable |
|||
|
|||
## Guardrails |
|||
|
|||
### Guardrail 1: Do not reopen protocol shape |
|||
|
|||
Phase 06 implemented around accepted engine slices and did not reopen: |
|||
|
|||
1. sender/session authority model |
|||
2. bounded catch-up contract |
|||
3. recoverability/truncation boundary |
|||
|
|||
### Guardrail 2: Do not let adapters smuggle V1 structure back in |
|||
|
|||
V1 code and docs remain: |
|||
|
|||
1. constraints |
|||
2. failure gates |
|||
3. integration references |
|||
|
|||
not the V2 architecture template. |
|||
|
|||
### Guardrail 3: Prefer explicit engine steps over synchronous convenience |
|||
|
|||
Key convenience helpers remain test-only. Real engine work now has explicit planner/executor boundaries. |
|||
|
|||
### Guardrail 4: Keep evidence quality high |
|||
|
|||
Phase 06 improved: |
|||
|
|||
1. cross-layer traceability |
|||
2. diagnosability |
|||
3. real-failure validation |
|||
|
|||
without growing protocol surface. |
|||
|
|||
### Guardrail 5: Do not fake storage truth with metadata-only adapters |
|||
|
|||
Phase 06 now requires: |
|||
|
|||
1. trusted base to come from storage-side truth |
|||
2. replayable tail to be grounded in retention state |
|||
3. observable rejection when those proofs cannot be established |
|||
|
|||
## Exit Criteria |
|||
|
|||
Phase 06 is done when: |
|||
|
|||
1. engine has real control/history adapters into the accepted core |
|||
2. engine has real storage/base adapters into the accepted core |
|||
3. key synchronous convenience paths are explicitly classified or replaced by real engine steps where necessary |
|||
4. selected real failure classes are validated against the engine stage |
|||
5. at least one cross-layer storage/engine proof path is validated end-to-end |
|||
6. engine observability remains good enough to explain recovery causality |
|||
|
|||
Status: |
|||
|
|||
- met |
|||
|
|||
## Closeout |
|||
|
|||
`Phase 06` is complete. |
|||
|
|||
It established: |
|||
|
|||
1. a broader runnable engine stage around the accepted Phase 05 core |
|||
2. real planner/executor/resource contracts |
|||
3. validated failure-class behavior through the engine path |
|||
4. diagnosable proof rejection and cleanup behavior |
|||
|
|||
Next step: |
|||
|
|||
- `Phase 07` real-system integration / product-path decision |
|||
@ -0,0 +1,119 @@ |
|||
# Phase 07 Decisions |
|||
|
|||
## Decision 1: Phase 07 is real-system integration, not protocol redesign |
|||
|
|||
The V2 protocol shape, engine core, and broader runnable engine stage are already accepted. |
|||
|
|||
Phase 07 should integrate them into a real-system service slice. |
|||
|
|||
## Decision 2: Phase 07 should make the first product-path decision |
|||
|
|||
This phase should not only integrate a service slice. |
|||
|
|||
It should also decide: |
|||
|
|||
1. what the first product path is |
|||
2. what remains before pre-production hardening |
|||
|
|||
## Decision 3: Phase 07 must preserve accepted V2 boundaries |
|||
|
|||
Phase 07 should preserve: |
|||
|
|||
1. narrow catch-up semantics |
|||
2. rebuild as the formal recovery path |
|||
3. trusted-base / replayable-tail proof boundaries |
|||
4. stable identity / fenced execution / diagnosable failure handling |
|||
|
|||
## Decision 4: Phase 07 P0 service-slice direction is set |
|||
|
|||
Current direction: |
|||
|
|||
1. first service slice = `RF=2` block volume primary + one replica |
|||
2. engine remains in `sw-block/engine/replication/` |
|||
3. current bridge work starts in `sw-block/bridge/blockvol/` |
|||
4. deferred real blockvol-side bridge target = `weed/storage/blockvol/v2bridge/` |
|||
5. stable identity mapping is explicit: |
|||
- `ReplicaID = <volume-name>/<server-id>` |
|||
6. `blockvol` executes I/O but does not own recovery policy |
|||
|
|||
## Decision 5: Phase 07 P1 is accepted with explicit scope limits |
|||
|
|||
Accepted `P1` coverage is: |
|||
|
|||
1. real reader mapping from `BlockVol` state |
|||
2. real retention hold / release wiring into the flusher retention floor |
|||
3. one real WAL catch-up scan path through `v2bridge` |
|||
4. direct real-adapter tests under `weed/storage/blockvol/v2bridge/` |
|||
|
|||
This acceptance means: |
|||
|
|||
1. the real bridge path is now integrated and evidenced |
|||
2. `P1` is not yet acceptance proof of general post-checkpoint catch-up viability |
|||
|
|||
Not accepted as part of `P1`: |
|||
|
|||
1. snapshot transfer execution |
|||
2. full-base transfer execution |
|||
3. WAL truncation execution |
|||
4. master-side confirmed failover / control-intent integration |
|||
|
|||
## Decision 6: Interim committed-truth limitation remains active |
|||
|
|||
`Phase 07 P1` is accepted with an explicit carry-forward limitation: |
|||
|
|||
1. interim `CommittedLSN = CheckpointLSN` is a service-slice mapping, not final V2 protocol truth |
|||
2. post-checkpoint catch-up semantics are therefore narrower than final V2 intent |
|||
3. later `Phase 07` work must not overclaim this limitation as solved until commit truth is separated from checkpoint truth |
|||
|
|||
## Decision 7: Phase 07 P2 is accepted with scoped replay claims |
|||
|
|||
Accepted `P2` coverage is: |
|||
|
|||
1. real service-path replay for changed-address restart |
|||
2. stale epoch / stale session invalidation through the integrated path |
|||
3. unrecoverable-gap / needs-rebuild replay with diagnosable proof |
|||
4. explicit replay of the post-checkpoint boundary under the interim model |
|||
|
|||
Not accepted as part of `P2`: |
|||
|
|||
1. general integrated engine-driven post-checkpoint catch-up semantics |
|||
2. real control-plane delivery from master heartbeat into the bridge |
|||
3. rebuild execution beyond the already-deferred executor stubs |
|||
|
|||
## Decision 8: Phase 07 now moves to product-path choice, not more bridge-shape proof |
|||
|
|||
With `P0`, `P1`, and `P2` accepted, the next step is: |
|||
|
|||
1. choose the first product path from accepted service-slice evidence |
|||
2. define what remains before pre-production hardening |
|||
3. keep unresolved limits explicit rather than hiding them behind broader claims |
|||
|
|||
## Decision 7: Phase 07 P2 must replay the interim limitation explicitly |
|||
|
|||
`Phase 07 P2` should not only replay happy-path or ordinary failure-path integration. |
|||
|
|||
It should also include one explicit replay where: |
|||
|
|||
1. the live bridge path is exercised after checkpoint truth has advanced |
|||
2. the observed catch-up limitation is diagnosed as a consequence of the interim mapping |
|||
3. the result is not overclaimed as proof of final V2 post-checkpoint catch-up semantics |
|||
|
|||
## Decision 10: Phase 07 P3 is accepted and Phase 07 is complete |
|||
|
|||
The first V2 product path is now explicitly chosen as: |
|||
|
|||
1. `RF=2` |
|||
2. `sync_all` |
|||
3. existing master / volume-server heartbeat path |
|||
4. V2 engine owns recovery policy |
|||
5. `v2bridge` provides real storage truth |
|||
|
|||
This decision is accepted with explicit non-claims: |
|||
|
|||
1. not production-ready |
|||
2. no real master-side control delivery proof yet |
|||
3. no full rebuild execution proof yet |
|||
4. no general post-checkpoint catch-up proof yet |
|||
5. no full integrated engine -> executor -> `v2bridge` catch-up proof yet |
|||
|
|||
Phase 07 is therefore complete, and the next phase is pre-production hardening. |
|||
@ -0,0 +1,63 @@ |
|||
# Phase 07 Log |
|||
|
|||
## 2026-03-30 |
|||
|
|||
### Opened |
|||
|
|||
`Phase 07` opened as: |
|||
|
|||
- real-system integration / product-path decision |
|||
|
|||
### Starting basis |
|||
|
|||
1. `Phase 06`: complete |
|||
2. broader runnable engine stage accepted |
|||
3. next work moves from engine-stage validation to real-system service-slice integration |
|||
|
|||
### Delivered |
|||
|
|||
1. Phase 07 P0 |
|||
- service-slice plan defined |
|||
- implementation slice proposal delivered |
|||
- bridge layer introduced as: |
|||
- `sw-block/bridge/blockvol/` for current bridge work |
|||
- `weed/storage/blockvol/v2bridge/` as the deferred real integration target |
|||
- stable identity mapping made explicit: |
|||
- `ReplicaID = <volume-name>/<server-id>` |
|||
- engine / blockvol policy boundary made explicit |
|||
- initial bridge tests delivered (`8`) |
|||
2. Phase 07 P1 |
|||
- real blockvol reader integrated via `weed/storage/blockvol/v2bridge/reader.go` |
|||
- real pinner integrated via `weed/storage/blockvol/v2bridge/pinner.go` |
|||
- one real catch-up executor path integrated via `weed/storage/blockvol/v2bridge/executor.go` |
|||
- direct real-adapter tests delivered in: |
|||
- `weed/storage/blockvol/v2bridge/bridge_test.go` |
|||
- accepted with explicit carry-forward: |
|||
- interim `CommittedLSN = CheckpointLSN` limits post-checkpoint catch-up semantics and is not final V2 commit truth |
|||
- acceptance is for the real integrated bridge path, not for general post-checkpoint catch-up viability |
|||
3. Phase 07 P2 |
|||
- real service-path failure replay accepted |
|||
- accepted replay set includes: |
|||
- changed-address restart |
|||
- stale epoch / stale session invalidation |
|||
- unrecoverable-gap / needs-rebuild replay |
|||
- explicit post-checkpoint boundary replay |
|||
- evidence kept explicitly scoped: |
|||
- real `v2bridge` WAL-scan execution proven |
|||
- general integrated post-checkpoint catch-up semantics not overclaimed under the interim model |
|||
4. Phase 07 P3 |
|||
- product-path decision accepted |
|||
- first product path chosen as: |
|||
- `RF=2` |
|||
- `sync_all` |
|||
- existing master / volume-server heartbeat path |
|||
- V2 engine recovery ownership with `v2bridge` real storage truth |
|||
- pre-hardening prerequisites made explicit |
|||
- intentional deferrals and non-claims recorded |
|||
- `Phase 07` completed |
|||
|
|||
### Next |
|||
|
|||
1. Phase 08 pre-production hardening |
|||
2. real master/control delivery integration |
|||
3. integrated catch-up / rebuild execution closure |
|||
@ -0,0 +1,220 @@ |
|||
# Phase 07 |
|||
|
|||
Date: 2026-03-30 |
|||
Status: complete |
|||
Purpose: connect the broader runnable V2 engine stage to a real-system service slice and decide the first product path |
|||
|
|||
## Why This Phase Exists |
|||
|
|||
`Phase 06` completed the broader runnable engine stage: |
|||
|
|||
1. planner/executor/resource contracts are real |
|||
2. selected real failure classes are validated through the engine path |
|||
3. cross-layer trusted-base / replayable-tail proof path is validated |
|||
|
|||
What still does not exist is a real-system slice where the engine runs inside actual service boundaries with real control/storage surroundings. |
|||
|
|||
So `Phase 07` exists to answer: |
|||
|
|||
1. how the engine runs as a real subsystem |
|||
2. what the first product path should be |
|||
3. what integration risks remain before pre-production hardening |
|||
|
|||
## Phase Goal |
|||
|
|||
Establish a real-system integration slice for the V2 engine and make the first product-path decision without reopening protocol shape. |
|||
|
|||
## Scope |
|||
|
|||
### In scope |
|||
|
|||
1. service-slice integration around `sw-block/engine/replication/` |
|||
2. real control-plane / lifecycle entry path into the engine |
|||
3. real storage-side adapter hookup into existing system boundaries |
|||
4. selected real-system failure replay and diagnosis |
|||
5. explicit product-path decision framing |
|||
|
|||
### Out of scope |
|||
|
|||
1. broad performance optimization |
|||
2. Smart WAL expansion |
|||
3. full V1 replacement rollout |
|||
4. broad backend redesign |
|||
5. production rollout itself |
|||
|
|||
## Phase 07 Items |
|||
|
|||
### P0: Service-Slice Plan |
|||
|
|||
1. define the first real-system service slice that will host the engine |
|||
2. define adapter/module boundaries at the service boundary |
|||
3. choose the concrete integration path to exercise first |
|||
4. identify which current adapters are still mock/test-only and must be replaced first |
|||
5. make the first-slice identity/epoch mapping explicit |
|||
6. treat `blockvol` as execution backend only, not recovery-policy owner |
|||
|
|||
Status: |
|||
|
|||
- delivered |
|||
- planning artifact: |
|||
- `sw-block/design/phase-07-service-slice-plan.md` |
|||
- implementation slice proposal: |
|||
- engine core: `sw-block/engine/replication/` |
|||
- bridge adapters: `sw-block/bridge/blockvol/` |
|||
- real blockvol integration target: `weed/storage/blockvol/v2bridge/` (`P1`) |
|||
- adapter replacement order: |
|||
- `control_adapter.go` (`P0`) done |
|||
- `storage_adapter.go` (`P0`) done |
|||
- `executor_bridge.go` (`P1`) deferred |
|||
- `observe_adapter.go` (`P1`) deferred |
|||
- first-slice identity mapping is explicit: |
|||
- `ReplicaID = <volume-name>/<server-id>` |
|||
- not derived from any address field |
|||
- engine / blockvol boundary is explicit: |
|||
- bridge maps intent and state |
|||
- `blockvol` executes I/O |
|||
- `blockvol` does not own recovery policy |
|||
- service-slice validation gaps called out for `P1`: |
|||
- real blockvol field mapping |
|||
- real pin/release lifecycle against reclaim/GC |
|||
- assignment timing vs engine session lifecycle |
|||
- executor bridge into real WAL/snapshot work |
|||
|
|||
### P1: Real Entry-Path Integration |
|||
|
|||
1. connect real control/lifecycle events into the engine entry path |
|||
2. connect real storage/base/recoverability signals into the engine adapters |
|||
3. preserve accepted engine authority/execution/recoverability contracts |
|||
|
|||
Status: |
|||
|
|||
- accepted |
|||
- real integration now established for: |
|||
- reader via `weed/storage/blockvol/v2bridge/reader.go` |
|||
- pinner via `weed/storage/blockvol/v2bridge/pinner.go` |
|||
- catch-up executor path via `weed/storage/blockvol/v2bridge/executor.go` |
|||
- direct real-adapter tests now exist in: |
|||
- `weed/storage/blockvol/v2bridge/bridge_test.go` |
|||
- accepted scope is explicit: |
|||
- real reader |
|||
- real retention hold / release |
|||
- real WAL catch-up scan path |
|||
- direct real bridge evidence for the integrated path |
|||
- still deferred: |
|||
- `TransferSnapshot` |
|||
- `TransferFullBase` |
|||
- `TruncateWAL` |
|||
- control intent from confirmed failover / master-side integration |
|||
- carry-forward limitation: |
|||
- under interim `CommittedLSN = CheckpointLSN`, this slice proves a real bridge path, not general post-checkpoint catch-up viability |
|||
- post-checkpoint catch-up semantics therefore remain narrower than final V2 intent and do not represent final V2 commit semantics |
|||
|
|||
### P2: Real-System Failure Replay |
|||
|
|||
1. replay selected real failure classes against the integrated service slice |
|||
2. confirm diagnosability from logs/status |
|||
3. identify any remaining mismatch between engine-stage assumptions and real system behavior |
|||
|
|||
Status: |
|||
|
|||
- accepted |
|||
- real service-path replay now accepted for: |
|||
- changed-address restart |
|||
- stale epoch / stale session invalidation |
|||
- unrecoverable-gap / needs-rebuild replay |
|||
- explicit post-checkpoint boundary replay under the interim model |
|||
- accepted with scoped limitation: |
|||
- real `v2bridge` WAL-scan execution is proven |
|||
- full integrated engine-driven catch-up semantics are not overclaimed under interim `CommittedLSN = CheckpointLSN` |
|||
- control-plane delivery remains simulated via direct `AssignmentIntent` construction |
|||
- carry-forward remains explicit: |
|||
- post-checkpoint catch-up semantics are still narrower than final V2 intent |
|||
|
|||
### P3: Product-Path Decision |
|||
|
|||
1. choose the first product path for V2 |
|||
2. define what remains before pre-production hardening |
|||
3. record what is still intentionally deferred |
|||
|
|||
Status: |
|||
|
|||
- accepted |
|||
- first product path chosen: |
|||
- `RF=2` |
|||
- `sync_all` |
|||
- existing master / volume-server heartbeat path |
|||
- V2 engine owns recovery policy |
|||
- `v2bridge` provides real storage truth |
|||
- proposal is evidence-grounded and explicitly bounded by accepted `P0/P1/P2` evidence |
|||
- pre-hardening prerequisites are explicit: |
|||
- real master control delivery |
|||
- full integrated engine -> executor -> `v2bridge` catch-up chain |
|||
- separation of committed truth from checkpoint truth |
|||
- rebuild execution (`snapshot` / `full-base` / `truncation`) |
|||
- pinner / flusher behavior under concurrent load |
|||
- intentionally deferred: |
|||
- `RF>2` |
|||
- Smart WAL optimizations |
|||
- `best_effort` background recovery |
|||
- performance tuning |
|||
- full V1 replacement |
|||
- non-claims remain explicit: |
|||
- not production-ready |
|||
- no end-to-end rebuild proof yet |
|||
- no general post-checkpoint catch-up proof |
|||
- no real master heartbeat/control delivery proof yet |
|||
- no full integrated engine -> executor -> `v2bridge` catch-up proof yet |
|||
|
|||
## Guardrails |
|||
|
|||
### Guardrail 1: Do not re-import V1 structure as the design owner |
|||
|
|||
Use `weed/storage/block*` and `learn/projects/sw-block/` as constraints and validation sources, not as the architecture template. |
|||
|
|||
### Guardrail 2: Keep catch-up narrow and rebuild explicit |
|||
|
|||
Do not use integration work as an excuse to widen catch-up semantics or blur rebuild as the formal recovery path. |
|||
|
|||
### Guardrail 3: Prefer real entry paths over test-only wrappers |
|||
|
|||
The integrated slice should exercise real service boundaries, not only internal engine helpers. |
|||
|
|||
### Guardrail 4: Observability must explain causality |
|||
|
|||
Integrated logs/status must explain: |
|||
|
|||
1. why rebuild was required |
|||
2. why proof was rejected |
|||
3. why execution was cancelled or invalidated |
|||
4. why a product-path integration failed |
|||
|
|||
### Guardrail 5: Stable identity must not collapse back to address shape |
|||
|
|||
For the first slice, `ReplicaID` must be derived from master/block-registry identity, not current endpoint addresses. |
|||
|
|||
### Guardrail 6: `blockvol` executes I/O but does not own recovery policy |
|||
|
|||
The service bridge may translate engine decisions into concrete blockvol actions, but it must not re-decide: |
|||
|
|||
1. zero-gap / catch-up / rebuild |
|||
2. trusted-base validity |
|||
3. replayable-tail sufficiency |
|||
4. rebuild fallback requirement |
|||
|
|||
## Exit Criteria |
|||
|
|||
Phase 07 is done when: |
|||
|
|||
1. one real-system service slice is integrated with the engine |
|||
2. selected real-system failure classes are replayed through that slice |
|||
3. diagnosability is sufficient for service-slice debugging |
|||
4. the first product path is explicitly chosen |
|||
5. the remaining work to pre-production hardening is clear |
|||
|
|||
## Assignment For `sw` |
|||
|
|||
Next tasks move to `Phase 08`. |
|||
|
|||
## Assignment For `tester` |
|||
|
|||
Next tasks move to `Phase 08`. |
|||
@ -0,0 +1,78 @@ |
|||
# Phase 08 Decisions |
|||
|
|||
## Decision 1: Phase 08 is pre-production hardening, not protocol rediscovery |
|||
|
|||
The accepted V2 product path from `Phase 07` is the basis. |
|||
|
|||
`Phase 08` should harden that path rather than reopen accepted protocol shape. |
|||
|
|||
## Decision 2: The first hardening priorities are control delivery and execution closure |
|||
|
|||
The most important remaining gaps are: |
|||
|
|||
1. real master/control delivery into the bridge/engine path |
|||
2. integrated engine -> executor -> `v2bridge` catch-up execution closure |
|||
3. first rebuild execution path for the chosen product path |
|||
|
|||
## Decision 3: Carry-forward limitations remain explicit until closed |
|||
|
|||
Phase 08 must keep explicit: |
|||
|
|||
1. committed truth is still not separated from checkpoint truth |
|||
2. rebuild execution is still incomplete |
|||
3. current control delivery is still simulated |
|||
|
|||
## Decision 4: Phase 08 P0 is accepted |
|||
|
|||
The hardening plan is sufficiently specified to begin implementation work. |
|||
|
|||
In particular, `P0` now fixes: |
|||
|
|||
1. the committed-truth gate decision requirement |
|||
2. the unified replay requirement after control and execution closure |
|||
3. the need for at least one real failover / reassignment validation target |
|||
## Decision 5: The committed-truth limitation must become a hardening gate |
|||
|
|||
Phase 08 must explicitly decide one of: |
|||
|
|||
1. `CommittedLSN != CheckpointLSN` separation is mandatory before a production-candidate phase |
|||
2. the first candidate path is intentionally bounded to the currently proven pre-checkpoint replay behavior |
|||
|
|||
It must not remain only a documented carry-forward. |
|||
|
|||
## Decision 6: Unified-path replay is required after control and execution closure |
|||
|
|||
Once real control delivery and integrated execution closure land, `Phase 08` must replay the accepted failure-class set again on the unified live path. |
|||
|
|||
This prevents independent closure of: |
|||
|
|||
1. control delivery |
|||
2. execution closure |
|||
|
|||
without proving that they behave correctly together. |
|||
|
|||
## Decision 7: Real failover / reassignment validation is mandatory for the chosen path |
|||
|
|||
Because the chosen product path depends on the existing master / volume-server heartbeat path, at least one real failover / promotion / reassignment cycle must be a named hardening target in `Phase 08`. |
|||
|
|||
## Decision 8: Phase 08 should reuse the existing Seaweed control/runtime path, not invent a new one |
|||
|
|||
For the first hardening path, implementation should preferentially reuse: |
|||
|
|||
1. existing master / heartbeat / assignment delivery |
|||
2. existing volume-server assignment receive/apply path |
|||
3. existing `blockvol` runtime and `v2bridge` storage/runtime hooks |
|||
|
|||
This reuse is about: |
|||
|
|||
1. control-plane reality |
|||
2. storage/runtime reality |
|||
3. execution-path reality |
|||
|
|||
It is not permission to inherit old policy semantics as V2 truth. |
|||
|
|||
The hard rule remains: |
|||
|
|||
1. engine owns recovery policy |
|||
2. bridge translates confirmed control/storage truth |
|||
3. `blockvol` executes I/O |
|||
@ -0,0 +1,21 @@ |
|||
# Phase 08 Log |
|||
|
|||
## 2026-03-31 |
|||
|
|||
### Opened |
|||
|
|||
`Phase 08` opened as: |
|||
|
|||
- pre-production hardening |
|||
|
|||
### Starting basis |
|||
|
|||
1. `Phase 07`: complete |
|||
2. first V2 product path chosen |
|||
3. remaining gaps are integration and hardening gaps, not protocol-discovery gaps |
|||
|
|||
### Next |
|||
|
|||
1. Phase 08 P0 accepted |
|||
2. Phase 08 P1 real master/control delivery integration |
|||
3. Phase 08 P2 integrated execution closure |
|||
@ -0,0 +1,254 @@ |
|||
# Phase 08 |
|||
|
|||
Date: 2026-03-31 |
|||
Status: active |
|||
Purpose: convert the accepted Phase 07 product path into a pre-production-hardening program without reopening accepted V2 protocol shape |
|||
|
|||
## Why This Phase Exists |
|||
|
|||
`Phase 07` completed: |
|||
|
|||
1. a real service-slice integration around the V2 engine |
|||
2. real storage-truth bridge evidence through `v2bridge` |
|||
3. selected real-system failure replay |
|||
4. the first explicit product-path decision |
|||
|
|||
What still does not exist is a pre-production-ready system path. The remaining work is no longer protocol discovery. It is closing the operational and integration gaps between the accepted product path and a hardened deployment candidate. |
|||
|
|||
## Phase Goal |
|||
|
|||
Harden the first accepted V2 product path until the remaining gap to a production candidate is explicit, bounded, and implementation-driven. |
|||
|
|||
## Scope |
|||
|
|||
### In scope |
|||
|
|||
1. real master/control delivery into the engine service path |
|||
2. integrated engine -> executor -> `v2bridge` execution closure |
|||
3. rebuild execution closure for the accepted product path |
|||
4. operational/debuggability hardening |
|||
5. concurrency/load validation around retention and recovery |
|||
|
|||
### Out of scope |
|||
|
|||
1. new protocol redesign |
|||
2. `RF>2` coordination |
|||
3. Smart WAL optimization work |
|||
4. broad performance tuning beyond validation needed for hardening |
|||
5. full V1 replacement rollout |
|||
|
|||
## Phase 08 Items |
|||
|
|||
### P0: Hardening Plan |
|||
|
|||
1. convert the accepted `Phase 07` product path into a hardening plan |
|||
2. define the minimum pre-production gates |
|||
3. order the remaining integration closures by risk |
|||
4. make an explicit gate decision on committed truth vs checkpoint truth: |
|||
- either separate `CommittedLSN` from `CheckpointLSN` before a production-candidate phase |
|||
- or explicitly bound the first candidate path to the currently proven pre-checkpoint replay behavior |
|||
|
|||
Status: |
|||
|
|||
- planning package accepted in this phase doc |
|||
- first hardening priorities are fixed as: |
|||
- real master/control delivery |
|||
- integrated engine -> executor -> `v2bridge` catch-up execution chain |
|||
- first rebuild execution path |
|||
- the committed-truth carry-forward is now a required hardening gate, not just a note: |
|||
- either separate `CommittedLSN` from `CheckpointLSN` before a production-candidate phase |
|||
- or explicitly bound the first candidate path to the currently proven pre-checkpoint replay behavior |
|||
- at least one real failover / promotion / reassignment cycle is a required hardening target |
|||
- once `P1` and `P2` land, the accepted failure-class set must be replayed again on the newly unified live path |
|||
- the validation oracle for `Phase 08` is expected to reject overclaiming around: |
|||
- catch-up semantics |
|||
- rebuild execution |
|||
- master/control delivery |
|||
- candidate-path readiness vs production readiness |
|||
- accepted |
|||
|
|||
### P1: Real Control Delivery |
|||
|
|||
1. connect real master/heartbeat assignment delivery into the bridge |
|||
2. replace direct `AssignmentIntent` construction for the first live path |
|||
3. preserve stable identity and fenced authority through the real control path |
|||
4. include at least one real failover / promotion / reassignment validation target on the chosen `sync_all` path |
|||
|
|||
Technical focus: |
|||
|
|||
- keep the control-path split explicit: |
|||
- master confirms assignment / epoch / role |
|||
- bridge translates confirmed control truth into engine intent |
|||
- engine owns sender/session/recovery policy |
|||
- `blockvol` does not re-decide recovery policy |
|||
- preserve the identity rule through the live path: |
|||
- `ReplicaID = <volume>/<server>` |
|||
- endpoint change updates location but must not recreate logical identity |
|||
- preserve the fencing rule through the live path: |
|||
- stale epoch must invalidate old authority |
|||
- stale session must not mutate current lineage |
|||
- address change must invalidate the old live session before the new path proceeds |
|||
- treat failover / promotion / reassignment as control-truth events first, not storage-side heuristics |
|||
|
|||
Implementation route (`reuse map`): |
|||
|
|||
- reuse directly as the first hardening carrier: |
|||
- `weed/server/master_grpc_server.go` |
|||
- `weed/server/volume_grpc_client_to_master.go` |
|||
- `weed/server/volume_server_block.go` |
|||
- `weed/server/master_block_registry.go` |
|||
- `weed/server/master_block_failover.go` |
|||
- reuse as storage/runtime execution reality: |
|||
- `weed/storage/blockvol/blockvol.go` |
|||
- `weed/storage/blockvol/replica_apply.go` |
|||
- `weed/storage/blockvol/replica_barrier.go` |
|||
- `weed/storage/blockvol/v2bridge/` |
|||
- preserve the V2 boundary while reusing these files: |
|||
- reuse transport/control/runtime reality |
|||
- do not inherit old policy semantics as V2 truth |
|||
- keep engine as the recovery-policy owner |
|||
- keep `blockvol` as the I/O executor |
|||
|
|||
Expectation note: |
|||
|
|||
- the `P1` tester expectation is already embedded in this phase doc under: |
|||
- `P1 / Validation focus` |
|||
- `P1 / Reject if` |
|||
- do not grow a separate long template unless `P1` scope expands materially |
|||
|
|||
Validation focus: |
|||
|
|||
- prove live assignment delivery into the bridge/engine path |
|||
- prove stable `ReplicaID` across address refresh on the live path |
|||
- prove stale epoch / stale session invalidation through the live path |
|||
- prove at least one real failover / promotion / reassignment cycle on the chosen `sync_all` path |
|||
- prove the resulting logs explain: |
|||
- why reassignment happened |
|||
- why a session was invalidated |
|||
- which epoch / identity / endpoint drove the transition |
|||
|
|||
Reject if: |
|||
|
|||
- address-shaped identity reappears anywhere in the control path |
|||
- bridge starts re-deriving catch-up vs rebuild policy from convenience inputs |
|||
- old epoch or old session can still mutate after the new control truth arrives |
|||
- failover / reassignment is claimed without a real replay target |
|||
- delivery claims general production readiness rather than control-path closure |
|||
|
|||
### P2: Execution Closure |
|||
|
|||
1. close the live engine -> executor -> `v2bridge` execution chain |
|||
2. make catch-up execution evidence integrated rather than split across layers |
|||
3. close the first rebuild execution path required by the product path |
|||
|
|||
### P3: Hardening Validation |
|||
|
|||
1. validate diagnosability under the live integrated path |
|||
2. validate retention/pinner behavior under concurrent load |
|||
3. replay the accepted failure-class set again on the newly unified live path after `P1` and `P2` land |
|||
4. confirm the remaining gap to a production candidate |
|||
|
|||
Validation focus: |
|||
|
|||
- prove the chosen path through a real control-delivery path |
|||
- prove the live engine -> executor -> `v2bridge` execution chain as one path, not split evidence |
|||
- prove the first rebuild execution path required by the chosen product path |
|||
- prove at least one real failover / promotion / reassignment cycle |
|||
- prove concurrent retention/pinner behavior does not break recovery guarantees |
|||
|
|||
Reject if: |
|||
|
|||
- catch-up semantics are overclaimed beyond the currently proven boundary |
|||
- rebuild is claimed as supported without real execution closure |
|||
- master/control delivery is claimed as real without the live path in place |
|||
- `CommittedLSN` vs `CheckpointLSN` remains an unclassified note instead of a gate decision |
|||
- `P1` and `P2` land independently but the accepted failure-class set is not replayed again on the unified live path |
|||
|
|||
## Guardrails |
|||
|
|||
### Guardrail 1: Do not reopen accepted V2 protocol truths casually |
|||
|
|||
`Phase 08` is a hardening phase. New work should preserve the accepted protocol truth set unless a real contradiction is demonstrated. |
|||
|
|||
### Guardrail 2: Keep product-path claims evidence-bound |
|||
|
|||
Do not claim more than the hardened path actually proves. Distinguish: |
|||
|
|||
1. live integrated path |
|||
2. hardened product path |
|||
3. production candidate |
|||
|
|||
### Guardrail 3: Identity and policy boundaries remain hard rules |
|||
|
|||
1. `ReplicaID` must remain stable and never collapse to address shape |
|||
2. engine decides recovery policy |
|||
3. bridge translates intent/state |
|||
4. `blockvol` executes I/O only |
|||
|
|||
### Guardrail 4: Carry-forward limitations must remain explicit until closed |
|||
|
|||
Especially: |
|||
|
|||
1. committed truth vs checkpoint truth |
|||
2. rebuild execution coverage |
|||
3. real master/control delivery coverage |
|||
|
|||
### Guardrail 5: The committed-truth carry-forward must become a gate, not a note |
|||
|
|||
Before the next phase, `Phase 08` must decide one of: |
|||
|
|||
1. committed-truth separation is mandatory before a production-candidate phase |
|||
2. the first candidate path is intentionally bounded to the currently proven pre-checkpoint replay behavior |
|||
|
|||
It must not remain an unclassified carry-forward. |
|||
|
|||
## Exit Criteria |
|||
|
|||
Phase 08 is done when: |
|||
|
|||
1. the first product path runs through a real control delivery path |
|||
2. the critical execution chain is integrated and validated |
|||
3. rebuild execution for the chosen path is no longer just detected but executed |
|||
4. at least one real failover / reassignment cycle is replayed through the live control path |
|||
5. the accepted failure-class set is replayed again on the unified live path |
|||
6. operational/debug evidence is sufficient for pre-production use |
|||
7. the remaining gap to a production candidate is small and explicit |
|||
|
|||
## Assignment For `sw` |
|||
|
|||
Next tasks: |
|||
|
|||
1. drive `Phase 08 P1` as real master/control delivery integration |
|||
2. replace direct `AssignmentIntent` construction for the first live path |
|||
3. preserve through the real control path: |
|||
- stable `ReplicaID` |
|||
- epoch fencing |
|||
- address-change invalidation |
|||
4. include at least one real failover / promotion / reassignment validation target |
|||
5. keep acceptance claims scoped: |
|||
- real control delivery path |
|||
- not yet general production readiness |
|||
6. keep explicit carry-forwards: |
|||
- `CommittedLSN != CheckpointLSN` still unresolved |
|||
- integrated catch-up execution chain still incomplete |
|||
- rebuild execution still incomplete |
|||
|
|||
## Assignment For `tester` |
|||
|
|||
Next tasks: |
|||
|
|||
1. use the accepted `Phase 08` plan framing as the `P1` validation oracle |
|||
2. validate real control delivery for: |
|||
- live assignment delivery |
|||
- stable identity through the control path |
|||
- stale epoch/session invalidation |
|||
- at least one real failover / reassignment cycle |
|||
3. keep the no-overclaim rule active around: |
|||
- catch-up semantics |
|||
- rebuild execution |
|||
- master/control delivery |
|||
4. keep the committed-truth gate explicit: |
|||
- still unresolved in `P1` |
|||
5. prepare `P2` follow-up expectations for: |
|||
- integrated engine -> executor -> `v2bridge` execution closure |
|||
- unified replay after `P1` and `P2` |
|||
@ -0,0 +1,59 @@ |
|||
# Phase 4.5 Decisions |
|||
|
|||
## Decision 1: Phase 4.5 remains a bounded hardening phase |
|||
|
|||
It is not a new architecture line and must not expand into broad feature work. |
|||
|
|||
Purpose: |
|||
|
|||
1. tighten recovery boundaries |
|||
2. strengthen crash-consistency / recoverability proof |
|||
3. clear the path for engine planning |
|||
|
|||
## Decision 2: `sw` Phase 4.5 P0 is accepted |
|||
|
|||
Accepted basis: |
|||
|
|||
1. bounded `CatchUp` now changes prototype behavior |
|||
2. `FrozenTargetLSN` is intrinsic to the session contract |
|||
3. `Rebuild` is a first-class sender-owned execution path |
|||
4. rebuild and catch-up are execution-path exclusive |
|||
|
|||
## Decision 3: `tester` crash-consistency simulator strengthening is accepted |
|||
|
|||
Accepted basis: |
|||
|
|||
1. checkpoint semantics are explicit |
|||
2. recoverability after restart is no longer collapsed into a single loose watermark |
|||
3. crash-consistency invariants are executable and passing |
|||
|
|||
## Decision 4: Remaining Phase 4.5 work is evidence hardening, not primitive-building |
|||
|
|||
Completed focus: |
|||
|
|||
1. `A5-A8` prototype + simulator double evidence |
|||
2. predicate exploration for dangerous states |
|||
3. adversarial search over crash-consistency / liveness states |
|||
|
|||
Remaining optional work: |
|||
|
|||
4. any low-priority cleanup that improves clarity without reopening design |
|||
|
|||
## Decision 5: After Phase 4.5, the project should move to engine-planning readiness review |
|||
|
|||
Unless new blocking flaws appear, the next major decision after `4.5` should be: |
|||
|
|||
1. real V2 engine planning |
|||
2. engine slicing plan |
|||
|
|||
not another broad prototype phase |
|||
|
|||
## Decision 6: Phase 4.5 is complete |
|||
|
|||
Reason: |
|||
|
|||
1. bounded `CatchUp` is semantic in the prototype |
|||
2. `Rebuild` is first-class in the prototype |
|||
3. crash-consistency / restart-recoverability are materially stronger in the simulator |
|||
4. `A5-A8` evidence is materially stronger on both prototype and simulator sides |
|||
5. adversarial search found and helped fix a real correctness bug, validating the proof style |
|||
@ -0,0 +1,33 @@ |
|||
# Phase 4.5 Log |
|||
|
|||
## 2026-03-29 |
|||
|
|||
### Accepted |
|||
|
|||
1. `sw` `Phase 4.5 P0` |
|||
- bounded `CatchUp` budget is semantic in `enginev2` |
|||
- `FrozenTargetLSN` is a real session invariant |
|||
- `Rebuild` is wired into sender execution and is exclusive from catch-up |
|||
- rebuild completion goes through `CompleteRebuild`, not generic session completion |
|||
|
|||
2. `tester` crash-consistency simulator strengthening |
|||
- storage-state split introduced and accepted |
|||
- checkpoint/restart boundary made explicit |
|||
- recoverability upgraded from watermark-style logic to checkpoint + contiguous WAL replayability proof |
|||
- core invariant tests for crash consistency now pass |
|||
|
|||
3. `tester` evidence hardening and adversarial exploration |
|||
- grouped simulator evidence for `A5-A8` |
|||
- danger predicates added |
|||
- adversarial search added and passing |
|||
- adversarial search found a real `StateAt(lsn)` historical-state bug |
|||
- `StateAt(lsn)` corrected so newer checkpoint/base state does not leak into older historical queries |
|||
|
|||
4. `Phase 4.5` closeout judgment |
|||
- prototype and simulator evidence are now strong enough to stop expanding `4.5` |
|||
- next major step should move to engine-readiness review and engine slicing |
|||
|
|||
### Remaining open work |
|||
|
|||
1. low-priority cleanup |
|||
- remove or consolidate redundant frozen-target bookkeeping if no longer needed |
|||
@ -0,0 +1,397 @@ |
|||
# Phase 4.5 Reason |
|||
|
|||
Date: 2026-03-27 |
|||
Status: proposal for dev manager decision |
|||
Purpose: explain why a narrow V2 fine-tuning step should follow the main Phase 04 slice, without reopening the core ownership/fencing direction |
|||
|
|||
## 1. Why This Note Exists |
|||
|
|||
`Phase 04` has already produced strong progress on the first standalone V2 slice: |
|||
|
|||
- per-replica sender identity |
|||
- one active recovery session per replica per epoch |
|||
- endpoint / epoch invalidation |
|||
- sender-owned execution APIs |
|||
- explicit recovery outcome branching |
|||
- minimal historical-data prototype |
|||
|
|||
This is good progress and should continue. |
|||
|
|||
However, recent review and discussion show that the next risk is no longer: |
|||
|
|||
- ownership ambiguity |
|||
- stale completion acceptance |
|||
- scattered local recovery authority |
|||
|
|||
The next risk is different: |
|||
|
|||
- `CatchUp` may become too broad, too long-lived, and too resource-heavy |
|||
- simulator proof is still weaker than desired on crash-consistency and recoverability boundaries |
|||
- the project may accidentally carry V1.5-style "keep trying to catch up" assumptions into V2 engine work |
|||
|
|||
So this note proposes: |
|||
|
|||
- **do not interrupt the main Phase 04 work** |
|||
- **do not reopen core V2 ownership/fencing architecture** |
|||
- **add a narrow fine-tuning step immediately after Phase 04 main closure** |
|||
|
|||
This note is for the dev manager to decide implementation sequencing. |
|||
|
|||
## 2. Current Basis |
|||
|
|||
This proposal is grounded in the following current documents: |
|||
|
|||
- `sw-block/.private/phase/phase-04.md` |
|||
- `sw-block/design/v2-prototype-roadmap-and-gates.md` |
|||
- `sw-block/design/v2-acceptance-criteria.md` |
|||
- `sw-block/design/v2-detailed-algorithm.zh.md` |
|||
|
|||
In particular: |
|||
|
|||
- `phase-04.md` shows that Phase 04 is correctly centered on sender/session ownership and recovery execution authority |
|||
- `v2-prototype-roadmap-and-gates.md` shows that design proof is high, but data/recovery proof and prototype end-to-end proof are still low |
|||
- `v2-acceptance-criteria.md` already requires stronger proof for: |
|||
- `A5` non-convergent catch-up escalation |
|||
- `A6` explicit recoverability boundary |
|||
- `A7` historical correctness |
|||
- `A8` durability-mode correctness |
|||
- `v2-detailed-algorithm.zh.md` Section 17 now argues for a direction tightening: |
|||
- keep the V2 core |
|||
- narrow `CatchUp` |
|||
- elevate `Rebuild` |
|||
- defer higher-complexity expansion |
|||
|
|||
## 3. Main Judgment |
|||
|
|||
### 3.1 What should NOT change |
|||
|
|||
The following V2 core should remain stable: |
|||
|
|||
- `CommittedLSN` as the external safe boundary |
|||
- durable progress as sync truth |
|||
- one sender per replica |
|||
- one active recovery session per replica per epoch |
|||
- stale epoch / stale endpoint / stale session fencing |
|||
- explicit `ZeroGap / CatchUp / NeedsRebuild` |
|||
|
|||
This is the architecture that most clearly separates V2 from V1.5. |
|||
|
|||
### 3.2 What SHOULD be fine-tuned |
|||
|
|||
The following should be tightened before engine planning: |
|||
|
|||
1. `CatchUp` should be narrowed to a short-gap, bounded, budgeted path |
|||
2. `Rebuild` should be treated as a formal primary recovery path, not only a fallback embarrassment |
|||
3. `recover -> keepup` handoff should be made more explicit |
|||
4. simulator should prove recoverability and crash-consistency more directly |
|||
|
|||
## 4. Algorithm Thinking Behind The Fine-Tune |
|||
|
|||
This section summarizes the reasoning already captured in: |
|||
|
|||
- `sw-block/design/v2-detailed-algorithm.zh.md` |
|||
|
|||
Especially Section 17: |
|||
|
|||
- `V2` is still the right direction |
|||
- but V2 should be tightened from: |
|||
- "make WAL recovery increasingly smart" |
|||
- to: |
|||
- "make block truth boundaries hard, keep `CatchUp` cheap and bounded, and use formal `Rebuild` when recovery becomes too complex" |
|||
|
|||
### 4.1 First-principles view |
|||
|
|||
From block first principles, the hardest truths are: |
|||
|
|||
1. when `write` becomes real |
|||
2. what `flush/fsync ACK` truly promises |
|||
3. whether acknowledged boundaries survive failover |
|||
4. how replicas rejoin without corrupting lineage |
|||
|
|||
These are more fundamental than: |
|||
|
|||
- volume product shape |
|||
- control-plane surface |
|||
- recovery cleverness for its own sake |
|||
|
|||
So the project should optimize for: |
|||
|
|||
- clearer truth boundaries |
|||
- not for maximal catch-up cleverness |
|||
|
|||
### 4.2 Mayastor-style product insight |
|||
|
|||
The useful first-principles lesson from Mayastor-like product thinking is: |
|||
|
|||
- not every lagging replica is worth indefinite low-cost chase |
|||
- `Rebuild` can be a formal product path, not a shameful fallback |
|||
- block products benefit from explicit lifecycle objects and formal rebuild flow |
|||
|
|||
This does NOT replace the V2 core concerns: |
|||
|
|||
- `flush ACK` truth |
|||
- committed-prefix failover safety |
|||
- stale authority fencing |
|||
|
|||
But it does suggest a correction: |
|||
|
|||
- do not let `CatchUp` become an over-smart general answer to all recovery |
|||
|
|||
### 4.3 Proposed V2 fine-tuned interpretation |
|||
|
|||
The fine-tuned interpretation of V2 should be: |
|||
|
|||
- `CatchUp` is for short-gap, clearly recoverable, bounded recovery |
|||
- `Rebuild` is for long-gap, high-cost, unstable, or non-convergent recovery |
|||
- recovery session is a bounded contract, not a long-running rescue thread |
|||
- `> H0` live WAL must not silently turn one recovery session into an endless chase |
|||
|
|||
## 5. Specific Fine-Tune Adjustments |
|||
|
|||
### 5.1 Narrow `CatchUp` |
|||
|
|||
`CatchUp` should explicitly require: |
|||
|
|||
- short outage |
|||
- bounded target `H0` |
|||
- clear recoverability |
|||
- bounded reservation |
|||
- bounded time |
|||
- bounded resource cost |
|||
- bounded convergence expectation |
|||
|
|||
`CatchUp` should explicitly stop when: |
|||
|
|||
- target drifts too long without convergence |
|||
- replay progress stalls |
|||
- recoverability proof is lost |
|||
- retention cost becomes unreasonable |
|||
- session budget expires |
|||
|
|||
### 5.2 Elevate `Rebuild` |
|||
|
|||
`Rebuild` should be treated as a first-class path when: |
|||
|
|||
- lag is too large |
|||
- catch-up does not converge |
|||
- recoverability is no longer stable |
|||
- complexity of continued catch-up exceeds its product value |
|||
|
|||
The intended model becomes: |
|||
|
|||
- short gap -> `CatchUp` |
|||
- long gap / unstable / non-convergent -> `Rebuild` |
|||
|
|||
This should be interpreted more strictly than a simple routing rule: |
|||
|
|||
- `CatchUp` is not a general recovery framework |
|||
- `CatchUp` is a relaxed form of `KeepUp` |
|||
- it should stay limited to short-gap, bounded, clearly recoverable WAL replay |
|||
- it only makes sense while the replica's current base is still trustworthy enough to continue from |
|||
|
|||
By contrast: |
|||
|
|||
- `Rebuild` is the more general recovery framework |
|||
- it restores the replica from a trusted base toward a frozen target boundary |
|||
- `full rebuild` and `partial rebuild` are not different protocols; they are different base/transfer choices under the same rebuild contract |
|||
|
|||
So the intended product shape is: |
|||
|
|||
- use `CatchUp` when replay debt is small and clearly cheaper than rebuild |
|||
- use `Rebuild` when correctness, boundedness, or product simplicity would otherwise be compromised |
|||
|
|||
And the correctness anchor for both `full` and `partial` rebuild should remain explicit: |
|||
|
|||
- freeze `TargetLSN` |
|||
- pin the snapshot/base used for recovery |
|||
- only then optimize transfer volume using `snapshot + tail`, `bitmap`, or similar mechanisms |
|||
|
|||
### 5.3 Clarify `recover -> keepup` handoff |
|||
|
|||
Phase 04 already aims to prove a clean handoff between normal sender and recovery session. |
|||
|
|||
The fine-tune should make the next step more explicit: |
|||
|
|||
- one recovery session only owns `(R, H0]` |
|||
- session completion releases recovery debt |
|||
- replica should not silently stay in "quasi-recovery" |
|||
- re-entry to `KeepUp` / `InSync` should remain explicit, ideally with `PromotionHold` or equivalent stabilization logic |
|||
|
|||
### 5.4 Keep Smart WAL deferred |
|||
|
|||
No fine-tune should broaden Smart WAL scope at this point. |
|||
|
|||
Reason: |
|||
|
|||
- Smart WAL multiplies recoverability, GC, payload-availability, and reservation complexity |
|||
- the current priority is to harden the simpler V2 replication contract first |
|||
|
|||
So the rule remains: |
|||
|
|||
- no Smart WAL expansion beyond what minimal proof work might later require |
|||
|
|||
## 6. Simulation Strengthening Requirements |
|||
|
|||
This is the highest-value part of the fine-tune. |
|||
|
|||
Current simulator strength is already good on: |
|||
|
|||
- epoch fencing |
|||
- stale traffic rejection |
|||
- promotion candidate rules |
|||
- ownership / session invalidation |
|||
- basic `CatchUp / NeedsRebuild` classification |
|||
|
|||
Current simulator weakness is still significant on: |
|||
|
|||
- crash-consistency around extent / checkpoint / replay boundaries |
|||
- `ACK` boundary versus recoverable boundary |
|||
- `CatchUp` liveness / convergence |
|||
|
|||
### 6.1 Required new modeling direction |
|||
|
|||
The simulator should stop collapsing these states together: |
|||
|
|||
- received but not durable |
|||
- WAL durable but not yet fully materialized |
|||
- extent-visible but not yet checkpoint-safe |
|||
- checkpoint-safe base image |
|||
- restart-recoverable read state |
|||
|
|||
Suggested explicit storage-state split: |
|||
|
|||
- `ReceivedLSN` |
|||
- `WALDurableLSN` |
|||
- `ExtentAppliedLSN` |
|||
- `CheckpointLSN` |
|||
- `RecoverableLSNAfterRestart` |
|||
|
|||
### 6.2 Required new invariants |
|||
|
|||
The simulator should explicitly check at least: |
|||
|
|||
1. `AckedFlushLSN <= RecoverableLSNAfterRestart` |
|||
2. visible state must have recoverable backing |
|||
3. `CatchUp` cannot remain non-convergent indefinitely |
|||
4. promotion candidate must still possess recoverable committed prefix |
|||
|
|||
### 6.3 Required new scenario classes |
|||
|
|||
Priority scenarios to add: |
|||
|
|||
1. `ExtentAheadOfCheckpoint_CrashRestart_ReadBoundary` |
|||
2. `AckedFlush_MustBeRecoverableAfterCrash` |
|||
3. `UnackedVisibleExtent_MustNotSurviveAsCommittedTruth` |
|||
4. `CatchUpChasingMovingHead_EscalatesOrConverges` |
|||
5. `CheckpointGCBreaksRecoveryProof` |
|||
|
|||
### 6.4 Required simulator style upgrade |
|||
|
|||
The simulator should move beyond only hand-authored examples and also support: |
|||
|
|||
- dangerous-state predicates |
|||
- adversarial random exploration guided by those predicates |
|||
|
|||
Examples: |
|||
|
|||
- `acked_flush_lost` |
|||
- `extent_exposes_unrecoverable_state` |
|||
- `catchup_livelock` |
|||
- `rebuild_required_but_not_escalated` |
|||
|
|||
## 7. Relationship To Acceptance Criteria |
|||
|
|||
This fine-tune is not a separate architecture line. |
|||
|
|||
It is mainly intended to make the project satisfy the existing acceptance set more convincingly: |
|||
|
|||
- `A5` explicit escalation from non-convergent catch-up |
|||
- `A6` recoverability boundary as a real rule, not hopeful policy |
|||
- `A7` historical correctness against snapshot + tail rebuild |
|||
- `A8` strict durability mode semantics |
|||
|
|||
So this fine-tune is a strengthening of the current V2 proof path, not a new branch. |
|||
|
|||
## 8. Recommended Sequencing |
|||
|
|||
### Option A: pause Phase 04 and reopen design now |
|||
|
|||
Not recommended. |
|||
|
|||
Why: |
|||
|
|||
- Phase 04 has strong momentum |
|||
- its core ownership/fencing work is correct |
|||
- pausing it now would blur scope and waste recent closure |
|||
|
|||
### Option B: finish Phase 04, then add a narrow `4.5` |
|||
|
|||
Recommended. |
|||
|
|||
Why: |
|||
|
|||
- Phase 04 can finish its intended ownership / orchestration / minimal-history closure |
|||
- `4.5` can then tighten recovery strategy without destabilizing the slice |
|||
- the project avoids carrying "too-smart catch-up" assumptions into later engine planning |
|||
|
|||
Recommended sequence: |
|||
|
|||
1. finish Phase 04 main closure |
|||
2. immediately start `Phase 4.5` |
|||
3. use `4.5` to tighten: |
|||
- bounded `CatchUp` |
|||
- formal `Rebuild` |
|||
- crash-consistency and recoverability simulator proof |
|||
4. then re-evaluate Gate 4 / Gate 5 |
|||
|
|||
## 9. Scope Of A Possible Phase 4.5 |
|||
|
|||
If the dev manager chooses to implement a `4.5` step, its scope should be: |
|||
|
|||
### In scope |
|||
|
|||
- tighten algorithm wording and boundaries from `v2-detailed-algorithm.zh.md` |
|||
- formalize bounded `CatchUp` |
|||
- formalize `Rebuild` as first-class path |
|||
- strengthen simulator state model and invariants |
|||
- add targeted crash-consistency and liveness scenarios |
|||
- improve prototype traceability against `A5-A8` |
|||
|
|||
### Out of scope |
|||
|
|||
- Smart WAL expansion |
|||
- real storage engine redesign |
|||
- V1 production integration |
|||
- frontend/wire protocol |
|||
- performance optimization as primary goal |
|||
|
|||
## 10. Decision Requested From Dev Manager |
|||
|
|||
Please decide: |
|||
|
|||
1. whether `Phase 04` should continue to normal closure without interruption |
|||
2. whether a narrow `Phase 4.5` should immediately follow |
|||
3. whether the simulator strengthening work should be treated as mandatory for Gate 4 / Gate 5 credibility |
|||
|
|||
Recommended decision: |
|||
|
|||
- **Yes**: finish `Phase 04` |
|||
- **Yes**: add `Phase 4.5` as a bounded fine-tuning step |
|||
- **Yes**: treat crash-consistency / recoverability / liveness simulator strengthening as required, not optional |
|||
|
|||
## 11. Bottom Line |
|||
|
|||
The project does not need a new direction. |
|||
|
|||
It needs: |
|||
|
|||
- a slightly tighter interpretation of V2 |
|||
- a stronger recoverability/crash-consistency simulator |
|||
- a clearer willingness to use formal `Rebuild` instead of over-extending `CatchUp` |
|||
|
|||
So the practical recommendation is: |
|||
|
|||
- **keep the V2 core** |
|||
- **finish Phase 04** |
|||
- **add a narrow Phase 4.5** |
|||
- **strengthen simulator proof before engine planning** |
|||
@ -0,0 +1,356 @@ |
|||
# Phase 4.5 |
|||
|
|||
Date: 2026-03-29 |
|||
Status: complete |
|||
Purpose: harden Gate 4 / Gate 5 credibility after Phase 04 by tightening bounded `CatchUp`, elevating `Rebuild` as a first-class path, and strengthening crash-consistency / recoverability proof |
|||
|
|||
## Related Plan |
|||
|
|||
Strategic phase: |
|||
|
|||
- `sw-block/.private/phase/phase-4.5.md` |
|||
|
|||
Simulator implementation plan: |
|||
|
|||
- `learn/projects/sw-block/design/phase-05-crash-consistency-simulation.md` |
|||
|
|||
Use them together: |
|||
|
|||
- `Phase 4.5` defines the gate-hardening purpose and priorities |
|||
- `phase-05-crash-consistency-simulation.md` is the detailed simulator implementation plan |
|||
|
|||
## Why This Phase Exists |
|||
|
|||
Phase 04 has already established: |
|||
|
|||
1. per-replica sender identity |
|||
2. one active recovery session per replica per epoch |
|||
3. stale authority fencing |
|||
4. sender-owned execution APIs |
|||
5. assignment-intent orchestration |
|||
6. minimal historical-data prototype |
|||
7. prototype scenario closure |
|||
|
|||
The next risk is no longer ownership structure. |
|||
|
|||
The next risk is: |
|||
|
|||
1. `CatchUp` becoming too broad, too long-lived, or too optimistic |
|||
2. `Rebuild` remaining underspecified even though it will likely become a common path |
|||
3. simulator proof still being weaker than desired on crash-consistency and restart-recoverability |
|||
|
|||
So `Phase 4.5` exists to harden the decision gate before real engine planning. |
|||
|
|||
## Relationship To Phase 04 |
|||
|
|||
`Phase 4.5` is not a new architecture line. |
|||
|
|||
It is a narrow hardening step after normal Phase 04 closure. |
|||
|
|||
It should: |
|||
|
|||
- keep the V2 core |
|||
- not reopen sender/session ownership architecture |
|||
- strengthen recovery boundaries and proof quality |
|||
|
|||
## Main Questions |
|||
|
|||
1. how narrow should `CatchUp` be? |
|||
2. when must recovery escalate to `Rebuild`? |
|||
3. what exactly is the `Rebuild` source of truth? |
|||
4. what does restart-recoverable / crash-consistent state mean in the simulator? |
|||
|
|||
## Core Decisions To Drive |
|||
|
|||
### 1. Bounded CatchUp |
|||
|
|||
`CatchUp` should be explicitly bounded by: |
|||
|
|||
1. target range |
|||
2. retention proof |
|||
3. time budget |
|||
4. progress budget |
|||
5. resource budget |
|||
|
|||
It should stop and escalate when: |
|||
|
|||
1. target drifts too long |
|||
2. progress stalls |
|||
3. recoverability proof is lost |
|||
4. retention cost becomes unreasonable |
|||
5. session budget expires |
|||
|
|||
### 2. Rebuild Is First-Class |
|||
|
|||
`Rebuild` is not an embarrassment path. |
|||
|
|||
It is the formal path for: |
|||
|
|||
1. long gap |
|||
2. unstable recoverability |
|||
3. non-convergent catch-up |
|||
4. excessive replay cost |
|||
5. restart-recoverability uncertainty |
|||
|
|||
### 3. Rebuild Source Model |
|||
|
|||
To address the concern that tightening `CatchUp` makes `Rebuild` too dominant: |
|||
|
|||
`Rebuild` should be split conceptually into two modes: |
|||
|
|||
1. **Snapshot + Tail** |
|||
- preferred path |
|||
- use a dated but internally consistent base snapshot/checkpoint |
|||
- then apply retained WAL tail up to the committed recovery boundary |
|||
|
|||
2. **Full Base Rebuild** |
|||
- fallback path |
|||
- used when no acceptable snapshot/base image exists |
|||
- more expensive and slower |
|||
|
|||
Decision boundary: |
|||
|
|||
- use `Snapshot + Tail` when a trusted snapshot/checkpoint/base exists that covers the required base state |
|||
- use `Full Base Rebuild` when no such trusted base exists |
|||
|
|||
So "rebuild" should not mean only: |
|||
|
|||
- copy everything from scratch |
|||
|
|||
It should usually mean: |
|||
|
|||
- re-establish a trustworthy base image |
|||
- then catch up from that base to the committed boundary |
|||
|
|||
This keeps `Rebuild` practical even if `CatchUp` becomes narrower. |
|||
|
|||
### 4. Safe Recovery Truth |
|||
|
|||
The simulator should explicitly separate: |
|||
|
|||
1. `ReceivedLSN` |
|||
2. `WALDurableLSN` |
|||
3. `ExtentAppliedLSN` |
|||
4. `CheckpointLSN` |
|||
5. `RecoverableLSNAfterRestart` |
|||
|
|||
This is needed so that: |
|||
|
|||
- `ACK` truth |
|||
- visible-state truth |
|||
- crash-restart truth |
|||
|
|||
do not collapse into one number. |
|||
|
|||
## Priority |
|||
|
|||
### P0 |
|||
|
|||
1. document bounded `CatchUp` rule |
|||
2. document `Rebuild` modes: |
|||
- snapshot + tail |
|||
- full base rebuild |
|||
3. define escalation conditions from `CatchUp` to `Rebuild` |
|||
|
|||
Status: |
|||
|
|||
- accepted on both prototype and simulator sides |
|||
- prototype: bounded `CatchUp` is semantic, target-frozen, budget-enforced, and rebuild is a sender-owned exclusive path |
|||
- simulator: crash-consistency state split, checkpoint-safe restart boundary, and core invariants are in place |
|||
|
|||
### P1 |
|||
|
|||
4. strengthen simulator state model with crash-consistency split: |
|||
- `ReceivedLSN` |
|||
- `WALDurableLSN` |
|||
- `ExtentAppliedLSN` |
|||
- `CheckpointLSN` |
|||
- `RecoverableLSNAfterRestart` |
|||
|
|||
5. add explicit invariants: |
|||
- `AckedFlushLSN <= RecoverableLSNAfterRestart` |
|||
- visible state must have recoverable backing |
|||
- promotion candidate must possess recoverable committed prefix |
|||
|
|||
Status: |
|||
|
|||
- accepted on the simulator side |
|||
- remaining work is no longer basic state split; it is stronger traceability and adversarial exploration |
|||
|
|||
### P2 |
|||
|
|||
6. add targeted scenarios: |
|||
- `ExtentAheadOfCheckpoint_CrashRestart_ReadBoundary` |
|||
- `AckedFlush_MustBeRecoverableAfterCrash` |
|||
- `UnackedVisibleExtent_MustNotSurviveAsCommittedTruth` |
|||
- `CatchUpChasingMovingHead_EscalatesOrConverges` |
|||
- `CheckpointGCBreaksRecoveryProof` |
|||
|
|||
Status: |
|||
|
|||
- baseline targeted scenarios accepted |
|||
- predicate-guided/adversarial exploration remains open |
|||
|
|||
### P3 |
|||
|
|||
7. make prototype traceability stronger for: |
|||
- `A5` |
|||
- `A6` |
|||
- `A7` |
|||
- `A8` |
|||
|
|||
8. decide whether Gate 4 / Gate 5 are now credible enough for engine planning |
|||
|
|||
Status: |
|||
|
|||
- partially complete |
|||
- Gate 4 / Gate 5 are materially stronger |
|||
- remaining work is to make `A5-A8` double evidence more explicit and reviewable |
|||
|
|||
## Scope |
|||
|
|||
### In scope |
|||
|
|||
1. bounded `CatchUp` |
|||
2. first-class `Rebuild` |
|||
3. snapshot + tail rebuild model |
|||
4. crash-consistency simulator state split |
|||
5. targeted liveness / recoverability scenarios |
|||
|
|||
### Out of scope |
|||
|
|||
1. Smart WAL expansion |
|||
2. V1 production integration |
|||
3. backend/storage engine redesign |
|||
4. performance optimization as primary goal |
|||
5. frontend/wire protocol work |
|||
|
|||
## Exit Criteria |
|||
|
|||
`Phase 4.5` is done when: |
|||
|
|||
1. `CatchUp` budget / escalation rule is explicit in docs and simulator |
|||
2. `Rebuild` is explicitly modeled as: |
|||
- snapshot + tail preferred |
|||
- full base rebuild fallback |
|||
3. simulator has explicit crash-consistency state split |
|||
4. simulator has targeted crash / liveness scenarios for the listed risks |
|||
5. acceptance items `A5-A8` have stronger executable proof, ideally with explicit prototype + simulator evidence pairs |
|||
6. we can make a more credible decision on: |
|||
- real V2 engine planning |
|||
- or `V2.5` correction |
|||
|
|||
## Review Gates |
|||
|
|||
These are explicit review gates for `Phase 4.5`. |
|||
|
|||
### Gate 1: Bounded CatchUp Must Be Semantic |
|||
|
|||
It is not enough to add budget fields in docs or structs. |
|||
|
|||
To count as complete: |
|||
|
|||
1. timeout / budget exceed must force exit |
|||
2. moving-head chase must not continue indefinitely |
|||
3. escalation to `NeedsRebuild` must be explicit |
|||
4. tests must prove those behaviors |
|||
|
|||
### Gate 2: State Split Must Change Decisions |
|||
|
|||
It is not enough to add more state names. |
|||
|
|||
To count as complete, the new crash-consistency state split must materially change: |
|||
|
|||
1. `ACK` legality |
|||
2. restart recoverability judgment |
|||
3. visible-state legality |
|||
4. promotion-candidate legality |
|||
|
|||
### Gate 3: A5-A8 Need Double Evidence |
|||
|
|||
It is not enough for only prototype or only simulator to cover them. |
|||
|
|||
To count as complete, each of: |
|||
|
|||
- `A5` |
|||
- `A6` |
|||
- `A7` |
|||
- `A8` |
|||
|
|||
should have: |
|||
|
|||
1. one prototype-side evidence path |
|||
2. one simulator-side evidence path |
|||
|
|||
## Scope Discipline |
|||
|
|||
`Phase 4.5` must remain a bounded gate-hardening phase. |
|||
|
|||
It should stay focused on: |
|||
|
|||
1. tightening boundaries |
|||
2. strengthening proof |
|||
3. clearing the path for engine planning |
|||
|
|||
It should not turn into a broad new feature-expansion phase. |
|||
|
|||
## Current Status Summary |
|||
|
|||
Accepted now: |
|||
|
|||
1. `sw` `Phase 4.5 P0` |
|||
- bounded `CatchUp` is semantic, not documentary |
|||
- `FrozenTargetLSN` is a real session invariant |
|||
- `Rebuild` is an exclusive sender-owned execution path |
|||
2. `tester` crash-consistency simulator strengthening |
|||
- checkpoint/restart boundary is explicit |
|||
- recoverability is no longer a single collapsed watermark |
|||
- core crash-consistency invariants are executable |
|||
|
|||
Open now: |
|||
|
|||
1. low-priority cleanup such as redundant frozen-target bookkeeping fields |
|||
|
|||
Completed since initial approval: |
|||
|
|||
1. `A5-A8` explicit double-evidence traceability materially strengthened |
|||
2. predicate exploration / adversarial search added on simulator side |
|||
3. crash-consistency random/adversarial search found and helped fix a real `StateAt(lsn)` historical-state bug |
|||
|
|||
## Assignment For `sw` |
|||
|
|||
Focus: prototype/control-path formalization |
|||
|
|||
Completed work: |
|||
|
|||
1. updated prototype traceability for: |
|||
- `A5` |
|||
- `A6` |
|||
- `A7` |
|||
- `A8` |
|||
2. made rebuild-source decision evidence explicit in prototype tests: |
|||
- snapshot + tail chosen only when trusted base exists |
|||
- full base chosen when it does not |
|||
3. added focused prototype evidence grouping for engine-planning review |
|||
|
|||
Remaining optional cleanup: |
|||
|
|||
4. optionally clean low-priority redundancy: |
|||
- `TargetLSNAtStart` if superseded by `FrozenTargetLSN` |
|||
|
|||
## Assignment For `tester` |
|||
|
|||
Focus: simulator/crash-consistency proof |
|||
|
|||
Completed work: |
|||
|
|||
1. wired simulator-side evidence explicitly into acceptance traceability for: |
|||
- `A5` |
|||
- `A6` |
|||
- `A7` |
|||
- `A8` |
|||
2. added predicate exploration / adversarial search around the new crash-consistency model |
|||
3. added danger predicates for major failure classes: |
|||
- acked flush lost |
|||
- visible unrecoverable state |
|||
- catch-up livelock / rebuild-required-but-not-escalated |
|||
@ -0,0 +1,117 @@ |
|||
# A5-A8 Acceptance Traceability |
|||
|
|||
Date: 2026-03-29 |
|||
Status: Phase 4.5 evidence-hardening |
|||
|
|||
## Purpose |
|||
|
|||
Map each acceptance criterion to specific executable evidence. |
|||
Two evidence layers: |
|||
- **Simulator** (distsim): protocol-level proof |
|||
- **Prototype** (enginev2): ownership/session-level proof |
|||
|
|||
--- |
|||
|
|||
## A5: Non-Convergent Catch-Up Escalates Explicitly |
|||
|
|||
**Must prove**: tail-chasing or failed catch-up does not pretend success. |
|||
|
|||
**Pass condition**: explicit `CatchingUp → NeedsRebuild` transition. |
|||
|
|||
| Evidence | Test | File | Layer | Status | |
|||
|----------|------|------|-------|--------| |
|||
| Tail-chasing converges or aborts | `TestS6_TailChasing_ConvergesOrAborts` | `cluster_test.go` | distsim | PASS | |
|||
| Tail-chasing non-convergent → NeedsRebuild | `TestS6_TailChasing_NonConvergent_EscalatesToNeedsRebuild` | `phase02_advanced_test.go` | distsim | PASS | |
|||
| Catch-up timeout → NeedsRebuild | `TestP03_CatchupTimeout_EscalatesToNeedsRebuild` | `phase03_timeout_test.go` | distsim | PASS | |
|||
| Reservation expiry aborts catch-up | `TestReservationExpiryAbortsCatchup` | `cluster_test.go` | distsim | PASS | |
|||
| Flapping budget exceeded → NeedsRebuild | `TestP02_S5_FlappingExceedsBudget_EscalatesToNeedsRebuild` | `phase02_advanced_test.go` | distsim | PASS | |
|||
| Catch-up converges or escalates (I3) | `TestI3_CatchUpConvergesOrEscalates` | `phase045_crash_test.go` | distsim | PASS | |
|||
| Catch-up timeout in enginev2 | `TestE2E_NeedsRebuild_Escalation` | `p2_test.go` | enginev2 | PASS | |
|||
|
|||
**Verdict**: A5 is well-covered. Both simulator and prototype prove explicit escalation. No pretend-success path exists. |
|||
|
|||
--- |
|||
|
|||
## A6: Recoverability Boundary Is Explicit |
|||
|
|||
**Must prove**: recoverable vs unrecoverable gap is decided explicitly. |
|||
|
|||
**Pass condition**: recovery aborts when reservation/payload availability is lost; rebuild is explicit fallback. |
|||
|
|||
| Evidence | Test | File | Layer | Status | |
|||
|----------|------|------|-------|--------| |
|||
| Reservation expiry aborts catch-up | `TestReservationExpiryAbortsCatchup` | `cluster_test.go` | distsim | PASS | |
|||
| WAL GC beyond replica → NeedsRebuild | `TestI5_CheckpointGC_PreservesAckedBoundary` | `phase045_crash_test.go` | distsim | PASS | |
|||
| Rebuild from snapshot + tail | `TestReplicaRebuildFromSnapshotAndTail` | `cluster_test.go` | distsim | PASS | |
|||
| Smart WAL: resolvable → unresolvable | `TestP02_SmartWAL_RecoverableThenUnrecoverable` | `phase02_advanced_test.go` | distsim | PASS | |
|||
| Time-varying payload availability | `TestP02_SmartWAL_TimeVaryingAvailability` | `phase02_advanced_test.go` | distsim | PASS | |
|||
| RecoverableLSN is replayability proof | `RecoverableLSN()` in `storage.go` | `storage.go` | distsim | Implemented | |
|||
| Handshake outcome: NeedsRebuild | `TestExec_HandshakeOutcome_NeedsRebuild_InvalidatesSession` | `execution_test.go` | enginev2 | PASS | |
|||
|
|||
**Verdict**: A6 is covered. Recovery boundary is decided by explicit reservation + recoverability check, not by optimistic assumption. `RecoverableLSN()` verifies contiguous WAL coverage. |
|||
|
|||
--- |
|||
|
|||
## A7: Historical Data Correctness Holds |
|||
|
|||
**Must prove**: recovered data for target LSN is historically correct; current extent cannot fake old history. |
|||
|
|||
**Pass condition**: snapshot + tail rebuild matches reference; current-extent reconstruction of old LSN fails correctness. |
|||
|
|||
| Evidence | Test | File | Layer | Status | |
|||
|----------|------|------|-------|--------| |
|||
| Snapshot + tail matches reference | `TestReplicaRebuildFromSnapshotAndTail` | `cluster_test.go` | distsim | PASS | |
|||
| Historical state not reconstructable after GC | `TestA7_HistoricalState_NotReconstructableAfterGC` | `phase045_crash_test.go` | distsim | PASS | |
|||
| `CanReconstructAt()` rejects faked history | `CanReconstructAt()` in `storage.go` | `storage.go` | distsim | Implemented | |
|||
| Checkpoint does not leak applied state | `TestI2_CheckpointDoesNotLeakAppliedState` | `phase045_crash_test.go` | distsim | PASS | |
|||
| Extent-referenced resolvable records | `TestExtentReferencedResolvableRecordsAreRecoverable` | `cluster_test.go` | distsim | PASS | |
|||
| Extent-referenced unresolvable → rebuild | `TestExtentReferencedUnresolvableForcesRebuild` | `cluster_test.go` | distsim | PASS | |
|||
| ACK'd flush recoverable after crash (I1) | `TestI1_AckedFlush_RecoverableAfterPrimaryCrash` | `phase045_crash_test.go` | distsim | PASS | |
|||
|
|||
**Verdict**: A7 is now covered with the Phase 4.5 crash-consistency additions. The critical gap ("current extent cannot fake old history") is proven by `CanReconstructAt()` + `TestA7_HistoricalState_NotReconstructableAfterGC`. |
|||
|
|||
--- |
|||
|
|||
## A8: Durability Mode Semantics Are Correct |
|||
|
|||
**Must prove**: best_effort, sync_all, sync_quorum behave as intended under mixed replica states. |
|||
|
|||
**Pass condition**: sync_all strict, sync_quorum commits only with true durable quorum, invalid topology rejected. |
|||
|
|||
| Evidence | Test | File | Layer | Status | |
|||
|----------|------|------|-------|--------| |
|||
| sync_quorum continues with one lagging | `TestSyncQuorumContinuesWithOneLaggingReplica` | `cluster_test.go` | distsim | PASS | |
|||
| sync_all blocks with one lagging | `TestSyncAllBlocksWithOneLaggingReplica` | `cluster_test.go` | distsim | PASS | |
|||
| sync_quorum mixed states | `TestSyncQuorumWithMixedReplicaStates` | `cluster_test.go` | distsim | PASS | |
|||
| sync_all mixed states | `TestSyncAllBlocksWithMixedReplicaStates` | `cluster_test.go` | distsim | PASS | |
|||
| Barrier timeout: sync_all blocked | `TestP03_BarrierTimeout_SyncAll_Blocked` | `phase03_timeout_test.go` | distsim | PASS | |
|||
| Barrier timeout: sync_quorum commits | `TestP03_BarrierTimeout_SyncQuorum_StillCommits` | `phase03_timeout_test.go` | distsim | PASS | |
|||
| Promotion uses RecoverableLSN | `EvaluateCandidateEligibility()` | `cluster.go` | distsim | Implemented | |
|||
| Promoted replica has committed prefix (I4) | `TestI4_PromotedReplica_HasCommittedPrefix` | `phase045_crash_test.go` | distsim | PASS | |
|||
|
|||
**Verdict**: A8 is well-covered. sync_all is strict (blocks on lagging), sync_quorum uses true durable quorum (not connection count). Promotion now uses `RecoverableLSN()` for committed-prefix check. |
|||
|
|||
--- |
|||
|
|||
## Summary |
|||
|
|||
| Criterion | Simulator Evidence | Prototype Evidence | Status | |
|||
|-----------|-------------------|-------------------|--------| |
|||
| A5 (catch-up escalation) | 6 tests | 1 test | **Strong** | |
|||
| A6 (recoverability boundary) | 6 tests + RecoverableLSN() | 1 test | **Strong** | |
|||
| A7 (historical correctness) | 7 tests + CanReconstructAt() | — | **Strong** (new in Phase 4.5) | |
|||
| A8 (durability modes) | 7 tests + RecoverableLSN() | — | **Strong** | |
|||
|
|||
**Total executable evidence**: 26 simulator tests + 2 prototype tests + 2 new storage methods. |
|||
|
|||
All A5-A8 acceptance criteria have direct test evidence. No criterion depends solely on design-doc claims. |
|||
|
|||
--- |
|||
|
|||
## Still Open (Not Blocking) |
|||
|
|||
| Item | Priority | Why not blocking | |
|||
|------|----------|-----------------| |
|||
| Predicate exploration / adversarial search | P2 | Manual scenarios already cover known failure classes | |
|||
| Catch-up convergence under sustained load | P2 | I3 proves escalation; load-rate modeling is optimization | |
|||
| A5-A8 in a single grouped runner view | P3 | Traceability doc serves as grouped evidence for now | |
|||
@ -0,0 +1,304 @@ |
|||
# Agent Development Process |
|||
|
|||
Date: 2026-03-30 |
|||
Status: active |
|||
Purpose: define the working split between `sw`, `tester`, and review/management roles so each phase and slice has a clear delivery path |
|||
|
|||
## Why This Exists |
|||
|
|||
The project is now beyond pure exploration. |
|||
|
|||
The expensive part is no longer only writing code. |
|||
The expensive part is: |
|||
|
|||
1. delivery |
|||
2. review |
|||
3. fixes |
|||
4. re-review |
|||
|
|||
So the process must reduce repeated full-stack review and make each role responsible for a distinct layer. |
|||
|
|||
## Roles |
|||
|
|||
### `manager` |
|||
|
|||
Primary role: |
|||
|
|||
- phase/plan owner |
|||
|
|||
Responsibilities: |
|||
|
|||
1. define the phase/slice direction and scope |
|||
2. accept the planning package before coding starts |
|||
3. decide whether a carry-forward is acceptable or must become a gate |
|||
4. perform the final round review for overall logic, omissions, and product-path fit |
|||
|
|||
### `architect` |
|||
|
|||
Primary role: |
|||
|
|||
- plan and technical reviewer |
|||
|
|||
Responsibilities: |
|||
|
|||
1. review the plan before implementation starts |
|||
2. tighten algorithm wording, scope edges, and expectation framing |
|||
3. review technical correctness during implementation |
|||
4. review API/state/resource/fail-closed behavior |
|||
5. catch semantic drift, scope drift, and V1/V1.5 leakage |
|||
|
|||
### `sw` |
|||
|
|||
Primary role: |
|||
|
|||
- implementation owner |
|||
|
|||
Responsibilities: |
|||
|
|||
1. implement the accepted slice |
|||
2. state changed contracts |
|||
3. state fail-closed handling |
|||
4. state resources acquired/released |
|||
5. state carry-forward items |
|||
6. add or update tests |
|||
|
|||
### `tester` |
|||
|
|||
Primary role: |
|||
|
|||
- evidence owner |
|||
|
|||
Responsibilities: |
|||
|
|||
1. define what the slice must prove before implementation starts |
|||
2. maintain the failure-class checklist |
|||
3. define reject conditions and required test level |
|||
4. confirm that implementation claims are actually covered by evidence |
|||
|
|||
## Default Routine |
|||
|
|||
Each slice should follow this order: |
|||
|
|||
1. `manager` defines the plan direction |
|||
2. `architect` reviews and tightens the plan / algorithm / expectation framing |
|||
3. `tester` writes the expectation template |
|||
4. `manager` accepts the package and records it in the phase docs |
|||
5. `sw` implements and submits with the delivery template |
|||
6. `architect` reviews the technical layer until clean enough |
|||
7. `tester` performs validation and evidence closure |
|||
8. `manager` performs round-two review for overall logic and omissions |
|||
|
|||
Urgent exception: |
|||
|
|||
- if early work already shows major scope drift, protocol contradiction, or V1/V1.5 leakage, architecture review may short-circuit before implementation grows further |
|||
|
|||
## Delivery Template For `sw` |
|||
|
|||
Each delivery should include: |
|||
|
|||
1. changed contracts |
|||
2. fail-closed handling added |
|||
3. resources acquired/released |
|||
4. test inventory |
|||
5. known carry-forward notes |
|||
|
|||
This template is required between: |
|||
|
|||
1. implementation |
|||
2. implementation/fail-closed review |
|||
|
|||
It should accompany the delivery before reviewers start detailed review. |
|||
|
|||
Suggested format: |
|||
|
|||
```md |
|||
Changed contracts: |
|||
- ... |
|||
|
|||
Fail-closed handling: |
|||
- ... |
|||
|
|||
Resources acquired/released: |
|||
- ... |
|||
|
|||
Test inventory: |
|||
- ... |
|||
|
|||
Carry-forward notes: |
|||
- ... |
|||
``` |
|||
|
|||
## Phase Doc Usage |
|||
|
|||
Use the three phase documents differently: |
|||
|
|||
### `phase-xx.md` |
|||
|
|||
Use for: |
|||
|
|||
1. current execution direction |
|||
2. current scope |
|||
3. current guardrails |
|||
4. current accepted status |
|||
5. current assignments |
|||
|
|||
Keep it short and execution-oriented. |
|||
|
|||
### `phase-xx-log.md` |
|||
|
|||
Use for: |
|||
|
|||
1. detailed planning evolution |
|||
2. review feedback |
|||
3. carry-forward discussion |
|||
4. open observations |
|||
5. why wording or scope changed |
|||
|
|||
This document may be longer and more detailed. |
|||
|
|||
### `phase-xx-decisions.md` |
|||
|
|||
Use for: |
|||
|
|||
1. durable phase-level decisions |
|||
2. accepted boundaries that later rounds should inherit |
|||
3. gate decisions |
|||
4. decisions that should not be re-argued without new evidence |
|||
|
|||
This document should stay compact and hold only the more important global decisions. |
|||
|
|||
## Expectation Template For `tester` |
|||
|
|||
Before or at slice start, `tester` should define: |
|||
|
|||
1. must-pass expectations |
|||
2. failure-class checklist |
|||
3. required test level for each behavior |
|||
4. reject conditions |
|||
|
|||
`tester` should re-engage after technical review is mostly clean, to confirm final evidence closure before the manager's second-round review. |
|||
|
|||
Suggested format: |
|||
|
|||
```md |
|||
Expectation: |
|||
- ... |
|||
|
|||
Required level: |
|||
- entry path / engine / unit |
|||
|
|||
Reject if: |
|||
- ... |
|||
|
|||
Failure classes covered: |
|||
- ... |
|||
``` |
|||
|
|||
## Review Checklist For `architect` |
|||
|
|||
Review these first: |
|||
|
|||
1. nil handling |
|||
2. missing-resource handling |
|||
3. wrong-state / wrong-kind rejection |
|||
4. stale ID / stale authority rejection |
|||
5. resource pin / release symmetry |
|||
6. plan/execute/complete argument correctness |
|||
7. fail-closed cleanup on partial failure |
|||
|
|||
## Failure-Class Checklist |
|||
|
|||
This checklist should be kept active across phases. |
|||
|
|||
Minimum recurring classes: |
|||
|
|||
1. changed-address restart |
|||
2. stale epoch / stale session |
|||
3. missing resource pin |
|||
4. cleanup after failed plan |
|||
5. replay range mis-derived |
|||
6. false trusted-base selection |
|||
7. truncation missing but completion attempted |
|||
8. bounded catch-up not escalating |
|||
|
|||
## Process Rules |
|||
|
|||
### Rule 1: Do not wait until the end to define proof |
|||
|
|||
Each slice should begin with a statement of: |
|||
|
|||
1. what must be proven |
|||
2. which failure classes must stay closed |
|||
|
|||
### Rule 2: Do not let convenience wrappers silently become model truth |
|||
|
|||
Any convenience flow must be explicitly classified as: |
|||
|
|||
1. test-only convenience |
|||
2. stepwise engine task |
|||
3. planner/executor split |
|||
|
|||
### Rule 3: Prefer evidence quality over object growth |
|||
|
|||
New work should preferentially improve: |
|||
|
|||
1. traceability |
|||
2. diagnosability |
|||
3. failure-class closure |
|||
4. adapter contracts |
|||
|
|||
not just add: |
|||
|
|||
1. more structs |
|||
2. more states |
|||
3. more helper APIs |
|||
|
|||
### Rule 4: Use V1 as validation source, not architecture template |
|||
|
|||
Use: |
|||
|
|||
1. `learn/projects/sw-block/` |
|||
2. `weed/storage/block*` |
|||
|
|||
for: |
|||
|
|||
1. constraints |
|||
2. failure gates |
|||
3. implementation reality |
|||
|
|||
Do not use them as the default V2 architecture template. |
|||
|
|||
### Rule 5: Reuse reality, not inherited semantics |
|||
|
|||
When later implementation reuses existing `Seaweed` / `V1` paths: |
|||
|
|||
1. reuse control-plane reality |
|||
2. reuse storage/runtime reality |
|||
3. reuse execution mechanisms |
|||
|
|||
but do not silently inherit: |
|||
|
|||
1. address-shaped identity |
|||
2. old recovery classification semantics |
|||
3. old committed-truth assumptions |
|||
4. old failover authority assumptions |
|||
|
|||
Any such reuse should be reviewed explicitly as: |
|||
|
|||
1. safe reuse |
|||
2. reuse with explicit boundary |
|||
3. temporary carry-forward |
|||
4. hard gate before later phases |
|||
|
|||
## Current Direction |
|||
|
|||
The project has moved from exploration-heavy work to evidence-first engine work. |
|||
|
|||
From `Phase 06` onward, the default is: |
|||
|
|||
1. plan first |
|||
2. review plan before coding |
|||
3. implement |
|||
4. review technical layer |
|||
5. close evidence |
|||
6. do final manager review |
|||
@ -0,0 +1,403 @@ |
|||
# Phase 07 Service-Slice Plan |
|||
|
|||
Date: 2026-03-30 |
|||
Status: draft |
|||
Scope: `Phase 07 P0` |
|||
|
|||
## Purpose |
|||
|
|||
Define the first real-system service slice that will host the V2 engine, choose the first concrete integration path in the existing codebase, and map engine adapters onto real modules. |
|||
|
|||
This is a planning document. It does not claim the integration already works. |
|||
|
|||
## Decision |
|||
|
|||
The first service slice should be: |
|||
|
|||
- a single `blockvol` primary on a real volume server |
|||
- with one replica target (`RF=2` path) |
|||
- driven by the existing master heartbeat / assignment loop |
|||
- using the V2 engine only for replication recovery ownership / planning / execution |
|||
|
|||
This is the narrowest real-system slice that still exercises: |
|||
|
|||
1. real assignment delivery |
|||
2. real epoch and failover signals |
|||
3. real volume-server lifecycle |
|||
4. real WAL/checkpoint/base-image truth |
|||
5. real changed-address / reconnect behavior |
|||
|
|||
It is narrow enough to avoid reopening the whole system, but real enough to stop hiding behind engine-local mocks. |
|||
|
|||
## Why This Slice |
|||
|
|||
This slice is the right first integration target because: |
|||
|
|||
1. `weed/server/master_grpc_server.go` already delivers block-volume assignments over heartbeat |
|||
2. `weed/server/master_block_failover.go` already owns failover / promotion / pending rebuild decisions |
|||
3. `weed/storage/blockvol/blockvol.go` already owns the current replication runtime (`shipperGroup`, receiver, WAL retention, checkpoint state) |
|||
4. the existing V1/V1.5 failure history is concentrated in exactly this master <-> volume-server <-> blockvol path |
|||
|
|||
So this slice gives maximum validation value with minimum new surface. |
|||
|
|||
## First Concrete Integration Path |
|||
|
|||
The first integration path should be: |
|||
|
|||
1. master receives volume-server heartbeat |
|||
2. master updates block registry and emits `BlockVolumeAssignment` |
|||
3. volume server receives assignment |
|||
4. block volume adapter converts assignment + local storage state into V2 engine inputs |
|||
5. V2 engine drives sender/session/recovery state |
|||
6. existing block-volume runtime executes the actual data-path work under engine decisions |
|||
|
|||
In code, that path starts here: |
|||
|
|||
- master side: |
|||
- `weed/server/master_grpc_server.go` |
|||
- `weed/server/master_block_failover.go` |
|||
- `weed/server/master_block_registry.go` |
|||
- volume / storage side: |
|||
- `weed/storage/blockvol/blockvol.go` |
|||
- `weed/storage/blockvol/recovery.go` |
|||
- `weed/storage/blockvol/wal_shipper.go` |
|||
- assignment-handling code under `weed/storage/blockvol/` |
|||
- V2 engine side: |
|||
- `sw-block/engine/replication/` |
|||
|
|||
## Service-Slice Boundaries |
|||
|
|||
### In-process placement |
|||
|
|||
The V2 engine should initially live: |
|||
|
|||
- in-process with the volume server / `blockvol` runtime |
|||
- not in master |
|||
- not as a separate service yet |
|||
|
|||
Reason: |
|||
|
|||
- the engine needs local access to storage truth and local recovery execution |
|||
- master should remain control-plane authority, not recovery executor |
|||
|
|||
### Control-plane boundary |
|||
|
|||
Master remains authoritative for: |
|||
|
|||
1. epoch |
|||
2. role / assignment |
|||
3. promotion / failover decision |
|||
4. replica membership |
|||
|
|||
The engine consumes these as control inputs. It does not replace master failover policy in `Phase 07`. |
|||
|
|||
### Control-Over-Heartbeat Upgrade Path |
|||
|
|||
For the first V2 product path, the recommended direction is: |
|||
|
|||
- reuse the existing master <-> volume-server heartbeat path as the control carrier |
|||
- upgrade the block-specific control semantics carried on that path |
|||
- do not immediately invent a separate control service or assignment channel |
|||
|
|||
Why: |
|||
|
|||
1. this is the real Seaweed path already carrying block assignments and confirmations today |
|||
2. this gives the fastest route to a real integrated control path |
|||
3. it preserves compatibility with existing Seaweed master/volume-server semantics while V2 hardens its own control truth |
|||
|
|||
Concretely, the current V1 path already provides: |
|||
|
|||
1. block assignments delivered in heartbeat responses from `weed/server/master_grpc_server.go` |
|||
2. assignment application on the volume server in `weed/server/volume_grpc_client_to_master.go` and `weed/server/volume_server_block.go` |
|||
3. assignment confirmation and address-change refresh driven by later heartbeats in `weed/server/master_grpc_server.go` and `weed/server/master_block_registry.go` |
|||
4. immediate block heartbeat on selected shipper state changes in `weed/server/volume_grpc_client_to_master.go` |
|||
|
|||
What should be upgraded for V2 is not mainly the transport, but the control contract carried on it: |
|||
|
|||
1. stable `ReplicaID` |
|||
2. explicit `Epoch` |
|||
3. explicit role / assignment authority |
|||
4. explicit apply/confirm semantics |
|||
5. explicit stale assignment rejection |
|||
6. explicit address-change refresh as endpoint change, not identity change |
|||
|
|||
Current cadence note: |
|||
|
|||
- the block volume heartbeat is periodic (`5 * sleepInterval`) with some immediate state-change heartbeats |
|||
- this is acceptable as the first hardening carrier |
|||
- it should not be assumed to be the final control responsiveness model |
|||
|
|||
Deferred design decision: |
|||
|
|||
- whether block control should eventually move beyond heartbeat-only carriage into a more explicit control/assignment channel should be decided only after the `Phase 08 P1` real control-delivery path exists and can be measured |
|||
|
|||
That later decision should be based on: |
|||
|
|||
1. failover / reassignment responsiveness |
|||
2. assignment confirmation precision |
|||
3. operational complexity |
|||
4. whether heartbeat carriage remains too coarse for the block-control path |
|||
|
|||
Until then, the preferred direction is: |
|||
|
|||
- strengthen block control semantics over the existing heartbeat path |
|||
- do not prematurely create a second control plane |
|||
|
|||
### Storage boundary |
|||
|
|||
`blockvol` remains authoritative for: |
|||
|
|||
1. WAL head / retention reality |
|||
2. checkpoint/base-image reality |
|||
3. actual catch-up streaming |
|||
4. actual rebuild transfer / restore operations |
|||
|
|||
The engine consumes these as storage truth and recovery execution capabilities. It does not replace the storage backend in `Phase 07`. |
|||
|
|||
## First-Slice Identity Mapping |
|||
|
|||
This must be explicit in the first integration slice. |
|||
|
|||
For `RF=2` on the existing master / block registry path: |
|||
|
|||
- stable engine `ReplicaID` should be derived from: |
|||
- `<volume-name>/<replica-server-id>` |
|||
- not from: |
|||
- `DataAddr` |
|||
- `CtrlAddr` |
|||
- heartbeat transport endpoint |
|||
|
|||
For this slice, the adapter should map: |
|||
|
|||
1. `ReplicaID` |
|||
- from master/block-registry identity for the replica host entry |
|||
|
|||
2. `Endpoint` |
|||
- from the current replica receiver/data/control addresses reported by the real runtime |
|||
|
|||
3. `Epoch` |
|||
- from the confirmed master assignment for the volume |
|||
|
|||
4. `SessionKind` |
|||
- from master-driven recovery intent / role transition outcome |
|||
|
|||
This is a hard first-slice requirement because address refresh must not collapse identity back into endpoint-shaped keys. |
|||
|
|||
## Adapter Mapping |
|||
|
|||
### 1. ControlPlaneAdapter |
|||
|
|||
Engine interface today: |
|||
|
|||
- `HandleHeartbeat(serverID, volumes)` |
|||
- `HandleFailover(deadServerID)` |
|||
|
|||
Real mapping should be: |
|||
|
|||
- master-side source: |
|||
- `weed/server/master_grpc_server.go` |
|||
- `weed/server/master_block_failover.go` |
|||
- `weed/server/master_block_registry.go` |
|||
- volume-server side sink: |
|||
- assignment receive/apply path in `weed/storage/blockvol/` |
|||
|
|||
Recommended real shape: |
|||
|
|||
- do not literally push raw heartbeat messages into the engine |
|||
- instead introduce a thin adapter that converts confirmed master assignment state into: |
|||
- stable `ReplicaID` |
|||
- endpoint set |
|||
- epoch |
|||
- recovery target kind |
|||
|
|||
That keeps master as control owner and the engine as execution owner. |
|||
|
|||
Important note: |
|||
|
|||
- the adapter should treat heartbeat as the transport carrier, not as the final protocol shape |
|||
- block-control semantics should be made explicit over that carrier |
|||
- if a later phase concludes that heartbeat-only carriage is too coarse, that should be a separate design decision after the real hardening path is measured |
|||
|
|||
### 2. StorageAdapter |
|||
|
|||
Engine interface today: |
|||
|
|||
- `GetRetainedHistory()` |
|||
- `PinSnapshot(lsn)` / `ReleaseSnapshot(pin)` |
|||
- `PinWALRetention(startLSN)` / `ReleaseWALRetention(pin)` |
|||
- `PinFullBase(committedLSN)` / `ReleaseFullBase(pin)` |
|||
|
|||
Real mapping should be: |
|||
|
|||
- retained history source: |
|||
- current WAL head/tail/checkpoint state from `weed/storage/blockvol/blockvol.go` |
|||
- recovery helpers in `weed/storage/blockvol/recovery.go` |
|||
- WAL retention pin: |
|||
- existing retention-floor / replica-aware WAL retention machinery around `shipperGroup` |
|||
- snapshot pin: |
|||
- existing snapshot/checkpoint artifacts in `blockvol` |
|||
- full-base pin: |
|||
- explicit pinned full-extent export or equivalent consistent base handle from `blockvol` |
|||
|
|||
Important constraint: |
|||
|
|||
- `Phase 07` must not fake this by reconstructing `RetainedHistory` from tests or metadata alone |
|||
|
|||
### 3. Execution Driver / Executor hookup |
|||
|
|||
Engine side already has: |
|||
|
|||
- planner/executor split in `sw-block/engine/replication/driver.go` |
|||
- stepwise executors in `sw-block/engine/replication/executor.go` |
|||
|
|||
Real mapping should be: |
|||
|
|||
- engine planner decides: |
|||
- zero-gap / catch-up / rebuild |
|||
- trusted-base requirement |
|||
- replayable-tail requirement |
|||
- blockvol runtime performs: |
|||
- actual WAL catch-up transport |
|||
- actual snapshot/base transfer |
|||
- actual truncation / apply operations |
|||
|
|||
Recommended split: |
|||
|
|||
- engine owns contract and state transitions |
|||
- blockvol adapter owns concrete I/O work |
|||
|
|||
## First-Slice Acceptance Rule |
|||
|
|||
For the first integration slice, this is a hard rule: |
|||
|
|||
- `blockvol` may execute recovery I/O |
|||
- `blockvol` must not own recovery policy |
|||
|
|||
Concretely, `blockvol` must not decide: |
|||
|
|||
1. zero-gap vs catch-up vs rebuild |
|||
2. trusted-base validity |
|||
3. replayable-tail sufficiency |
|||
4. whether rebuild fallback is required |
|||
|
|||
Those decisions must remain in the V2 engine. |
|||
|
|||
The bridge may translate engine decisions into concrete blockvol actions, but it must not re-decide recovery policy underneath the engine. |
|||
|
|||
## First Product Path |
|||
|
|||
The first product path should be: |
|||
|
|||
- `RF=2` block volume replication on the existing heartbeat/assignment loop |
|||
- primary + one replica |
|||
- failover / reconnect / changed-address handling |
|||
- rebuild as the formal non-catch-up recovery path |
|||
|
|||
This is the right first path because it exercises the core correctness boundary without introducing N-replica coordination complexity too early. |
|||
|
|||
## What Must Be Replaced First |
|||
|
|||
Current engine-stage pieces that are still mock/test-only or too abstract: |
|||
|
|||
### Replace first |
|||
|
|||
1. `mockStorage` in engine tests |
|||
- replace with a real `blockvol`-backed `StorageAdapter` |
|||
|
|||
2. synthetic control events in engine tests |
|||
- replace with assignment-driven events from the real master/volume-server path |
|||
|
|||
3. convenience recovery completion wrappers |
|||
- keep them test-only |
|||
- real integration should use planner + executor + storage work loop |
|||
|
|||
### Can remain temporarily abstract in Phase 07 P0/P1 |
|||
|
|||
1. `ControlPlaneAdapter` exact public shape |
|||
- can remain thin while the integration path is being chosen |
|||
|
|||
2. async production scheduler details |
|||
- executor can still be driven by a service loop before full background-task architecture is finalized |
|||
|
|||
## Recommended Concrete Modules |
|||
|
|||
### Engine stays here |
|||
|
|||
- `sw-block/engine/replication/` |
|||
|
|||
### First real adapter package should be added near blockvol |
|||
|
|||
Recommended initial location: |
|||
|
|||
- `weed/storage/blockvol/v2bridge/` |
|||
|
|||
Reason: |
|||
|
|||
- keeps V2 engine independent under `sw-block/` |
|||
- keeps real-system glue close to blockvol storage truth |
|||
- avoids copying engine logic into `weed/` |
|||
|
|||
Suggested contents: |
|||
|
|||
1. `control_adapter.go` |
|||
- convert master assignment / local apply path into engine intents |
|||
|
|||
2. `storage_adapter.go` |
|||
- expose retained history, pin/release, trusted-base export handles from real blockvol state |
|||
|
|||
3. `executor_bridge.go` |
|||
- translate engine executor steps into actual blockvol recovery actions |
|||
|
|||
4. `observe_adapter.go` |
|||
- map engine status/logs into service-visible diagnostics |
|||
|
|||
## First Failure Replay Set For Phase 07 |
|||
|
|||
The first real-system replay set should be: |
|||
|
|||
1. changed-address restart |
|||
- current risk: old identity/address coupling reappears in service glue |
|||
|
|||
2. stale epoch / stale result after failover |
|||
- current risk: master and engine disagree on authority timing |
|||
|
|||
3. unreplayable-tail rebuild fallback |
|||
- current risk: service glue over-trusts checkpoint/base availability |
|||
|
|||
4. plan/execution cleanup after resource failure |
|||
- current risk: blockvol-side resource failures leave engine or service state dangling |
|||
|
|||
5. primary failover to replica with rebuild pending on old primary reconnect |
|||
- current risk: old V1/V1.5 semantics leak back into reconnect handling |
|||
|
|||
## Non-Goals For This Slice |
|||
|
|||
Do not use `Phase 07` to: |
|||
|
|||
1. widen catch-up semantics |
|||
2. add smart rebuild optimizations |
|||
3. redesign all blockvol internals |
|||
4. replace the full V1 runtime in one move |
|||
5. claim production readiness |
|||
|
|||
## Deliverables For Phase 07 P0 |
|||
|
|||
A good `P0` delivery should include: |
|||
|
|||
1. chosen service slice |
|||
2. chosen integration path in the current repo |
|||
3. adapter-to-module mapping |
|||
4. list of test-only adapters to replace first |
|||
5. first failure replay set |
|||
6. explicit note of what remains outside this first slice |
|||
|
|||
## Short Form |
|||
|
|||
`Phase 07 P0` should start with: |
|||
|
|||
- engine in `sw-block/engine/replication/` |
|||
- bridge in `weed/storage/blockvol/v2bridge/` |
|||
- first real slice = blockvol primary + one replica on the existing master heartbeat / assignment path |
|||
- `ReplicaID = <volume-name>/<replica-server-id>` for the first slice |
|||
- `blockvol` executes I/O but does not own recovery policy |
|||
- first product path = `RF=2` failover/reconnect/rebuild correctness |
|||
@ -0,0 +1,686 @@ |
|||
# V2 Algorithm Overview |
|||
|
|||
Date: 2026-03-27 |
|||
Status: strategic design overview |
|||
Audience: CEO / owner / technical leadership |
|||
|
|||
## Purpose |
|||
|
|||
This document explains the current V2 direction for `sw-block`: |
|||
|
|||
- what V2 is trying to solve |
|||
- why V1 and V1.5 are not enough as the long-term architecture |
|||
- why a WAL-based design is still worth pursuing |
|||
- how V2 compares with major market and paper directions |
|||
- how simulation and the real test runner systematically build confidence |
|||
|
|||
This is not a phase report and not a production-commitment document. |
|||
|
|||
It is the high-level technical rationale for the V2 line. |
|||
|
|||
## Relationship To Other Documents |
|||
|
|||
| Document | Role | |
|||
|----------|------| |
|||
| `v1-v15-v2-comparison.md` | Detailed comparison of the three lines | |
|||
| `v2-acceptance-criteria.md` | Protocol validation bar | |
|||
| `v2_scenarios.md` | Scenario backlog and simulator mapping | |
|||
| `v2-open-questions.md` | Remaining algorithmic questions | |
|||
| `protocol-development-process.md` | Method for protocol work | |
|||
| `learn/projects/sw-block/algorithm_overview.md` | Current V1/V1.5 system review | |
|||
| `learn/projects/sw-block/design/algorithm_survey.md` | Paper and vendor survey | |
|||
| `learn/projects/sw-block/test/README.md` | Real test runner overview | |
|||
| `learn/projects/sw-block/test/test-platform-review.md` | Test platform maturity and standalone direction | |
|||
|
|||
## 1. Executive Summary |
|||
|
|||
The current judgment is: |
|||
|
|||
- `V1` proved that the basic WAL-based replicated block model can work. |
|||
- `V1.5` materially improved real recovery behavior and now has stronger operational evidence on real hardware. |
|||
- `V2` exists because the next correctness problems should not be solved by incremental local fixes. They should be made explicit in the protocol itself. |
|||
|
|||
The central V2 idea is simple: |
|||
|
|||
- short-gap recovery should be explicit |
|||
- stale authority should be explicitly fenced |
|||
- catch-up vs rebuild should be an explicit decision |
|||
- recovery ownership should be a protocol object, not an implementation accident |
|||
|
|||
`V2` is not yet a production engine. But it is already the stronger architectural direction. |
|||
|
|||
The correct strategic posture today is: |
|||
|
|||
- continue `V1.5` as the production line |
|||
- continue `V2` as the long-term architecture line |
|||
- continue WAL investigation because we now have a serious validation framework |
|||
- if prototype evidence later shows a structural flaw, evolve to `V2.5` before heavy implementation |
|||
|
|||
## 2. The Real Problem V2 Tries To Solve |
|||
|
|||
At the frontend, a block service looks simple: |
|||
|
|||
- `write` |
|||
- `flush` / `sync` |
|||
- failover |
|||
- recovery |
|||
|
|||
But the real difficulty is not the frontend verb set. The real difficulty is the asynchronous distributed boundary between: |
|||
|
|||
- local WAL append on the primary |
|||
- durable progress on replicas |
|||
- client-visible commit / sync truth |
|||
- failover and promotion safety |
|||
- recovery after lag, restart, endpoint change, or timeout |
|||
|
|||
This is the root reason V2 exists. |
|||
|
|||
The project has already learned that correctness problems in block storage do not usually come from the happy path. They come from: |
|||
|
|||
- a replica going briefly down and coming back |
|||
- a replica coming back on a new address |
|||
- a delayed stale barrier or stale reconnect result |
|||
- a lagging node that is almost, but not quite, recoverable |
|||
- a failover decision made on insufficient lineage information |
|||
|
|||
V2 is the attempt to make those cases first-class protocol behavior instead of post-hoc patching. |
|||
|
|||
## 3. Why V1 And V1.5 Are Not Enough |
|||
|
|||
This overview does not need a long retelling of `V1` and `V1.5`. |
|||
|
|||
What matters is their architectural limit. |
|||
|
|||
### What `V1` got right |
|||
|
|||
`V1` proved the basic shape: |
|||
|
|||
- ordered WAL |
|||
- primary-replica replication |
|||
- extent-backed storage |
|||
- epoch and lease as the first fencing model |
|||
|
|||
### Why `V1` is not enough |
|||
|
|||
Its main shortcomings were: |
|||
|
|||
- short-gap recovery was too weak and too implicit |
|||
- lagging replicas too easily fell into rebuild or long degraded states |
|||
- changed-address restart was fragile |
|||
- stale authority and stale results were not modeled as first-class protocol objects |
|||
- the system did not cleanly separate: |
|||
- current WAL head |
|||
- committed prefix |
|||
- recoverable retained range |
|||
- stale or divergent replica tail |
|||
|
|||
### Why `V1.5` is still not enough |
|||
|
|||
`V1.5` fixed several real operational problems: |
|||
|
|||
- retained-WAL catch-up |
|||
- same-address reconnect |
|||
- `sync_all` correctness on real tests |
|||
- rebuild fallback after unrecoverable gap |
|||
- control-plane refresh after changed-address restart |
|||
|
|||
Those fixes matter, and they are why `V1.5` is the stronger production line today. |
|||
|
|||
But `V1.5` is still not the long-term architecture because its recovery model remains too incremental: |
|||
|
|||
- reconnect logic is still layered onto an older shipper model |
|||
- recovery ownership was discovered as a bug class before it became a protocol object |
|||
- catch-up vs rebuild became clearer, but still not clean enough as a top-level protocol contract |
|||
- the system still looks too much like "repair V1" rather than "define the next replication model" |
|||
|
|||
### What `V2` changes |
|||
|
|||
`V2` is not trying to invent a completely different storage model. |
|||
|
|||
It is trying to make the critical parts explicit: |
|||
|
|||
- recovery ownership |
|||
- lineage-safe recovery boundary |
|||
- catch-up vs rebuild classification |
|||
- per-replica sender authority |
|||
- stale-result rejection |
|||
- explicit recovery orchestration |
|||
|
|||
So the correct comparison is still: |
|||
|
|||
- `V1.5` is stronger operationally today |
|||
- `V2` is stronger architecturally today |
|||
|
|||
That is not a contradiction. It is the right split between a current production line and the next architecture line. |
|||
|
|||
```mermaid |
|||
flowchart TD |
|||
V1[V1] |
|||
V15[V1_5] |
|||
V2[V2] |
|||
realFailures[RealFailures] |
|||
realTests[RealHardwareEvidence] |
|||
simAndProto[SimulationAndPrototype] |
|||
|
|||
V1 --> V15 |
|||
V15 --> V2 |
|||
realFailures --> V15 |
|||
realFailures --> V2 |
|||
V15 --> realTests |
|||
V2 --> simAndProto |
|||
``` |
|||
|
|||
## 4. How V2 Solves WAL And Extent Synchronization |
|||
|
|||
The core V2 question is not simply "do we keep WAL?" |
|||
|
|||
The real question is: |
|||
|
|||
**how do WAL and extent stay synchronized across primary and replica while preserving both stability and performance?** |
|||
|
|||
This is the center of the V2 design. |
|||
|
|||
### 4.1 The basic separation of roles |
|||
|
|||
V2 treats the storage path as two different but coordinated layers: |
|||
|
|||
- **WAL** is the ordered truth for recent history |
|||
- **extent** is the stable materialized image |
|||
|
|||
WAL is used for: |
|||
|
|||
- strict write ordering |
|||
- local crash recovery |
|||
- short-gap replica catch-up |
|||
- durable progress accounting through `LSN` |
|||
|
|||
Extent is used for: |
|||
|
|||
- stable read image |
|||
- long-lived storage |
|||
- checkpoint and base-image creation |
|||
- long-gap recovery only through a real checkpoint/snapshot base, not through guessing from the current live extent |
|||
|
|||
This separation is the first stability rule: |
|||
|
|||
- do not ask current extent to behave like historical state |
|||
- do not ask WAL to be the only long-range recovery mechanism forever |
|||
|
|||
### 4.2 Primary-replica synchronization model |
|||
|
|||
The intended V2 steady-state model is: |
|||
|
|||
1. primary allocates monotonic `LSN` |
|||
2. primary appends ordered WAL locally |
|||
3. primary enqueues the record to per-replica sender loops |
|||
4. replicas receive in order and advance explicit progress |
|||
5. barrier/sync uses **durable replica progress**, not optimistic send progress |
|||
6. flusher later materializes WAL-backed dirty state into extent |
|||
|
|||
The local WAL-to-extent lifecycle can be understood as: |
|||
|
|||
```mermaid |
|||
stateDiagram-v2 |
|||
[*] --> WalAppended |
|||
WalAppended --> SenderQueued |
|||
SenderQueued --> ReplicaReceived |
|||
ReplicaReceived --> ReplicaDurable |
|||
ReplicaDurable --> SyncEligible |
|||
SyncEligible --> ExtentMaterialized |
|||
ExtentMaterialized --> CheckpointAdvanced |
|||
|
|||
note right of WalAppended |
|||
Ordered local WAL exists |
|||
and defines the write LSN |
|||
end note |
|||
|
|||
note right of ReplicaDurable |
|||
Replica durable progress |
|||
is now explicit |
|||
end note |
|||
|
|||
note right of ExtentMaterialized |
|||
Flusher moves stable data |
|||
from WAL-backed dirty state |
|||
into extent |
|||
end note |
|||
``` |
|||
|
|||
The critical synchronization rule is: |
|||
|
|||
- **client-visible sync truth must follow durable replica progress** |
|||
- not local send progress |
|||
- not local WAL head |
|||
- not "replica probably received it" |
|||
|
|||
This is why V2 uses a lineage-safe recovery target such as `CommittedLSN` instead of a looser notion like "current primary head." |
|||
|
|||
### 4.2.1 Sync mode and result model |
|||
|
|||
V2 also makes the sync-result logic more explicit. |
|||
|
|||
- `best_effort` should succeed after the primary has reached its local durability point, even if replicas are degraded. |
|||
- `sync_all` should succeed only when all required replicas are durable through the target boundary. |
|||
- `sync_quorum` should succeed only when a true durable quorum exists through the target boundary. |
|||
|
|||
This decision path can be presented as: |
|||
|
|||
```mermaid |
|||
flowchart TD |
|||
writeReq[WriteAndSyncRequest] |
|||
localDurable[PrimaryLocalDurable] |
|||
barrierEval[EvaluateReplicaDurableProgress] |
|||
bestEffortAck[best_effortAck] |
|||
syncAllAck[sync_allAck] |
|||
syncQuorumAck[sync_quorumAck] |
|||
rejectOrBlock[RejectOrBlock] |
|||
|
|||
writeReq --> localDurable |
|||
localDurable --> bestEffortAck |
|||
localDurable --> barrierEval |
|||
|
|||
barrierEval -->|"allRequiredReplicasDurable"| syncAllAck |
|||
barrierEval -->|"durableQuorumExists"| syncQuorumAck |
|||
barrierEval -->|"notEnoughDurableReplicas"| rejectOrBlock |
|||
``` |
|||
|
|||
The key point is that sync success is no longer inferred from send progress or socket health. |
|||
It is derived from explicit durable progress at the right safety boundary. |
|||
|
|||
### 4.3 Why this should be stable |
|||
|
|||
This model is designed to be stable because the dangerous ambiguities are separated: |
|||
|
|||
- **write ordering** is carried by WAL and `LSN` |
|||
- **durability truth** is carried by barrier / flushed progress |
|||
- **recovery ownership** is carried by sender + recovery attempt identity |
|||
- **catch-up vs rebuild** is an explicit classification, not an accidental timeout side effect |
|||
- **promotion safety** depends on committed prefix and lineage, not on whichever node looks newest |
|||
|
|||
In other words, V2 stability comes from reducing hidden coupling. |
|||
|
|||
The design tries to remove cases where one piece of state silently stands in for another. |
|||
|
|||
### 4.4 Why this can still be high-performance |
|||
|
|||
The performance argument is not that V2 is magically faster in all cases. |
|||
|
|||
The argument is narrower and more realistic: |
|||
|
|||
- keep the primary write path simple: |
|||
- ordered local WAL append |
|||
- enqueue to per-replica sender loops |
|||
- no heavy inline recovery logic in foreground writes |
|||
- keep most complexity off the healthy hot path: |
|||
- sender ownership |
|||
- reconnect classification |
|||
- catch-up / rebuild decisions |
|||
- timeout and stale-result fencing |
|||
live mostly in recovery/control paths |
|||
- use WAL for what it is good at: |
|||
- recent ordered delta |
|||
- short-gap replay |
|||
- stop using WAL as the answer to every lag problem: |
|||
- long-gap recovery should move toward checkpoint/snapshot base plus tail replay |
|||
|
|||
So the V2 performance thesis is: |
|||
|
|||
- **healthy steady-state should remain close to V1.5** |
|||
- **degraded/recovery behavior should become much cleaner** |
|||
- **short-gap recovery should be cheaper than rebuild** |
|||
- **long-gap recovery should stop forcing an unbounded WAL-retention tax** |
|||
|
|||
That is a much stronger and more believable claim than saying "V2 will just be faster." |
|||
|
|||
### 4.5 Why WAL is still worth choosing |
|||
|
|||
The reason to keep the WAL-based direction is that it gives the best foundation for this exact synchronization problem: |
|||
|
|||
- explicit order |
|||
- explicit history |
|||
- explicit committed prefix |
|||
- explicit short-gap replay |
|||
- explicit failover reasoning |
|||
|
|||
WAL is risky only if the design blurs: |
|||
|
|||
- local write acceptance |
|||
- replica durable progress |
|||
- committed boundary |
|||
- recoverable retained history |
|||
|
|||
V2 exists precisely to stop blurring those things. |
|||
|
|||
So the current project position is: |
|||
|
|||
- WAL is not automatically safe |
|||
- but WAL is still the most promising base for this block service |
|||
- because the project now has enough real evidence, simulator coverage, and prototype work to investigate it rigorously |
|||
|
|||
## 5. Comparison With Market And Papers |
|||
|
|||
The current V2 direction is not chosen because other vendors are wrong. It is chosen because other directions solve different problems and carry different costs. |
|||
|
|||
### Ceph / RBD style systems |
|||
|
|||
Ceph-style block systems avoid this exact per-volume replicated WAL shape. They gain: |
|||
|
|||
- deep integration with object-backed distributed storage |
|||
- mature placement and recovery machinery |
|||
- strong cluster-scale distribution logic |
|||
|
|||
But they pay elsewhere: |
|||
|
|||
- more system layers |
|||
- more object-store and peering complexity |
|||
- a heavier operational and conceptual model |
|||
|
|||
This is not a free simplification. It is a different complexity trade. |
|||
|
|||
For `sw-block`, the design choice is to keep a narrower software block service with more explicit per-volume replication semantics instead of inheriting the full distributed object-backed block complexity stack. |
|||
|
|||
### PolarFS / ParallelRaft style work |
|||
|
|||
These systems explore more aggressive ordering and apply strategies: |
|||
|
|||
- out-of-order or conflict-aware work |
|||
- deeper parallelism |
|||
- more sophisticated log handling |
|||
|
|||
They are valuable references, especially for: |
|||
|
|||
- LBA conflict reasoning |
|||
- recovery and replay cost thinking |
|||
- future flusher parallelization ideas |
|||
|
|||
But they also introduce a much heavier correctness surface. |
|||
|
|||
The project does not currently want to buy that complexity before fully proving the simpler strict-order path. |
|||
|
|||
### AWS chain replication / EBS-style lessons |
|||
|
|||
Chain replication and related work are attractive because they address real bandwidth and recovery concerns: |
|||
|
|||
- Primary NIC pressure |
|||
- forwarding topology |
|||
- cleaner scaling for RF=3 |
|||
|
|||
This is one of the more plausible borrowable directions later. |
|||
|
|||
But it changes: |
|||
|
|||
- latency profile |
|||
- failure handling |
|||
- barrier semantics |
|||
- operational topology |
|||
|
|||
So it belongs to a later architecture stage, not to the current V2 core proof. |
|||
|
|||
### The actual strategic choice |
|||
|
|||
The project is deliberately choosing: |
|||
|
|||
- a narrower software-first block design |
|||
- explicit per-volume correctness |
|||
- strict reasoning before performance heroics |
|||
- validation before feature expansion |
|||
|
|||
That is not conservatism for its own sake. It is how to build a block product that can later be trusted. |
|||
|
|||
## 6. Why This Direction Fits SeaweedFS And Future Standalone sw-block |
|||
|
|||
`sw-block` started inside SeaweedFS, but V2 is already being shaped as the next standalone block service line. |
|||
|
|||
That means the architecture should preserve two things at once: |
|||
|
|||
### What should remain compatible |
|||
|
|||
- placement and topology concepts where they remain useful |
|||
- explainable control-plane contracts |
|||
- operational continuity with the SeaweedFS ecosystem |
|||
|
|||
### What should become more block-specific |
|||
|
|||
- replication correctness |
|||
- recovery ownership |
|||
- recoverability classification |
|||
- block-specific test and evidence story |
|||
|
|||
So the current direction is: |
|||
|
|||
- use SeaweedFS as the practical ecosystem and experience base |
|||
- but shape V2 as a true block-service architecture, not as a minor sub-feature of `weed/` |
|||
|
|||
This is why the V2 line belongs under `sw-block/` rather than as a direct patch path inside the existing production tree. |
|||
|
|||
## 7. The Systematic Validation Method |
|||
|
|||
The second major reason the current direction is rational is the validation method. |
|||
|
|||
The project is no longer relying on: |
|||
|
|||
- implement first |
|||
- discover behavior later |
|||
- patch after failure |
|||
|
|||
Instead, the intended ladder is: |
|||
|
|||
- contract and invariants |
|||
- scenario backlog |
|||
- simulator |
|||
- timer/race simulator |
|||
- standalone prototype |
|||
- real engine test runner |
|||
|
|||
```mermaid |
|||
flowchart TD |
|||
contract[ContractAndInvariants] |
|||
scenarios[ScenarioBacklog] |
|||
distsim[distsim] |
|||
eventsim[eventsim] |
|||
prototype[enginev2Prototype] |
|||
runner[RealTestRunner] |
|||
confidence[SystemAndProductConfidence] |
|||
|
|||
contract --> scenarios |
|||
scenarios --> distsim |
|||
scenarios --> eventsim |
|||
distsim --> prototype |
|||
eventsim --> prototype |
|||
prototype --> runner |
|||
runner --> confidence |
|||
``` |
|||
|
|||
This is the right shape for a risky block-storage algorithm: |
|||
|
|||
- simulation for protocol truth |
|||
- prototype for executable truth |
|||
- real runner for product/system truth |
|||
|
|||
## 8. What The Simulation System Proves |
|||
|
|||
The simulation system exists to answer: |
|||
|
|||
- what should happen |
|||
- what must never happen |
|||
- which V1/V1.5 shapes fail |
|||
- why the V2 shape is better |
|||
|
|||
### `distsim` |
|||
|
|||
`distsim` is the main protocol simulator. |
|||
|
|||
It is used for: |
|||
|
|||
- protocol correctness |
|||
- state transitions |
|||
- stale authority fencing |
|||
- promotion and lineage safety |
|||
- catch-up vs rebuild |
|||
- changed-address restart |
|||
- candidate safety |
|||
- reference-state checking |
|||
|
|||
### `eventsim` |
|||
|
|||
`eventsim` is the timing/race layer. |
|||
|
|||
It is used for: |
|||
|
|||
- barrier timeout behavior |
|||
- catch-up timeout behavior |
|||
- reservation timeout behavior |
|||
- same-tick and delayed event ordering |
|||
- stale timeout effects |
|||
|
|||
### What the simulator is good at |
|||
|
|||
It is especially strong for proving: |
|||
|
|||
- stale traffic rejection |
|||
- explicit recovery boundaries |
|||
- timeout/race semantics |
|||
- failover correctness at committed prefix |
|||
- why old authority must not mutate current lineage |
|||
|
|||
### What the simulator does not prove |
|||
|
|||
It does not prove: |
|||
|
|||
- real TCP behavior |
|||
- real OS scheduling behavior |
|||
- disk timing |
|||
- real `WALShipper` integration |
|||
- real frontend behavior under iSCSI or NVMe |
|||
|
|||
So the simulator is not the whole truth. |
|||
|
|||
It is the algorithm/protocol truth layer. |
|||
|
|||
## 9. What The Real Test Runner Proves |
|||
|
|||
The real test runner under `learn/projects/sw-block/test/` is the system and product validation layer. |
|||
|
|||
It is not merely QA support. It is a core part of whether the design can be trusted. |
|||
|
|||
### What it covers |
|||
|
|||
The runner and surrounding test system already span: |
|||
|
|||
- unit tests |
|||
- component tests |
|||
- integration tests |
|||
- distributed scenarios |
|||
- real hardware workflows |
|||
|
|||
The environment already includes: |
|||
|
|||
- real nodes |
|||
- real block targets |
|||
- real fault injection |
|||
- benchmark and result capture |
|||
- run bundles and scenario traceability |
|||
|
|||
### Why it matters |
|||
|
|||
The runner is what tells us whether: |
|||
|
|||
- the implemented engine behaves like the design says |
|||
- the product works under real restart/failover/rejoin conditions |
|||
- the operator workflows are credible |
|||
- benchmark claims are real rather than accidental |
|||
|
|||
This is why the runner is best thought of as: |
|||
|
|||
- implementation truth |
|||
- system truth |
|||
- product truth |
|||
|
|||
not just test automation. |
|||
|
|||
## 10. How Simulation And Test Runner Progress Systematically |
|||
|
|||
The intended feedback loop is: |
|||
|
|||
1. V1/V1.5 real failures happen |
|||
2. those failures are turned into design requirements |
|||
3. scenarios are distilled for simulator use |
|||
4. the simulator closes protocol ambiguity |
|||
5. the standalone prototype closes execution ambiguity |
|||
6. the real test runner validates system behavior on real environments |
|||
7. new failures or mismatches feed back into design again |
|||
|
|||
This gives the project two different but complementary truths: |
|||
|
|||
- `simulation -> algorithm / protocol correctness` |
|||
- `test runner -> implementation / system / product correctness` |
|||
|
|||
That separation is healthy. |
|||
|
|||
It prevents two common mistakes: |
|||
|
|||
- trusting design without real behavior |
|||
- trusting green system tests without understanding the protocol deeply enough |
|||
|
|||
## 11. Current Status And Honest Limits |
|||
|
|||
### What is already strong |
|||
|
|||
- `V1.5` has materially better recovery behavior than `V1` and stronger operational evidence |
|||
- `V2` has stronger architectural structure than `V1.5` |
|||
- the simulator has serious acceptance coverage |
|||
- the prototype line has already started closing ownership and orchestration risk |
|||
- the real test runner is large enough to support serious system validation |
|||
|
|||
### What is not yet done |
|||
|
|||
- `V2` is not a production engine |
|||
- prototype work is still in early-to-mid stages |
|||
- historical-data / recovery-boundary prototype work is not complete |
|||
- steady-state performance of `V2` is not yet proven |
|||
- real hardware validation of `V2` does not yet exist |
|||
|
|||
So the correct statement is not: |
|||
|
|||
- "V2 is already better in production" |
|||
|
|||
The correct statement is: |
|||
|
|||
- "V2 is the better long-term architecture, but not yet the stronger deployed engine" |
|||
|
|||
## 12. Why The Current Direction Is Rational |
|||
|
|||
The current direction is rational because it keeps the right split: |
|||
|
|||
- `V1.5` continues as the production line today |
|||
- `V2` continues as the next architecture line |
|||
|
|||
This lets the project: |
|||
|
|||
- keep shipping and hardening what already works |
|||
- explore the better architecture without destabilizing the current engine |
|||
- use simulation, prototype work, and the real runner to decide whether V2 should become the next real engine |
|||
|
|||
The final strategic rule should remain: |
|||
|
|||
- continue WAL investigation because the project now has a credible validation framework |
|||
- continue V2 because the architectural evidence is strong |
|||
- if prototype evidence later reveals a structural flaw, redesign to `V2.5` before heavy implementation |
|||
|
|||
That is the disciplined path for a block-storage algorithm. |
|||
|
|||
## Bottom Line |
|||
|
|||
If choosing based on current production proof: |
|||
|
|||
- use `V1.5` |
|||
|
|||
If choosing based on long-term protocol quality: |
|||
|
|||
- choose `V2` |
|||
|
|||
If choosing based on whether WAL should still be investigated: |
|||
|
|||
- yes, because the project now has the right validation stack to investigate it responsibly |
|||
|
|||
That is the current strategic answer. |
|||
@ -0,0 +1,660 @@ |
|||
# V2 算法综述 |
|||
|
|||
日期:2026-03-27 |
|||
状态:战略级设计综述 |
|||
读者:CEO / owner / 技术管理层 |
|||
|
|||
## 文档目的 |
|||
|
|||
本文用于说明 `sw-block` 当前 `V2` 方向背后的核心判断: |
|||
|
|||
- `V2` 到底想解决什么问题 |
|||
- 为什么 `V1` / `V1.5` 不足以作为长期架构 |
|||
- 为什么我们仍然认为基于 `WAL` 的方向值得继续走 |
|||
- `V2` 与主要市场方案 / 论文路线相比的取舍是什么 |
|||
- `simulation` 与真实 `test runner` 如何形成系统化验证闭环 |
|||
|
|||
这不是 phase 汇报,也不是对生产可用性的承诺文档。 |
|||
|
|||
它是对 `V2` 这条架构线的高层技术解释。 |
|||
|
|||
## 与其他文档的关系 |
|||
|
|||
| 文档 | 作用 | |
|||
|------|------| |
|||
| `v1-v15-v2-comparison.md` | 三条技术线的详细比较 | |
|||
| `v2-acceptance-criteria.md` | V2 协议验证下限 | |
|||
| `v2_scenarios.md` | 场景清单与 simulator 覆盖 | |
|||
| `v2-open-questions.md` | 仍未关闭的算法问题 | |
|||
| `protocol-development-process.md` | 协议开发方法论 | |
|||
| `learn/projects/sw-block/algorithm_overview.md` | 当前 V1/V1.5 系统级算法综述 | |
|||
| `learn/projects/sw-block/design/algorithm_survey.md` | 论文 / vendor 调研与借鉴项 | |
|||
| `learn/projects/sw-block/test/README.md` | 真实测试系统入口 | |
|||
| `learn/projects/sw-block/test/test-platform-review.md` | test runner 的平台化方向 | |
|||
|
|||
## 1. 执行摘要 |
|||
|
|||
当前最准确的结论是: |
|||
|
|||
- `V1` 证明了基于 `WAL` 的复制块存储基本路径是可行的。 |
|||
- `V1.5` 在真实恢复场景上已经比 `V1` 明显更强,并且有真实硬件上的运行证据。 |
|||
- `V2` 的意义,不是在已有逻辑上继续打补丁,而是把最关键的恢复与一致性问题直接上升为协议对象。 |
|||
|
|||
`V2` 的核心想法可以概括为: |
|||
|
|||
- 短间隙恢复要显式 |
|||
- 过期 authority 要显式 fencing |
|||
- `catch-up` 与 `rebuild` 的边界要显式 |
|||
- 恢复 ownership 要成为协议的一部分,而不是实现细节里的偶然行为 |
|||
|
|||
所以今天正确的策略是: |
|||
|
|||
- 继续用 `V1.5` 作为当前生产线 |
|||
- 继续用 `V2` 作为长期架构线 |
|||
- 继续认真研究 `WAL` 路线,因为现在我们已经具备了可信的验证框架 |
|||
- 如果后续 prototype 证明 `V2` 有结构性缺陷,就应当先演进到 `V2.5`,而不是硬着头皮直接实现 |
|||
|
|||
## 2. V2 真正要解决的问题 |
|||
|
|||
从前端看,块存储似乎只有几个简单动作: |
|||
|
|||
- `write` |
|||
- `flush` / `sync` |
|||
- failover |
|||
- recovery |
|||
|
|||
但真正难的,不是这些前端动作本身,而是异步分布式边界: |
|||
|
|||
- primary 本地 WAL 追加 |
|||
- replica 端 durable progress |
|||
- client 可见的 sync / commit 真值 |
|||
- failover / promote 时的数据边界 |
|||
- lag、restart、address change、timeout 后的恢复正确性 |
|||
|
|||
这才是 `V2` 存在的根因。 |
|||
|
|||
项目已经反复验证过:块存储真正的 bug 通常不出在 happy path,而出在: |
|||
|
|||
- replica 短暂掉线又回来 |
|||
- replica 重启后地址变化 |
|||
- 延迟到达的 stale barrier / stale reconnect 结果 |
|||
- 一个 lagging replica 看起来“差一点点就能恢复” |
|||
- failover 时基于错误 lineage 做了 promote |
|||
|
|||
`V2` 就是要把这些情况变成协议的第一公民,而不是上线后再继续被动修补。 |
|||
|
|||
## 3. 为什么 V1 / V1.5 不够 |
|||
|
|||
这份综述不需要长篇回顾 `V1` 和 `V1.5` 的所有细节。 |
|||
|
|||
只需要讲清它们为什么不足以作为长期架构。 |
|||
|
|||
### `V1` 做对了什么 |
|||
|
|||
`V1` 建立了最重要的基础: |
|||
|
|||
- 严格有序的 `WAL` |
|||
- primary-replica 复制 |
|||
- 基于 `epoch + lease` 的初步 fencing |
|||
- 以 `extent` 作为稳定数据面,而不是一开始就做全日志结构 |
|||
|
|||
### `V1` 的不足 |
|||
|
|||
它的关键短板主要在恢复与退化场景: |
|||
|
|||
- 短 outage 很容易演化成 rebuild 或长期 degraded |
|||
- 恢复结构过于隐式 |
|||
- changed-address restart 脆弱 |
|||
- stale authority / stale result 还不是协议层的显式对象 |
|||
- 系统没有足够清晰地区分: |
|||
- 当前 head |
|||
- committed prefix |
|||
- recoverable retained range |
|||
- stale / divergent tail |
|||
|
|||
### `V1.5` 的不足 |
|||
|
|||
`V1.5` 已经解决了不少真实问题: |
|||
|
|||
- retained-WAL catch-up |
|||
- same-address reconnect |
|||
- `sync_all` 的真实行为 |
|||
- catch-up 失败后的 rebuild fallback |
|||
- changed-address restart 之后的 control-plane 刷新 |
|||
|
|||
所以它今天是更强的生产线。 |
|||
|
|||
但它仍然不是长期架构,因为它本质上仍然是增量修复: |
|||
|
|||
- reconnect 逻辑仍然附着在旧 shipper 模型上 |
|||
- 恢复 ownership 是先作为 bug 暴露出来,再逐步被抽象 |
|||
- `catch-up` vs `rebuild` 更清楚了,但还不够成为协议顶层契约 |
|||
- 整体感觉仍然更像“继续修 V1”,而不是“定义下一代复制协议” |
|||
|
|||
### `V2` 的变化 |
|||
|
|||
`V2` 不是重新发明一个完全不同的存储模型。 |
|||
|
|||
它的目标是把最关键的东西显式化: |
|||
|
|||
- recovery ownership |
|||
- lineage-safe recovery boundary |
|||
- `catch-up` / `rebuild` 分类 |
|||
- per-replica sender authority |
|||
- stale-result rejection |
|||
- 明确的 recovery orchestration |
|||
|
|||
因此最诚实的比较是: |
|||
|
|||
- `V1.5` 今天在运行证据上更强 |
|||
- `V2` 今天在架构质量上更强 |
|||
|
|||
这不是矛盾,而是“当前生产线”和“下一代架构线”应有的分工。 |
|||
|
|||
```mermaid |
|||
flowchart TD |
|||
V1[V1] |
|||
V15[V1_5] |
|||
V2[V2] |
|||
realFailures[真实故障] |
|||
realTests[真实硬件验证] |
|||
simAndProto[仿真与原型] |
|||
|
|||
V1 --> V15 |
|||
V15 --> V2 |
|||
realFailures --> V15 |
|||
realFailures --> V2 |
|||
V15 --> realTests |
|||
V2 --> simAndProto |
|||
``` |
|||
|
|||
## 4. V2 如何解决 WAL 与 Extent 的同步问题 |
|||
|
|||
`V2` 的核心问题不是“还要不要 WAL”。 |
|||
|
|||
真正的问题是: |
|||
|
|||
**primary 与 replica 之间,WAL 和 extent 如何保持同步,同时还能兼顾稳定性与性能。** |
|||
|
|||
这才是 `V2` 的中心。 |
|||
|
|||
### 4.1 基本分工 |
|||
|
|||
`V2` 把数据路径拆成两个既分离又协作的层: |
|||
|
|||
- **WAL**:近期历史的有序真相 |
|||
- **extent**:稳定的物化数据镜像 |
|||
|
|||
WAL 负责: |
|||
|
|||
- 严格写入顺序 |
|||
- 本地崩溃恢复 |
|||
- 短间隙 replica catch-up |
|||
- 基于 `LSN` 的 durable progress 计量 |
|||
|
|||
Extent 负责: |
|||
|
|||
- 稳定读镜像 |
|||
- 长期存储 |
|||
- checkpoint / base image 生成 |
|||
- 长间隙恢复时作为真正 base image 的来源 |
|||
|
|||
第一条稳定性原则就是: |
|||
|
|||
- 不要让当前 extent 冒充历史状态 |
|||
- 不要让 WAL 永远承担所有长距离恢复责任 |
|||
|
|||
### 4.2 Primary-replica 同步模型 |
|||
|
|||
`V2` 理想中的 steady-state 同步模型是: |
|||
|
|||
1. primary 分配单调递增的 `LSN` |
|||
2. primary 本地顺序追加 `WAL` |
|||
3. primary 把记录放入 per-replica sender loop |
|||
4. replica 按顺序接收并推进显式 progress |
|||
5. `barrier/sync` 依赖 replica 的 durable progress,而不是 optimistic send progress |
|||
6. flusher 再把 WAL-backed dirty state 物化到 extent |
|||
|
|||
本地 `WAL -> extent` 生命周期可以理解为: |
|||
|
|||
```mermaid |
|||
stateDiagram-v2 |
|||
[*] --> WalAppended |
|||
WalAppended --> SenderQueued |
|||
SenderQueued --> ReplicaReceived |
|||
ReplicaReceived --> ReplicaDurable |
|||
ReplicaDurable --> SyncEligible |
|||
SyncEligible --> ExtentMaterialized |
|||
ExtentMaterialized --> CheckpointAdvanced |
|||
``` |
|||
|
|||
这里最关键的规则是: |
|||
|
|||
- **client 可见的 sync 真值必须跟随 durable replica progress** |
|||
- 不能跟随 send progress |
|||
- 不能跟随 local WAL head |
|||
- 不能跟随“看起来 replica 应该已经收到了” |
|||
|
|||
这也是为什么 `V2` 使用像 `CommittedLSN` 这样的 lineage-safe 边界,而不是松散的“当前 primary head”。 |
|||
|
|||
### 4.2.1 不同 sync mode 如何判断结果 |
|||
|
|||
`V2` 让不同 sync mode 的成功条件变得更明确: |
|||
|
|||
- `best_effort`:primary 达到本地 durability point 后即可成功,replica 可以后台恢复 |
|||
- `sync_all`:所有 required replica 都要在目标边界上 durable |
|||
- `sync_quorum`:必须存在真实 durable quorum |
|||
|
|||
其判断路径可以表示为: |
|||
|
|||
```mermaid |
|||
flowchart TD |
|||
writeReq[WriteAndSyncRequest] |
|||
localDurable[PrimaryLocalDurable] |
|||
barrierEval[EvaluateReplicaDurableProgress] |
|||
bestEffortAck[best_effort成功] |
|||
syncAllAck[sync_all成功] |
|||
syncQuorumAck[sync_quorum成功] |
|||
rejectOrBlock[阻塞或失败] |
|||
|
|||
writeReq --> localDurable |
|||
localDurable --> bestEffortAck |
|||
localDurable --> barrierEval |
|||
|
|||
barrierEval -->|"allRequiredReplicasDurable"| syncAllAck |
|||
barrierEval -->|"durableQuorumExists"| syncQuorumAck |
|||
barrierEval -->|"notEnoughDurableReplicas"| rejectOrBlock |
|||
``` |
|||
|
|||
这意味着 sync 结果不再依赖: |
|||
|
|||
- socket 看起来还活着 |
|||
- sender 好像还在发 |
|||
- replica 似乎“差不多收到了” |
|||
|
|||
而是依赖显式 durable progress。 |
|||
|
|||
### 4.3 为什么这个设计应该更稳定 |
|||
|
|||
它试图把最危险的模糊边界拆开: |
|||
|
|||
- **写入顺序** 由 `WAL + LSN` 表达 |
|||
- **durability truth** 由 barrier / flushed progress 表达 |
|||
- **recovery ownership** 由 sender + recovery attempt identity 表达 |
|||
- **catch-up vs rebuild** 由显式分类表达 |
|||
- **promotion safety** 由 committed prefix 与 lineage 表达 |
|||
|
|||
也就是说,`V2` 的稳定性来自于减少隐式耦合。 |
|||
|
|||
### 4.4 为什么它仍然可以有高性能 |
|||
|
|||
这里不能夸大说 `V2` 一定在所有情况下都更快。 |
|||
|
|||
更准确的性能论点是: |
|||
|
|||
- 保持 primary 前台写路径简单: |
|||
- 本地顺序 `WAL append` |
|||
- 投递到 per-replica sender loop |
|||
- 不把复杂恢复逻辑塞进前台写路径 |
|||
- 把复杂度主要放在健康热路径之外: |
|||
- sender ownership |
|||
- reconnect classification |
|||
- catch-up / rebuild decision |
|||
- timeout 和 stale-result fencing |
|||
主要都在 recovery / control path |
|||
- 让 WAL 只承担它擅长的工作: |
|||
- 近期 ordered delta |
|||
- 短间隙 replay |
|||
- 不再让 WAL 承担所有长距离恢复: |
|||
- 长间隙恢复转向 checkpoint/snapshot base + tail replay |
|||
|
|||
所以 `V2` 的性能论点应该是: |
|||
|
|||
- **健康 steady-state 应该尽量接近 `V1.5`** |
|||
- **退化与恢复路径会更干净** |
|||
- **短间隙恢复会比 rebuild 更便宜** |
|||
- **长间隙恢复不再逼迫系统支付无上限的 WAL retention 税** |
|||
|
|||
这比“V2 天然更快”要可信得多。 |
|||
|
|||
### 4.5 为什么仍然选择 WAL |
|||
|
|||
之所以还继续走 WAL,是因为它仍然是解决这个同步问题最有力的基础: |
|||
|
|||
- 显式顺序 |
|||
- 显式历史 |
|||
- 显式 committed prefix |
|||
- 显式短间隙 replay |
|||
- 显式 failover reasoning |
|||
|
|||
只有当设计把以下概念混淆时,WAL 才会变得危险: |
|||
|
|||
- 本地写入接受 |
|||
- replica durable progress |
|||
- committed boundary |
|||
- recoverable retained history |
|||
|
|||
而 `V2` 的存在,正是为了不再混淆这些东西。 |
|||
|
|||
## 5. 与市场和论文路线的比较 |
|||
|
|||
选择 `V2` 这条路线,并不是因为别的 vendor 都错了,而是因为他们解决的是不同问题,也承担了不同复杂度。 |
|||
|
|||
### Ceph / RBD 路线 |
|||
|
|||
Ceph/RBD 避开了这种 per-volume replicated WAL 形态。 |
|||
|
|||
它获得的是: |
|||
|
|||
- 对象存储深度一体化 |
|||
- 成熟的 placement 与 recovery 体系 |
|||
- 更强的集群级分布能力 |
|||
|
|||
但代价是: |
|||
|
|||
- 系统层次更多 |
|||
- object-store / peering 复杂度更重 |
|||
- 运维与概念模型更重 |
|||
|
|||
所以这不是“更简单”,而是把复杂度迁移到了别处。 |
|||
|
|||
对 `sw-block` 而言,当前选择是: |
|||
|
|||
- 保持更窄的软件块服务模型 |
|||
- 用更显式的 per-volume correctness 来换取更可控的复杂度 |
|||
|
|||
### PolarFS / ParallelRaft 路线 |
|||
|
|||
这类系统探索更激进的顺序与并行策略: |
|||
|
|||
- conflict-aware 或乱序并行 |
|||
- 更深的日志并行 |
|||
- 更复杂的 apply / replay 机制 |
|||
|
|||
它们在未来仍然值得借鉴: |
|||
|
|||
- LBA conflict reasoning |
|||
- replay 成本与恢复成本 |
|||
- flusher 并行优化 |
|||
|
|||
但它们也明显扩大了正确性边界。 |
|||
|
|||
在当前阶段,项目不应该在还没彻底证明严格顺序模型之前,就过早买入这类复杂度。 |
|||
|
|||
### AWS 链式复制 / EBS 类经验 |
|||
|
|||
链式复制之类的路线吸引人,是因为它们能解决真实问题: |
|||
|
|||
- Primary NIC 压力 |
|||
- forward 拓扑 |
|||
- RF=3 时更好的扩展性 |
|||
|
|||
这是后续较有希望借鉴的方向。 |
|||
|
|||
但它会改变: |
|||
|
|||
- 延迟画像 |
|||
- 失败处理方式 |
|||
- barrier 语义 |
|||
- 运维拓扑 |
|||
|
|||
所以它属于更后面的架构阶段,而不是当前 V2 核心证明。 |
|||
|
|||
### 当前的真实选择 |
|||
|
|||
项目当前选择的是: |
|||
|
|||
- 更窄的软件优先 block 设计 |
|||
- 明确的 per-volume correctness |
|||
- 在性能英雄主义之前先把逻辑讲清 |
|||
- 在功能扩张之前先建立验证闭环 |
|||
|
|||
这不是保守,而是为了让这个 block 产品未来真的值得信任。 |
|||
|
|||
## 6. 为什么这条方向适合 SeaweedFS 与未来独立 sw-block |
|||
|
|||
`sw-block` 起步于 SeaweedFS,但 `V2` 已经在按下一代独立 block service 的方向成形。 |
|||
|
|||
这意味着架构上要同时保留两类东西: |
|||
|
|||
### 需要保持兼容的部分 |
|||
|
|||
- placement / topology 这些概念 |
|||
- 可解释的 control-plane contract |
|||
- 与 SeaweedFS 生态的运维连续性 |
|||
|
|||
### 应该更 block-specific 的部分 |
|||
|
|||
- replication correctness |
|||
- recovery ownership |
|||
- recoverability classification |
|||
- block 特有的 test / evidence 体系 |
|||
|
|||
因此当前方向不是“继续把 V2 当成 weed 里的一个 patch”,而是: |
|||
|
|||
- 以 SeaweedFS 作为经验与生态基础 |
|||
- 同时把 `V2` 逐步塑造成真正独立的块服务架构 |
|||
|
|||
## 7. 系统化验证方法 |
|||
|
|||
当前方向之所以合理,另一个重要原因是验证方法本身已经系统化。 |
|||
|
|||
项目不再依赖: |
|||
|
|||
- 先实现 |
|||
- 再观察 |
|||
- 出 bug 再修 |
|||
|
|||
而是依赖如下层次: |
|||
|
|||
- contract / invariants |
|||
- scenario backlog |
|||
- simulator |
|||
- timer/race simulator |
|||
- standalone prototype |
|||
- real engine test runner |
|||
|
|||
```mermaid |
|||
flowchart TD |
|||
contract[ContractAndInvariants] |
|||
scenarios[ScenarioBacklog] |
|||
distsim[distsim] |
|||
eventsim[eventsim] |
|||
prototype[enginev2Prototype] |
|||
runner[RealTestRunner] |
|||
confidence[SystemAndProductConfidence] |
|||
|
|||
contract --> scenarios |
|||
scenarios --> distsim |
|||
scenarios --> eventsim |
|||
distsim --> prototype |
|||
eventsim --> prototype |
|||
prototype --> runner |
|||
runner --> confidence |
|||
``` |
|||
|
|||
这对于一个高风险块存储算法是非常正确的结构: |
|||
|
|||
- simulation 用来证明协议逻辑 |
|||
- prototype 用来证明执行语义 |
|||
- 真实 runner 用来证明系统与产品行为 |
|||
|
|||
## 8. Simulation 系统证明什么 |
|||
|
|||
simulation 系统的目标是回答: |
|||
|
|||
- 应该发生什么 |
|||
- 绝不能发生什么 |
|||
- 为什么旧设计会失败 |
|||
- 为什么 V2 更好 |
|||
|
|||
### `distsim` |
|||
|
|||
`distsim` 是主协议仿真器,主要用于: |
|||
|
|||
- 协议正确性 |
|||
- 状态迁移 |
|||
- stale authority fencing |
|||
- promotion / lineage safety |
|||
- catch-up vs rebuild |
|||
- changed-address restart |
|||
- candidate safety |
|||
- reference-state checking |
|||
|
|||
### `eventsim` |
|||
|
|||
`eventsim` 是时间 / race 层,主要用于: |
|||
|
|||
- barrier timeout |
|||
- catch-up timeout |
|||
- reservation timeout |
|||
- 同 tick / 延迟事件顺序 |
|||
- stale timeout 的影响 |
|||
|
|||
### simulation 擅长证明什么 |
|||
|
|||
它特别擅长证明: |
|||
|
|||
- stale traffic rejection |
|||
- recovery boundary 的显式性 |
|||
- timeout/race 语义 |
|||
- committed prefix 下的 failover 正确性 |
|||
- 旧 authority 不能修改新 lineage |
|||
|
|||
### simulation 不证明什么 |
|||
|
|||
它不证明: |
|||
|
|||
- 真实 TCP 行为 |
|||
- 真实 OS 调度 |
|||
- 磁盘时序 |
|||
- 真正的 `WALShipper` 集成 |
|||
- iSCSI / NVMe 前端的真实行为 |
|||
|
|||
因此 simulation 不是全部真相。 |
|||
|
|||
它是 **算法 / 协议真相层**。 |
|||
|
|||
## 9. 真实 test runner 证明什么 |
|||
|
|||
`learn/projects/sw-block/test/` 下的真实 test runner 是系统与产品验证层。 |
|||
|
|||
它不只是 QA 工具,而是设计是否可信的重要组成部分。 |
|||
|
|||
### 它覆盖什么 |
|||
|
|||
当前 runner 与周边测试体系已经覆盖: |
|||
|
|||
- unit |
|||
- component |
|||
- integration |
|||
- distributed scenario |
|||
- 真实硬件 workflow |
|||
|
|||
而且环境已经包含: |
|||
|
|||
- 真实节点 |
|||
- 真实 block target |
|||
- 真实 fault injection |
|||
- benchmark 与结果采集 |
|||
- run bundle 与 scenario traceability |
|||
|
|||
### 为什么它重要 |
|||
|
|||
它帮助我们判断: |
|||
|
|||
- 实际引擎是否按设计运行 |
|||
- 产品在真实 restart / failover / rejoin 场景下是否可靠 |
|||
- operator workflow 是否可信 |
|||
- benchmark 结果是不是有效而非偶然 |
|||
|
|||
所以 test runner 最好被理解为: |
|||
|
|||
- implementation truth |
|||
- system truth |
|||
- product truth |
|||
|
|||
而不只是“测试脚本框架”。 |
|||
|
|||
## 10. Simulation 与 test runner 如何系统性推进 |
|||
|
|||
理想的反馈闭环是: |
|||
|
|||
1. `V1` / `V1.5` 出现真实故障 |
|||
2. 这些故障被转化为设计要求 |
|||
3. 再被提炼为 simulator 场景 |
|||
4. simulator 关闭协议歧义 |
|||
5. standalone prototype 关闭执行歧义 |
|||
6. 真实 test runner 在硬件与分布式环境中验证系统行为 |
|||
7. 新故障或新偏差再反哺设计 |
|||
|
|||
这就形成了两类互补真相: |
|||
|
|||
- `simulation -> algorithm / protocol correctness` |
|||
- `test runner -> implementation / system / product correctness` |
|||
|
|||
这种分层是健康的,因为它避免了两种常见错误: |
|||
|
|||
- 只相信设计推导,却没有真实行为 |
|||
- 只相信系统测试全绿,却没有真正理解协议本身 |
|||
|
|||
## 11. 当前状态与诚实边界 |
|||
|
|||
### 现在已经比较强的部分 |
|||
|
|||
- `V1.5` 相比 `V1` 的恢复能力已经明显增强,并且有真实运行证据 |
|||
- `V2` 的架构清晰度已经明显强于 `V1.5` |
|||
- simulator 已经有较强的 acceptance 覆盖 |
|||
- prototype 已经开始关闭 ownership 与 orchestration 风险 |
|||
- 真实 test runner 已经足够大,可以支撑严肃的系统验证 |
|||
|
|||
### 现在还没有完成的部分 |
|||
|
|||
- `V2` 还不是生产引擎 |
|||
- prototype 仍处于早中期 |
|||
- historical-data / recovery-boundary prototype 还没有闭合 |
|||
- `V2` steady-state 性能还没有真实证明 |
|||
- `V2` 还没有真实硬件上的运行验证 |
|||
|
|||
所以最准确的话不是: |
|||
|
|||
- “V2 现在已经在生产上更强” |
|||
|
|||
而是: |
|||
|
|||
- “V2 是长期更好的架构,但今天还不是更强的已部署引擎” |
|||
|
|||
## 12. 为什么当前方向是理性的 |
|||
|
|||
当前方向之所以理性,是因为它保持了正确的分工: |
|||
|
|||
- `V1.5` 继续作为今天的生产线 |
|||
- `V2` 继续作为下一代架构线 |
|||
|
|||
这样项目就可以: |
|||
|
|||
- 在已有可运行系统上继续交付和加固 |
|||
- 在不扰动生产线的前提下认真验证更强的架构 |
|||
- 用 simulation、prototype 和真实 runner 来决定 `V2` 是否真能成为下一代引擎 |
|||
|
|||
最终的战略规则应当保持不变: |
|||
|
|||
- 继续研究 WAL,因为现在我们已经有可信的验证框架 |
|||
- 继续推进 V2,因为架构证据已经很强 |
|||
- 如果 prototype 证明 V2 有结构性缺陷,就先演进到 `V2.5`,不要急于重实现 |
|||
|
|||
## 结论 |
|||
|
|||
如果按当前生产证据选择: |
|||
|
|||
- 选择 `V1.5` |
|||
|
|||
如果按长期协议质量选择: |
|||
|
|||
- 选择 `V2` |
|||
|
|||
如果问 WAL 是否还值得继续研究: |
|||
|
|||
- 值得,因为现在项目已经拥有了足够严肃的验证体系,可以负责任地继续推进 |
|||
|
|||
这就是当前最合理的技术与战略判断。 |
|||
1068
sw-block/design/v2-detailed-algorithm.zh.md
File diff suppressed because it is too large
View File
File diff suppressed because it is too large
View File
@ -0,0 +1,170 @@ |
|||
# V2 Engine Readiness Review |
|||
|
|||
Date: 2026-03-29 |
|||
Status: active |
|||
Purpose: record the decision on whether the current V2 design + prototype + simulator stack is strong enough to begin real V2 engine slicing |
|||
|
|||
## Decision |
|||
|
|||
Current judgment: |
|||
|
|||
- proceed to real V2 engine planning |
|||
- do not open a `V2.5` redesign track at this time |
|||
|
|||
This is a planning-readiness decision, not a production-readiness claim. |
|||
|
|||
## Why This Review Exists |
|||
|
|||
The project has now completed: |
|||
|
|||
1. design/FSM closure for the V2 line |
|||
2. protocol simulation closure for: |
|||
- V1 / V1.5 / V2 comparison |
|||
- timeout/race behavior |
|||
- ownership/session semantics |
|||
3. standalone prototype closure for: |
|||
- sender/session ownership |
|||
- execution authority |
|||
- recovery branching |
|||
- minimal historical-data proof |
|||
- prototype scenario closure |
|||
4. `Phase 4.5` hardening for: |
|||
- bounded `CatchUp` |
|||
- first-class `Rebuild` |
|||
- crash-consistency / restart-recoverability |
|||
- `A5-A8` stronger evidence |
|||
|
|||
So the question is no longer: |
|||
|
|||
- "can the prototype be made richer?" |
|||
|
|||
The question is: |
|||
|
|||
- "is the evidence now strong enough to begin real engine slicing?" |
|||
|
|||
## Evidence Summary |
|||
|
|||
### 1. Design / Protocol |
|||
|
|||
Primary docs: |
|||
|
|||
- `sw-block/design/v2-acceptance-criteria.md` |
|||
- `sw-block/design/v2-open-questions.md` |
|||
- `sw-block/design/v2_scenarios.md` |
|||
- `sw-block/design/v1-v15-v2-comparison.md` |
|||
- `sw-block/design/v2-prototype-roadmap-and-gates.md` |
|||
|
|||
Judgment: |
|||
|
|||
- protocol story is coherent |
|||
- acceptance set exists |
|||
- major V1 / V1.5 failures are mapped into V2 scenarios |
|||
|
|||
### 2. Simulator |
|||
|
|||
Primary code/tests: |
|||
|
|||
- `sw-block/prototype/distsim/` |
|||
- `sw-block/prototype/distsim/eventsim.go` |
|||
- `learn/projects/sw-block/test/results/v2-simulation-review.md` |
|||
|
|||
Judgment: |
|||
|
|||
- strong enough for protocol/design validation |
|||
- strong enough to challenge crash-consistency and liveness assumptions |
|||
- not a substitute for real engine / hardware proof |
|||
|
|||
### 3. Prototype |
|||
|
|||
Primary code/tests: |
|||
|
|||
- `sw-block/prototype/enginev2/` |
|||
- `sw-block/prototype/enginev2/acceptance_test.go` |
|||
|
|||
Judgment: |
|||
|
|||
- ownership is explicit and fenced |
|||
- execution authority is explicit and fenced |
|||
- bounded `CatchUp` is semantic, not documentary |
|||
- `Rebuild` is a first-class sender-owned path |
|||
- historical-data and recoverability reasoning are executable |
|||
|
|||
### 4. `A5-A8` Double Evidence |
|||
|
|||
Prototype-side grouped evidence: |
|||
|
|||
- `sw-block/prototype/enginev2/acceptance_test.go` |
|||
|
|||
Simulator-side grouped evidence: |
|||
|
|||
- `sw-block/design/a5-a8-traceability.md` |
|||
- `sw-block/prototype/distsim/` |
|||
|
|||
Judgment: |
|||
|
|||
- the critical acceptance items that most affect engine risk now have materially stronger proof on both sides |
|||
|
|||
## What Is Good Enough Now |
|||
|
|||
The following are good enough to begin engine slicing: |
|||
|
|||
1. sender/session ownership model |
|||
2. stale authority fencing |
|||
3. recovery orchestration shape |
|||
4. bounded `CatchUp` contract |
|||
5. `Rebuild` as formal path |
|||
6. committed/recoverable boundary thinking |
|||
7. crash-consistency / restart-recoverability proof style |
|||
|
|||
## What Is Still Not Proven |
|||
|
|||
The following still require real engine work and later real-system validation: |
|||
|
|||
1. actual engine lifecycle integration |
|||
2. real storage/backend implementation |
|||
3. real control-plane integration |
|||
4. real durability / fsync behavior under the actual engine |
|||
5. real hardware timing / performance |
|||
6. final production observability and failure handling |
|||
|
|||
These are expected gaps. They do not block engine planning. |
|||
|
|||
## Open Risks To Carry Forward |
|||
|
|||
These are not blockers, but they should remain explicit: |
|||
|
|||
1. prototype and simulator are still reduced models |
|||
2. rebuild-source quality in the real engine will depend on actual checkpoint/base-image mechanics |
|||
3. durability truth in the real engine must still be re-proven against actual persistence behavior |
|||
4. predicate exploration can still grow, but should not block engine slicing |
|||
|
|||
## Engine-Planning Decision |
|||
|
|||
Decision: |
|||
|
|||
- start real V2 engine planning |
|||
|
|||
Reason: |
|||
|
|||
1. no current evidence points to a structural flaw requiring `V2.5` |
|||
2. the remaining gaps are implementation/system gaps, not prototype ambiguity |
|||
3. continuing to extend prototype/simulator breadth would have diminishing returns |
|||
|
|||
## Required Outputs After This Review |
|||
|
|||
1. `sw-block/design/v2-engine-slicing-plan.md` |
|||
2. first real engine slice definition |
|||
3. explicit non-goals for first engine stage |
|||
4. explicit validation plan for engine slices |
|||
|
|||
## Non-Goals Of This Review |
|||
|
|||
This review does not claim: |
|||
|
|||
1. V2 is production-ready |
|||
2. V2 should replace V1 immediately |
|||
3. all design questions are forever closed |
|||
|
|||
It only claims: |
|||
|
|||
- the project now has enough evidence to begin disciplined real engine slicing |
|||
@ -0,0 +1,191 @@ |
|||
# V2 Engine Slicing Plan |
|||
|
|||
Date: 2026-03-29 |
|||
Status: active |
|||
Purpose: define the first real V2 engine slices after prototype and `Phase 4.5` closure |
|||
|
|||
## Goal |
|||
|
|||
Move from: |
|||
|
|||
- standalone design/prototype truth under `sw-block/prototype/` |
|||
|
|||
to: |
|||
|
|||
- a real V2 engine core under `sw-block/` |
|||
|
|||
without dragging V1.5 lifecycle assumptions into the implementation. |
|||
|
|||
## Planning Rules |
|||
|
|||
1. reuse V1 ideas and tests selectively, not structurally |
|||
2. prefer narrow vertical slices over broad skeletons |
|||
3. each slice must preserve the accepted V2 ownership/fencing model |
|||
4. keep simulator/prototype as validation support, not as the implementation itself |
|||
5. do not mix V2 engine work into `weed/storage/blockvol/` |
|||
|
|||
## First Engine Stage |
|||
|
|||
The first engine stage should build the control/recovery core, not the full storage engine. |
|||
|
|||
That means: |
|||
|
|||
1. per-replica sender identity |
|||
2. one active recovery session per replica per epoch |
|||
3. sender-owned execution authority |
|||
4. explicit recovery outcomes: |
|||
- zero gap |
|||
- bounded catch-up |
|||
- rebuild |
|||
5. rebuild execution shell only |
|||
- do not hard-code final snapshot + tail vs full base decision logic yet |
|||
- keep real rebuild-source choice tied to Slice 3 recoverability inputs |
|||
|
|||
## Recommended Slice Order |
|||
|
|||
### Slice 1: Engine Ownership Core |
|||
|
|||
Purpose: |
|||
|
|||
- carry the accepted `enginev2` ownership/fencing model into the real engine core |
|||
|
|||
Scope: |
|||
|
|||
1. stable per-replica sender object |
|||
2. stable recovery-session object |
|||
3. session identity fencing |
|||
4. endpoint / epoch invalidation |
|||
5. sender-group or equivalent ownership registry |
|||
|
|||
Acceptance: |
|||
|
|||
1. stale session results cannot mutate current authority |
|||
2. changed-address and epoch-bump invalidation work in engine code |
|||
3. the 4 V2-boundary ownership themes remain provable |
|||
|
|||
### Slice 2: Engine Recovery Execution Core |
|||
|
|||
Purpose: |
|||
|
|||
- move the prototype execution APIs into real engine behavior |
|||
|
|||
Scope: |
|||
|
|||
1. connect / handshake / catch-up flow |
|||
2. bounded `CatchUp` |
|||
3. explicit `NeedsRebuild` |
|||
4. sender-owned rebuild execution path |
|||
5. rebuild execution shell without final trusted-base selection policy |
|||
|
|||
Acceptance: |
|||
|
|||
1. bounded catch-up does not chase indefinitely |
|||
2. rebuild is exclusive from catch-up |
|||
3. session completion rules are explicit and fenced |
|||
|
|||
### Slice 3: Engine Data / Recoverability Core |
|||
|
|||
Purpose: |
|||
|
|||
- connect recovery behavior to real retained-history / checkpoint mechanics |
|||
|
|||
Scope: |
|||
|
|||
1. real recoverability decision inputs |
|||
2. trusted-base decision for rebuild source |
|||
3. minimal real checkpoint/base-image integration |
|||
4. real truncation / safe-boundary handling |
|||
|
|||
This is the first slice that should decide, from real engine inputs, between: |
|||
|
|||
1. `snapshot + tail` |
|||
2. `full base` |
|||
|
|||
Acceptance: |
|||
|
|||
1. engine can explain why recovery is allowed |
|||
2. rebuild-source choice is explicit and testable |
|||
3. historical correctness and truncation rules remain intact |
|||
|
|||
### Slice 4: Engine Integration Closure |
|||
|
|||
Purpose: |
|||
|
|||
- bind engine control/recovery core to real orchestration and validation surfaces |
|||
|
|||
Scope: |
|||
|
|||
1. real assignment/control intent entry path |
|||
2. engine-facing observability |
|||
3. focused real-engine tests for V2-boundary cases |
|||
4. first integration review against real failure classes |
|||
|
|||
Acceptance: |
|||
|
|||
1. key V2-boundary failures are reproduced and closed in engine tests |
|||
2. engine observability is good enough to debug ownership/recovery failures |
|||
3. remaining gaps are system/performance gaps, not control-model ambiguity |
|||
|
|||
## What To Reuse |
|||
|
|||
Good reuse candidates: |
|||
|
|||
1. tests and failure cases from V1 / V1.5 |
|||
2. narrow utility/data helpers where not coupled to V1 lifecycle |
|||
3. selected WAL/history concepts if they fit V2 ownership boundaries |
|||
|
|||
Do not structurally reuse: |
|||
|
|||
1. V1/V1.5 shipper lifecycle |
|||
2. address-based identity assumptions |
|||
3. `SetReplicaAddrs`-style behavior |
|||
4. old recovery control structure |
|||
|
|||
## Where The Work Should Live |
|||
|
|||
Real V2 engine work should continue under: |
|||
|
|||
- `sw-block/` |
|||
|
|||
Recommended next area: |
|||
|
|||
- `sw-block/core/` |
|||
or |
|||
- `sw-block/engine/` |
|||
|
|||
Exact path can be chosen later, but it should remain separate from: |
|||
|
|||
- `sw-block/prototype/` |
|||
- `weed/storage/blockvol/` |
|||
|
|||
## Validation Plan For Engine Slices |
|||
|
|||
Each engine slice should be validated at three levels: |
|||
|
|||
1. prototype alignment |
|||
- does engine behavior preserve the accepted prototype invariant? |
|||
|
|||
2. focused engine tests |
|||
- does the real engine slice enforce the same contract? |
|||
|
|||
3. scenario mapping |
|||
- does at least one important V1/V1.5 failure class remain closed? |
|||
|
|||
## Non-Goals For First Engine Stage |
|||
|
|||
Do not try to do these immediately: |
|||
|
|||
1. full Smart WAL expansion |
|||
2. performance optimization |
|||
3. V1 replacement/migration plan |
|||
4. full product integration |
|||
5. all storage/backend redesign at once |
|||
|
|||
## Immediate Next Assignment |
|||
|
|||
The first concrete engine-planning task should be: |
|||
|
|||
1. choose the real V2 engine module location under `sw-block/` |
|||
2. define Slice 1 file/module boundaries |
|||
3. write a short engine ownership-core spec |
|||
4. map 3-5 acceptance scenarios directly onto Slice 1 expectations |
|||
@ -0,0 +1,199 @@ |
|||
# V2 Production Roadmap |
|||
|
|||
Date: 2026-03-30 |
|||
Status: active |
|||
Purpose: define the path from the accepted V2 engine core to a production candidate |
|||
|
|||
## Current Position |
|||
|
|||
Completed: |
|||
|
|||
1. design / FSM closure |
|||
2. simulator / protocol validation |
|||
3. prototype closure |
|||
4. evidence hardening |
|||
5. engine core slices: |
|||
- Slice 1 ownership core |
|||
- Slice 2 recovery execution core |
|||
- Slice 3 data / recoverability core |
|||
- Slice 4 integration closure |
|||
|
|||
Current stage: |
|||
|
|||
- entering broader engine implementation |
|||
|
|||
This means the main risk is no longer: |
|||
|
|||
- whether the V2 idea stands up |
|||
|
|||
The main risk is: |
|||
|
|||
- whether the accepted engine core can be turned into a real system without reintroducing V1/V1.5 structure and semantics |
|||
|
|||
## Roadmap Summary |
|||
|
|||
1. Phase 06: broader engine implementation stage |
|||
2. Phase 07: real-system integration / product-path decision |
|||
3. Phase 08: pre-production hardening |
|||
4. Phase 09: performance / scale / soak validation |
|||
5. Phase 10: production candidate and rollout gate |
|||
|
|||
## Phase 06 |
|||
|
|||
### Goal |
|||
|
|||
Connect the accepted engine core to: |
|||
|
|||
1. real control truth |
|||
2. real storage truth |
|||
3. explicit engine execution steps |
|||
|
|||
### Outputs |
|||
|
|||
1. control-plane adapter into the engine core |
|||
2. storage/base/recoverability adapters |
|||
3. explicit execution-driver model where synchronous helpers are no longer sufficient |
|||
4. validation against selected real failure classes |
|||
|
|||
### Gate |
|||
|
|||
At the end of Phase 06, the project should be able to say: |
|||
|
|||
- the engine core can live inside a real system shape |
|||
|
|||
## Phase 07 |
|||
|
|||
### Goal |
|||
|
|||
Move from engine-local correctness to a real runnable subsystem. |
|||
|
|||
### Outputs |
|||
|
|||
1. service-style runnable engine slice |
|||
2. integration with real control and storage surfaces |
|||
3. crash/failover/restart integration tests |
|||
4. decision on the first viable product path |
|||
|
|||
### Gate |
|||
|
|||
At the end of Phase 07, the project should be able to say: |
|||
|
|||
- the engine can run as a real subsystem, not only as an isolated core |
|||
|
|||
## Phase 08 |
|||
|
|||
### Goal |
|||
|
|||
Turn correctness into operational safety. |
|||
|
|||
### Outputs |
|||
|
|||
1. observability hardening |
|||
2. operator/debug flows |
|||
3. recovery/runbook procedures |
|||
4. config surface cleanup |
|||
5. realistic durability/restart validation |
|||
|
|||
### Gate |
|||
|
|||
At the end of Phase 08, the project should be able to say: |
|||
|
|||
- operators can run, debug, and recover the system safely |
|||
|
|||
## Phase 09 |
|||
|
|||
### Goal |
|||
|
|||
Prove viability under load and over time. |
|||
|
|||
### Outputs |
|||
|
|||
1. throughput / latency baselines |
|||
2. rebuild / catch-up cost characterization |
|||
3. steady-state overhead measurement |
|||
4. soak testing |
|||
5. scale and failure-under-load validation |
|||
|
|||
### Gate |
|||
|
|||
At the end of Phase 09, the project should be able to say: |
|||
|
|||
- the design is not only correct, but viable at useful scale and duration |
|||
|
|||
## Phase 10 |
|||
|
|||
### Goal |
|||
|
|||
Produce a controlled production candidate. |
|||
|
|||
### Outputs |
|||
|
|||
1. feature-gated production candidate |
|||
2. rollback strategy |
|||
3. migration/coexistence plan with V1 |
|||
4. staged rollout plan |
|||
5. production acceptance checklist |
|||
|
|||
### Gate |
|||
|
|||
At the end of Phase 10, the project should be able to say: |
|||
|
|||
- the system is ready for a controlled production rollout |
|||
|
|||
## Cross-Phase Rules |
|||
|
|||
### Rule 1: Do not reopen protocol shape casually |
|||
|
|||
The accepted core should remain stable unless new implementation evidence forces a change. |
|||
|
|||
### Rule 2: Use V1 as validation source, not design template |
|||
|
|||
Use: |
|||
|
|||
1. `learn/projects/sw-block/` |
|||
2. `weed/storage/block*` |
|||
|
|||
for: |
|||
|
|||
1. failure gates |
|||
2. constraints |
|||
3. integration references |
|||
|
|||
Do not use them as the default V2 architecture template. |
|||
|
|||
### Rule 3: Keep `CatchUp` narrow |
|||
|
|||
Do not let later implementation phases re-expand `CatchUp` into a broad, optimistic, long-lived recovery mode. |
|||
|
|||
### Rule 4: Keep evidence quality ahead of object growth |
|||
|
|||
New work should preferentially improve: |
|||
|
|||
1. traceability |
|||
2. diagnosability |
|||
3. real-failure validation |
|||
4. operational confidence |
|||
|
|||
not simply add new objects, states, or mechanisms. |
|||
|
|||
## Production Readiness Ladder |
|||
|
|||
The project should move through this ladder explicitly: |
|||
|
|||
1. proof-of-design |
|||
2. proof-of-engine-shape |
|||
3. proof-of-runnable-engine-stage |
|||
4. proof-of-operable-system |
|||
5. proof-of-viable-production-candidate |
|||
|
|||
Current ladder position: |
|||
|
|||
- between `2` and `3` |
|||
- engine core accepted; broader runnable engine stage underway |
|||
|
|||
## Next Documents To Maintain |
|||
|
|||
1. `sw-block/.private/phase/phase-06.md` |
|||
2. `sw-block/design/v2-engine-readiness-review.md` |
|||
3. `sw-block/design/v2-engine-slicing-plan.md` |
|||
4. this roadmap |
|||
@ -0,0 +1,561 @@ |
|||
# V2 Protocol Truths |
|||
|
|||
Date: 2026-03-30 |
|||
Status: active |
|||
Purpose: record the compact, stable truths that later phases must preserve, and provide a conformance reference for implementation reviews |
|||
|
|||
## Why This Document Exists |
|||
|
|||
`FSM`, `simulator`, `prototype`, and `engine` are not a code-production pipeline. |
|||
They are an evidence ladder. |
|||
|
|||
So the most important output to carry forward is not only code, but: |
|||
|
|||
1. accepted semantics |
|||
2. must-hold boundaries |
|||
3. failure classes that must stay closed |
|||
4. explicit places where later phases may improve or drift |
|||
|
|||
This document is the compact truth table for the V2 line. |
|||
|
|||
## How To Use It |
|||
|
|||
For each later phase or slice, ask: |
|||
|
|||
1. does the new implementation remain aligned with these truths? |
|||
2. if not, is the deviation constructive or risky? |
|||
3. which truth is newly strengthened by this phase? |
|||
|
|||
Deviation labels: |
|||
|
|||
- `Aligned`: implementation preserves the truth |
|||
- `Constructive deviation`: implementation changes shape but strengthens the truth |
|||
- `Risky deviation`: implementation weakens or blurs the truth |
|||
|
|||
## Core Truths |
|||
|
|||
### T1. `CommittedLSN` is the external truth boundary |
|||
|
|||
Short form: |
|||
|
|||
- external promises are anchored at `CommittedLSN`, not `HeadLSN` |
|||
|
|||
Meaning: |
|||
|
|||
- recovery targets |
|||
- promotion safety |
|||
- flush/visibility reasoning |
|||
|
|||
must all be phrased against `CommittedLSN`. |
|||
|
|||
Prevents: |
|||
|
|||
- using optimistic WAL head as committed truth |
|||
- acknowledging lineage that failover cannot preserve |
|||
|
|||
Evidence anchor: |
|||
|
|||
- strong in design |
|||
- strong in simulator |
|||
- strong in prototype |
|||
- strong in engine |
|||
|
|||
### T2. `ZeroGap <=> ReplicaFlushedLSN == CommittedLSN` |
|||
|
|||
Short form: |
|||
|
|||
- zero-gap requires exact equality with committed truth |
|||
|
|||
Meaning: |
|||
|
|||
- replica ahead is not zero-gap |
|||
- replica behind is not zero-gap |
|||
|
|||
Prevents: |
|||
|
|||
- unsafe fast-path completion |
|||
- replica-ahead being mistaken for in-sync |
|||
|
|||
Evidence anchor: |
|||
|
|||
- strong in prototype |
|||
- strong in engine |
|||
|
|||
### T3. `CatchUp` is bounded replay on a still-trusted base |
|||
|
|||
Short form: |
|||
|
|||
- `CatchUp = KeepUp with bounded debt` |
|||
|
|||
Meaning: |
|||
|
|||
- catch-up is a short-gap, low-cost, bounded replay path |
|||
- it only makes sense while the replica base is still trustworthy enough to continue from |
|||
|
|||
Prevents: |
|||
|
|||
- turning catch-up into indefinite moving-head chase |
|||
- hiding broad recovery complexity in replay logic |
|||
|
|||
Evidence anchor: |
|||
|
|||
- strong in design |
|||
- strong in simulator |
|||
- strong in prototype |
|||
- strong in engine |
|||
|
|||
### T4. `NeedsRebuild` is explicit when replay is not the right answer |
|||
|
|||
Short form: |
|||
|
|||
- `NeedsRebuild <=> replay is unrecoverable, unstable, or no longer worth bounded replay` |
|||
|
|||
Meaning: |
|||
|
|||
- long-gap |
|||
- lost recoverability |
|||
- no trusted base |
|||
- budget violation |
|||
|
|||
must escalate explicitly. |
|||
|
|||
Prevents: |
|||
|
|||
- pretending catch-up will eventually succeed |
|||
- carrying V1/V1.5-style unbounded degraded chase forward |
|||
|
|||
Evidence anchor: |
|||
|
|||
- strong in simulator |
|||
- strong in prototype |
|||
- strong in engine |
|||
|
|||
### T5. `Rebuild` is the formal primary recovery path |
|||
|
|||
Short form: |
|||
|
|||
- `Rebuild = frozen TargetLSN + trusted base + optional tail + barrier` |
|||
|
|||
Meaning: |
|||
|
|||
- rebuild is not a shameful fallback |
|||
- it is the general recovery framework |
|||
|
|||
Prevents: |
|||
|
|||
- overloading catch-up with broad recovery semantics |
|||
- treating full/partial rebuild as unrelated protocols |
|||
|
|||
Evidence anchor: |
|||
|
|||
- strong in design |
|||
- strong in prototype |
|||
- strong in engine |
|||
|
|||
### T6. Full and partial rebuild share one correctness contract |
|||
|
|||
Short form: |
|||
|
|||
- `full rebuild` and `partial rebuild` differ in transfer choice, not in truth model |
|||
|
|||
Meaning: |
|||
|
|||
- both require frozen `TargetLSN` |
|||
- both require trusted pinned base |
|||
- both require explicit durable completion |
|||
|
|||
Prevents: |
|||
|
|||
- optimization layers redefining protocol truth |
|||
- bitmap/range paths bypassing trusted-base rules |
|||
|
|||
Evidence anchor: |
|||
|
|||
- strong in design |
|||
- partial in engine |
|||
- stronger real-system proof still deferred |
|||
|
|||
### T7. No recovery result may outlive its authority |
|||
|
|||
Short form: |
|||
|
|||
- `ValidMutation <=> sender exists && sessionID matches && epoch current && endpoint current` |
|||
|
|||
Meaning: |
|||
|
|||
- stale session |
|||
- stale epoch |
|||
- stale endpoint |
|||
- stale sender |
|||
|
|||
must all fail closed. |
|||
|
|||
Prevents: |
|||
|
|||
- late results mutating current lineage |
|||
- changed-address stale completion bugs |
|||
|
|||
Evidence anchor: |
|||
|
|||
- strong in simulator |
|||
- strong in prototype |
|||
- strong in engine |
|||
|
|||
### T8. `ReplicaID` is stable identity; `Endpoint` is mutable location |
|||
|
|||
Short form: |
|||
|
|||
- `ReplicaID != address` |
|||
|
|||
Meaning: |
|||
|
|||
- address changes may invalidate sessions |
|||
- address changes must not destroy sender identity |
|||
|
|||
Prevents: |
|||
|
|||
- reintroducing address-shaped identity |
|||
- changed-address restarting as logical removal + add |
|||
|
|||
Evidence anchor: |
|||
|
|||
- strong in prototype |
|||
- strong in engine |
|||
- strong in bridge P0 |
|||
|
|||
### T9. Truncation is a protocol boundary, not cleanup |
|||
|
|||
Short form: |
|||
|
|||
- replica-ahead cannot complete until divergent tail is explicitly truncated |
|||
|
|||
Meaning: |
|||
|
|||
- truncation is part of recovery contract |
|||
- not a side-effect or best-effort cleanup |
|||
|
|||
Prevents: |
|||
|
|||
- completing recovery while replica still contains newer divergent writes |
|||
|
|||
Evidence anchor: |
|||
|
|||
- strong in design |
|||
- strong in engine |
|||
|
|||
### T10. Recoverability must be proven from real retained history |
|||
|
|||
Short form: |
|||
|
|||
- `CatchUp allowed <=> required replay range is recoverable from retained history` |
|||
|
|||
Meaning: |
|||
|
|||
- the engine should consume storage truth |
|||
- not test-reconstructed optimism |
|||
|
|||
Prevents: |
|||
|
|||
- replay on missing WAL |
|||
- fake recoverability based only on watermarks |
|||
|
|||
Evidence anchor: |
|||
|
|||
- strong in simulator |
|||
- strong in engine |
|||
- strengthened in driver/adapter phases |
|||
|
|||
### T11. Trusted-base choice must be explicit and causal |
|||
|
|||
Short form: |
|||
|
|||
- `snapshot_tail` requires both trusted checkpoint and replayable tail |
|||
|
|||
Meaning: |
|||
|
|||
- snapshot existence alone is insufficient |
|||
- fallback to full-base must be explainable |
|||
|
|||
Prevents: |
|||
|
|||
- over-trusting old checkpoints |
|||
- silently choosing an invalid rebuild source |
|||
|
|||
Evidence anchor: |
|||
|
|||
- strong in simulator |
|||
- strong in engine |
|||
- strengthened by Phase 06 |
|||
|
|||
### T12. Current extent cannot fake old history |
|||
|
|||
Short form: |
|||
|
|||
- historical correctness requires reconstructable history, not current-state approximation |
|||
|
|||
Meaning: |
|||
|
|||
- live extent state is not sufficient proof of an old target point |
|||
- historical reconstruction must be justified by checkpoint + retained history |
|||
|
|||
Prevents: |
|||
|
|||
- using current extent as fake proof of older state |
|||
|
|||
Evidence anchor: |
|||
|
|||
- strongest in simulator |
|||
- engine currently proves prerequisites, not full reconstruction proof |
|||
|
|||
### T13. Promotion requires recoverable committed prefix |
|||
|
|||
Short form: |
|||
|
|||
- promoted replica must be able to recover committed truth, not merely advertise a high watermark |
|||
|
|||
Meaning: |
|||
|
|||
- candidate selection is about recoverable lineage, not optimistic flush visibility |
|||
|
|||
Prevents: |
|||
|
|||
- promoting a replica that cannot reconstruct committed prefix after crash/restart |
|||
|
|||
Evidence anchor: |
|||
|
|||
- strong in simulator |
|||
- partially carried into engine semantics |
|||
- real-system validation still needed |
|||
|
|||
### T14. `blockvol` executes I/O; engine owns recovery policy |
|||
|
|||
Short form: |
|||
|
|||
- adapters may translate engine decisions into concrete work |
|||
- they must not silently re-decide recovery classification or source choice |
|||
|
|||
Meaning: |
|||
|
|||
- master remains control authority |
|||
- engine remains recovery authority |
|||
- storage remains truth source |
|||
|
|||
Prevents: |
|||
|
|||
- V1/V1.5 policy leakage back into service glue |
|||
|
|||
Evidence anchor: |
|||
|
|||
- strong in Phase 07 service-slice planning |
|||
- initial bridge P0 aligns |
|||
- real-system proof still pending |
|||
|
|||
### T15. Reuse reality, not inherited semantics |
|||
|
|||
Short form: |
|||
|
|||
- V2 may reuse existing Seaweed control/runtime/storage paths |
|||
- it must not inherit old semantics as protocol truth |
|||
|
|||
Meaning: |
|||
|
|||
- reuse existing heartbeat, assignment, `blockvol`, receiver, shipper, retention, and runtime machinery when useful |
|||
- keep `ReplicaID`, epoch authority, recovery classification, committed truth, and rebuild boundaries anchored in accepted V2 semantics |
|||
|
|||
Prevents: |
|||
|
|||
- V1/V1.5 structure silently redefining V2 behavior |
|||
- convenience reuse turning old runtime assumptions into new protocol truth |
|||
|
|||
Evidence anchor: |
|||
|
|||
- strong in Phase 07/08 direction |
|||
- should remain active in later implementation phases |
|||
|
|||
## Current Strongest Evidence By Layer |
|||
|
|||
| Layer | Main value | |
|||
|------|------------| |
|||
| `FSM` / design | define truth and non-goals | |
|||
| simulator | prove protocol truth and failure-class closure cheaply | |
|||
| prototype | prove implementation-shape and authority semantics cheaply | |
|||
| engine | prove the accepted contracts survive real implementation structure | |
|||
| service slice / runner | prove truth survives real control/storage/system reality | |
|||
|
|||
## Phase Conformance Notes |
|||
|
|||
### Phase 04 |
|||
|
|||
- `Aligned`: T7, T8 |
|||
- strengthened sender/session ownership and stale rejection |
|||
|
|||
### Phase 4.5 |
|||
|
|||
- `Aligned`: T3, T4, T5, T10, T12 |
|||
- major tightening: |
|||
- bounded catch-up |
|||
- first-class rebuild |
|||
- crash-consistency and recoverability proof style |
|||
|
|||
### Phase 05 |
|||
|
|||
- `Aligned`: T1, T2, T3, T4, T5, T7, T8, T9, T10, T11 |
|||
- engine core slices closed: |
|||
- ownership |
|||
- execution |
|||
- recoverability gating |
|||
- orchestrated entry path |
|||
|
|||
### Phase 06 |
|||
|
|||
- `Aligned`: T10, T11, T14 |
|||
- `Constructive deviation`: planner/executor split replaced convenience wrappers without changing protocol truth |
|||
- strengthened: |
|||
- real storage/resource contracts |
|||
- explicit release symmetry |
|||
- failure-class validation against engine path |
|||
|
|||
### Phase 07 P0 |
|||
|
|||
- `Aligned`: T8, T10, T14 |
|||
- bridge now makes stable `ReplicaID` explicit at service boundary |
|||
- bridge states the hard rule that engine decides policy and `blockvol` executes I/O |
|||
- real `weed/storage/blockvol/` integration still pending |
|||
|
|||
## Current Carry-Forward Truths For Later Phases |
|||
|
|||
Later phases must not regress these: |
|||
|
|||
1. `CommittedLSN` remains the external truth boundary |
|||
2. `CatchUp` stays narrow and bounded |
|||
3. `Rebuild` remains the formal primary recovery path |
|||
4. stale authority must fail closed |
|||
5. stable identity must remain separate from mutable endpoint |
|||
6. trusted-base choice must remain explicit and causal |
|||
7. service glue must not silently re-decide recovery policy |
|||
8. reuse reality, but do not inherit old semantics as V2 truth |
|||
|
|||
## Review Rule |
|||
|
|||
Every later phase or slice should explicitly answer: |
|||
|
|||
1. which truths are exercised? |
|||
2. which truths are strengthened? |
|||
3. does this phase introduce any constructive or risky deviation? |
|||
4. which evidence layer now carries the truth most strongly? |
|||
|
|||
## Phase Alignment Rule |
|||
|
|||
From `Phase 05` onward, every phase or slice should align explicitly against this document. |
|||
|
|||
Minimum phase-alignment questions: |
|||
|
|||
1. which truths are in scope? |
|||
2. which truths are strengthened? |
|||
3. which truths are merely carried forward? |
|||
4. does the phase introduce any constructive deviation? |
|||
5. does the phase introduce any risky deviation? |
|||
6. which evidence layer currently carries each in-scope truth most strongly? |
|||
|
|||
Expected output shape for each later phase: |
|||
|
|||
- `In-scope truths` |
|||
- `Strengthened truths` |
|||
- `Carry-forward truths` |
|||
- `Constructive deviations` |
|||
- `Risky deviations` |
|||
- `Evidence shift` |
|||
|
|||
## Phase 05-07 Alignment |
|||
|
|||
### Phase 05 |
|||
|
|||
Primary alignment focus: |
|||
|
|||
- T1 `CommittedLSN` as external truth boundary |
|||
- T2 zero-gap exactness |
|||
- T3 bounded `CatchUp` |
|||
- T4 explicit `NeedsRebuild` |
|||
- T5/T6 rebuild correctness contract |
|||
- T7 stale authority must fail closed |
|||
- T8 stable `ReplicaID` |
|||
- T9 truncation as protocol boundary |
|||
- T10/T11 recoverability and trusted-base gating |
|||
|
|||
Main strengthening: |
|||
|
|||
- engine core adopted accepted protocol truths as real implementation structure |
|||
|
|||
Main review risk: |
|||
|
|||
- engine structure accidentally collapsing back to address identity or unfenced execution |
|||
|
|||
### Phase 06 |
|||
|
|||
Primary alignment focus: |
|||
|
|||
- T10 recoverability from real retained history |
|||
- T11 trusted-base choice remains explicit and causal |
|||
- T14 engine owns policy, adapters carry truth and execution contracts |
|||
|
|||
Main strengthening: |
|||
|
|||
- planner/executor/resource contracts |
|||
- fail-closed cleanup symmetry |
|||
- cross-layer proof path through engine execution |
|||
|
|||
Main review risk: |
|||
|
|||
- executor or adapters recomputing policy from convenience inputs |
|||
- storage/resource contracts becoming approximate instead of real |
|||
|
|||
### Phase 07+ |
|||
|
|||
Primary alignment focus: |
|||
|
|||
- T8 stable identity at the real service boundary |
|||
- T10 real storage truth into engine decisions |
|||
- T11 trusted-base proof remains explicit through service glue |
|||
- T14 `blockvol` executes I/O but does not own recovery policy |
|||
|
|||
Main strengthening: |
|||
|
|||
- real-system service-slice conformance |
|||
- real control-plane and storage-plane integration |
|||
- diagnosable failure replay through the integrated path |
|||
|
|||
Main review risk: |
|||
|
|||
- V1/V1.5 semantics leaking back in through service glue |
|||
- address-shaped identity reappearing at the boundary |
|||
- blockvol-side code silently re-deciding recovery policy |
|||
|
|||
## Future Feature Rule |
|||
|
|||
When a later feature expands the protocol surface (for example `SmartWAL` or a new rebuild optimization), the order should be: |
|||
|
|||
1. `FSM / design` |
|||
- define the new semantics and non-goals |
|||
|
|||
2. `Truth update` |
|||
- either attach the feature to an existing truth |
|||
- or add a new protocol truth if the feature creates a new long-lived invariant |
|||
|
|||
3. `Phase alignment` |
|||
- define which later phases strengthen or validate that truth |
|||
|
|||
4. `Evidence ladder` |
|||
- simulator, prototype, engine, service slice as needed |
|||
|
|||
Do not start feature implementation by editing engine or service glue first and only later trying to explain what truth changed. |
|||
|
|||
## Feature Review Rule |
|||
|
|||
For any future feature, later reviews should ask: |
|||
|
|||
1. did the feature create a new truth or just strengthen an existing one? |
|||
2. which phase first validates it? |
|||
3. which evidence layer proves it most strongly today? |
|||
4. does the feature weaken any existing truth? |
|||
|
|||
This keeps feature growth aligned with protocol truth instead of letting implementation convenience define semantics. |
|||
@ -0,0 +1,219 @@ |
|||
package distsim |
|||
|
|||
import ( |
|||
"math/rand" |
|||
"testing" |
|||
) |
|||
|
|||
// Phase 4.5: Adversarial predicate search.
|
|||
// These tests run randomized/semi-randomized scenarios and check danger
|
|||
// predicates after each step. The goal is to find protocol violations
|
|||
// that handwritten scenarios might miss.
|
|||
|
|||
// TestAdversarial_RandomWritesAndCrashes runs random write + crash + restart
|
|||
// sequences and checks all danger predicates after each step.
|
|||
func TestAdversarial_RandomWritesAndCrashes(t *testing.T) { |
|||
rng := rand.New(rand.NewSource(42)) |
|||
|
|||
for trial := 0; trial < 50; trial++ { |
|||
c := NewCluster(CommitSyncAll, "p", "r1", "r2") |
|||
|
|||
// Random sequence of operations.
|
|||
for step := 0; step < 30; step++ { |
|||
op := rng.Intn(10) |
|||
switch { |
|||
case op < 5: |
|||
// Write.
|
|||
block := uint64(rng.Intn(10) + 1) |
|||
c.CommitWrite(block) |
|||
case op < 7: |
|||
// Tick (advance time, deliver messages).
|
|||
c.Tick() |
|||
case op < 8: |
|||
// Crash a random node.
|
|||
nodes := []string{"p", "r1", "r2"} |
|||
target := nodes[rng.Intn(3)] |
|||
node := c.Nodes[target] |
|||
if node.Running { |
|||
node.Storage.Crash() |
|||
node.Running = false |
|||
node.ReplicaState = NodeStateLagging |
|||
} |
|||
case op < 9: |
|||
// Restart a crashed node.
|
|||
nodes := []string{"p", "r1", "r2"} |
|||
target := nodes[rng.Intn(3)] |
|||
node := c.Nodes[target] |
|||
if !node.Running { |
|||
node.Storage.Restart() |
|||
node.Running = true |
|||
node.ReplicaState = NodeStateLagging // needs catch-up
|
|||
} |
|||
default: |
|||
// Flusher tick on all running nodes.
|
|||
for _, node := range c.Nodes { |
|||
if node.Running { |
|||
node.Storage.ApplyToExtent(node.Storage.WALDurableLSN) |
|||
node.Storage.AdvanceCheckpoint(node.Storage.WALDurableLSN) |
|||
} |
|||
} |
|||
} |
|||
|
|||
// Check predicates after every step.
|
|||
violations := CheckAllPredicates(c) |
|||
if len(violations) > 0 { |
|||
for name, detail := range violations { |
|||
t.Errorf("trial %d step %d: PREDICATE VIOLATED [%s]: %s", trial, step, name, detail) |
|||
} |
|||
t.FailNow() |
|||
} |
|||
} |
|||
} |
|||
} |
|||
|
|||
// TestAdversarial_FailoverChainWithPredicates runs a sequence of
|
|||
// failovers (promote, crash, promote) and checks predicates.
|
|||
func TestAdversarial_FailoverChainWithPredicates(t *testing.T) { |
|||
c := NewCluster(CommitSyncAll, "p", "r1", "r2") |
|||
|
|||
// Write some data and commit.
|
|||
for i := 0; i < 5; i++ { |
|||
c.CommitWrite(uint64(i + 1)) |
|||
} |
|||
c.TickN(5) |
|||
|
|||
check := func(label string) { |
|||
violations := CheckAllPredicates(c) |
|||
for name, detail := range violations { |
|||
t.Fatalf("%s: PREDICATE VIOLATED [%s]: %s", label, name, detail) |
|||
} |
|||
} |
|||
|
|||
check("after initial writes") |
|||
|
|||
// Kill primary.
|
|||
c.Nodes["p"].Running = false |
|||
c.Nodes["p"].Storage.Crash() |
|||
|
|||
// Promote r1.
|
|||
c.Promote("r1") |
|||
c.TickN(3) |
|||
check("after first promotion") |
|||
|
|||
// Write more under new primary.
|
|||
for i := 0; i < 3; i++ { |
|||
c.CommitWrite(uint64(i + 10)) |
|||
} |
|||
c.TickN(5) |
|||
check("after writes on new primary") |
|||
|
|||
// Kill new primary.
|
|||
c.Nodes["r1"].Running = false |
|||
c.Nodes["r1"].Storage.Crash() |
|||
|
|||
// Promote r2.
|
|||
c.Promote("r2") |
|||
c.TickN(3) |
|||
check("after second promotion") |
|||
|
|||
// Write more under third primary.
|
|||
c.CommitWrite(99) |
|||
c.TickN(5) |
|||
check("after writes on third primary") |
|||
} |
|||
|
|||
// TestAdversarial_CatchUpUnderLoad runs catch-up while the primary keeps
|
|||
// writing, then checks predicates for livelock.
|
|||
func TestAdversarial_CatchUpUnderLoad(t *testing.T) { |
|||
c := NewCluster(CommitSyncAll, "p", "r1") |
|||
|
|||
// Write initial data.
|
|||
for i := 0; i < 10; i++ { |
|||
c.CommitWrite(uint64(i + 1)) |
|||
} |
|||
c.TickN(5) |
|||
|
|||
// Disconnect r1.
|
|||
c.Nodes["r1"].Running = false |
|||
|
|||
// Write more while r1 is down.
|
|||
for i := 0; i < 20; i++ { |
|||
c.CommitWrite(uint64(i + 100)) |
|||
c.Tick() |
|||
} |
|||
|
|||
// Reconnect r1 — needs catch-up.
|
|||
c.Nodes["r1"].Running = true |
|||
c.Nodes["r1"].ReplicaState = NodeStateLagging |
|||
|
|||
// Attempt catch-up while primary keeps writing.
|
|||
for step := 0; step < 20; step++ { |
|||
// Primary writes more.
|
|||
c.CommitWrite(uint64(step + 200)) |
|||
c.Tick() |
|||
|
|||
// Attempt catch-up progress.
|
|||
c.CatchUpWithEscalation("r1", 5) |
|||
|
|||
// Check predicates.
|
|||
violations := CheckAllPredicates(c) |
|||
for name, detail := range violations { |
|||
t.Fatalf("step %d: PREDICATE VIOLATED [%s]: %s", step, name, detail) |
|||
} |
|||
} |
|||
|
|||
// After the loop, r1 should be either InSync or NeedsRebuild.
|
|||
state := c.Nodes["r1"].ReplicaState |
|||
if state != NodeStateInSync && state != NodeStateNeedsRebuild { |
|||
t.Fatalf("r1 should be InSync or NeedsRebuild after catch-up under load, got %s", state) |
|||
} |
|||
} |
|||
|
|||
// TestAdversarial_CheckpointGCThenCrash runs checkpoint + WAL GC + crash
|
|||
// sequences and verifies acked data is never lost.
|
|||
func TestAdversarial_CheckpointGCThenCrash(t *testing.T) { |
|||
rng := rand.New(rand.NewSource(99)) |
|||
|
|||
for trial := 0; trial < 30; trial++ { |
|||
c := NewCluster(CommitSyncAll, "p", "r1") |
|||
|
|||
// Write and commit data.
|
|||
for i := 0; i < 15; i++ { |
|||
c.CommitWrite(uint64(rng.Intn(20) + 1)) |
|||
} |
|||
c.TickN(10) |
|||
|
|||
// Flusher + checkpoint at various points.
|
|||
for _, node := range c.Nodes { |
|||
if node.Running { |
|||
flushTo := node.Storage.WALDurableLSN |
|||
node.Storage.ApplyToExtent(flushTo) |
|||
// Checkpoint at a random point up to flush.
|
|||
cpLSN := uint64(rng.Int63n(int64(flushTo+1))) |
|||
node.Storage.AdvanceCheckpoint(cpLSN) |
|||
|
|||
// GC WAL entries before checkpoint.
|
|||
retained := make([]Write, 0) |
|||
for _, w := range node.Storage.WAL { |
|||
if w.LSN > node.Storage.CheckpointLSN { |
|||
retained = append(retained, w) |
|||
} |
|||
} |
|||
node.Storage.WAL = retained |
|||
} |
|||
} |
|||
|
|||
// Crash primary.
|
|||
primary := c.Primary() |
|||
if primary != nil { |
|||
primary.Storage.Crash() |
|||
primary.Storage.Restart() |
|||
} |
|||
|
|||
// Check predicates — committed data must still be recoverable.
|
|||
violations := CheckAllPredicates(c) |
|||
for name, detail := range violations { |
|||
t.Errorf("trial %d: PREDICATE VIOLATED [%s]: %s", trial, name, detail) |
|||
} |
|||
} |
|||
} |
|||
@ -0,0 +1,334 @@ |
|||
package distsim |
|||
|
|||
import ( |
|||
"testing" |
|||
) |
|||
|
|||
// Phase 4.5: Crash-consistency and recoverability tests.
|
|||
// These validate invariants I1-I5 from the crash-consistency simulation plan.
|
|||
|
|||
// --- Invariant I1: ACK'd flush is recoverable after any crash ---
|
|||
|
|||
func TestI1_AckedFlush_RecoverableAfterPrimaryCrash(t *testing.T) { |
|||
c := NewCluster(CommitSyncAll, "p", "r") |
|||
|
|||
// Write 3 entries and commit (sync_all = durable on both nodes).
|
|||
for i := 0; i < 3; i++ { |
|||
c.CommitWrite(uint64(i + 1)) |
|||
} |
|||
c.Tick() |
|||
c.Tick() |
|||
c.Tick() |
|||
|
|||
if c.Coordinator.CommittedLSN < 3 { |
|||
t.Fatalf("expected CommittedLSN>=3, got %d", c.Coordinator.CommittedLSN) |
|||
} |
|||
|
|||
committedLSN := c.Coordinator.CommittedLSN |
|||
|
|||
// Crash the primary.
|
|||
primary := c.Nodes["p"] |
|||
primary.Storage.Crash() |
|||
|
|||
// Restart: recover from checkpoint + durable WAL.
|
|||
recoveredLSN := primary.Storage.Restart() |
|||
|
|||
// I1: all committed data must be recoverable.
|
|||
if recoveredLSN < committedLSN { |
|||
t.Fatalf("I1 VIOLATED: recoveredLSN=%d < committedLSN=%d — acked data lost", |
|||
recoveredLSN, committedLSN) |
|||
} |
|||
|
|||
// Verify data correctness against reference.
|
|||
refState := c.Reference.StateAt(committedLSN) |
|||
recState := primary.Storage.StateAt(committedLSN) |
|||
for block, expected := range refState { |
|||
if got := recState[block]; got != expected { |
|||
t.Fatalf("I1 VIOLATED: block %d: reference=%d recovered=%d", block, expected, got) |
|||
} |
|||
} |
|||
} |
|||
|
|||
// --- Invariant I2: No ghost visible state after crash ---
|
|||
|
|||
func TestI2_ExtentAheadOfCheckpoint_CrashRestart(t *testing.T) { |
|||
s := NewStorage() |
|||
|
|||
// Write 5 entries to WAL.
|
|||
for i := uint64(1); i <= 5; i++ { |
|||
s.AppendWrite(Write{Block: 10 + i, Value: i * 100, LSN: i}) |
|||
} |
|||
|
|||
// Make all 5 durable.
|
|||
s.AdvanceFlush(5) |
|||
|
|||
// Flusher materializes entries 1-3 to live extent.
|
|||
s.ApplyToExtent(3) |
|||
|
|||
// Checkpoint at LSN 1 only.
|
|||
s.AdvanceCheckpoint(1) |
|||
|
|||
// Crash.
|
|||
s.Crash() |
|||
if s.LiveExtent != nil { |
|||
t.Fatal("after crash, LiveExtent should be nil") |
|||
} |
|||
|
|||
// Restart.
|
|||
recoveredLSN := s.Restart() |
|||
if recoveredLSN != 5 { |
|||
t.Fatalf("recoveredLSN should be 5, got %d", recoveredLSN) |
|||
} |
|||
|
|||
// I2: all durable data recovered from checkpoint + WAL replay.
|
|||
for i := uint64(1); i <= 5; i++ { |
|||
block := 10 + i |
|||
expected := i * 100 |
|||
if got := s.LiveExtent[block]; got != expected { |
|||
t.Fatalf("I2: block %d: expected %d, got %d", block, expected, got) |
|||
} |
|||
} |
|||
} |
|||
|
|||
func TestI2_UnackedData_LostAfterCrash(t *testing.T) { |
|||
s := NewStorage() |
|||
|
|||
for i := uint64(1); i <= 5; i++ { |
|||
s.AppendWrite(Write{Block: i, Value: i * 10, LSN: i}) |
|||
} |
|||
|
|||
// Only fsync 1-3. Entries 4-5 are NOT durable.
|
|||
s.AdvanceFlush(3) |
|||
s.ApplyToExtent(5) // should clamp to 3
|
|||
s.AdvanceCheckpoint(3) |
|||
|
|||
if s.ExtentAppliedLSN != 3 { |
|||
t.Fatalf("ApplyToExtent should clamp to WALDurableLSN=3, got %d", s.ExtentAppliedLSN) |
|||
} |
|||
|
|||
s.Crash() |
|||
s.Restart() |
|||
|
|||
// Blocks 4,5 must NOT be in recovered extent.
|
|||
if val, ok := s.LiveExtent[4]; ok && val != 0 { |
|||
t.Fatalf("I2 VIOLATED: block 4=%d survived crash — unfsynced data", val) |
|||
} |
|||
if val, ok := s.LiveExtent[5]; ok && val != 0 { |
|||
t.Fatalf("I2 VIOLATED: block 5=%d survived crash — unfsynced data", val) |
|||
} |
|||
|
|||
// Blocks 1-3 must be there.
|
|||
for i := uint64(1); i <= 3; i++ { |
|||
if got := s.LiveExtent[i]; got != i*10 { |
|||
t.Fatalf("block %d: expected %d, got %d", i, i*10, got) |
|||
} |
|||
} |
|||
} |
|||
|
|||
// --- Invariant I3: CatchUp converges or escalates ---
|
|||
|
|||
func TestI3_CatchUpConvergesOrEscalates(t *testing.T) { |
|||
c := NewCluster(CommitSyncAll, "p", "r") |
|||
|
|||
// Commit initial entry.
|
|||
c.CommitWrite(1) |
|||
c.Tick() |
|||
c.Tick() |
|||
|
|||
// Disconnect replica and write more.
|
|||
c.Nodes["r"].Running = false |
|||
for i := uint64(2); i <= 10; i++ { |
|||
c.CommitWrite(i) |
|||
c.Tick() |
|||
} |
|||
|
|||
// Reconnect.
|
|||
c.Nodes["r"].Running = true |
|||
c.Nodes["r"].ReplicaState = NodeStateLagging |
|||
|
|||
// Catch-up with escalation.
|
|||
converged := c.CatchUpWithEscalation("r", 3) |
|||
|
|||
// I3: must resolve — either converged or escalated to NeedsRebuild.
|
|||
state := c.Nodes["r"].ReplicaState |
|||
if !converged && state != NodeStateNeedsRebuild { |
|||
t.Fatalf("I3 VIOLATED: catchup did not converge and state=%s (not NeedsRebuild)", state) |
|||
} |
|||
} |
|||
|
|||
// --- Invariant I4: Promoted replica has committed prefix ---
|
|||
|
|||
func TestI4_PromotedReplica_HasCommittedPrefix(t *testing.T) { |
|||
c := NewCluster(CommitSyncAll, "p", "r") |
|||
|
|||
for i := uint64(1); i <= 5; i++ { |
|||
c.CommitWrite(i) |
|||
} |
|||
c.Tick() |
|||
c.Tick() |
|||
c.Tick() |
|||
|
|||
committedLSN := c.Coordinator.CommittedLSN |
|||
if committedLSN < 5 { |
|||
t.Fatalf("expected CommittedLSN>=5, got %d", committedLSN) |
|||
} |
|||
|
|||
// Promote replica.
|
|||
if err := c.Promote("r"); err != nil { |
|||
t.Fatalf("promote: %v", err) |
|||
} |
|||
|
|||
// I4: new primary must have recoverable committed prefix.
|
|||
newPrimary := c.Nodes["r"] |
|||
recoverableLSN := newPrimary.Storage.RecoverableLSN() |
|||
if recoverableLSN < committedLSN { |
|||
t.Fatalf("I4 VIOLATED: promoted recoverableLSN=%d < committedLSN=%d", |
|||
recoverableLSN, committedLSN) |
|||
} |
|||
|
|||
// Verify data matches reference.
|
|||
refState := c.Reference.StateAt(committedLSN) |
|||
recState := newPrimary.Storage.StateAt(committedLSN) |
|||
for block, expected := range refState { |
|||
if got := recState[block]; got != expected { |
|||
t.Fatalf("I4 VIOLATED: block %d: ref=%d got=%d", block, expected, got) |
|||
} |
|||
} |
|||
} |
|||
|
|||
// --- Direct test: checkpoint must not leak applied-but-uncheckpointed state ---
|
|||
|
|||
func TestI2_CheckpointDoesNotLeakAppliedState(t *testing.T) { |
|||
s := NewStorage() |
|||
|
|||
// Write 5 entries, all durable.
|
|||
for i := uint64(1); i <= 5; i++ { |
|||
s.AppendWrite(Write{Block: i, Value: i * 10, LSN: i}) |
|||
} |
|||
s.AdvanceFlush(5) |
|||
|
|||
// Flusher applies all 5 to LiveExtent.
|
|||
s.ApplyToExtent(5) |
|||
|
|||
// But checkpoint only at LSN 2.
|
|||
s.AdvanceCheckpoint(2) |
|||
|
|||
// CheckpointExtent must contain ONLY blocks 1-2, not 3-5.
|
|||
for i := uint64(3); i <= 5; i++ { |
|||
if val, ok := s.CheckpointExtent[i]; ok && val != 0 { |
|||
t.Fatalf("CHECKPOINT LEAK: block %d=%d in checkpoint but CheckpointLSN=2", i, val) |
|||
} |
|||
} |
|||
// Blocks 1-2 must be in checkpoint.
|
|||
for i := uint64(1); i <= 2; i++ { |
|||
expected := i * 10 |
|||
if got := s.CheckpointExtent[i]; got != expected { |
|||
t.Fatalf("block %d: checkpoint should have %d, got %d", i, expected, got) |
|||
} |
|||
} |
|||
|
|||
// Now crash: LiveExtent lost, entries 3-5 only in WAL.
|
|||
s.Crash() |
|||
recoveredLSN := s.Restart() |
|||
|
|||
if recoveredLSN != 5 { |
|||
t.Fatalf("recoveredLSN should be 5, got %d", recoveredLSN) |
|||
} |
|||
|
|||
// All 5 blocks must be recovered: 1-2 from checkpoint, 3-5 from WAL replay.
|
|||
for i := uint64(1); i <= 5; i++ { |
|||
expected := i * 10 |
|||
if got := s.LiveExtent[i]; got != expected { |
|||
t.Fatalf("block %d: expected %d after crash+restart, got %d", i, expected, got) |
|||
} |
|||
} |
|||
} |
|||
|
|||
// --- A7: Historical state before checkpoint is not fakeable ---
|
|||
|
|||
func TestA7_HistoricalState_NotReconstructableAfterGC(t *testing.T) { |
|||
s := NewStorage() |
|||
|
|||
// Write 10 entries, all durable.
|
|||
for i := uint64(1); i <= 10; i++ { |
|||
s.AppendWrite(Write{Block: i, Value: i * 10, LSN: i}) |
|||
} |
|||
s.AdvanceFlush(10) |
|||
s.ApplyToExtent(10) |
|||
|
|||
// Checkpoint at LSN 7.
|
|||
s.AdvanceCheckpoint(7) |
|||
|
|||
// GC WAL entries before checkpoint.
|
|||
retained := make([]Write, 0) |
|||
for _, w := range s.WAL { |
|||
if w.LSN > s.CheckpointLSN { |
|||
retained = append(retained, w) |
|||
} |
|||
} |
|||
s.WAL = retained |
|||
|
|||
// Can reconstruct at LSN 7 (checkpoint covers it).
|
|||
if !s.CanReconstructAt(7) { |
|||
t.Fatal("should be reconstructable at checkpoint LSN") |
|||
} |
|||
|
|||
// Can reconstruct at LSN 10 (checkpoint + WAL 8-10).
|
|||
if !s.CanReconstructAt(10) { |
|||
t.Fatal("should be reconstructable at LSN 10 (checkpoint + WAL)") |
|||
} |
|||
|
|||
// CANNOT accurately reconstruct at LSN 3 (WAL 1-6 has been GC'd).
|
|||
// The state at LSN 3 required WAL entries 1-3 which are gone.
|
|||
if s.CanReconstructAt(3) { |
|||
t.Fatal("A7: should NOT be reconstructable at LSN 3 after WAL GC — history is lost") |
|||
} |
|||
|
|||
// StateAt(3) returns checkpoint state (best-effort approximation, not exact).
|
|||
// This is fine for display but must NOT be treated as authoritative.
|
|||
state3 := s.StateAt(3) |
|||
// The returned state includes blocks 1-7 (from checkpoint), which is MORE
|
|||
// than what was actually committed at LSN 3. This is the "current extent
|
|||
// cannot fake old history" problem from A7.
|
|||
if len(state3) == 3 { |
|||
t.Fatal("StateAt(3) after GC should return checkpoint state (7 blocks), not exact 3-block state") |
|||
} |
|||
} |
|||
|
|||
// --- Invariant I5: Checkpoint GC preserves recovery proof ---
|
|||
|
|||
func TestI5_CheckpointGC_PreservesAckedBoundary(t *testing.T) { |
|||
s := NewStorage() |
|||
|
|||
for i := uint64(1); i <= 10; i++ { |
|||
s.AppendWrite(Write{Block: i, Value: i * 10, LSN: i}) |
|||
} |
|||
s.AdvanceFlush(10) |
|||
s.ApplyToExtent(7) |
|||
s.AdvanceCheckpoint(7) |
|||
|
|||
// GC: remove WAL entries before checkpoint.
|
|||
retained := make([]Write, 0) |
|||
for _, w := range s.WAL { |
|||
if w.LSN > s.CheckpointLSN { |
|||
retained = append(retained, w) |
|||
} |
|||
} |
|||
s.WAL = retained |
|||
|
|||
// Crash + restart.
|
|||
s.Crash() |
|||
recoveredLSN := s.Restart() |
|||
|
|||
if recoveredLSN != 10 { |
|||
t.Fatalf("I5: recoveredLSN should be 10, got %d", recoveredLSN) |
|||
} |
|||
|
|||
// All 10 blocks recoverable: 1-7 from checkpoint, 8-10 from WAL.
|
|||
for i := uint64(1); i <= 10; i++ { |
|||
expected := i * 10 |
|||
if got := s.LiveExtent[i]; got != expected { |
|||
t.Fatalf("I5 VIOLATED: block %d: expected %d, got %d", i, expected, got) |
|||
} |
|||
} |
|||
} |
|||
@ -0,0 +1,160 @@ |
|||
package distsim |
|||
|
|||
import "fmt" |
|||
|
|||
// DangerPredicate checks for a protocol-violating or dangerous state.
|
|||
// Returns (violated bool, detail string).
|
|||
type DangerPredicate func(c *Cluster) (bool, string) |
|||
|
|||
// PredicateAckedFlushLost checks if any committed (ACK'd) write has become
|
|||
// unrecoverable on ANY node that is supposed to have it.
|
|||
// This is the most dangerous protocol violation: data loss after ACK.
|
|||
func PredicateAckedFlushLost(c *Cluster) (bool, string) { |
|||
committedLSN := c.Coordinator.CommittedLSN |
|||
if committedLSN == 0 { |
|||
return false, "" |
|||
} |
|||
|
|||
refState := c.Reference.StateAt(committedLSN) |
|||
|
|||
// Check primary.
|
|||
primary := c.Primary() |
|||
if primary != nil && primary.Running { |
|||
recLSN := primary.Storage.RecoverableLSN() |
|||
if recLSN < committedLSN { |
|||
return true, fmt.Sprintf("primary %s: recoverableLSN=%d < committedLSN=%d", |
|||
primary.ID, recLSN, committedLSN) |
|||
} |
|||
// Verify committed state correctness using StateAt (not LiveExtent).
|
|||
// LiveExtent may contain uncommitted-but-durable writes beyond committedLSN.
|
|||
// Only check if we can reconstruct the exact committed state.
|
|||
if primary.Storage.CanReconstructAt(committedLSN) { |
|||
nodeState := primary.Storage.StateAt(committedLSN) |
|||
for block, expected := range refState { |
|||
if got := nodeState[block]; got != expected { |
|||
return true, fmt.Sprintf("primary %s: block %d = %d, reference = %d at committedLSN=%d", |
|||
primary.ID, block, got, expected, committedLSN) |
|||
} |
|||
} |
|||
} |
|||
} |
|||
|
|||
// Check replicas that should have committed data (InSync replicas).
|
|||
for _, node := range c.Nodes { |
|||
if node.ID == c.Coordinator.PrimaryID { |
|||
continue |
|||
} |
|||
if !node.Running || node.ReplicaState != NodeStateInSync { |
|||
continue |
|||
} |
|||
recLSN := node.Storage.RecoverableLSN() |
|||
if recLSN < committedLSN { |
|||
return true, fmt.Sprintf("InSync replica %s: recoverableLSN=%d < committedLSN=%d", |
|||
node.ID, recLSN, committedLSN) |
|||
} |
|||
} |
|||
|
|||
return false, "" |
|||
} |
|||
|
|||
// PredicateVisibleUnrecoverableState checks if any running node has extent
|
|||
// state that would NOT survive a crash+restart. This detects ghost visible
|
|||
// state — data that is readable now but would be lost on crash.
|
|||
func PredicateVisibleUnrecoverableState(c *Cluster) (bool, string) { |
|||
for _, node := range c.Nodes { |
|||
if !node.Running || node.Storage.LiveExtent == nil { |
|||
continue |
|||
} |
|||
// Simulate what would happen on crash+restart.
|
|||
recoverableLSN := node.Storage.RecoverableLSN() |
|||
|
|||
// Check each block in LiveExtent: is its value backed by
|
|||
// a write at LSN <= recoverableLSN?
|
|||
for block, value := range node.Storage.LiveExtent { |
|||
// Find which LSN wrote this value.
|
|||
writtenAtLSN := uint64(0) |
|||
for _, w := range node.Storage.WAL { |
|||
if w.Block == block && w.Value == value { |
|||
writtenAtLSN = w.LSN |
|||
} |
|||
} |
|||
if writtenAtLSN > recoverableLSN { |
|||
return true, fmt.Sprintf("node %s: block %d has value %d (written at LSN %d) but recoverableLSN=%d — ghost state", |
|||
node.ID, block, value, writtenAtLSN, recoverableLSN) |
|||
} |
|||
} |
|||
} |
|||
return false, "" |
|||
} |
|||
|
|||
// PredicateCatchUpLivelockOrMissingEscalation checks if any replica is stuck
|
|||
// in CatchingUp without making progress and without being escalated to
|
|||
// NeedsRebuild. Also checks if a replica needs rebuild but hasn't been
|
|||
// escalated.
|
|||
func PredicateCatchUpLivelockOrMissingEscalation(c *Cluster) (bool, string) { |
|||
for _, node := range c.Nodes { |
|||
if !node.Running { |
|||
continue |
|||
} |
|||
if node.ReplicaState == NodeStateCatchingUp { |
|||
// A node in CatchingUp is suspicious if it has been there for
|
|||
// many ticks without approaching the target. We check if its
|
|||
// FlushedLSN is far behind the primary's head.
|
|||
primary := c.Primary() |
|||
if primary == nil { |
|||
continue |
|||
} |
|||
primaryHead := primary.Storage.WALDurableLSN |
|||
replicaFlushed := node.Storage.FlushedLSN |
|||
gap := primaryHead - replicaFlushed |
|||
|
|||
// If the gap is larger than what the WAL can reasonably hold
|
|||
// and the node hasn't been escalated, that's a livelock risk.
|
|||
// Use a simple heuristic: if gap > 2x what we've seen committed, flag it.
|
|||
if gap > c.Coordinator.CommittedLSN*2 && c.Coordinator.CommittedLSN > 5 { |
|||
return true, fmt.Sprintf("node %s: CatchingUp with gap=%d (primary head=%d, replica flushed=%d) — potential livelock", |
|||
node.ID, gap, primaryHead, replicaFlushed) |
|||
} |
|||
} |
|||
|
|||
// Check if a node is Lagging for a long time without being moved to
|
|||
// CatchingUp or NeedsRebuild.
|
|||
// Note: Lagging is a transient state that the control plane should resolve.
|
|||
// In adversarial random tests without explicit recovery triggers, a node
|
|||
// staying Lagging is expected. We only flag truly excessive lag (> 3x committed)
|
|||
// as potential livelock — anything smaller is normal recovery latency.
|
|||
if node.ReplicaState == NodeStateLagging { |
|||
primary := c.Primary() |
|||
if primary != nil { |
|||
gap := primary.Storage.WALDurableLSN - node.Storage.FlushedLSN |
|||
if gap > c.Coordinator.CommittedLSN*3 && c.Coordinator.CommittedLSN > 10 { |
|||
return true, fmt.Sprintf("node %s: Lagging with gap=%d without escalation to CatchingUp or NeedsRebuild", |
|||
node.ID, gap) |
|||
} |
|||
} |
|||
} |
|||
} |
|||
return false, "" |
|||
} |
|||
|
|||
// AllDangerPredicates returns the standard set of danger predicates.
|
|||
func AllDangerPredicates() map[string]DangerPredicate { |
|||
return map[string]DangerPredicate{ |
|||
"acked_flush_lost": PredicateAckedFlushLost, |
|||
"visible_unrecoverable": PredicateVisibleUnrecoverableState, |
|||
"catchup_livelock_or_no_esc": PredicateCatchUpLivelockOrMissingEscalation, |
|||
} |
|||
} |
|||
|
|||
// CheckAllPredicates runs all danger predicates against a cluster state.
|
|||
// Returns a map of violated predicate names → detail messages.
|
|||
func CheckAllPredicates(c *Cluster) map[string]string { |
|||
violations := map[string]string{} |
|||
for name, pred := range AllDangerPredicates() { |
|||
violated, detail := pred(c) |
|||
if violated { |
|||
violations[name] = detail |
|||
} |
|||
} |
|||
return violations |
|||
} |
|||
@ -0,0 +1,481 @@ |
|||
package weed_server |
|||
|
|||
import ( |
|||
"sync" |
|||
"testing" |
|||
"time" |
|||
|
|||
"github.com/seaweedfs/seaweedfs/weed/storage/blockvol" |
|||
) |
|||
|
|||
// ============================================================
|
|||
// Edge Case Tests: RF, Promotion, Network, LSN
|
|||
//
|
|||
// Covers gaps identified in the testing framework review:
|
|||
// 1. LSN-lagging replica skipped during promotion
|
|||
// 2. Cascading double failover (RF=3, epoch chain 1→2→3)
|
|||
// 3. Demotion/drain under concurrent promotion pressure
|
|||
// 4. Promotion with mixed LSN + health scores
|
|||
// 5. Network flap simulation (mark/unmark block capable rapidly)
|
|||
// 6. RF=3 all-gate evaluation under pressure
|
|||
// ============================================================
|
|||
|
|||
// --- Test 1: LSN-lagging replica skipped, fresher one promoted ---
|
|||
|
|||
func TestEdge_LSNLag_StaleReplicaSkipped(t *testing.T) { |
|||
ms := testMasterServerForFailover(t) |
|||
ms.blockRegistry.SetPromotionLSNTolerance(10) |
|||
|
|||
ms.blockRegistry.MarkBlockCapable("primary") |
|||
ms.blockRegistry.MarkBlockCapable("stale-replica") |
|||
ms.blockRegistry.MarkBlockCapable("fresh-replica") |
|||
|
|||
entry := &BlockVolumeEntry{ |
|||
Name: "lsn-test", VolumeServer: "primary", Path: "/data/lsn-test.blk", |
|||
SizeBytes: 1 << 30, Epoch: 1, Role: blockvol.RoleToWire(blockvol.RolePrimary), |
|||
Status: StatusActive, LeaseTTL: 5 * time.Second, |
|||
LastLeaseGrant: time.Now().Add(-10 * time.Second), // expired
|
|||
WALHeadLSN: 1000, |
|||
Replicas: []ReplicaInfo{ |
|||
{ |
|||
Server: "stale-replica", Path: "/data/lsn-test.blk", |
|||
HealthScore: 1.0, WALHeadLSN: 100, // lag=900, way beyond tolerance=10
|
|||
Role: blockvol.RoleToWire(blockvol.RoleReplica), LastHeartbeat: time.Now(), |
|||
}, |
|||
{ |
|||
Server: "fresh-replica", Path: "/data/lsn-test.blk", |
|||
HealthScore: 0.9, WALHeadLSN: 995, // lag=5, within tolerance=10
|
|||
Role: blockvol.RoleToWire(blockvol.RoleReplica), LastHeartbeat: time.Now(), |
|||
}, |
|||
}, |
|||
} |
|||
ms.blockRegistry.Register(entry) |
|||
|
|||
// Kill primary.
|
|||
ms.blockRegistry.UnmarkBlockCapable("primary") |
|||
ms.failoverBlockVolumes("primary") |
|||
|
|||
// Verify: fresh-replica promoted (despite lower health score), stale skipped.
|
|||
after, ok := ms.blockRegistry.Lookup("lsn-test") |
|||
if !ok { |
|||
t.Fatal("volume not found") |
|||
} |
|||
if after.VolumeServer != "fresh-replica" { |
|||
t.Fatalf("expected fresh-replica promoted, got %q (stale-replica with lag=900 should be skipped)", after.VolumeServer) |
|||
} |
|||
if after.Epoch != 2 { |
|||
t.Fatalf("epoch: got %d, want 2", after.Epoch) |
|||
} |
|||
} |
|||
|
|||
// --- Test 2: Cascading double failover (RF=3, epoch 1→2→3) ---
|
|||
|
|||
func TestEdge_CascadeFailover_RF3_EpochChain(t *testing.T) { |
|||
ms := testMasterServerForFailover(t) |
|||
ms.blockRegistry.MarkBlockCapable("vs1") |
|||
ms.blockRegistry.MarkBlockCapable("vs2") |
|||
ms.blockRegistry.MarkBlockCapable("vs3") |
|||
|
|||
entry := &BlockVolumeEntry{ |
|||
Name: "cascade-test", VolumeServer: "vs1", Path: "/data/cascade.blk", |
|||
SizeBytes: 1 << 30, Epoch: 1, Role: blockvol.RoleToWire(blockvol.RolePrimary), |
|||
Status: StatusActive, LeaseTTL: 5 * time.Second, |
|||
LastLeaseGrant: time.Now().Add(-10 * time.Second), |
|||
ReplicaFactor: 3, |
|||
Replicas: []ReplicaInfo{ |
|||
{Server: "vs2", Path: "/r2.blk", HealthScore: 1.0, WALHeadLSN: 100, |
|||
Role: blockvol.RoleToWire(blockvol.RoleReplica), LastHeartbeat: time.Now()}, |
|||
{Server: "vs3", Path: "/r3.blk", HealthScore: 0.9, WALHeadLSN: 100, |
|||
Role: blockvol.RoleToWire(blockvol.RoleReplica), LastHeartbeat: time.Now()}, |
|||
}, |
|||
} |
|||
ms.blockRegistry.Register(entry) |
|||
|
|||
// Failover 1: vs1 dies → vs2 promoted (higher health).
|
|||
ms.blockRegistry.UnmarkBlockCapable("vs1") |
|||
ms.failoverBlockVolumes("vs1") |
|||
|
|||
after1, _ := ms.blockRegistry.Lookup("cascade-test") |
|||
if after1.VolumeServer != "vs2" { |
|||
t.Fatalf("failover 1: expected vs2, got %q", after1.VolumeServer) |
|||
} |
|||
if after1.Epoch != 2 { |
|||
t.Fatalf("failover 1: epoch got %d, want 2", after1.Epoch) |
|||
} |
|||
|
|||
// Failover 2: vs2 dies → vs3 promoted (only remaining).
|
|||
// Update vs3's heartbeat and set lease expired for the new primary.
|
|||
ms.blockRegistry.UpdateEntry("cascade-test", func(e *BlockVolumeEntry) { |
|||
e.LastLeaseGrant = time.Now().Add(-10 * time.Second) |
|||
for i := range e.Replicas { |
|||
if e.Replicas[i].Server == "vs3" { |
|||
e.Replicas[i].LastHeartbeat = time.Now() |
|||
} |
|||
} |
|||
}) |
|||
|
|||
ms.blockRegistry.UnmarkBlockCapable("vs2") |
|||
ms.failoverBlockVolumes("vs2") |
|||
|
|||
after2, _ := ms.blockRegistry.Lookup("cascade-test") |
|||
if after2.VolumeServer != "vs3" { |
|||
t.Fatalf("failover 2: expected vs3, got %q", after2.VolumeServer) |
|||
} |
|||
if after2.Epoch != 3 { |
|||
t.Fatalf("failover 2: epoch got %d, want 3", after2.Epoch) |
|||
} |
|||
|
|||
// No more replicas — third failover should fail silently.
|
|||
ms.blockRegistry.UpdateEntry("cascade-test", func(e *BlockVolumeEntry) { |
|||
e.LastLeaseGrant = time.Now().Add(-10 * time.Second) |
|||
}) |
|||
ms.blockRegistry.UnmarkBlockCapable("vs3") |
|||
ms.failoverBlockVolumes("vs3") |
|||
|
|||
after3, _ := ms.blockRegistry.Lookup("cascade-test") |
|||
// Epoch should still be 3 — no eligible replicas.
|
|||
if after3.Epoch != 3 { |
|||
t.Fatalf("failover 3: epoch should stay 3, got %d", after3.Epoch) |
|||
} |
|||
} |
|||
|
|||
// --- Test 3: Concurrent failover + heartbeat + promotion (stress) ---
|
|||
|
|||
func TestEdge_ConcurrentFailoverAndHeartbeat_NoPanic(t *testing.T) { |
|||
ms := testMasterServerForFailover(t) |
|||
ms.blockRegistry.MarkBlockCapable("vs1") |
|||
ms.blockRegistry.MarkBlockCapable("vs2") |
|||
ms.blockRegistry.MarkBlockCapable("vs3") |
|||
|
|||
setup := func() { |
|||
ms.blockRegistry.Unregister("stress-vol") |
|||
ms.blockRegistry.MarkBlockCapable("vs1") |
|||
ms.blockRegistry.MarkBlockCapable("vs2") |
|||
ms.blockRegistry.MarkBlockCapable("vs3") |
|||
ms.blockRegistry.Register(&BlockVolumeEntry{ |
|||
Name: "stress-vol", VolumeServer: "vs1", Path: "/data/stress.blk", |
|||
SizeBytes: 1 << 30, Epoch: 1, Role: blockvol.RoleToWire(blockvol.RolePrimary), |
|||
Status: StatusActive, LeaseTTL: 5 * time.Second, |
|||
LastLeaseGrant: time.Now().Add(-10 * time.Second), |
|||
Replicas: []ReplicaInfo{ |
|||
{Server: "vs2", Path: "/r2.blk", HealthScore: 1.0, |
|||
Role: blockvol.RoleToWire(blockvol.RoleReplica), LastHeartbeat: time.Now()}, |
|||
{Server: "vs3", Path: "/r3.blk", HealthScore: 0.9, |
|||
Role: blockvol.RoleToWire(blockvol.RoleReplica), LastHeartbeat: time.Now()}, |
|||
}, |
|||
}) |
|||
} |
|||
|
|||
for round := 0; round < 30; round++ { |
|||
setup() |
|||
var wg sync.WaitGroup |
|||
wg.Add(4) |
|||
go func() { defer wg.Done(); ms.failoverBlockVolumes("vs1") }() |
|||
go func() { defer wg.Done(); ms.reevaluateOrphanedPrimaries("vs2") }() |
|||
go func() { defer wg.Done(); ms.blockRegistry.PromoteBestReplica("stress-vol") }() |
|||
go func() { |
|||
defer wg.Done() |
|||
ms.blockRegistry.ManualPromote("stress-vol", "", true) |
|||
}() |
|||
wg.Wait() |
|||
} |
|||
// No panic = pass.
|
|||
} |
|||
|
|||
// --- Test 4: LSN + health score interaction — health wins within tolerance ---
|
|||
|
|||
func TestEdge_LSNWithinTolerance_HealthWins(t *testing.T) { |
|||
ms := testMasterServerForFailover(t) |
|||
ms.blockRegistry.SetPromotionLSNTolerance(100) |
|||
ms.blockRegistry.MarkBlockCapable("primary") |
|||
ms.blockRegistry.MarkBlockCapable("high-health") |
|||
ms.blockRegistry.MarkBlockCapable("high-lsn") |
|||
|
|||
ms.blockRegistry.Register(&BlockVolumeEntry{ |
|||
Name: "health-vs-lsn", VolumeServer: "primary", Path: "/data/hvl.blk", |
|||
SizeBytes: 1 << 30, Epoch: 1, Role: blockvol.RoleToWire(blockvol.RolePrimary), |
|||
Status: StatusActive, LeaseTTL: 5 * time.Second, |
|||
LastLeaseGrant: time.Now().Add(-10 * time.Second), |
|||
WALHeadLSN: 1000, |
|||
Replicas: []ReplicaInfo{ |
|||
{Server: "high-health", Path: "/r1.blk", HealthScore: 1.0, WALHeadLSN: 950, |
|||
Role: blockvol.RoleToWire(blockvol.RoleReplica), LastHeartbeat: time.Now()}, |
|||
{Server: "high-lsn", Path: "/r2.blk", HealthScore: 0.5, WALHeadLSN: 999, |
|||
Role: blockvol.RoleToWire(blockvol.RoleReplica), LastHeartbeat: time.Now()}, |
|||
}, |
|||
}) |
|||
|
|||
ms.blockRegistry.UnmarkBlockCapable("primary") |
|||
ms.failoverBlockVolumes("primary") |
|||
|
|||
after, _ := ms.blockRegistry.Lookup("health-vs-lsn") |
|||
// Both within tolerance (lag ≤ 100). Health wins: high-health (1.0) > high-lsn (0.5).
|
|||
if after.VolumeServer != "high-health" { |
|||
t.Fatalf("expected high-health promoted (higher health, both within LSN tolerance), got %q", after.VolumeServer) |
|||
} |
|||
} |
|||
|
|||
// --- Test 5: Network flap simulation — rapid mark/unmark block capable ---
|
|||
|
|||
func TestEdge_NetworkFlap_RapidMarkUnmark(t *testing.T) { |
|||
ms := testMasterServerForFailover(t) |
|||
ms.blockRegistry.MarkBlockCapable("flapper") |
|||
ms.blockRegistry.MarkBlockCapable("stable") |
|||
|
|||
ms.blockRegistry.Register(&BlockVolumeEntry{ |
|||
Name: "flap-test", VolumeServer: "stable", Path: "/data/flap.blk", |
|||
SizeBytes: 1 << 30, Epoch: 1, Role: blockvol.RoleToWire(blockvol.RolePrimary), |
|||
Status: StatusActive, LeaseTTL: 5 * time.Second, |
|||
LastLeaseGrant: time.Now(), |
|||
Replicas: []ReplicaInfo{ |
|||
{Server: "flapper", Path: "/r.blk", HealthScore: 1.0, |
|||
Role: blockvol.RoleToWire(blockvol.RoleReplica), LastHeartbeat: time.Now()}, |
|||
}, |
|||
}) |
|||
|
|||
var wg sync.WaitGroup |
|||
// Goroutine 1: rapidly flap the "flapper" server.
|
|||
wg.Add(1) |
|||
go func() { |
|||
defer wg.Done() |
|||
for i := 0; i < 100; i++ { |
|||
ms.blockRegistry.UnmarkBlockCapable("flapper") |
|||
ms.blockRegistry.MarkBlockCapable("flapper") |
|||
} |
|||
}() |
|||
|
|||
// Goroutine 2: attempt promotions during flapping.
|
|||
wg.Add(1) |
|||
go func() { |
|||
defer wg.Done() |
|||
for i := 0; i < 50; i++ { |
|||
ms.blockRegistry.EvaluatePromotion("flap-test") |
|||
} |
|||
}() |
|||
|
|||
// Goroutine 3: concurrent heartbeat updates.
|
|||
wg.Add(1) |
|||
go func() { |
|||
defer wg.Done() |
|||
for i := 0; i < 50; i++ { |
|||
ms.blockRegistry.UpdateFullHeartbeat("flapper", nil, "") |
|||
} |
|||
}() |
|||
|
|||
wg.Wait() |
|||
// No panic, no corruption = pass.
|
|||
|
|||
// Volume should still be on stable primary.
|
|||
after, ok := ms.blockRegistry.Lookup("flap-test") |
|||
if !ok { |
|||
t.Fatal("volume lost during flapping") |
|||
} |
|||
if after.VolumeServer != "stable" { |
|||
t.Fatalf("primary changed from stable to %q during flapping", after.VolumeServer) |
|||
} |
|||
} |
|||
|
|||
// --- Test 6: RF=3 all gates — mixed rejection reasons ---
|
|||
|
|||
func TestEdge_RF3_MixedGates_BestEligiblePromoted(t *testing.T) { |
|||
ms := testMasterServerForFailover(t) |
|||
ms.blockRegistry.SetPromotionLSNTolerance(50) |
|||
ms.blockRegistry.MarkBlockCapable("primary") |
|||
// Note: "dead-server" NOT marked block capable.
|
|||
ms.blockRegistry.MarkBlockCapable("stale-hb") |
|||
ms.blockRegistry.MarkBlockCapable("good") |
|||
|
|||
ms.blockRegistry.Register(&BlockVolumeEntry{ |
|||
Name: "mixed-gates", VolumeServer: "primary", Path: "/data/mixed.blk", |
|||
SizeBytes: 1 << 30, Epoch: 1, Role: blockvol.RoleToWire(blockvol.RolePrimary), |
|||
Status: StatusActive, LeaseTTL: 5 * time.Second, |
|||
LastLeaseGrant: time.Now().Add(-10 * time.Second), |
|||
WALHeadLSN: 500, |
|||
Replicas: []ReplicaInfo{ |
|||
{Server: "dead-server", Path: "/r1.blk", HealthScore: 1.0, WALHeadLSN: 500, |
|||
Role: blockvol.RoleToWire(blockvol.RoleReplica), LastHeartbeat: time.Now()}, |
|||
{Server: "stale-hb", Path: "/r2.blk", HealthScore: 1.0, WALHeadLSN: 500, |
|||
Role: blockvol.RoleToWire(blockvol.RoleReplica), |
|||
LastHeartbeat: time.Now().Add(-10 * time.Minute)}, // stale
|
|||
{Server: "good", Path: "/r3.blk", HealthScore: 0.8, WALHeadLSN: 480, |
|||
Role: blockvol.RoleToWire(blockvol.RoleReplica), LastHeartbeat: time.Now()}, |
|||
}, |
|||
}) |
|||
|
|||
// Evaluate preflight first (read-only).
|
|||
pf, err := ms.blockRegistry.EvaluatePromotion("mixed-gates") |
|||
if err != nil { |
|||
t.Fatalf("evaluate: %v", err) |
|||
} |
|||
if !pf.Promotable { |
|||
t.Fatalf("should be promotable, reason=%s, rejections=%v", pf.Reason, pf.Rejections) |
|||
} |
|||
// Should have 2 rejections: dead-server (server_dead) + stale-hb (stale_heartbeat).
|
|||
if len(pf.Rejections) != 2 { |
|||
t.Fatalf("expected 2 rejections, got %d: %v", len(pf.Rejections), pf.Rejections) |
|||
} |
|||
reasons := map[string]string{} |
|||
for _, r := range pf.Rejections { |
|||
reasons[r.Server] = r.Reason |
|||
} |
|||
if reasons["dead-server"] != "server_dead" { |
|||
t.Fatalf("dead-server: got %q, want server_dead", reasons["dead-server"]) |
|||
} |
|||
if reasons["stale-hb"] != "stale_heartbeat" { |
|||
t.Fatalf("stale-hb: got %q, want stale_heartbeat", reasons["stale-hb"]) |
|||
} |
|||
|
|||
// Now actually promote.
|
|||
ms.blockRegistry.UnmarkBlockCapable("primary") |
|||
ms.failoverBlockVolumes("primary") |
|||
|
|||
after, _ := ms.blockRegistry.Lookup("mixed-gates") |
|||
if after.VolumeServer != "good" { |
|||
t.Fatalf("expected 'good' promoted (only eligible), got %q", after.VolumeServer) |
|||
} |
|||
} |
|||
|
|||
// --- Test 7: Promotion changes publication (ISCSIAddr, NvmeAddr) ---
|
|||
|
|||
func TestEdge_PromotionUpdatesPublication(t *testing.T) { |
|||
ms := testMasterServerForFailover(t) |
|||
ms.blockRegistry.MarkBlockCapable("primary") |
|||
ms.blockRegistry.MarkBlockCapable("replica") |
|||
|
|||
ms.blockRegistry.Register(&BlockVolumeEntry{ |
|||
Name: "pub-test", VolumeServer: "primary", Path: "/data/pub.blk", |
|||
ISCSIAddr: "primary:3260", NvmeAddr: "primary:4420", NQN: "nqn.primary", |
|||
SizeBytes: 1 << 30, Epoch: 1, Role: blockvol.RoleToWire(blockvol.RolePrimary), |
|||
Status: StatusActive, LeaseTTL: 5 * time.Second, |
|||
LastLeaseGrant: time.Now().Add(-10 * time.Second), |
|||
Replicas: []ReplicaInfo{ |
|||
{Server: "replica", Path: "/r.blk", HealthScore: 1.0, |
|||
ISCSIAddr: "replica:3261", NvmeAddr: "replica:4421", NQN: "nqn.replica", |
|||
Role: blockvol.RoleToWire(blockvol.RoleReplica), LastHeartbeat: time.Now()}, |
|||
}, |
|||
}) |
|||
|
|||
ms.blockRegistry.UnmarkBlockCapable("primary") |
|||
ms.failoverBlockVolumes("primary") |
|||
|
|||
after, _ := ms.blockRegistry.Lookup("pub-test") |
|||
if after.ISCSIAddr != "replica:3261" { |
|||
t.Fatalf("ISCSIAddr: got %q, want replica:3261", after.ISCSIAddr) |
|||
} |
|||
if after.NvmeAddr != "replica:4421" { |
|||
t.Fatalf("NvmeAddr: got %q, want replica:4421", after.NvmeAddr) |
|||
} |
|||
if after.NQN != "nqn.replica" { |
|||
t.Fatalf("NQN: got %q, want nqn.replica", after.NQN) |
|||
} |
|||
} |
|||
|
|||
// --- Test 8: Orphaned primary re-evaluation with LSN lag ---
|
|||
|
|||
func TestEdge_OrphanReevaluation_LSNLag_StillPromotes(t *testing.T) { |
|||
ms := testMasterServerForFailover(t) |
|||
ms.blockRegistry.SetPromotionLSNTolerance(10) |
|||
// Primary is dead, replica is alive but lagging.
|
|||
ms.blockRegistry.MarkBlockCapable("replica") |
|||
|
|||
ms.blockRegistry.Register(&BlockVolumeEntry{ |
|||
Name: "orphan-lag", VolumeServer: "dead-primary", Path: "/data/orphan.blk", |
|||
SizeBytes: 1 << 30, Epoch: 1, Role: blockvol.RoleToWire(blockvol.RolePrimary), |
|||
Status: StatusActive, LeaseTTL: 5 * time.Second, |
|||
LastLeaseGrant: time.Now().Add(-10 * time.Second), // expired
|
|||
WALHeadLSN: 1000, |
|||
Replicas: []ReplicaInfo{ |
|||
{Server: "replica", Path: "/r.blk", HealthScore: 1.0, WALHeadLSN: 500, |
|||
Role: blockvol.RoleToWire(blockvol.RoleReplica), LastHeartbeat: time.Now()}, |
|||
}, |
|||
}) |
|||
|
|||
// Orphan re-evaluation: replica reconnects.
|
|||
ms.reevaluateOrphanedPrimaries("replica") |
|||
|
|||
// The replica has WAL lag of 500 (way beyond tolerance=10).
|
|||
// But it's the ONLY replica — should it promote or not?
|
|||
// Current behavior: LSN gate rejects it. No promotion.
|
|||
after, _ := ms.blockRegistry.Lookup("orphan-lag") |
|||
if after.Epoch != 1 { |
|||
// If epoch changed, the lagging replica was promoted.
|
|||
// This may or may not be desired — document the behavior.
|
|||
t.Logf("NOTE: lagging replica WAS promoted (epoch=%d). LSN lag=%d, tolerance=%d", |
|||
after.Epoch, 1000-500, 10) |
|||
} else { |
|||
t.Logf("NOTE: lagging replica was NOT promoted (epoch=1). Volume is stuck with dead primary.") |
|||
t.Logf("This is the current behavior: LSN gate blocks promotion even when it's the only option.") |
|||
} |
|||
// This test documents behavior, doesn't assert pass/fail.
|
|||
// The question is: should a lagging-but-only replica be promoted to avoid downtime?
|
|||
} |
|||
|
|||
// --- Test 9: Rebuild addr cleared after promotion, then repopulated ---
|
|||
|
|||
func TestEdge_RebuildAddr_ClearedThenRepopulated(t *testing.T) { |
|||
ms := testMasterServerForFailover(t) |
|||
ms.blockRegistry.MarkBlockCapable("primary") |
|||
ms.blockRegistry.MarkBlockCapable("replica") |
|||
|
|||
ms.blockRegistry.Register(&BlockVolumeEntry{ |
|||
Name: "rebuild-addr", VolumeServer: "primary", Path: "/data/rebuild.blk", |
|||
SizeBytes: 1 << 30, Epoch: 1, Role: blockvol.RoleToWire(blockvol.RolePrimary), |
|||
Status: StatusActive, LeaseTTL: 5 * time.Second, |
|||
LastLeaseGrant: time.Now().Add(-10 * time.Second), |
|||
RebuildListenAddr: "primary:15000", // old primary's rebuild addr
|
|||
Replicas: []ReplicaInfo{ |
|||
{Server: "replica", Path: "/r.blk", HealthScore: 1.0, |
|||
Role: blockvol.RoleToWire(blockvol.RoleReplica), LastHeartbeat: time.Now()}, |
|||
}, |
|||
}) |
|||
|
|||
ms.blockRegistry.UnmarkBlockCapable("primary") |
|||
ms.failoverBlockVolumes("primary") |
|||
|
|||
after, _ := ms.blockRegistry.Lookup("rebuild-addr") |
|||
// RebuildListenAddr should be cleared after promotion (B-11 fix).
|
|||
if after.RebuildListenAddr != "" { |
|||
t.Fatalf("RebuildListenAddr should be cleared after promotion, got %q", after.RebuildListenAddr) |
|||
} |
|||
} |
|||
|
|||
// --- Test 10: Multiple volumes on same server — all fail over ---
|
|||
|
|||
func TestEdge_MultipleVolumes_SameServer_AllFailover(t *testing.T) { |
|||
ms := testMasterServerForFailover(t) |
|||
ms.blockRegistry.MarkBlockCapable("vs1") |
|||
ms.blockRegistry.MarkBlockCapable("vs2") |
|||
|
|||
// Register 5 volumes, all with primary on vs1.
|
|||
for i := 0; i < 5; i++ { |
|||
name := "multi-" + string(rune('a'+i)) |
|||
ms.blockRegistry.Register(&BlockVolumeEntry{ |
|||
Name: name, VolumeServer: "vs1", Path: "/data/" + name + ".blk", |
|||
SizeBytes: 1 << 30, Epoch: 1, Role: blockvol.RoleToWire(blockvol.RolePrimary), |
|||
Status: StatusActive, LeaseTTL: 5 * time.Second, |
|||
LastLeaseGrant: time.Now().Add(-10 * time.Second), |
|||
Replicas: []ReplicaInfo{ |
|||
{Server: "vs2", Path: "/r/" + name + ".blk", HealthScore: 1.0, |
|||
Role: blockvol.RoleToWire(blockvol.RoleReplica), LastHeartbeat: time.Now()}, |
|||
}, |
|||
}) |
|||
} |
|||
|
|||
// Kill vs1 — all 5 volumes should fail over.
|
|||
ms.blockRegistry.UnmarkBlockCapable("vs1") |
|||
ms.failoverBlockVolumes("vs1") |
|||
|
|||
for i := 0; i < 5; i++ { |
|||
name := "multi-" + string(rune('a'+i)) |
|||
entry, ok := ms.blockRegistry.Lookup(name) |
|||
if !ok { |
|||
t.Fatalf("volume %s not found", name) |
|||
} |
|||
if entry.VolumeServer != "vs2" { |
|||
t.Fatalf("volume %s: expected vs2, got %q", name, entry.VolumeServer) |
|||
} |
|||
if entry.Epoch != 2 { |
|||
t.Fatalf("volume %s: epoch got %d, want 2", name, entry.Epoch) |
|||
} |
|||
} |
|||
} |
|||
@ -0,0 +1,77 @@ |
|||
package weed_server |
|||
|
|||
import ( |
|||
"encoding/json" |
|||
"net/http" |
|||
"time" |
|||
|
|||
"github.com/seaweedfs/seaweedfs/weed/storage/blockvol" |
|||
) |
|||
|
|||
// ShipperDebugInfo is the real-time shipper state for one replica.
|
|||
type ShipperDebugInfo struct { |
|||
DataAddr string `json:"data_addr"` |
|||
State string `json:"state"` |
|||
FlushedLSN uint64 `json:"flushed_lsn"` |
|||
} |
|||
|
|||
// BlockVolumeDebugInfo is the real-time block volume state.
|
|||
type BlockVolumeDebugInfo struct { |
|||
Path string `json:"path"` |
|||
Role string `json:"role"` |
|||
Epoch uint64 `json:"epoch"` |
|||
HeadLSN uint64 `json:"head_lsn"` |
|||
Degraded bool `json:"degraded"` |
|||
Shippers []ShipperDebugInfo `json:"shippers,omitempty"` |
|||
Timestamp string `json:"timestamp"` |
|||
} |
|||
|
|||
// debugBlockShipperHandler returns real-time shipper state for all block volumes.
|
|||
// Unlike the master's replica_degraded (heartbeat-lagged), this reads directly
|
|||
// from the shipper's atomic state field — no heartbeat delay.
|
|||
//
|
|||
// GET /debug/block/shipper
|
|||
func (vs *VolumeServer) debugBlockShipperHandler(w http.ResponseWriter, r *http.Request) { |
|||
if vs.blockService == nil { |
|||
w.Header().Set("Content-Type", "application/json") |
|||
json.NewEncoder(w).Encode([]BlockVolumeDebugInfo{}) |
|||
return |
|||
} |
|||
|
|||
store := vs.blockService.Store() |
|||
if store == nil { |
|||
w.Header().Set("Content-Type", "application/json") |
|||
json.NewEncoder(w).Encode([]BlockVolumeDebugInfo{}) |
|||
return |
|||
} |
|||
|
|||
var infos []BlockVolumeDebugInfo |
|||
store.IterateBlockVolumes(func(path string, vol *blockvol.BlockVol) { |
|||
status := vol.Status() |
|||
info := BlockVolumeDebugInfo{ |
|||
Path: path, |
|||
Role: status.Role.String(), |
|||
Epoch: status.Epoch, |
|||
HeadLSN: status.WALHeadLSN, |
|||
Degraded: status.ReplicaDegraded, |
|||
Timestamp: time.Now().UTC().Format(time.RFC3339Nano), |
|||
} |
|||
|
|||
// Get per-shipper state from ShipperGroup if available.
|
|||
sg := vol.GetShipperGroup() |
|||
if sg != nil { |
|||
for _, ss := range sg.ShipperStates() { |
|||
info.Shippers = append(info.Shippers, ShipperDebugInfo{ |
|||
DataAddr: ss.DataAddr, |
|||
State: ss.State, |
|||
FlushedLSN: ss.FlushedLSN, |
|||
}) |
|||
} |
|||
} |
|||
|
|||
infos = append(infos, info) |
|||
}) |
|||
|
|||
w.Header().Set("Content-Type", "application/json") |
|||
json.NewEncoder(w).Encode(infos) |
|||
} |
|||
Write
Preview
Loading…
Cancel
Save
Reference in new issue