Browse Source
chore: archive superseded V2 design docs
chore: archive superseded V2 design docs
Copies of design docs removed in Phase 09, preserved in sw-block/docs/archive/ for historical reference. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>feature/sw-block
10 changed files with 2001 additions and 0 deletions
-
28sw-block/docs/archive/design/README.md
-
117sw-block/docs/archive/design/a5-a8-traceability.md
-
403sw-block/docs/archive/design/phase-07-service-slice-plan.md
-
301sw-block/docs/archive/design/phase-08-engine-skeleton-map.md
-
170sw-block/docs/archive/design/v2-engine-readiness-review.md
-
191sw-block/docs/archive/design/v2-engine-slicing-plan.md
-
159sw-block/docs/archive/design/v2-first-slice-sender-ownership.md
-
194sw-block/docs/archive/design/v2-first-slice-session-ownership.md
-
199sw-block/docs/archive/design/v2-production-roadmap.md
-
239sw-block/docs/archive/design/v2-prototype-roadmap-and-gates.md
@ -0,0 +1,28 @@ |
|||
# Design Archive |
|||
|
|||
This directory contains historical `sw-block` design/planning documents that are still worth keeping as references, but are no longer the main entrypoints for current work. |
|||
|
|||
Use `sw-block/design/` for active design and process documents. |
|||
Use `sw-block/.private/phase/` for current phase contracts, logs, and slice-level execution packages. |
|||
|
|||
## Archived Here |
|||
|
|||
- `v2-production-roadmap.md` |
|||
- `v2-engine-readiness-review.md` |
|||
- `v2-engine-slicing-plan.md` |
|||
- `v2-prototype-roadmap-and-gates.md` |
|||
- `phase-07-service-slice-plan.md` |
|||
- `phase-08-engine-skeleton-map.md` |
|||
- `v2-first-slice-session-ownership.md` |
|||
- `v2-first-slice-sender-ownership.md` |
|||
- `a5-a8-traceability.md` |
|||
|
|||
## Why Archived |
|||
|
|||
These documents are useful for: |
|||
|
|||
1. historical decision context |
|||
2. earlier slice/phase rationale |
|||
3. traceability for passed reviews and planning gates |
|||
|
|||
They are not the canonical source for the current phase roadmap. |
|||
@ -0,0 +1,117 @@ |
|||
# A5-A8 Acceptance Traceability |
|||
|
|||
Date: 2026-03-29 |
|||
Status: historical evidence traceability |
|||
|
|||
## Purpose |
|||
|
|||
Map each acceptance criterion to specific executable evidence. |
|||
Two evidence layers: |
|||
- **Simulator** (distsim): protocol-level proof |
|||
- **Prototype** (enginev2): ownership/session-level proof |
|||
|
|||
--- |
|||
|
|||
## A5: Non-Convergent Catch-Up Escalates Explicitly |
|||
|
|||
**Must prove**: tail-chasing or failed catch-up does not pretend success. |
|||
|
|||
**Pass condition**: explicit `CatchingUp → NeedsRebuild` transition. |
|||
|
|||
| Evidence | Test | File | Layer | Status | |
|||
|----------|------|------|-------|--------| |
|||
| Tail-chasing converges or aborts | `TestS6_TailChasing_ConvergesOrAborts` | `cluster_test.go` | distsim | PASS | |
|||
| Tail-chasing non-convergent → NeedsRebuild | `TestS6_TailChasing_NonConvergent_EscalatesToNeedsRebuild` | `phase02_advanced_test.go` | distsim | PASS | |
|||
| Catch-up timeout → NeedsRebuild | `TestP03_CatchupTimeout_EscalatesToNeedsRebuild` | `phase03_timeout_test.go` | distsim | PASS | |
|||
| Reservation expiry aborts catch-up | `TestReservationExpiryAbortsCatchup` | `cluster_test.go` | distsim | PASS | |
|||
| Flapping budget exceeded → NeedsRebuild | `TestP02_S5_FlappingExceedsBudget_EscalatesToNeedsRebuild` | `phase02_advanced_test.go` | distsim | PASS | |
|||
| Catch-up converges or escalates (I3) | `TestI3_CatchUpConvergesOrEscalates` | `phase045_crash_test.go` | distsim | PASS | |
|||
| Catch-up timeout in enginev2 | `TestE2E_NeedsRebuild_Escalation` | `p2_test.go` | enginev2 | PASS | |
|||
|
|||
**Verdict**: A5 is well-covered. Both simulator and prototype prove explicit escalation. No pretend-success path exists. |
|||
|
|||
--- |
|||
|
|||
## A6: Recoverability Boundary Is Explicit |
|||
|
|||
**Must prove**: recoverable vs unrecoverable gap is decided explicitly. |
|||
|
|||
**Pass condition**: recovery aborts when reservation/payload availability is lost; rebuild is explicit fallback. |
|||
|
|||
| Evidence | Test | File | Layer | Status | |
|||
|----------|------|------|-------|--------| |
|||
| Reservation expiry aborts catch-up | `TestReservationExpiryAbortsCatchup` | `cluster_test.go` | distsim | PASS | |
|||
| WAL GC beyond replica → NeedsRebuild | `TestI5_CheckpointGC_PreservesAckedBoundary` | `phase045_crash_test.go` | distsim | PASS | |
|||
| Rebuild from snapshot + tail | `TestReplicaRebuildFromSnapshotAndTail` | `cluster_test.go` | distsim | PASS | |
|||
| Smart WAL: resolvable → unresolvable | `TestP02_SmartWAL_RecoverableThenUnrecoverable` | `phase02_advanced_test.go` | distsim | PASS | |
|||
| Time-varying payload availability | `TestP02_SmartWAL_TimeVaryingAvailability` | `phase02_advanced_test.go` | distsim | PASS | |
|||
| RecoverableLSN is replayability proof | `RecoverableLSN()` in `storage.go` | `storage.go` | distsim | Implemented | |
|||
| Handshake outcome: NeedsRebuild | `TestExec_HandshakeOutcome_NeedsRebuild_InvalidatesSession` | `execution_test.go` | enginev2 | PASS | |
|||
|
|||
**Verdict**: A6 is covered. Recovery boundary is decided by explicit reservation + recoverability check, not by optimistic assumption. `RecoverableLSN()` verifies contiguous WAL coverage. |
|||
|
|||
--- |
|||
|
|||
## A7: Historical Data Correctness Holds |
|||
|
|||
**Must prove**: recovered data for target LSN is historically correct; current extent cannot fake old history. |
|||
|
|||
**Pass condition**: snapshot + tail rebuild matches reference; current-extent reconstruction of old LSN fails correctness. |
|||
|
|||
| Evidence | Test | File | Layer | Status | |
|||
|----------|------|------|-------|--------| |
|||
| Snapshot + tail matches reference | `TestReplicaRebuildFromSnapshotAndTail` | `cluster_test.go` | distsim | PASS | |
|||
| Historical state not reconstructable after GC | `TestA7_HistoricalState_NotReconstructableAfterGC` | `phase045_crash_test.go` | distsim | PASS | |
|||
| `CanReconstructAt()` rejects faked history | `CanReconstructAt()` in `storage.go` | `storage.go` | distsim | Implemented | |
|||
| Checkpoint does not leak applied state | `TestI2_CheckpointDoesNotLeakAppliedState` | `phase045_crash_test.go` | distsim | PASS | |
|||
| Extent-referenced resolvable records | `TestExtentReferencedResolvableRecordsAreRecoverable` | `cluster_test.go` | distsim | PASS | |
|||
| Extent-referenced unresolvable → rebuild | `TestExtentReferencedUnresolvableForcesRebuild` | `cluster_test.go` | distsim | PASS | |
|||
| ACK'd flush recoverable after crash (I1) | `TestI1_AckedFlush_RecoverableAfterPrimaryCrash` | `phase045_crash_test.go` | distsim | PASS | |
|||
|
|||
**Verdict**: A7 is now covered with the Phase 4.5 crash-consistency additions. The critical gap ("current extent cannot fake old history") is proven by `CanReconstructAt()` + `TestA7_HistoricalState_NotReconstructableAfterGC`. |
|||
|
|||
--- |
|||
|
|||
## A8: Durability Mode Semantics Are Correct |
|||
|
|||
**Must prove**: best_effort, sync_all, sync_quorum behave as intended under mixed replica states. |
|||
|
|||
**Pass condition**: sync_all strict, sync_quorum commits only with true durable quorum, invalid topology rejected. |
|||
|
|||
| Evidence | Test | File | Layer | Status | |
|||
|----------|------|------|-------|--------| |
|||
| sync_quorum continues with one lagging | `TestSyncQuorumContinuesWithOneLaggingReplica` | `cluster_test.go` | distsim | PASS | |
|||
| sync_all blocks with one lagging | `TestSyncAllBlocksWithOneLaggingReplica` | `cluster_test.go` | distsim | PASS | |
|||
| sync_quorum mixed states | `TestSyncQuorumWithMixedReplicaStates` | `cluster_test.go` | distsim | PASS | |
|||
| sync_all mixed states | `TestSyncAllBlocksWithMixedReplicaStates` | `cluster_test.go` | distsim | PASS | |
|||
| Barrier timeout: sync_all blocked | `TestP03_BarrierTimeout_SyncAll_Blocked` | `phase03_timeout_test.go` | distsim | PASS | |
|||
| Barrier timeout: sync_quorum commits | `TestP03_BarrierTimeout_SyncQuorum_StillCommits` | `phase03_timeout_test.go` | distsim | PASS | |
|||
| Promotion uses RecoverableLSN | `EvaluateCandidateEligibility()` | `cluster.go` | distsim | Implemented | |
|||
| Promoted replica has committed prefix (I4) | `TestI4_PromotedReplica_HasCommittedPrefix` | `phase045_crash_test.go` | distsim | PASS | |
|||
|
|||
**Verdict**: A8 is well-covered. sync_all is strict (blocks on lagging), sync_quorum uses true durable quorum (not connection count). Promotion now uses `RecoverableLSN()` for committed-prefix check. |
|||
|
|||
--- |
|||
|
|||
## Summary |
|||
|
|||
| Criterion | Simulator Evidence | Prototype Evidence | Status | |
|||
|-----------|-------------------|-------------------|--------| |
|||
| A5 (catch-up escalation) | 6 tests | 1 test | **Strong** | |
|||
| A6 (recoverability boundary) | 6 tests + RecoverableLSN() | 1 test | **Strong** | |
|||
| A7 (historical correctness) | 7 tests + CanReconstructAt() | — | **Strong** (new in Phase 4.5) | |
|||
| A8 (durability modes) | 7 tests + RecoverableLSN() | — | **Strong** | |
|||
|
|||
**Total executable evidence**: 26 simulator tests + 2 prototype tests + 2 new storage methods. |
|||
|
|||
All A5-A8 acceptance criteria have direct test evidence. No criterion depends solely on design-doc claims. |
|||
|
|||
--- |
|||
|
|||
## Still Open (Not Blocking) |
|||
|
|||
| Item | Priority | Why not blocking | |
|||
|------|----------|-----------------| |
|||
| Predicate exploration / adversarial search | P2 | Manual scenarios already cover known failure classes | |
|||
| Catch-up convergence under sustained load | P2 | I3 proves escalation; load-rate modeling is optimization | |
|||
| A5-A8 in a single grouped runner view | P3 | Traceability doc serves as grouped evidence for now | |
|||
@ -0,0 +1,403 @@ |
|||
# Phase 07 Service-Slice Plan |
|||
|
|||
Date: 2026-03-30 |
|||
Status: historical phase-planning artifact |
|||
Scope: `Phase 07 P0` |
|||
|
|||
## Purpose |
|||
|
|||
Define the first real-system service slice that will host the V2 engine, choose the first concrete integration path in the existing codebase, and map engine adapters onto real modules. |
|||
|
|||
This is a planning document. It does not claim the integration already works. |
|||
|
|||
## Decision |
|||
|
|||
The first service slice should be: |
|||
|
|||
- a single `blockvol` primary on a real volume server |
|||
- with one replica target (`RF=2` path) |
|||
- driven by the existing master heartbeat / assignment loop |
|||
- using the V2 engine only for replication recovery ownership / planning / execution |
|||
|
|||
This is the narrowest real-system slice that still exercises: |
|||
|
|||
1. real assignment delivery |
|||
2. real epoch and failover signals |
|||
3. real volume-server lifecycle |
|||
4. real WAL/checkpoint/base-image truth |
|||
5. real changed-address / reconnect behavior |
|||
|
|||
It is narrow enough to avoid reopening the whole system, but real enough to stop hiding behind engine-local mocks. |
|||
|
|||
## Why This Slice |
|||
|
|||
This slice is the right first integration target because: |
|||
|
|||
1. `weed/server/master_grpc_server.go` already delivers block-volume assignments over heartbeat |
|||
2. `weed/server/master_block_failover.go` already owns failover / promotion / pending rebuild decisions |
|||
3. `weed/storage/blockvol/blockvol.go` already owns the current replication runtime (`shipperGroup`, receiver, WAL retention, checkpoint state) |
|||
4. the existing V1/V1.5 failure history is concentrated in exactly this master <-> volume-server <-> blockvol path |
|||
|
|||
So this slice gives maximum validation value with minimum new surface. |
|||
|
|||
## First Concrete Integration Path |
|||
|
|||
The first integration path should be: |
|||
|
|||
1. master receives volume-server heartbeat |
|||
2. master updates block registry and emits `BlockVolumeAssignment` |
|||
3. volume server receives assignment |
|||
4. block volume adapter converts assignment + local storage state into V2 engine inputs |
|||
5. V2 engine drives sender/session/recovery state |
|||
6. existing block-volume runtime executes the actual data-path work under engine decisions |
|||
|
|||
In code, that path starts here: |
|||
|
|||
- master side: |
|||
- `weed/server/master_grpc_server.go` |
|||
- `weed/server/master_block_failover.go` |
|||
- `weed/server/master_block_registry.go` |
|||
- volume / storage side: |
|||
- `weed/storage/blockvol/blockvol.go` |
|||
- `weed/storage/blockvol/recovery.go` |
|||
- `weed/storage/blockvol/wal_shipper.go` |
|||
- assignment-handling code under `weed/storage/blockvol/` |
|||
- V2 engine side: |
|||
- `sw-block/engine/replication/` |
|||
|
|||
## Service-Slice Boundaries |
|||
|
|||
### In-process placement |
|||
|
|||
The V2 engine should initially live: |
|||
|
|||
- in-process with the volume server / `blockvol` runtime |
|||
- not in master |
|||
- not as a separate service yet |
|||
|
|||
Reason: |
|||
|
|||
- the engine needs local access to storage truth and local recovery execution |
|||
- master should remain control-plane authority, not recovery executor |
|||
|
|||
### Control-plane boundary |
|||
|
|||
Master remains authoritative for: |
|||
|
|||
1. epoch |
|||
2. role / assignment |
|||
3. promotion / failover decision |
|||
4. replica membership |
|||
|
|||
The engine consumes these as control inputs. It does not replace master failover policy in `Phase 07`. |
|||
|
|||
### Control-Over-Heartbeat Upgrade Path |
|||
|
|||
For the first V2 product path, the recommended direction is: |
|||
|
|||
- reuse the existing master <-> volume-server heartbeat path as the control carrier |
|||
- upgrade the block-specific control semantics carried on that path |
|||
- do not immediately invent a separate control service or assignment channel |
|||
|
|||
Why: |
|||
|
|||
1. this is the real Seaweed path already carrying block assignments and confirmations today |
|||
2. this gives the fastest route to a real integrated control path |
|||
3. it preserves compatibility with existing Seaweed master/volume-server semantics while V2 hardens its own control truth |
|||
|
|||
Concretely, the current V1 path already provides: |
|||
|
|||
1. block assignments delivered in heartbeat responses from `weed/server/master_grpc_server.go` |
|||
2. assignment application on the volume server in `weed/server/volume_grpc_client_to_master.go` and `weed/server/volume_server_block.go` |
|||
3. assignment confirmation and address-change refresh driven by later heartbeats in `weed/server/master_grpc_server.go` and `weed/server/master_block_registry.go` |
|||
4. immediate block heartbeat on selected shipper state changes in `weed/server/volume_grpc_client_to_master.go` |
|||
|
|||
What should be upgraded for V2 is not mainly the transport, but the control contract carried on it: |
|||
|
|||
1. stable `ReplicaID` |
|||
2. explicit `Epoch` |
|||
3. explicit role / assignment authority |
|||
4. explicit apply/confirm semantics |
|||
5. explicit stale assignment rejection |
|||
6. explicit address-change refresh as endpoint change, not identity change |
|||
|
|||
Current cadence note: |
|||
|
|||
- the block volume heartbeat is periodic (`5 * sleepInterval`) with some immediate state-change heartbeats |
|||
- this is acceptable as the first hardening carrier |
|||
- it should not be assumed to be the final control responsiveness model |
|||
|
|||
Deferred design decision: |
|||
|
|||
- whether block control should eventually move beyond heartbeat-only carriage into a more explicit control/assignment channel should be decided only after the `Phase 08 P1` real control-delivery path exists and can be measured |
|||
|
|||
That later decision should be based on: |
|||
|
|||
1. failover / reassignment responsiveness |
|||
2. assignment confirmation precision |
|||
3. operational complexity |
|||
4. whether heartbeat carriage remains too coarse for the block-control path |
|||
|
|||
Until then, the preferred direction is: |
|||
|
|||
- strengthen block control semantics over the existing heartbeat path |
|||
- do not prematurely create a second control plane |
|||
|
|||
### Storage boundary |
|||
|
|||
`blockvol` remains authoritative for: |
|||
|
|||
1. WAL head / retention reality |
|||
2. checkpoint/base-image reality |
|||
3. actual catch-up streaming |
|||
4. actual rebuild transfer / restore operations |
|||
|
|||
The engine consumes these as storage truth and recovery execution capabilities. It does not replace the storage backend in `Phase 07`. |
|||
|
|||
## First-Slice Identity Mapping |
|||
|
|||
This must be explicit in the first integration slice. |
|||
|
|||
For `RF=2` on the existing master / block registry path: |
|||
|
|||
- stable engine `ReplicaID` should be derived from: |
|||
- `<volume-name>/<replica-server-id>` |
|||
- not from: |
|||
- `DataAddr` |
|||
- `CtrlAddr` |
|||
- heartbeat transport endpoint |
|||
|
|||
For this slice, the adapter should map: |
|||
|
|||
1. `ReplicaID` |
|||
- from master/block-registry identity for the replica host entry |
|||
|
|||
2. `Endpoint` |
|||
- from the current replica receiver/data/control addresses reported by the real runtime |
|||
|
|||
3. `Epoch` |
|||
- from the confirmed master assignment for the volume |
|||
|
|||
4. `SessionKind` |
|||
- from master-driven recovery intent / role transition outcome |
|||
|
|||
This is a hard first-slice requirement because address refresh must not collapse identity back into endpoint-shaped keys. |
|||
|
|||
## Adapter Mapping |
|||
|
|||
### 1. ControlPlaneAdapter |
|||
|
|||
Engine interface today: |
|||
|
|||
- `HandleHeartbeat(serverID, volumes)` |
|||
- `HandleFailover(deadServerID)` |
|||
|
|||
Real mapping should be: |
|||
|
|||
- master-side source: |
|||
- `weed/server/master_grpc_server.go` |
|||
- `weed/server/master_block_failover.go` |
|||
- `weed/server/master_block_registry.go` |
|||
- volume-server side sink: |
|||
- assignment receive/apply path in `weed/storage/blockvol/` |
|||
|
|||
Recommended real shape: |
|||
|
|||
- do not literally push raw heartbeat messages into the engine |
|||
- instead introduce a thin adapter that converts confirmed master assignment state into: |
|||
- stable `ReplicaID` |
|||
- endpoint set |
|||
- epoch |
|||
- recovery target kind |
|||
|
|||
That keeps master as control owner and the engine as execution owner. |
|||
|
|||
Important note: |
|||
|
|||
- the adapter should treat heartbeat as the transport carrier, not as the final protocol shape |
|||
- block-control semantics should be made explicit over that carrier |
|||
- if a later phase concludes that heartbeat-only carriage is too coarse, that should be a separate design decision after the real hardening path is measured |
|||
|
|||
### 2. StorageAdapter |
|||
|
|||
Engine interface today: |
|||
|
|||
- `GetRetainedHistory()` |
|||
- `PinSnapshot(lsn)` / `ReleaseSnapshot(pin)` |
|||
- `PinWALRetention(startLSN)` / `ReleaseWALRetention(pin)` |
|||
- `PinFullBase(committedLSN)` / `ReleaseFullBase(pin)` |
|||
|
|||
Real mapping should be: |
|||
|
|||
- retained history source: |
|||
- current WAL head/tail/checkpoint state from `weed/storage/blockvol/blockvol.go` |
|||
- recovery helpers in `weed/storage/blockvol/recovery.go` |
|||
- WAL retention pin: |
|||
- existing retention-floor / replica-aware WAL retention machinery around `shipperGroup` |
|||
- snapshot pin: |
|||
- existing snapshot/checkpoint artifacts in `blockvol` |
|||
- full-base pin: |
|||
- explicit pinned full-extent export or equivalent consistent base handle from `blockvol` |
|||
|
|||
Important constraint: |
|||
|
|||
- `Phase 07` must not fake this by reconstructing `RetainedHistory` from tests or metadata alone |
|||
|
|||
### 3. Execution Driver / Executor hookup |
|||
|
|||
Engine side already has: |
|||
|
|||
- planner/executor split in `sw-block/engine/replication/driver.go` |
|||
- stepwise executors in `sw-block/engine/replication/executor.go` |
|||
|
|||
Real mapping should be: |
|||
|
|||
- engine planner decides: |
|||
- zero-gap / catch-up / rebuild |
|||
- trusted-base requirement |
|||
- replayable-tail requirement |
|||
- blockvol runtime performs: |
|||
- actual WAL catch-up transport |
|||
- actual snapshot/base transfer |
|||
- actual truncation / apply operations |
|||
|
|||
Recommended split: |
|||
|
|||
- engine owns contract and state transitions |
|||
- blockvol adapter owns concrete I/O work |
|||
|
|||
## First-Slice Acceptance Rule |
|||
|
|||
For the first integration slice, this is a hard rule: |
|||
|
|||
- `blockvol` may execute recovery I/O |
|||
- `blockvol` must not own recovery policy |
|||
|
|||
Concretely, `blockvol` must not decide: |
|||
|
|||
1. zero-gap vs catch-up vs rebuild |
|||
2. trusted-base validity |
|||
3. replayable-tail sufficiency |
|||
4. whether rebuild fallback is required |
|||
|
|||
Those decisions must remain in the V2 engine. |
|||
|
|||
The bridge may translate engine decisions into concrete blockvol actions, but it must not re-decide recovery policy underneath the engine. |
|||
|
|||
## First Product Path |
|||
|
|||
The first product path should be: |
|||
|
|||
- `RF=2` block volume replication on the existing heartbeat/assignment loop |
|||
- primary + one replica |
|||
- failover / reconnect / changed-address handling |
|||
- rebuild as the formal non-catch-up recovery path |
|||
|
|||
This is the right first path because it exercises the core correctness boundary without introducing N-replica coordination complexity too early. |
|||
|
|||
## What Must Be Replaced First |
|||
|
|||
Current engine-stage pieces that are still mock/test-only or too abstract: |
|||
|
|||
### Replace first |
|||
|
|||
1. `mockStorage` in engine tests |
|||
- replace with a real `blockvol`-backed `StorageAdapter` |
|||
|
|||
2. synthetic control events in engine tests |
|||
- replace with assignment-driven events from the real master/volume-server path |
|||
|
|||
3. convenience recovery completion wrappers |
|||
- keep them test-only |
|||
- real integration should use planner + executor + storage work loop |
|||
|
|||
### Can remain temporarily abstract in Phase 07 P0/P1 |
|||
|
|||
1. `ControlPlaneAdapter` exact public shape |
|||
- can remain thin while the integration path is being chosen |
|||
|
|||
2. async production scheduler details |
|||
- executor can still be driven by a service loop before full background-task architecture is finalized |
|||
|
|||
## Recommended Concrete Modules |
|||
|
|||
### Engine stays here |
|||
|
|||
- `sw-block/engine/replication/` |
|||
|
|||
### First real adapter package should be added near blockvol |
|||
|
|||
Recommended initial location: |
|||
|
|||
- `weed/storage/blockvol/v2bridge/` |
|||
|
|||
Reason: |
|||
|
|||
- keeps V2 engine independent under `sw-block/` |
|||
- keeps real-system glue close to blockvol storage truth |
|||
- avoids copying engine logic into `weed/` |
|||
|
|||
Suggested contents: |
|||
|
|||
1. `control_adapter.go` |
|||
- convert master assignment / local apply path into engine intents |
|||
|
|||
2. `storage_adapter.go` |
|||
- expose retained history, pin/release, trusted-base export handles from real blockvol state |
|||
|
|||
3. `executor_bridge.go` |
|||
- translate engine executor steps into actual blockvol recovery actions |
|||
|
|||
4. `observe_adapter.go` |
|||
- map engine status/logs into service-visible diagnostics |
|||
|
|||
## First Failure Replay Set For Phase 07 |
|||
|
|||
The first real-system replay set should be: |
|||
|
|||
1. changed-address restart |
|||
- current risk: old identity/address coupling reappears in service glue |
|||
|
|||
2. stale epoch / stale result after failover |
|||
- current risk: master and engine disagree on authority timing |
|||
|
|||
3. unreplayable-tail rebuild fallback |
|||
- current risk: service glue over-trusts checkpoint/base availability |
|||
|
|||
4. plan/execution cleanup after resource failure |
|||
- current risk: blockvol-side resource failures leave engine or service state dangling |
|||
|
|||
5. primary failover to replica with rebuild pending on old primary reconnect |
|||
- current risk: old V1/V1.5 semantics leak back into reconnect handling |
|||
|
|||
## Non-Goals For This Slice |
|||
|
|||
Do not use `Phase 07` to: |
|||
|
|||
1. widen catch-up semantics |
|||
2. add smart rebuild optimizations |
|||
3. redesign all blockvol internals |
|||
4. replace the full V1 runtime in one move |
|||
5. claim production readiness |
|||
|
|||
## Deliverables For Phase 07 P0 |
|||
|
|||
A good `P0` delivery should include: |
|||
|
|||
1. chosen service slice |
|||
2. chosen integration path in the current repo |
|||
3. adapter-to-module mapping |
|||
4. list of test-only adapters to replace first |
|||
5. first failure replay set |
|||
6. explicit note of what remains outside this first slice |
|||
|
|||
## Short Form |
|||
|
|||
`Phase 07 P0` should start with: |
|||
|
|||
- engine in `sw-block/engine/replication/` |
|||
- bridge in `weed/storage/blockvol/v2bridge/` |
|||
- first real slice = blockvol primary + one replica on the existing master heartbeat / assignment path |
|||
- `ReplicaID = <volume-name>/<replica-server-id>` for the first slice |
|||
- `blockvol` executes I/O but does not own recovery policy |
|||
- first product path = `RF=2` failover/reconnect/rebuild correctness |
|||
@ -0,0 +1,301 @@ |
|||
# Phase 08 Engine Skeleton Map |
|||
|
|||
Date: 2026-03-31 |
|||
Status: historical phase map |
|||
Purpose: provide a short structural map for the `Phase 08` hardening path so implementation can move faster without reopening accepted V2 boundaries |
|||
|
|||
## Scope |
|||
|
|||
This is not the final standalone `sw-block` architecture. |
|||
|
|||
It is the shortest useful engine skeleton for the accepted `Phase 08` hardening path: |
|||
|
|||
- `RF=2` |
|||
- `sync_all` |
|||
- existing `Seaweed` master / volume-server heartbeat path |
|||
- V2 engine owns recovery policy |
|||
- `blockvol` remains the execution backend |
|||
|
|||
## Module Map |
|||
|
|||
### 1. Control plane |
|||
|
|||
Role: |
|||
|
|||
- authoritative control truth |
|||
|
|||
Primary sources: |
|||
|
|||
- `weed/server/master_grpc_server.go` |
|||
- `weed/server/master_block_registry.go` |
|||
- `weed/server/master_block_failover.go` |
|||
- `weed/server/volume_grpc_client_to_master.go` |
|||
|
|||
What it produces: |
|||
|
|||
- confirmed assignment |
|||
- `Epoch` |
|||
- target `Role` |
|||
- failover / promotion / reassignment result |
|||
- stable server identity |
|||
|
|||
### 2. Control bridge |
|||
|
|||
Role: |
|||
|
|||
- translate real control truth into V2 engine intent |
|||
|
|||
Primary files: |
|||
|
|||
- `weed/storage/blockvol/v2bridge/control.go` |
|||
- `sw-block/bridge/blockvol/control_adapter.go` |
|||
- entry path in `weed/server/volume_server_block.go` |
|||
|
|||
What it produces: |
|||
|
|||
- `AssignmentIntent` |
|||
- stable `ReplicaID` |
|||
- `Endpoint` |
|||
- `SessionKind` |
|||
|
|||
### 3. Engine runtime |
|||
|
|||
Role: |
|||
|
|||
- recovery-policy core |
|||
|
|||
Primary files: |
|||
|
|||
- `sw-block/engine/replication/orchestrator.go` |
|||
- `sw-block/engine/replication/driver.go` |
|||
- `sw-block/engine/replication/executor.go` |
|||
- `sw-block/engine/replication/sender.go` |
|||
- `sw-block/engine/replication/history.go` |
|||
|
|||
What it decides: |
|||
|
|||
- zero-gap / catch-up / needs-rebuild |
|||
- sender/session ownership |
|||
- stale authority rejection |
|||
- resource acquisition / release |
|||
- rebuild source selection |
|||
|
|||
### 4. Storage bridge |
|||
|
|||
Role: |
|||
|
|||
- translate real blockvol storage truth and execution capability into engine-facing adapters |
|||
|
|||
Primary files: |
|||
|
|||
- `weed/storage/blockvol/v2bridge/reader.go` |
|||
- `weed/storage/blockvol/v2bridge/pinner.go` |
|||
- `weed/storage/blockvol/v2bridge/executor.go` |
|||
- `sw-block/bridge/blockvol/storage_adapter.go` |
|||
|
|||
What it provides: |
|||
|
|||
- `RetainedHistory` |
|||
- WAL retention pin / release |
|||
- snapshot pin / release |
|||
- full-base pin / release |
|||
- WAL scan execution |
|||
|
|||
### 5. Block runtime |
|||
|
|||
Role: |
|||
|
|||
- execute real I/O |
|||
|
|||
Primary files: |
|||
|
|||
- `weed/storage/blockvol/blockvol.go` |
|||
- `weed/storage/blockvol/replica_apply.go` |
|||
- `weed/storage/blockvol/replica_barrier.go` |
|||
- `weed/storage/blockvol/recovery.go` |
|||
- `weed/storage/blockvol/rebuild.go` |
|||
- `weed/storage/blockvol/wal_shipper.go` |
|||
|
|||
What it owns: |
|||
|
|||
- WAL |
|||
- extent |
|||
- flusher |
|||
- checkpoint / superblock |
|||
- receiver / shipper |
|||
- rebuild server |
|||
|
|||
## Execution Order |
|||
|
|||
### Control path |
|||
|
|||
```text |
|||
master heartbeat / failover truth |
|||
-> BlockVolumeAssignment |
|||
-> volume server ProcessAssignments |
|||
-> v2bridge control conversion |
|||
-> engine ProcessAssignment |
|||
-> sender/session state updated |
|||
``` |
|||
|
|||
### Catch-up path |
|||
|
|||
```text |
|||
assignment accepted |
|||
-> engine reads retained history |
|||
-> engine plans catch-up |
|||
-> storage bridge pins WAL retention |
|||
-> engine executor drives v2bridge executor |
|||
-> blockvol scans WAL / ships entries |
|||
-> engine completes session |
|||
``` |
|||
|
|||
### Rebuild path |
|||
|
|||
```text |
|||
assignment accepted |
|||
-> engine detects NeedsRebuild |
|||
-> engine selects rebuild source |
|||
-> storage bridge pins snapshot/full-base/tail |
|||
-> executor drives transfer path |
|||
-> blockvol performs restore / replay work |
|||
-> engine completes rebuild |
|||
``` |
|||
|
|||
### Local durability path |
|||
|
|||
```text |
|||
WriteLBA / Trim |
|||
-> WAL append |
|||
-> shipping / barrier |
|||
-> client-visible durability decision |
|||
-> flusher writes extent |
|||
-> checkpoint advances |
|||
-> retention floor decides WAL reclaimability |
|||
``` |
|||
|
|||
## Interim Fields |
|||
|
|||
These are currently acceptable only as explicit hardening carry-forwards: |
|||
|
|||
### `localServerID` |
|||
|
|||
Current source: |
|||
|
|||
- `BlockService.listenAddr` |
|||
|
|||
Meaning: |
|||
|
|||
- temporary local identity source for replica/rebuild-side assignment translation |
|||
|
|||
Status: |
|||
|
|||
- interim only |
|||
- should become registry-assigned stable server identity later |
|||
|
|||
### `CommittedLSN = CheckpointLSN` |
|||
|
|||
Current source: |
|||
|
|||
- `v2bridge.Reader` / `BlockVol.StatusSnapshot()` |
|||
|
|||
Meaning: |
|||
|
|||
- current V1-style interim mapping where committed truth collapses to local checkpoint truth |
|||
|
|||
Status: |
|||
|
|||
- not final V2 truth |
|||
- must become a gate decision before a production-candidate phase |
|||
|
|||
### heartbeat as control carrier |
|||
|
|||
Current source: |
|||
|
|||
- existing master <-> volume-server heartbeat path |
|||
|
|||
Meaning: |
|||
|
|||
- current transport for assignment/control delivery |
|||
|
|||
Status: |
|||
|
|||
- acceptable as current carrier |
|||
- not yet a final proof that no separate control channel will ever be needed |
|||
|
|||
## Hard Gates |
|||
|
|||
These should remain explicit in `Phase 08`: |
|||
|
|||
### Gate 1: committed truth |
|||
|
|||
Before production-candidate: |
|||
|
|||
- either separate `CommittedLSN` from `CheckpointLSN` |
|||
- or explicitly bound the first candidate path to currently proven pre-checkpoint replay behavior |
|||
|
|||
### Gate 2: live control delivery |
|||
|
|||
Required: |
|||
|
|||
- real assignment delivery must reach the engine on the live path |
|||
- not only converter-level proof |
|||
|
|||
### Gate 3: integrated catch-up closure |
|||
|
|||
Required: |
|||
|
|||
- engine -> executor -> `v2bridge` -> blockvol must be proven as one live chain |
|||
- not planner proof plus direct WAL-scan proof as separate evidence |
|||
|
|||
### Gate 4: first rebuild execution path |
|||
|
|||
Required: |
|||
|
|||
- rebuild must not remain only a detection outcome |
|||
- the chosen product path needs one real executable rebuild closure |
|||
|
|||
### Gate 5: unified replay |
|||
|
|||
Required: |
|||
|
|||
- after control and execution closure land, rerun the accepted failure-class set on the unified live path |
|||
|
|||
## Reuse Map |
|||
|
|||
### Reuse directly |
|||
|
|||
- `weed/server/master_grpc_server.go` |
|||
- `weed/server/volume_grpc_client_to_master.go` |
|||
- `weed/server/volume_server_block.go` |
|||
- `weed/server/master_block_registry.go` |
|||
- `weed/server/master_block_failover.go` |
|||
- `weed/storage/blockvol/blockvol.go` |
|||
- `weed/storage/blockvol/replica_apply.go` |
|||
- `weed/storage/blockvol/replica_barrier.go` |
|||
- `weed/storage/blockvol/v2bridge/` |
|||
|
|||
### Reuse as implementation reality, not truth |
|||
|
|||
- `shipperGroup` |
|||
- `RetentionFloorFn` |
|||
- `ReplicaReceiver` |
|||
- checkpoint/superblock machinery |
|||
- existing failover heuristics |
|||
|
|||
### Do not inherit as V2 semantics |
|||
|
|||
- address-shaped identity |
|||
- old degraded/catch-up intuition from V1/V1.5 |
|||
- `CommittedLSN = CheckpointLSN` as final truth |
|||
- blockvol-side recovery policy decisions |
|||
|
|||
## Short Rule |
|||
|
|||
Use this skeleton as: |
|||
|
|||
- a hardening map for the current product path |
|||
|
|||
Do not mistake it for: |
|||
|
|||
- the final standalone `sw-block` architecture |
|||
@ -0,0 +1,170 @@ |
|||
# V2 Engine Readiness Review |
|||
|
|||
Date: 2026-03-29 |
|||
Status: historical readiness review |
|||
Purpose: record the decision on whether the current V2 design + prototype + simulator stack is strong enough to begin real V2 engine slicing |
|||
|
|||
## Decision |
|||
|
|||
Current judgment: |
|||
|
|||
- proceed to real V2 engine planning |
|||
- do not open a `V2.5` redesign track at this time |
|||
|
|||
This is a planning-readiness decision, not a production-readiness claim. |
|||
|
|||
## Why This Review Exists |
|||
|
|||
The project has now completed: |
|||
|
|||
1. design/FSM closure for the V2 line |
|||
2. protocol simulation closure for: |
|||
- V1 / V1.5 / V2 comparison |
|||
- timeout/race behavior |
|||
- ownership/session semantics |
|||
3. standalone prototype closure for: |
|||
- sender/session ownership |
|||
- execution authority |
|||
- recovery branching |
|||
- minimal historical-data proof |
|||
- prototype scenario closure |
|||
4. `Phase 4.5` hardening for: |
|||
- bounded `CatchUp` |
|||
- first-class `Rebuild` |
|||
- crash-consistency / restart-recoverability |
|||
- `A5-A8` stronger evidence |
|||
|
|||
So the question is no longer: |
|||
|
|||
- "can the prototype be made richer?" |
|||
|
|||
The question is: |
|||
|
|||
- "is the evidence now strong enough to begin real engine slicing?" |
|||
|
|||
## Evidence Summary |
|||
|
|||
### 1. Design / Protocol |
|||
|
|||
Primary docs: |
|||
|
|||
- `sw-block/design/v2-acceptance-criteria.md` |
|||
- `sw-block/design/v2-open-questions.md` |
|||
- `sw-block/design/v2_scenarios.md` |
|||
- `sw-block/design/v1-v15-v2-comparison.md` |
|||
- `sw-block/docs/archive/design/v2-prototype-roadmap-and-gates.md` |
|||
|
|||
Judgment: |
|||
|
|||
- protocol story is coherent |
|||
- acceptance set exists |
|||
- major V1 / V1.5 failures are mapped into V2 scenarios |
|||
|
|||
### 2. Simulator |
|||
|
|||
Primary code/tests: |
|||
|
|||
- `sw-block/prototype/distsim/` |
|||
- `sw-block/prototype/distsim/eventsim.go` |
|||
- `learn/projects/sw-block/test/results/v2-simulation-review.md` |
|||
|
|||
Judgment: |
|||
|
|||
- strong enough for protocol/design validation |
|||
- strong enough to challenge crash-consistency and liveness assumptions |
|||
- not a substitute for real engine / hardware proof |
|||
|
|||
### 3. Prototype |
|||
|
|||
Primary code/tests: |
|||
|
|||
- `sw-block/prototype/enginev2/` |
|||
- `sw-block/prototype/enginev2/acceptance_test.go` |
|||
|
|||
Judgment: |
|||
|
|||
- ownership is explicit and fenced |
|||
- execution authority is explicit and fenced |
|||
- bounded `CatchUp` is semantic, not documentary |
|||
- `Rebuild` is a first-class sender-owned path |
|||
- historical-data and recoverability reasoning are executable |
|||
|
|||
### 4. `A5-A8` Double Evidence |
|||
|
|||
Prototype-side grouped evidence: |
|||
|
|||
- `sw-block/prototype/enginev2/acceptance_test.go` |
|||
|
|||
Simulator-side grouped evidence: |
|||
|
|||
- `sw-block/docs/archive/design/a5-a8-traceability.md` |
|||
- `sw-block/prototype/distsim/` |
|||
|
|||
Judgment: |
|||
|
|||
- the critical acceptance items that most affect engine risk now have materially stronger proof on both sides |
|||
|
|||
## What Is Good Enough Now |
|||
|
|||
The following are good enough to begin engine slicing: |
|||
|
|||
1. sender/session ownership model |
|||
2. stale authority fencing |
|||
3. recovery orchestration shape |
|||
4. bounded `CatchUp` contract |
|||
5. `Rebuild` as formal path |
|||
6. committed/recoverable boundary thinking |
|||
7. crash-consistency / restart-recoverability proof style |
|||
|
|||
## What Is Still Not Proven |
|||
|
|||
The following still require real engine work and later real-system validation: |
|||
|
|||
1. actual engine lifecycle integration |
|||
2. real storage/backend implementation |
|||
3. real control-plane integration |
|||
4. real durability / fsync behavior under the actual engine |
|||
5. real hardware timing / performance |
|||
6. final production observability and failure handling |
|||
|
|||
These are expected gaps. They do not block engine planning. |
|||
|
|||
## Open Risks To Carry Forward |
|||
|
|||
These are not blockers, but they should remain explicit: |
|||
|
|||
1. prototype and simulator are still reduced models |
|||
2. rebuild-source quality in the real engine will depend on actual checkpoint/base-image mechanics |
|||
3. durability truth in the real engine must still be re-proven against actual persistence behavior |
|||
4. predicate exploration can still grow, but should not block engine slicing |
|||
|
|||
## Engine-Planning Decision |
|||
|
|||
Decision: |
|||
|
|||
- start real V2 engine planning |
|||
|
|||
Reason: |
|||
|
|||
1. no current evidence points to a structural flaw requiring `V2.5` |
|||
2. the remaining gaps are implementation/system gaps, not prototype ambiguity |
|||
3. continuing to extend prototype/simulator breadth would have diminishing returns |
|||
|
|||
## Required Outputs After This Review |
|||
|
|||
1. `sw-block/docs/archive/design/v2-engine-slicing-plan.md` |
|||
2. first real engine slice definition |
|||
3. explicit non-goals for first engine stage |
|||
4. explicit validation plan for engine slices |
|||
|
|||
## Non-Goals Of This Review |
|||
|
|||
This review does not claim: |
|||
|
|||
1. V2 is production-ready |
|||
2. V2 should replace V1 immediately |
|||
3. all design questions are forever closed |
|||
|
|||
It only claims: |
|||
|
|||
- the project now has enough evidence to begin disciplined real engine slicing |
|||
@ -0,0 +1,191 @@ |
|||
# V2 Engine Slicing Plan |
|||
|
|||
Date: 2026-03-29 |
|||
Status: historical slicing plan |
|||
Purpose: define the first real V2 engine slices after prototype and `Phase 4.5` closure |
|||
|
|||
## Goal |
|||
|
|||
Move from: |
|||
|
|||
- standalone design/prototype truth under `sw-block/prototype/` |
|||
|
|||
to: |
|||
|
|||
- a real V2 engine core under `sw-block/` |
|||
|
|||
without dragging V1.5 lifecycle assumptions into the implementation. |
|||
|
|||
## Planning Rules |
|||
|
|||
1. reuse V1 ideas and tests selectively, not structurally |
|||
2. prefer narrow vertical slices over broad skeletons |
|||
3. each slice must preserve the accepted V2 ownership/fencing model |
|||
4. keep simulator/prototype as validation support, not as the implementation itself |
|||
5. do not mix V2 engine work into `weed/storage/blockvol/` |
|||
|
|||
## First Engine Stage |
|||
|
|||
The first engine stage should build the control/recovery core, not the full storage engine. |
|||
|
|||
That means: |
|||
|
|||
1. per-replica sender identity |
|||
2. one active recovery session per replica per epoch |
|||
3. sender-owned execution authority |
|||
4. explicit recovery outcomes: |
|||
- zero gap |
|||
- bounded catch-up |
|||
- rebuild |
|||
5. rebuild execution shell only |
|||
- do not hard-code final snapshot + tail vs full base decision logic yet |
|||
- keep real rebuild-source choice tied to Slice 3 recoverability inputs |
|||
|
|||
## Recommended Slice Order |
|||
|
|||
### Slice 1: Engine Ownership Core |
|||
|
|||
Purpose: |
|||
|
|||
- carry the accepted `enginev2` ownership/fencing model into the real engine core |
|||
|
|||
Scope: |
|||
|
|||
1. stable per-replica sender object |
|||
2. stable recovery-session object |
|||
3. session identity fencing |
|||
4. endpoint / epoch invalidation |
|||
5. sender-group or equivalent ownership registry |
|||
|
|||
Acceptance: |
|||
|
|||
1. stale session results cannot mutate current authority |
|||
2. changed-address and epoch-bump invalidation work in engine code |
|||
3. the 4 V2-boundary ownership themes remain provable |
|||
|
|||
### Slice 2: Engine Recovery Execution Core |
|||
|
|||
Purpose: |
|||
|
|||
- move the prototype execution APIs into real engine behavior |
|||
|
|||
Scope: |
|||
|
|||
1. connect / handshake / catch-up flow |
|||
2. bounded `CatchUp` |
|||
3. explicit `NeedsRebuild` |
|||
4. sender-owned rebuild execution path |
|||
5. rebuild execution shell without final trusted-base selection policy |
|||
|
|||
Acceptance: |
|||
|
|||
1. bounded catch-up does not chase indefinitely |
|||
2. rebuild is exclusive from catch-up |
|||
3. session completion rules are explicit and fenced |
|||
|
|||
### Slice 3: Engine Data / Recoverability Core |
|||
|
|||
Purpose: |
|||
|
|||
- connect recovery behavior to real retained-history / checkpoint mechanics |
|||
|
|||
Scope: |
|||
|
|||
1. real recoverability decision inputs |
|||
2. trusted-base decision for rebuild source |
|||
3. minimal real checkpoint/base-image integration |
|||
4. real truncation / safe-boundary handling |
|||
|
|||
This is the first slice that should decide, from real engine inputs, between: |
|||
|
|||
1. `snapshot + tail` |
|||
2. `full base` |
|||
|
|||
Acceptance: |
|||
|
|||
1. engine can explain why recovery is allowed |
|||
2. rebuild-source choice is explicit and testable |
|||
3. historical correctness and truncation rules remain intact |
|||
|
|||
### Slice 4: Engine Integration Closure |
|||
|
|||
Purpose: |
|||
|
|||
- bind engine control/recovery core to real orchestration and validation surfaces |
|||
|
|||
Scope: |
|||
|
|||
1. real assignment/control intent entry path |
|||
2. engine-facing observability |
|||
3. focused real-engine tests for V2-boundary cases |
|||
4. first integration review against real failure classes |
|||
|
|||
Acceptance: |
|||
|
|||
1. key V2-boundary failures are reproduced and closed in engine tests |
|||
2. engine observability is good enough to debug ownership/recovery failures |
|||
3. remaining gaps are system/performance gaps, not control-model ambiguity |
|||
|
|||
## What To Reuse |
|||
|
|||
Good reuse candidates: |
|||
|
|||
1. tests and failure cases from V1 / V1.5 |
|||
2. narrow utility/data helpers where not coupled to V1 lifecycle |
|||
3. selected WAL/history concepts if they fit V2 ownership boundaries |
|||
|
|||
Do not structurally reuse: |
|||
|
|||
1. V1/V1.5 shipper lifecycle |
|||
2. address-based identity assumptions |
|||
3. `SetReplicaAddrs`-style behavior |
|||
4. old recovery control structure |
|||
|
|||
## Where The Work Should Live |
|||
|
|||
Real V2 engine work should continue under: |
|||
|
|||
- `sw-block/` |
|||
|
|||
Recommended next area: |
|||
|
|||
- `sw-block/core/` |
|||
or |
|||
- `sw-block/engine/` |
|||
|
|||
Exact path can be chosen later, but it should remain separate from: |
|||
|
|||
- `sw-block/prototype/` |
|||
- `weed/storage/blockvol/` |
|||
|
|||
## Validation Plan For Engine Slices |
|||
|
|||
Each engine slice should be validated at three levels: |
|||
|
|||
1. prototype alignment |
|||
- does engine behavior preserve the accepted prototype invariant? |
|||
|
|||
2. focused engine tests |
|||
- does the real engine slice enforce the same contract? |
|||
|
|||
3. scenario mapping |
|||
- does at least one important V1/V1.5 failure class remain closed? |
|||
|
|||
## Non-Goals For First Engine Stage |
|||
|
|||
Do not try to do these immediately: |
|||
|
|||
1. full Smart WAL expansion |
|||
2. performance optimization |
|||
3. V1 replacement/migration plan |
|||
4. full product integration |
|||
5. all storage/backend redesign at once |
|||
|
|||
## Immediate Next Assignment |
|||
|
|||
The first concrete engine-planning task should be: |
|||
|
|||
1. choose the real V2 engine module location under `sw-block/` |
|||
2. define Slice 1 file/module boundaries |
|||
3. write a short engine ownership-core spec |
|||
4. map 3-5 acceptance scenarios directly onto Slice 1 expectations |
|||
@ -0,0 +1,159 @@ |
|||
# V2 First Slice: Per-Replica Sender/Session Ownership |
|||
|
|||
Date: 2026-03-27 |
|||
Status: historical first-slice note |
|||
Depends-on: Q1 (recovery session), Q6 (orchestrator scope), Q7 (first slice) |
|||
|
|||
## Problem |
|||
|
|||
`SetReplicaAddrs()` replaces the entire `ShipperGroup` atomically. This causes: |
|||
|
|||
1. **State loss on topology change.** All shippers are destroyed and recreated. |
|||
Recovery state (`replicaFlushedLSN`, `lastContactTime`, catch-up progress) is lost. |
|||
After a changed-address restart, the new shipper starts from scratch. |
|||
|
|||
2. **No per-replica identity.** Shippers are identified by array index. The master |
|||
cannot target a specific replica for rebuild/catch-up — it must re-issue the |
|||
entire address set. |
|||
|
|||
3. **Background reconnect races.** A reconnect cycle may be in progress when |
|||
`SetReplicaAddrs` replaces the group. The in-progress reconnect's connection |
|||
objects become orphaned. |
|||
|
|||
## Design |
|||
|
|||
### Per-replica sender identity |
|||
|
|||
`ShipperGroup` changes from `[]*WALShipper` to `map[string]*WALShipper`, keyed by |
|||
the replica's canonical data address. Each shipper stores its own `ReplicaID`. |
|||
|
|||
```go |
|||
type WALShipper struct { |
|||
ReplicaID string // canonical data address — identity across reconnects |
|||
// ... existing fields |
|||
} |
|||
|
|||
type ShipperGroup struct { |
|||
mu sync.RWMutex |
|||
shippers map[string]*WALShipper // keyed by ReplicaID |
|||
} |
|||
``` |
|||
|
|||
### ReconcileReplicas replaces SetReplicaAddrs |
|||
|
|||
Instead of replacing the entire group, `ReconcileReplicas` diffs old vs new: |
|||
|
|||
``` |
|||
ReconcileReplicas(newAddrs []ReplicaAddr): |
|||
for each existing shipper: |
|||
if NOT in newAddrs → Stop and remove |
|||
for each newAddr: |
|||
if matching shipper exists → keep (preserve state) |
|||
if no match → create new shipper |
|||
``` |
|||
|
|||
This preserves `replicaFlushedLSN`, `lastContactTime`, catch-up progress, and |
|||
background reconnect goroutines for replicas that stay in the set. |
|||
|
|||
`SetReplicaAddrs` becomes a wrapper: |
|||
```go |
|||
func (v *BlockVol) SetReplicaAddrs(addrs []ReplicaAddr) { |
|||
if v.shipperGroup == nil { |
|||
v.shipperGroup = NewShipperGroup(nil) |
|||
} |
|||
v.shipperGroup.ReconcileReplicas(addrs, v.makeShipperFactory()) |
|||
} |
|||
``` |
|||
|
|||
### Changed-address restart flow |
|||
|
|||
1. Replica restarts on new port. Heartbeat reports new address. |
|||
2. Master detects endpoint change (address differs, same volume). |
|||
3. Master sends assignment update to primary with new replica address. |
|||
4. Primary's `ReconcileReplicas` receives `[oldAddr1, newAddr2]`. |
|||
5. Old shipper for the changed replica is stopped (old address gone from set). |
|||
6. New shipper created with new address — but this is a fresh shipper. |
|||
7. New shipper bootstraps: Disconnected → Connecting → CatchingUp → InSync. |
|||
|
|||
The improvement over V1.5: the **other** replicas in the set are NOT disturbed. |
|||
Only the changed replica gets a fresh shipper. Recovery state for stable replicas |
|||
is preserved. |
|||
|
|||
### Recovery session |
|||
|
|||
Each WALShipper already contains the recovery state machine: |
|||
- `state` (Disconnected → Connecting → CatchingUp → InSync → Degraded → NeedsRebuild) |
|||
- `replicaFlushedLSN` (authoritative progress) |
|||
- `lastContactTime` (retention budget) |
|||
- `catchupFailures` (escalation counter) |
|||
- Background reconnect goroutine |
|||
|
|||
No separate `RecoverySession` object is needed. The WALShipper IS the per-replica |
|||
recovery session. The state machine already tracks the session lifecycle. |
|||
|
|||
What changes: the session is no longer destroyed on topology change (unless the |
|||
replica itself is removed from the set). |
|||
|
|||
### Coordinator vs primary responsibilities |
|||
|
|||
| Responsibility | Owner | |
|||
|---------------|-------| |
|||
| Endpoint truth (canonical address) | Coordinator (master) | |
|||
| Assignment updates (add/remove replicas) | Coordinator | |
|||
| Epoch authority | Coordinator | |
|||
| Session creation trigger | Coordinator (via assignment) | |
|||
| Session execution (reconnect, catch-up, barrier) | Primary (via WALShipper) | |
|||
| Timeout enforcement | Primary | |
|||
| Ordered receive/apply | Replica | |
|||
| Barrier ack | Replica | |
|||
| Heartbeat reporting | Replica | |
|||
|
|||
### Migration from current code |
|||
|
|||
| Current | V2 | |
|||
|---------|-----| |
|||
| `ShipperGroup.shippers []*WALShipper` | `ShipperGroup.shippers map[string]*WALShipper` | |
|||
| `SetReplicaAddrs()` creates all new | `ReconcileReplicas()` diffs and preserves | |
|||
| `StopAll()` in demote | `StopAll()` unchanged (stops all) | |
|||
| `ShipAll(entry)` iterates slice | `ShipAll(entry)` iterates map values | |
|||
| `BarrierAll(lsn)` parallel slice | `BarrierAll(lsn)` parallel map values | |
|||
| `MinReplicaFlushedLSN()` iterates slice | Same, iterates map values | |
|||
| `ShipperStates()` iterates slice | Same, iterates map values | |
|||
| No per-shipper identity | `WALShipper.ReplicaID` = canonical data addr | |
|||
|
|||
### Files changed |
|||
|
|||
| File | Change | |
|||
|------|--------| |
|||
| `wal_shipper.go` | Add `ReplicaID` field, pass in constructor | |
|||
| `shipper_group.go` | `map[string]*WALShipper`, `ReconcileReplicas`, update iterators | |
|||
| `blockvol.go` | `SetReplicaAddrs` calls `ReconcileReplicas`, shipper factory | |
|||
| `promotion.go` | No change (StopAll unchanged) | |
|||
| `dist_group_commit.go` | No change (uses ShipperGroup API) | |
|||
| `block_heartbeat.go` | No change (uses ShipperStates) | |
|||
|
|||
### Acceptance bar |
|||
|
|||
The following existing tests must continue to pass: |
|||
- All CP13-1 through CP13-7 protocol tests (sync_all_protocol_test.go) |
|||
- All adversarial tests (sync_all_adversarial_test.go) |
|||
- All baseline tests (sync_all_bug_test.go) |
|||
- All rebuild tests (rebuild_v1_test.go) |
|||
|
|||
The following CP13-8 tests validate the V2 improvement: |
|||
- `TestCP13_SyncAll_ReplicaRestart_Rejoin` — changed-address recovery |
|||
- `TestAdversarial_ReconnectUsesHandshakeNotBootstrap` — V2 reconnect protocol |
|||
- `TestAdversarial_CatchupMultipleDisconnects` — state preservation across reconnects |
|||
|
|||
New tests to add: |
|||
- `TestReconcileReplicas_PreservesExistingShipper` — stable replica keeps state |
|||
- `TestReconcileReplicas_RemovesStaleShipper` — removed replica stopped |
|||
- `TestReconcileReplicas_AddsNewShipper` — new replica bootstraps |
|||
- `TestReconcileReplicas_MixedUpdate` — one kept, one removed, one added |
|||
|
|||
## Non-goals for this slice |
|||
|
|||
- Smart WAL payload classes |
|||
- Recovery reservation protocol |
|||
- Full coordinator orchestration |
|||
- New transport layer |
|||
@ -0,0 +1,194 @@ |
|||
# V2 First Slice: Per-Replica Sender and Recovery Session Ownership |
|||
|
|||
Date: 2026-03-27 |
|||
Status: historical first-slice note |
|||
|
|||
## Purpose |
|||
|
|||
This document defines the first real V2 implementation slice. |
|||
|
|||
The slice is intentionally narrow: |
|||
|
|||
- per-replica sender ownership |
|||
- explicit recovery session ownership |
|||
- clear coordinator vs primary responsibility |
|||
|
|||
This is the first step toward a standalone V2 block engine under `sw-block/`. |
|||
|
|||
## Why This Slice First |
|||
|
|||
It directly addresses the clearest V1.5 structural limits: |
|||
|
|||
- sender identity loss when replica sets are refreshed |
|||
- changed-address restart recovery complexity |
|||
- repeated reconnect cycles without stable per-replica ownership |
|||
- adversarial Phase 13 boundary tests that V1.5 cannot cleanly satisfy |
|||
|
|||
It also avoids jumping too early into: |
|||
|
|||
- Smart WAL |
|||
- new backend storage layout |
|||
- full production transport redesign |
|||
|
|||
## Core Decision |
|||
|
|||
Use: |
|||
|
|||
- **one sender owner per replica** |
|||
- **at most one active recovery session per replica per epoch** |
|||
|
|||
Healthy replicas may only need their steady sender object. |
|||
|
|||
Degraded / reconnecting replicas gain an explicit recovery session owned by the primary. |
|||
|
|||
## Ownership Split |
|||
|
|||
### Coordinator |
|||
|
|||
Owns: |
|||
|
|||
- replica identity / endpoint truth |
|||
- assignment updates |
|||
- epoch authority |
|||
- session creation / destruction intent |
|||
|
|||
Does not own: |
|||
|
|||
- byte-by-byte catch-up execution |
|||
- local sender loop scheduling |
|||
|
|||
### Primary |
|||
|
|||
Owns: |
|||
|
|||
- per-replica sender objects |
|||
- per-replica recovery session execution |
|||
- reconnect / catch-up progress |
|||
- timeout enforcement for active session |
|||
- transition from: |
|||
- normal sender |
|||
- to recovery session |
|||
- back to normal sender |
|||
|
|||
### Replica |
|||
|
|||
Owns: |
|||
|
|||
- receive/apply path |
|||
- barrier ack |
|||
- heartbeat/reporting |
|||
|
|||
Replica remains passive from the recovery-orchestration point of view. |
|||
|
|||
## Data Model |
|||
|
|||
## Sender Owner |
|||
|
|||
Per replica, maintain a stable sender owner with: |
|||
|
|||
- replica logical ID |
|||
- current endpoint |
|||
- current epoch view |
|||
- steady-state health/status |
|||
- optional active recovery session reference |
|||
|
|||
## Recovery Session |
|||
|
|||
Per replica, per epoch: |
|||
|
|||
- `ReplicaID` |
|||
- `Epoch` |
|||
- `EndpointVersion` or equivalent endpoint truth |
|||
- `State` |
|||
- `connecting` |
|||
- `catching_up` |
|||
- `in_sync` |
|||
- `needs_rebuild` |
|||
- `StartLSN` |
|||
- `TargetLSN` |
|||
- timeout / deadline metadata |
|||
|
|||
## Session Rules |
|||
|
|||
1. only one active session per replica per epoch |
|||
2. new assignment for same replica: |
|||
- supersedes old session only if epoch/session generation is newer |
|||
3. stale session must not continue after: |
|||
- epoch bump |
|||
- endpoint truth change |
|||
- explicit coordinator replacement |
|||
|
|||
## Minimal State Transitions |
|||
|
|||
### Healthy path |
|||
|
|||
1. replica sender exists |
|||
2. sender ships normally |
|||
3. replica remains `InSync` |
|||
|
|||
### Recovery path |
|||
|
|||
1. sender detects or is told replica is not healthy |
|||
2. coordinator provides valid assignment/endpoint truth |
|||
3. primary creates recovery session |
|||
4. session connects |
|||
5. session catches up if recoverable |
|||
6. on success: |
|||
- session closes |
|||
- steady sender resumes normal state |
|||
|
|||
### Rebuild path |
|||
|
|||
1. session determines catch-up is not sufficient |
|||
2. session transitions to `needs_rebuild` |
|||
3. higher layer rebuild flow takes over |
|||
|
|||
## What This Slice Does Not Include |
|||
|
|||
Not in the first slice: |
|||
|
|||
- Smart WAL payload classes in production |
|||
- snapshot pinning / GC logic |
|||
- new on-disk engine |
|||
- frontend publication changes |
|||
- full production event scheduler |
|||
|
|||
## Proposed V2 Workspace Target |
|||
|
|||
Do this under `sw-block/`, not `weed/storage/blockvol/`. |
|||
|
|||
Suggested area: |
|||
|
|||
- `sw-block/prototype/enginev2/` |
|||
|
|||
Suggested first files: |
|||
|
|||
- `sw-block/prototype/enginev2/session.go` |
|||
- `sw-block/prototype/enginev2/sender.go` |
|||
- `sw-block/prototype/enginev2/group.go` |
|||
- `sw-block/prototype/enginev2/session_test.go` |
|||
|
|||
The first code does not need full storage I/O. |
|||
It should prove ownership and transition shape first. |
|||
|
|||
## Acceptance For This Slice |
|||
|
|||
The slice is good enough when: |
|||
|
|||
1. sender identity is stable per replica |
|||
2. changed-address reassignment updates the right sender owner |
|||
3. multiple reconnect cycles do not lose recovery ownership |
|||
4. stale session does not survive epoch bump |
|||
5. the 4 Phase 13 V2-boundary tests have a clear path to become satisfiable |
|||
|
|||
## Relationship To Existing Simulator |
|||
|
|||
This slice should align with: |
|||
|
|||
- `v2-acceptance-criteria.md` |
|||
- `v2-open-questions.md` |
|||
- `v1-v15-v2-comparison.md` |
|||
- `distsim` / `eventsim` behavior |
|||
|
|||
The simulator remains the design oracle. |
|||
The first implementation slice should not contradict it. |
|||
@ -0,0 +1,199 @@ |
|||
# V2 Production Roadmap |
|||
|
|||
Date: 2026-03-30 |
|||
Status: historical roadmap |
|||
Purpose: define the path from the accepted V2 engine core to a production candidate |
|||
|
|||
## Current Position |
|||
|
|||
Completed: |
|||
|
|||
1. design / FSM closure |
|||
2. simulator / protocol validation |
|||
3. prototype closure |
|||
4. evidence hardening |
|||
5. engine core slices: |
|||
- Slice 1 ownership core |
|||
- Slice 2 recovery execution core |
|||
- Slice 3 data / recoverability core |
|||
- Slice 4 integration closure |
|||
|
|||
Current stage: |
|||
|
|||
- entering broader engine implementation |
|||
|
|||
This means the main risk is no longer: |
|||
|
|||
- whether the V2 idea stands up |
|||
|
|||
The main risk is: |
|||
|
|||
- whether the accepted engine core can be turned into a real system without reintroducing V1/V1.5 structure and semantics |
|||
|
|||
## Roadmap Summary |
|||
|
|||
1. Phase 06: broader engine implementation stage |
|||
2. Phase 07: real-system integration / product-path decision |
|||
3. Phase 08: pre-production hardening |
|||
4. Phase 09: performance / scale / soak validation |
|||
5. Phase 10: production candidate and rollout gate |
|||
|
|||
## Phase 06 |
|||
|
|||
### Goal |
|||
|
|||
Connect the accepted engine core to: |
|||
|
|||
1. real control truth |
|||
2. real storage truth |
|||
3. explicit engine execution steps |
|||
|
|||
### Outputs |
|||
|
|||
1. control-plane adapter into the engine core |
|||
2. storage/base/recoverability adapters |
|||
3. explicit execution-driver model where synchronous helpers are no longer sufficient |
|||
4. validation against selected real failure classes |
|||
|
|||
### Gate |
|||
|
|||
At the end of Phase 06, the project should be able to say: |
|||
|
|||
- the engine core can live inside a real system shape |
|||
|
|||
## Phase 07 |
|||
|
|||
### Goal |
|||
|
|||
Move from engine-local correctness to a real runnable subsystem. |
|||
|
|||
### Outputs |
|||
|
|||
1. service-style runnable engine slice |
|||
2. integration with real control and storage surfaces |
|||
3. crash/failover/restart integration tests |
|||
4. decision on the first viable product path |
|||
|
|||
### Gate |
|||
|
|||
At the end of Phase 07, the project should be able to say: |
|||
|
|||
- the engine can run as a real subsystem, not only as an isolated core |
|||
|
|||
## Phase 08 |
|||
|
|||
### Goal |
|||
|
|||
Turn correctness into operational safety. |
|||
|
|||
### Outputs |
|||
|
|||
1. observability hardening |
|||
2. operator/debug flows |
|||
3. recovery/runbook procedures |
|||
4. config surface cleanup |
|||
5. realistic durability/restart validation |
|||
|
|||
### Gate |
|||
|
|||
At the end of Phase 08, the project should be able to say: |
|||
|
|||
- operators can run, debug, and recover the system safely |
|||
|
|||
## Phase 09 |
|||
|
|||
### Goal |
|||
|
|||
Prove viability under load and over time. |
|||
|
|||
### Outputs |
|||
|
|||
1. throughput / latency baselines |
|||
2. rebuild / catch-up cost characterization |
|||
3. steady-state overhead measurement |
|||
4. soak testing |
|||
5. scale and failure-under-load validation |
|||
|
|||
### Gate |
|||
|
|||
At the end of Phase 09, the project should be able to say: |
|||
|
|||
- the design is not only correct, but viable at useful scale and duration |
|||
|
|||
## Phase 10 |
|||
|
|||
### Goal |
|||
|
|||
Produce a controlled production candidate. |
|||
|
|||
### Outputs |
|||
|
|||
1. feature-gated production candidate |
|||
2. rollback strategy |
|||
3. migration/coexistence plan with V1 |
|||
4. staged rollout plan |
|||
5. production acceptance checklist |
|||
|
|||
### Gate |
|||
|
|||
At the end of Phase 10, the project should be able to say: |
|||
|
|||
- the system is ready for a controlled production rollout |
|||
|
|||
## Cross-Phase Rules |
|||
|
|||
### Rule 1: Do not reopen protocol shape casually |
|||
|
|||
The accepted core should remain stable unless new implementation evidence forces a change. |
|||
|
|||
### Rule 2: Use V1 as validation source, not design template |
|||
|
|||
Use: |
|||
|
|||
1. `learn/projects/sw-block/` |
|||
2. `weed/storage/block*` |
|||
|
|||
for: |
|||
|
|||
1. failure gates |
|||
2. constraints |
|||
3. integration references |
|||
|
|||
Do not use them as the default V2 architecture template. |
|||
|
|||
### Rule 3: Keep `CatchUp` narrow |
|||
|
|||
Do not let later implementation phases re-expand `CatchUp` into a broad, optimistic, long-lived recovery mode. |
|||
|
|||
### Rule 4: Keep evidence quality ahead of object growth |
|||
|
|||
New work should preferentially improve: |
|||
|
|||
1. traceability |
|||
2. diagnosability |
|||
3. real-failure validation |
|||
4. operational confidence |
|||
|
|||
not simply add new objects, states, or mechanisms. |
|||
|
|||
## Production Readiness Ladder |
|||
|
|||
The project should move through this ladder explicitly: |
|||
|
|||
1. proof-of-design |
|||
2. proof-of-engine-shape |
|||
3. proof-of-runnable-engine-stage |
|||
4. proof-of-operable-system |
|||
5. proof-of-viable-production-candidate |
|||
|
|||
Current ladder position: |
|||
|
|||
- between `2` and `3` |
|||
- engine core accepted; broader runnable engine stage underway |
|||
|
|||
## Next Documents To Maintain |
|||
|
|||
1. `sw-block/.private/phase/phase-06.md` |
|||
2. `sw-block/docs/archive/design/v2-engine-readiness-review.md` |
|||
3. `sw-block/docs/archive/design/v2-engine-slicing-plan.md` |
|||
4. this roadmap |
|||
@ -0,0 +1,239 @@ |
|||
# V2 Prototype Roadmap And Gates |
|||
|
|||
Date: 2026-03-27 |
|||
Status: historical prototype roadmap |
|||
Purpose: define the remaining prototype roadmap, the validation gates between stages, and the decision point between real V2 engine work and possible V2.5 redesign |
|||
|
|||
## Current Position |
|||
|
|||
V2 design/FSM/simulator work is sufficiently closed for serious prototyping, but not frozen against later `V2.5` adjustments. |
|||
|
|||
Current state: |
|||
|
|||
- design proof: high |
|||
- execution proof: medium |
|||
- data/recovery proof: low |
|||
- prototype end-to-end proof: low |
|||
|
|||
Rough prototype progress: |
|||
|
|||
- `25%` to `35%` |
|||
|
|||
This is early executable prototype, not engine-ready prototype. |
|||
|
|||
## Roadmap Goal |
|||
|
|||
Answer this question with prototype evidence: |
|||
|
|||
- can V2 become a real engine path? |
|||
- or should it become `V2.5` before real implementation begins? |
|||
|
|||
## Step 1: Execution Authority Closure |
|||
|
|||
Purpose: |
|||
|
|||
- finish the sender / recovery-session authority model so stale work is unambiguously rejected |
|||
|
|||
Scope: |
|||
|
|||
1. ownership-only `AttachSession()` / `SupersedeSession()` |
|||
2. execution begins only through execution APIs |
|||
3. stale handshake / progress / completion fenced by `sessionID` |
|||
4. endpoint bump / epoch bump invalidate execution authority |
|||
5. sender-group preserve-or-kill behavior is explicit |
|||
|
|||
Done when: |
|||
|
|||
1. all execution APIs are sender-gated and reject stale `sessionID` |
|||
2. session creation is separated from execution start |
|||
3. phase ordering is enforced |
|||
4. endpoint bump / epoch bump invalidate execution authority correctly |
|||
5. mixed add/remove/update reconciliation preserves or kills state exactly as intended |
|||
|
|||
Main files: |
|||
|
|||
- `sw-block/prototype/enginev2/` |
|||
- `sw-block/prototype/distsim/` |
|||
- `learn/projects/sw-block/phases/phase-13-v2-boundary-tests.md` |
|||
|
|||
Key gate: |
|||
|
|||
- old recovery work cannot mutate current sender state at any execution stage |
|||
|
|||
## Step 2: Orchestrated Recovery Prototype |
|||
|
|||
Purpose: |
|||
|
|||
- move from good local sender APIs to an actual prototype recovery flow driven by assignment/update intent |
|||
|
|||
Scope: |
|||
|
|||
1. assignment/update intent creates or supersedes recovery attempts |
|||
2. reconnect / reassignment / catch-up / rebuild decision path |
|||
3. sender-group becomes orchestration entry point |
|||
4. explicit outcome branching: |
|||
- zero-gap fast completion |
|||
- positive-gap catch-up |
|||
- unrecoverable gap -> `NeedsRebuild` |
|||
|
|||
Done when: |
|||
|
|||
1. the prototype expresses a realistic recovery flow from topology/control intent |
|||
2. sender-group drives recovery creation, not only unit helpers |
|||
3. recovery outcomes are explicit and testable |
|||
4. orchestrator responsibility is clear enough to narrow `v2-open-questions.md` item 6 |
|||
|
|||
Key gate: |
|||
|
|||
- recovery control is no longer scattered across helper calls; it has one clear orchestration path |
|||
|
|||
## Step 3: Minimal Historical Data Prototype |
|||
|
|||
Purpose: |
|||
|
|||
- prove the recovery model against real data-history assumptions, not only control logic |
|||
|
|||
Scope: |
|||
|
|||
1. minimal WAL/history model, not full engine |
|||
2. enough to exercise: |
|||
- catch-up range |
|||
- retained prefix/window |
|||
- rebuild fallback |
|||
- historical correctness at target LSN |
|||
3. enough reservation/recoverability state to make recovery explicit |
|||
|
|||
Done when: |
|||
|
|||
1. the prototype can prove why a gap is recoverable or unrecoverable |
|||
2. catch-up and rebuild decisions are backed by minimal data/history state |
|||
3. `v2-open-questions.md` items 3, 4, 5 are closed or sharply narrowed |
|||
4. prototype evidence strengthens acceptance criteria `A5`, `A6`, and `A7` |
|||
|
|||
Key gate: |
|||
|
|||
- the prototype must explain why recovery is allowed, not just that policy says it is |
|||
|
|||
## Step 4: Prototype Scenario Closure |
|||
|
|||
Purpose: |
|||
|
|||
- make the prototype itself demonstrate the V2 story end-to-end |
|||
|
|||
Scope: |
|||
|
|||
1. map key V2 scenarios onto the prototype |
|||
2. express the 4 V2-boundary cases against prototype behavior |
|||
3. add one small end-to-end harness inside `sw-block/prototype/` |
|||
4. align prototype evidence with acceptance criteria |
|||
|
|||
Done when: |
|||
|
|||
1. prototype behavior can be reviewed scenario-by-scenario |
|||
2. key V1/V1.5 failures have prototype equivalents |
|||
3. prototype outcomes match intended V2 design claims |
|||
4. remaining gaps are clearly real-engine gaps, not protocol/prototype ambiguity |
|||
|
|||
Key gate: |
|||
|
|||
- a reviewer can trace: |
|||
- acceptance criteria -> scenario -> prototype behavior |
|||
without hand-waving |
|||
|
|||
## Gates |
|||
|
|||
### Gate 1: Design Closed Enough |
|||
|
|||
Status: |
|||
|
|||
- mostly passed |
|||
|
|||
Meaning: |
|||
|
|||
1. acceptance criteria exist |
|||
2. core simulator exists |
|||
3. ownership gap from V1.5 is understood |
|||
|
|||
### Gate 2: Execution Authority Closed |
|||
|
|||
Passes after Step 1. |
|||
|
|||
Meaning: |
|||
|
|||
- stale execution results cannot mutate current authority |
|||
|
|||
### Gate 3: Orchestrated Recovery Closed |
|||
|
|||
Passes after Step 2. |
|||
|
|||
Meaning: |
|||
|
|||
- recovery flow is controlled by one coherent orchestration model |
|||
|
|||
### Gate 4: Historical Data Model Closed |
|||
|
|||
Passes after Step 3. |
|||
|
|||
Meaning: |
|||
|
|||
- catch-up vs rebuild is backed by executable data-history logic |
|||
|
|||
### Gate 5: Prototype Convincing |
|||
|
|||
Passes after Step 4. |
|||
|
|||
Meaning: |
|||
|
|||
- enough evidence exists to choose: |
|||
- real V2 engine path |
|||
- or `V2.5` redesign |
|||
|
|||
## Decision Gate After Step 4 |
|||
|
|||
### Path A: Real V2 Engine Planning |
|||
|
|||
Choose this if: |
|||
|
|||
1. prototype control logic is coherent |
|||
2. recovery boundary is explicit |
|||
3. boundary cases are convincing |
|||
4. no major structural flaw remains |
|||
|
|||
Outputs: |
|||
|
|||
1. real engine slicing plan |
|||
2. migration/integration plan into future standalone `sw-block` |
|||
3. explicit non-goals for first production version |
|||
|
|||
### Path B: V2.5 Redesign |
|||
|
|||
Choose this if the prototype reveals: |
|||
|
|||
1. ownership/orchestration still too fragile |
|||
2. recovery boundary still too implicit |
|||
3. historical correctness model too costly or too unclear |
|||
4. too much complexity leaks into the hot path |
|||
|
|||
Output: |
|||
|
|||
- write `V2.5` as a design/prototype correction before engine work |
|||
|
|||
## What Not To Do Yet |
|||
|
|||
1. no Smart WAL expansion beyond what Step 3 minimally needs |
|||
2. no backend/storage-engine redesign |
|||
3. no V1 production integration |
|||
4. no frontend/wire protocol work |
|||
5. no performance optimization as a primary goal |
|||
|
|||
## Practical Summary |
|||
|
|||
Current sequence: |
|||
|
|||
1. finish execution authority |
|||
2. build orchestrated recovery |
|||
3. add minimal historical-data proof |
|||
4. close key scenarios against the prototype |
|||
5. decide: |
|||
- V2 engine |
|||
- or `V2.5` |
|||
Write
Preview
Loading…
Cancel
Save
Reference in new issue