Browse Source
feat: Phase 09 P0 — production execution closure plan
feat: Phase 09 P0 — production execution closure plan
Execution-closure targets: - P1: TransferFullBase — reuse rebuild.go TCP protocol - P2: TransferSnapshot — checkpoint image + WAL tail - P3: TruncateWAL — AdvanceTail + superblock update - P4: Runtime ownership — V2 orchestrator drives execution Key reuse sources identified: - rebuild.go: rebuildFullExtent (client), RebuildServer (server) - wal_writer.go: AdvanceTail - flusher.go: updateSuperblockCheckpoint - blockvol.go: ScanWALEntries (already wired) Slice order: full-base first (highest value), then snapshot, then truncation, then runtime ownership. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>feature/sw-block
11 changed files with 2092 additions and 60 deletions
-
109sw-block/.private/phase/phase-08-decisions.md
-
397sw-block/.private/phase/phase-08-log.md
-
397sw-block/.private/phase/phase-08.md
-
26sw-block/.private/phase/phase-09-decisions.md
-
269sw-block/.private/phase/phase-09-log.md
-
127sw-block/.private/phase/phase-09.md
-
3sw-block/design/README.md
-
25sw-block/design/agent_dev_process.md
-
301sw-block/design/phase-08-engine-skeleton-map.md
-
231sw-block/design/v2-phase-development-plan.md
-
267sw-block/design/v2-product-completion-overview.md
@ -0,0 +1,26 @@ |
|||
# Phase 09 Decisions |
|||
|
|||
## Decision 1: Phase 09 is production execution closure, not packaging |
|||
|
|||
The candidate-path packaging/judgment work remains inside `Phase 08 P4`. |
|||
|
|||
`Phase 09` starts directly with substantial backend engineering closure. |
|||
|
|||
## Decision 2: The first Phase 09 targets are real transfer, truncation, and stronger runtime ownership |
|||
|
|||
The initial heavy execution blockers are: |
|||
|
|||
1. real `TransferFullBase` |
|||
2. real `TransferSnapshot` |
|||
3. real `TruncateWAL` |
|||
4. stronger live runtime execution ownership |
|||
|
|||
## Decision 3: Phase 09 remains bounded to the chosen candidate path unless evidence forces expansion |
|||
|
|||
Default scope remains: |
|||
|
|||
1. `RF=2` |
|||
2. `sync_all` |
|||
3. existing master / volume-server heartbeat path |
|||
|
|||
Future paths or durability modes should not be absorbed casually into this phase. |
|||
@ -0,0 +1,269 @@ |
|||
# Phase 09 Log |
|||
|
|||
## 2026-03-31 |
|||
|
|||
### Opened |
|||
|
|||
`Phase 09` opened as: |
|||
|
|||
- production execution closure |
|||
|
|||
### Starting basis |
|||
|
|||
1. `Phase 08`: closed |
|||
2. chosen candidate path exists for `RF=2 sync_all` |
|||
3. main remaining heavy engineering work is backend production-grade execution |
|||
|
|||
### Next |
|||
|
|||
1. `Phase 09 P0` planning for: |
|||
- real `TransferFullBase` |
|||
- real `TransferSnapshot` |
|||
- real `TruncateWAL` |
|||
- stronger live runtime execution ownership |
|||
|
|||
### P0 Technical Pack |
|||
|
|||
Purpose: |
|||
|
|||
- provide the minimum design/algo/test detail needed to start `Phase 09` |
|||
- keep the work centered on backend production execution closure |
|||
- avoid broad scope growth into control-plane redesign or product-surface work |
|||
|
|||
#### Execution-closure target |
|||
|
|||
`Phase 09` should close this gap: |
|||
|
|||
- current path is candidate-safe but still partially validation-grade |
|||
- next path must be backend-production-grade for the chosen `RF=2 sync_all` path |
|||
|
|||
This phase should not try to make every surrounding product surface complete. |
|||
It should make the backend execution path real enough that later phases can build on it. |
|||
|
|||
#### What "real" means in this phase |
|||
|
|||
Use these definitions. |
|||
|
|||
1. Real `TransferFullBase` |
|||
- not only "extent is accessible" |
|||
- must read and transfer real block/base contents through the execution path |
|||
- completion must depend on the transfer actually occurring |
|||
|
|||
2. Real `TransferSnapshot` |
|||
- not only "checkpoint exists and is readable" |
|||
- must read and transfer the snapshot/base image through the execution path |
|||
- tail replay must remain aligned with the transferred snapshot boundary |
|||
|
|||
3. Real `TruncateWAL` |
|||
- not only "replica ahead detected" |
|||
- must execute the physical correction required by the chosen path |
|||
- completion must depend on truncation having actually happened |
|||
|
|||
4. Stronger live runtime execution ownership |
|||
- V2 recovery execution should be driven by a stronger live runtime path than test-only orchestration |
|||
- the volume-server path should own plan / execute / cancel / cleanup more directly |
|||
- avoid split ownership where tests prove the path but the running service still does not drive it coherently |
|||
|
|||
#### Recommended slice order inside Phase 09 |
|||
|
|||
Keep the phase substantial, but still ordered by dependency: |
|||
|
|||
1. `P1` full-base execution closure |
|||
- make `TransferFullBase` real |
|||
- prove rebuild path no longer depends on accessibility-only validation |
|||
|
|||
2. `P2` snapshot execution closure |
|||
- make `TransferSnapshot` real |
|||
- prove snapshot/tail rebuild path can use a real transferred base |
|||
|
|||
3. `P3` truncation execution closure |
|||
- make `TruncateWAL` real |
|||
- prove replica-ahead path is executable, not only detectable |
|||
|
|||
4. `P4` stronger live runtime ownership |
|||
- move the accepted execution logic closer to the real volume-server/runtime loop |
|||
- prove cleanup / cancel / replacement under the stronger live path |
|||
|
|||
This order is recommended because: |
|||
|
|||
1. transfer closure is the largest production blocker |
|||
2. truncation depends on a clearer execution contract |
|||
3. runtime ownership should build on real execution, not on validation-grade stubs |
|||
|
|||
#### Design rules |
|||
|
|||
1. engine still owns policy |
|||
- do not move catch-up / rebuild / truncation decision logic into `v2bridge` or `blockvol` |
|||
|
|||
2. `v2bridge` owns real execution translation |
|||
- implement real transfer/truncate behavior there or through bounded runtime hooks |
|||
- keep it as the execution adapter, not the policy owner |
|||
|
|||
3. `blockvol` owns storage/runtime reality |
|||
- WAL |
|||
- checkpoint |
|||
- extent/snapshot data |
|||
- low-level execution primitives |
|||
|
|||
4. chosen-path bounds remain explicit |
|||
- `RF=2` |
|||
- `sync_all` |
|||
- existing master / volume-server heartbeat path |
|||
|
|||
#### Primary reuse / update targets |
|||
|
|||
For each target, state whether the expected action is: |
|||
|
|||
1. `update in place` |
|||
2. `reference only` |
|||
3. `copy is allowed` |
|||
|
|||
For `Phase 09`, the main targets are: |
|||
|
|||
1. `weed/storage/blockvol/v2bridge/executor.go` |
|||
- action: `update in place` |
|||
- why: |
|||
- this is the direct V2 execution adapter |
|||
- real transfer/truncate closure belongs here first |
|||
- boundary: |
|||
- add real execution behavior |
|||
- do not add policy decisions here |
|||
|
|||
2. `weed/storage/blockvol/blockvol.go` |
|||
- action: `update in place` |
|||
- why: |
|||
- authoritative runtime/storage hooks live here |
|||
- transfer/truncate/recovery primitives may need to be exposed or tightened here |
|||
- boundary: |
|||
- expose/execute real runtime behavior |
|||
- do not let old replication semantics redefine V2 truth |
|||
|
|||
3. `weed/storage/blockvol/rebuild.go` |
|||
- action: `reference first`, then `update in place if needed` |
|||
- why: |
|||
- existing rebuild transport/runtime reality may be reused |
|||
- boundary: |
|||
- reuse transfer reality |
|||
- keep rebuild-source choice and recovery policy in the V2 engine |
|||
|
|||
4. `weed/server/volume_server_block.go` |
|||
- action: `update in place` |
|||
- why: |
|||
- stronger live runtime execution ownership will likely terminate here |
|||
- boundary: |
|||
- strengthen live orchestration/runtime handoff |
|||
- do not collapse V2 boundaries into server-local convenience semantics |
|||
|
|||
5. `weed/storage/blockvol/v2bridge/control.go` |
|||
- action: `reference only` for `Phase 09` unless execution work exposes a real gap |
|||
- why: |
|||
- `Phase 09` is not the main control-plane closure phase |
|||
|
|||
6. product surfaces (`CSI`, `NVMe`, `iSCSI`) |
|||
- action: `reference only` |
|||
- why: |
|||
- not in scope for this phase |
|||
- avoid accidental scope growth |
|||
|
|||
7. old V1 shipper/rebuild execution paths |
|||
- action: `reference only`, `copy is not default` |
|||
- why: |
|||
- they are reality/integration references |
|||
- not the default semantic template for V2 |
|||
|
|||
When `sw` submits a slice package in this phase, include a short reuse note: |
|||
|
|||
- files updated in place |
|||
- files used as references only |
|||
- any file copied from older code and why copy was safer than in-place update |
|||
|
|||
#### Validation expectations |
|||
|
|||
For each execution closure target, require: |
|||
|
|||
1. one-chain proof |
|||
- engine plan |
|||
- engine executor/runtime driver |
|||
- `v2bridge` |
|||
- real `blockvol` operation |
|||
- completion |
|||
- cleanup |
|||
|
|||
2. physical-effect proof |
|||
- the operation must do real transfer/truncate work, not only validation |
|||
|
|||
3. fail-closed behavior |
|||
- partial failure |
|||
- cancellation |
|||
- replacement |
|||
- all release resources correctly |
|||
|
|||
4. observability |
|||
- logs explain: |
|||
- why the execution started |
|||
- what exact execution path ran |
|||
- why it completed / failed / cancelled |
|||
|
|||
#### Suggested validation package |
|||
|
|||
Keep the package focused: |
|||
|
|||
1. one full-base execution test |
|||
2. one snapshot execution test |
|||
3. one truncation execution test |
|||
4. one live runtime ownership test |
|||
5. one cleanup/adversarial package covering: |
|||
- cancel |
|||
- replacement |
|||
- partial failure |
|||
|
|||
Avoid: |
|||
|
|||
1. broad matrix growth before these closures are real |
|||
2. product-surface tests (`CSI` / `NVMe` / `iSCSI`) in this phase |
|||
3. new protocol exploration |
|||
|
|||
#### Assignment template for `sw` |
|||
|
|||
1. Goal |
|||
- Build the `Phase 09` production-execution closure package for the chosen candidate path. |
|||
|
|||
2. Required outputs |
|||
- explicit definition of "real" for each target: |
|||
- `TransferFullBase` |
|||
- `TransferSnapshot` |
|||
- `TruncateWAL` |
|||
- stronger runtime ownership |
|||
- slice/package order inside `Phase 09` |
|||
- implementation plan for the first heavy closure target |
|||
- expected tests/evidence for each target |
|||
|
|||
3. Hard rules |
|||
- no protocol redesign |
|||
- no broad scope growth into product surfaces |
|||
- no moving policy logic into `v2bridge` / `blockvol` |
|||
- keep the chosen-path bound explicit |
|||
|
|||
4. Delivery order |
|||
- first hand to architect review |
|||
- only after architect review passes, hand to tester validation |
|||
|
|||
5. Reject before handoff if |
|||
- "real" is still defined as accessibility-only validation |
|||
- target order ignores execution dependency |
|||
- runtime ownership remains too vague to verify |
|||
|
|||
#### Assignment template for `tester` |
|||
|
|||
1. Goal |
|||
- Validate that the `Phase 09` execution-closure plan is concrete enough to support substantial engineering work. |
|||
|
|||
2. Validate |
|||
- each target has a real/physical-effect definition |
|||
- each target has a one-chain proof expectation |
|||
- cleanup and fail-closed expectations are explicit |
|||
- the phase remains bounded to the chosen path |
|||
|
|||
3. Output |
|||
- pass/fail on execution-target clarity |
|||
- findings on vague "real" definitions, missing cleanup proofs, or hidden scope growth |
|||
@ -0,0 +1,127 @@ |
|||
# Phase 09 |
|||
|
|||
Date: 2026-03-31 |
|||
Status: active |
|||
Purpose: turn the accepted candidate-safe backend path into a production-grade execution path without reopening accepted V2 recovery semantics |
|||
|
|||
## Why This Phase Exists |
|||
|
|||
`Phase 08` closed: |
|||
|
|||
1. real control delivery on the chosen path |
|||
2. real one-chain catch-up and rebuild closure on the chosen path |
|||
3. unified hardening replay on the accepted live path |
|||
4. one bounded candidate package for `RF=2 sync_all` |
|||
|
|||
What still does not exist is production-grade execution completeness. |
|||
|
|||
The main remaining gap is no longer: |
|||
|
|||
1. whether the path is candidate-safe |
|||
|
|||
It is now: |
|||
|
|||
1. whether the backend execution path is production-grade rather than validation-grade |
|||
|
|||
## Phase Goal |
|||
|
|||
Close the main backend execution gaps so the chosen path is no longer blocked by validation-grade transfer/truncation behavior. |
|||
|
|||
## Scope |
|||
|
|||
### In scope |
|||
|
|||
1. real `TransferFullBase` |
|||
2. real `TransferSnapshot` |
|||
3. real `TruncateWAL` |
|||
4. stronger live runtime execution ownership on the volume-server path |
|||
|
|||
### Out of scope |
|||
|
|||
1. broad control-plane redesign |
|||
2. `RF>2` |
|||
3. `best_effort` / `sync_quorum` recovery semantics |
|||
4. product-surface rebinding (`CSI` / `NVMe` / `iSCSI`) |
|||
5. broad performance optimization |
|||
|
|||
## Phase 09 Items |
|||
|
|||
### P0: Production Execution Closure Plan |
|||
|
|||
1. convert the accepted candidate package into a production-execution closure plan |
|||
2. define the minimum execution blockers that must be closed in this phase |
|||
3. order the execution work by dependency and risk |
|||
4. keep the chosen-path bound explicit while making the backend path production-grade |
|||
|
|||
Goal: |
|||
|
|||
- start `Phase 09` with one substantial execution-closure plan, not another light packaging round |
|||
|
|||
Must prove: |
|||
|
|||
1. the phase is centered on real backend execution work |
|||
2. the required closures are explicit: |
|||
- `TransferFullBase` |
|||
- `TransferSnapshot` |
|||
- `TruncateWAL` |
|||
- stronger runtime ownership |
|||
3. the phase remains bounded to the chosen candidate path unless new evidence expands it |
|||
|
|||
Verification mechanism: |
|||
|
|||
1. architect review: |
|||
- phase shape is substantial and outcome-based |
|||
- work is ordered by real engineering dependency |
|||
2. tester review: |
|||
- validation expectations are explicit for each execution closure target |
|||
3. manager review: |
|||
- the phase is large enough to justify a full engineering round |
|||
|
|||
Output artifacts: |
|||
|
|||
1. explicit execution-closure target list |
|||
2. explicit execution blocker list |
|||
3. initial slice/package order inside `Phase 09` |
|||
|
|||
Execution note: |
|||
|
|||
- use `phase-09-log.md` as the technical pack for: |
|||
- the definition of "real" for each execution target |
|||
- recommended slice order |
|||
- validation expectations |
|||
- assignment templates for `sw` and `tester` |
|||
|
|||
Reject if: |
|||
|
|||
1. `Phase 09` is framed as another packaging/documentation phase |
|||
2. execution blockers remain implicit |
|||
3. the phase quietly expands into product surfaces or unrelated control-plane work |
|||
4. the phase has no clear verification mechanism |
|||
|
|||
## Assignment For `sw` |
|||
|
|||
Current next tasks: |
|||
|
|||
1. define the concrete execution-closure package for `Phase 09` |
|||
2. specify what "real" means for: |
|||
- `TransferFullBase` |
|||
- `TransferSnapshot` |
|||
- `TruncateWAL` |
|||
3. specify how stronger live runtime execution ownership should work on the volume-server path |
|||
4. keep the phase bounded to the chosen candidate path unless new evidence forces expansion |
|||
5. hand the package to architect review before tester work begins |
|||
|
|||
## Assignment For `tester` |
|||
|
|||
Current next tasks: |
|||
|
|||
1. prepare the validation oracle for production execution closure |
|||
2. require explicit validation targets for: |
|||
- real transfer behavior |
|||
- truncation execution |
|||
- cleanup on success/failure/cancel |
|||
- stronger live runtime ownership |
|||
3. keep no-overclaim active around: |
|||
- validation-grade vs production-grade execution |
|||
- chosen path vs future paths/modes |
|||
4. review only after architect pre-review passes |
|||
@ -0,0 +1,301 @@ |
|||
# Phase 08 Engine Skeleton Map |
|||
|
|||
Date: 2026-03-31 |
|||
Status: active |
|||
Purpose: provide a short structural map for the `Phase 08` hardening path so implementation can move faster without reopening accepted V2 boundaries |
|||
|
|||
## Scope |
|||
|
|||
This is not the final standalone `sw-block` architecture. |
|||
|
|||
It is the shortest useful engine skeleton for the accepted `Phase 08` hardening path: |
|||
|
|||
- `RF=2` |
|||
- `sync_all` |
|||
- existing `Seaweed` master / volume-server heartbeat path |
|||
- V2 engine owns recovery policy |
|||
- `blockvol` remains the execution backend |
|||
|
|||
## Module Map |
|||
|
|||
### 1. Control plane |
|||
|
|||
Role: |
|||
|
|||
- authoritative control truth |
|||
|
|||
Primary sources: |
|||
|
|||
- `weed/server/master_grpc_server.go` |
|||
- `weed/server/master_block_registry.go` |
|||
- `weed/server/master_block_failover.go` |
|||
- `weed/server/volume_grpc_client_to_master.go` |
|||
|
|||
What it produces: |
|||
|
|||
- confirmed assignment |
|||
- `Epoch` |
|||
- target `Role` |
|||
- failover / promotion / reassignment result |
|||
- stable server identity |
|||
|
|||
### 2. Control bridge |
|||
|
|||
Role: |
|||
|
|||
- translate real control truth into V2 engine intent |
|||
|
|||
Primary files: |
|||
|
|||
- `weed/storage/blockvol/v2bridge/control.go` |
|||
- `sw-block/bridge/blockvol/control_adapter.go` |
|||
- entry path in `weed/server/volume_server_block.go` |
|||
|
|||
What it produces: |
|||
|
|||
- `AssignmentIntent` |
|||
- stable `ReplicaID` |
|||
- `Endpoint` |
|||
- `SessionKind` |
|||
|
|||
### 3. Engine runtime |
|||
|
|||
Role: |
|||
|
|||
- recovery-policy core |
|||
|
|||
Primary files: |
|||
|
|||
- `sw-block/engine/replication/orchestrator.go` |
|||
- `sw-block/engine/replication/driver.go` |
|||
- `sw-block/engine/replication/executor.go` |
|||
- `sw-block/engine/replication/sender.go` |
|||
- `sw-block/engine/replication/history.go` |
|||
|
|||
What it decides: |
|||
|
|||
- zero-gap / catch-up / needs-rebuild |
|||
- sender/session ownership |
|||
- stale authority rejection |
|||
- resource acquisition / release |
|||
- rebuild source selection |
|||
|
|||
### 4. Storage bridge |
|||
|
|||
Role: |
|||
|
|||
- translate real blockvol storage truth and execution capability into engine-facing adapters |
|||
|
|||
Primary files: |
|||
|
|||
- `weed/storage/blockvol/v2bridge/reader.go` |
|||
- `weed/storage/blockvol/v2bridge/pinner.go` |
|||
- `weed/storage/blockvol/v2bridge/executor.go` |
|||
- `sw-block/bridge/blockvol/storage_adapter.go` |
|||
|
|||
What it provides: |
|||
|
|||
- `RetainedHistory` |
|||
- WAL retention pin / release |
|||
- snapshot pin / release |
|||
- full-base pin / release |
|||
- WAL scan execution |
|||
|
|||
### 5. Block runtime |
|||
|
|||
Role: |
|||
|
|||
- execute real I/O |
|||
|
|||
Primary files: |
|||
|
|||
- `weed/storage/blockvol/blockvol.go` |
|||
- `weed/storage/blockvol/replica_apply.go` |
|||
- `weed/storage/blockvol/replica_barrier.go` |
|||
- `weed/storage/blockvol/recovery.go` |
|||
- `weed/storage/blockvol/rebuild.go` |
|||
- `weed/storage/blockvol/wal_shipper.go` |
|||
|
|||
What it owns: |
|||
|
|||
- WAL |
|||
- extent |
|||
- flusher |
|||
- checkpoint / superblock |
|||
- receiver / shipper |
|||
- rebuild server |
|||
|
|||
## Execution Order |
|||
|
|||
### Control path |
|||
|
|||
```text |
|||
master heartbeat / failover truth |
|||
-> BlockVolumeAssignment |
|||
-> volume server ProcessAssignments |
|||
-> v2bridge control conversion |
|||
-> engine ProcessAssignment |
|||
-> sender/session state updated |
|||
``` |
|||
|
|||
### Catch-up path |
|||
|
|||
```text |
|||
assignment accepted |
|||
-> engine reads retained history |
|||
-> engine plans catch-up |
|||
-> storage bridge pins WAL retention |
|||
-> engine executor drives v2bridge executor |
|||
-> blockvol scans WAL / ships entries |
|||
-> engine completes session |
|||
``` |
|||
|
|||
### Rebuild path |
|||
|
|||
```text |
|||
assignment accepted |
|||
-> engine detects NeedsRebuild |
|||
-> engine selects rebuild source |
|||
-> storage bridge pins snapshot/full-base/tail |
|||
-> executor drives transfer path |
|||
-> blockvol performs restore / replay work |
|||
-> engine completes rebuild |
|||
``` |
|||
|
|||
### Local durability path |
|||
|
|||
```text |
|||
WriteLBA / Trim |
|||
-> WAL append |
|||
-> shipping / barrier |
|||
-> client-visible durability decision |
|||
-> flusher writes extent |
|||
-> checkpoint advances |
|||
-> retention floor decides WAL reclaimability |
|||
``` |
|||
|
|||
## Interim Fields |
|||
|
|||
These are currently acceptable only as explicit hardening carry-forwards: |
|||
|
|||
### `localServerID` |
|||
|
|||
Current source: |
|||
|
|||
- `BlockService.listenAddr` |
|||
|
|||
Meaning: |
|||
|
|||
- temporary local identity source for replica/rebuild-side assignment translation |
|||
|
|||
Status: |
|||
|
|||
- interim only |
|||
- should become registry-assigned stable server identity later |
|||
|
|||
### `CommittedLSN = CheckpointLSN` |
|||
|
|||
Current source: |
|||
|
|||
- `v2bridge.Reader` / `BlockVol.StatusSnapshot()` |
|||
|
|||
Meaning: |
|||
|
|||
- current V1-style interim mapping where committed truth collapses to local checkpoint truth |
|||
|
|||
Status: |
|||
|
|||
- not final V2 truth |
|||
- must become a gate decision before a production-candidate phase |
|||
|
|||
### heartbeat as control carrier |
|||
|
|||
Current source: |
|||
|
|||
- existing master <-> volume-server heartbeat path |
|||
|
|||
Meaning: |
|||
|
|||
- current transport for assignment/control delivery |
|||
|
|||
Status: |
|||
|
|||
- acceptable as current carrier |
|||
- not yet a final proof that no separate control channel will ever be needed |
|||
|
|||
## Hard Gates |
|||
|
|||
These should remain explicit in `Phase 08`: |
|||
|
|||
### Gate 1: committed truth |
|||
|
|||
Before production-candidate: |
|||
|
|||
- either separate `CommittedLSN` from `CheckpointLSN` |
|||
- or explicitly bound the first candidate path to currently proven pre-checkpoint replay behavior |
|||
|
|||
### Gate 2: live control delivery |
|||
|
|||
Required: |
|||
|
|||
- real assignment delivery must reach the engine on the live path |
|||
- not only converter-level proof |
|||
|
|||
### Gate 3: integrated catch-up closure |
|||
|
|||
Required: |
|||
|
|||
- engine -> executor -> `v2bridge` -> blockvol must be proven as one live chain |
|||
- not planner proof plus direct WAL-scan proof as separate evidence |
|||
|
|||
### Gate 4: first rebuild execution path |
|||
|
|||
Required: |
|||
|
|||
- rebuild must not remain only a detection outcome |
|||
- the chosen product path needs one real executable rebuild closure |
|||
|
|||
### Gate 5: unified replay |
|||
|
|||
Required: |
|||
|
|||
- after control and execution closure land, rerun the accepted failure-class set on the unified live path |
|||
|
|||
## Reuse Map |
|||
|
|||
### Reuse directly |
|||
|
|||
- `weed/server/master_grpc_server.go` |
|||
- `weed/server/volume_grpc_client_to_master.go` |
|||
- `weed/server/volume_server_block.go` |
|||
- `weed/server/master_block_registry.go` |
|||
- `weed/server/master_block_failover.go` |
|||
- `weed/storage/blockvol/blockvol.go` |
|||
- `weed/storage/blockvol/replica_apply.go` |
|||
- `weed/storage/blockvol/replica_barrier.go` |
|||
- `weed/storage/blockvol/v2bridge/` |
|||
|
|||
### Reuse as implementation reality, not truth |
|||
|
|||
- `shipperGroup` |
|||
- `RetentionFloorFn` |
|||
- `ReplicaReceiver` |
|||
- checkpoint/superblock machinery |
|||
- existing failover heuristics |
|||
|
|||
### Do not inherit as V2 semantics |
|||
|
|||
- address-shaped identity |
|||
- old degraded/catch-up intuition from V1/V1.5 |
|||
- `CommittedLSN = CheckpointLSN` as final truth |
|||
- blockvol-side recovery policy decisions |
|||
|
|||
## Short Rule |
|||
|
|||
Use this skeleton as: |
|||
|
|||
- a hardening map for the current product path |
|||
|
|||
Do not mistake it for: |
|||
|
|||
- the final standalone `sw-block` architecture |
|||
@ -0,0 +1,231 @@ |
|||
# V2 Phase Development Plan |
|||
|
|||
Date: 2026-03-31 |
|||
Status: active |
|||
Purpose: define the execution-oriented phase plan after the current candidate-path work, with explicit module status and target phase ownership |
|||
|
|||
## Why This Document Exists |
|||
|
|||
The project now needs a development plan that is: |
|||
|
|||
1. phase-oriented |
|||
2. execution-oriented |
|||
3. large enough to avoid overhead-heavy micro-slices |
|||
4. explicit about which module belongs to which future phase |
|||
|
|||
This document is the planning bridge between: |
|||
|
|||
1. `v2-product-completion-overview.md` |
|||
2. `../.private/phase/phase-08.md` |
|||
3. future implementation phases |
|||
|
|||
## Planning Rules |
|||
|
|||
Use these rules for all later phases: |
|||
|
|||
1. one phase should close one meaningful product/engineering outcome |
|||
2. every phase must have a clear verification mechanism |
|||
3. phases should prefer real code/test/evidence over wording-only progress |
|||
4. later phases may reuse V1 engineering reality, but must not inherit V1 recovery semantics as truth |
|||
5. a phase is too small if it does not move the overall product-completion state clearly |
|||
|
|||
## Current Baseline |
|||
|
|||
Current accepted/closing path through `Phase 08`: |
|||
|
|||
1. protocol/algo truth set is strong |
|||
2. engine recovery core is strong |
|||
3. real control delivery exists on the chosen path |
|||
4. real one-chain catch-up and rebuild closure exist on the chosen path |
|||
5. unified hardening validation exists on the chosen path |
|||
6. one bounded candidate statement exists for: |
|||
- `RF=2` |
|||
- `sync_all` |
|||
- existing master / volume-server heartbeat path |
|||
|
|||
Phase-accounting note: |
|||
|
|||
1. this document assumes the current `Phase 08` path and bookkeeping are being closed consistently |
|||
2. if `Phase 08` bookkeeping is still open, read the candidate statement items above as the current accepted/closing path, not as a fully closed phase label |
|||
|
|||
This means the next phases should focus mainly on: |
|||
|
|||
1. production-grade execution completeness |
|||
2. stronger runtime ownership |
|||
3. stronger control-plane closure |
|||
4. later product-surface rebinding |
|||
5. production hardening |
|||
|
|||
## Phase Roadmap |
|||
|
|||
### Phase 09: Production Execution Closure |
|||
|
|||
Goal: |
|||
|
|||
1. turn validation-grade backend execution into production-grade backend execution |
|||
|
|||
Must prove: |
|||
|
|||
1. full-base rebuild performs real data transfer |
|||
2. snapshot rebuild performs real image transfer |
|||
3. replica-ahead path is physically executable, not only detected |
|||
4. runtime execution ownership is stronger than the current bounded candidate path |
|||
|
|||
Typical outputs: |
|||
|
|||
1. real `TransferFullBase` |
|||
2. real `TransferSnapshot` |
|||
3. real `TruncateWAL` |
|||
4. stronger executor/runtime integration in the live volume-server path |
|||
|
|||
Verification mechanism: |
|||
|
|||
1. one-chain execution tests on real backend paths |
|||
2. cleanup assertions after success/failure/cancel |
|||
3. focused adversarial tests for truncation and rebuild execution |
|||
|
|||
Workload: |
|||
|
|||
1. large |
|||
2. this is likely the single biggest remaining engineering phase |
|||
|
|||
### Phase 10: Real Control-Plane Closure |
|||
|
|||
Goal: |
|||
|
|||
1. strengthen from accepted assignment-entry closure to fuller end-to-end control-plane closure |
|||
|
|||
Must prove: |
|||
|
|||
1. heartbeat/gRPC-level delivery is real for the chosen path |
|||
2. failover / reassignment state converges through the real control path |
|||
3. local and remote identity are consistent enough for product use |
|||
|
|||
Typical outputs: |
|||
|
|||
1. stronger heartbeat/gRPC delivery proof |
|||
2. stronger result/reporting convergence |
|||
3. cleaner local identity than transport-shaped `listenAddr` |
|||
|
|||
Verification mechanism: |
|||
|
|||
1. real failover/reassignment tests at the fuller control-plane level |
|||
2. identity/fencing assertions through the end-to-end path |
|||
|
|||
Workload: |
|||
|
|||
1. medium-large |
|||
|
|||
### Phase 11: Product Surface Rebinding |
|||
|
|||
Goal: |
|||
|
|||
1. bind product-facing surfaces onto the V2-backed block path after backend closure is strong enough |
|||
|
|||
Must prove: |
|||
|
|||
1. the V2-backed backend can support selected product surfaces without semantic drift |
|||
2. reuse of V1 surfaces does not reintroduce V1 recovery truth |
|||
|
|||
Candidate areas: |
|||
|
|||
1. snapshot product path |
|||
2. `CSI` |
|||
3. `NVMe` |
|||
4. `iSCSI` |
|||
|
|||
Verification mechanism: |
|||
|
|||
1. selected surface integration tests |
|||
2. product-surface contract checks |
|||
3. no-overclaim review that the surface does not imply unsupported backend capability |
|||
|
|||
Workload: |
|||
|
|||
1. medium-large |
|||
2. can be split by product surface if needed, but only after backend closure is strong |
|||
|
|||
### Phase 12: Production Hardening |
|||
|
|||
Goal: |
|||
|
|||
1. move from candidate-safe to production-safe |
|||
|
|||
Must prove: |
|||
|
|||
1. restart/recovery stability under repeated disturbance |
|||
2. long-run/soak viability |
|||
3. operational diagnosability |
|||
4. acceptable production blockers list or production-ready gate |
|||
|
|||
Verification mechanism: |
|||
|
|||
1. soak/adversarial runs |
|||
2. failover/restart under disturbance |
|||
3. runbook/debug validation |
|||
|
|||
Workload: |
|||
|
|||
1. large |
|||
|
|||
## Module Status Map |
|||
|
|||
|
|||
| Module area | Current status | Current owner phase | Next target phase | Notes | |
|||
| ------------------------------------------------------------- | ---------------------------- | ------------------------ | ----------------- | ------------------------------------------------------------------------------------------ | |
|||
| `sw-block/engine/replication` core FSM/orchestrator/driver | Strong | `Phase 08` accepted | `Phase 09` | Main later work is runtime/product execution closure, not new core semantics. | |
|||
| Engine executor real I/O boundary (`CatchUpIO` / `RebuildIO`) | Strong on chosen path | `Phase 08 P2/P3` | `Phase 09` | Keep the boundary; make underlying transfer/truncate production-grade. | |
|||
| `weed/storage/blockvol/v2bridge/control.go` | Strong on chosen path | `Phase 08 P1` | `Phase 10` | Next step is fuller control-plane closure, not new mapping semantics. | |
|||
| `weed/storage/blockvol/v2bridge/reader.go` | Strong | `Phase 08 P2/P3` | `Phase 09/10` | Keep comments/status aligned with candidate-path committed-truth decision. | |
|||
| `weed/storage/blockvol/v2bridge/pinner.go` | Strong | `Phase 08 P1/P3` | `Phase 09` | Retention safety proven; later work is product-grade execution under that safety. | |
|||
| `weed/storage/blockvol/v2bridge/executor.go` WAL scan | Strong | `Phase 08 P2` | `Phase 09` | Real scan is good; later work is real transfer/truncate completeness. | |
|||
| `v2bridge` `TransferFullBase` | Partial | `Phase 08 P2/P4` | `Phase 09` | Validation-grade now; target is real production streaming. | |
|||
| `v2bridge` `TransferSnapshot` | Partial | `Phase 08 P2/P4` | `Phase 09` | Validation-grade now; target is real image transfer. | |
|||
| `v2bridge` `TruncateWAL` | Weak/stub | `Phase 08 P4` bound | `Phase 09` | Must become a real executable path. | |
|||
| `weed/server/volume_server_block.go` V2 assignment intake | Medium-strong | `Phase 08 P1` | `Phase 09/10` | Real intake exists; later work is stronger runtime ownership + fuller control-plane proof. | |
|||
| `blockvol` WAL/flusher/checkpoint runtime | Reuse reality | Existing production code | `Phase 09` | Reuse implementation; do not let old semantics redefine V2 truth. | |
|||
| `blockvol` rebuild transport/server reality | Reuse with redesign boundary | Existing production code | `Phase 09` | Good area for production execution closure work. | |
|||
| local server identity (`localServerID`) | Partial | `Phase 08` bounded | `Phase 10` | Still transport-shaped; should become cleaner under control-plane closure. | |
|||
| Snapshot product path | Partial/reuse candidate | not core in `Phase 08` | `Phase 11` | Reuse implementation, but V2 semantics own placement and claims. | |
|||
| `CSI` integration | Deferred reuse candidate | not core in `Phase 08` | `Phase 11` | Product surface, not next core closure target. | |
|||
| `NVMe` / `iSCSI` front-ends | Deferred reuse candidate | not core in `Phase 08` | `Phase 11` | Rebind after backend path is stronger. | |
|||
| Testrunner / infra / metrics | Strong support layer | existing | `Phase 10-12` | Reuse to validate later control-plane and hardening phases. | |
|||
|
|||
|
|||
## Completion-State Targets |
|||
|
|||
Use these rough targets to judge whether a phase is moving the product meaningfully. |
|||
|
|||
|
|||
| Phase | Expected completion move | |
|||
| ---------- | ----------------------------------------------------------------------------- | |
|||
| `Phase 09` | from validation-grade backend execution to production-grade backend execution | |
|||
| `Phase 10` | from bounded control-entry proof to stronger end-to-end control-plane closure | |
|||
| `Phase 11` | from backend-ready path to selected product-surface readiness | |
|||
| `Phase 12` | from candidate-safe to production-safe | |
|||
|
|||
|
|||
## Near-Term Execution Direction |
|||
|
|||
If the goal is to maximize product completion efficiently, the recommended order is: |
|||
|
|||
1. finish `Phase 08` bookkeeping cleanly |
|||
2. `Phase 09` production execution closure |
|||
3. `Phase 10` real control-plane closure |
|||
4. `Phase 11` product surface rebinding |
|||
5. `Phase 12` production hardening |
|||
|
|||
The most important near-term engineering weight should go to `Phase 09`. |
|||
|
|||
## Short Summary |
|||
|
|||
The V2 line already has a real bounded candidate path. |
|||
The next development plan should treat later work as product-completion phases, not more protocol discovery. |
|||
|
|||
The main heavy engineering work still ahead is: |
|||
|
|||
1. production-grade execution |
|||
2. stronger runtime/control closure |
|||
3. later product-surface rebinding |
|||
4. production hardening |
|||
|
|||
@ -0,0 +1,267 @@ |
|||
# V2 Product Completion Overview |
|||
|
|||
Date: 2026-03-31 |
|||
Status: active |
|||
Purpose: provide one product-level overview of current V2 engineering completion, V1 reuse strategy, and the roadmap from the accepted candidate path to a production-ready block engine |
|||
|
|||
## Why This Document Exists |
|||
|
|||
The project now has enough accepted V2 algorithm, engine, and hardening evidence that the next question is no longer only: |
|||
|
|||
1. is the protocol correct |
|||
|
|||
It is also: |
|||
|
|||
1. how complete is the product path |
|||
2. which parts are already strong |
|||
3. which parts can reuse V1 engineering |
|||
4. which parts still require major implementation work |
|||
5. which future phases actually move product completion |
|||
|
|||
This document is the product-completion view. |
|||
|
|||
It complements: |
|||
|
|||
1. `v2-protocol-truths.md` for accepted semantics |
|||
2. `v2-production-roadmap.md` for the older roadmap ladder |
|||
3. `../.private/phase/phase-08.md` for current phase contract |
|||
|
|||
## Current Position |
|||
|
|||
The accepted first candidate path is: |
|||
|
|||
1. `RF=2` |
|||
2. `sync_all` |
|||
3. existing master / volume-server heartbeat path |
|||
4. V2 engine owns recovery policy |
|||
5. `v2bridge` translates real storage/control truth |
|||
6. `blockvol` remains the execution backend |
|||
|
|||
This means the project is no longer at "algorithm only". |
|||
It already has: |
|||
|
|||
1. accepted protocol truths |
|||
2. accepted engine execution closure |
|||
3. accepted hardening replay on a real integrated path |
|||
4. one bounded candidate statement |
|||
|
|||
## Engineering Completion Snapshot |
|||
|
|||
These levels are rough engineering estimates, not exact percentages. |
|||
|
|||
| Area | Current level | Notes | |
|||
|------|---------------|-------| |
|||
| Algorithm / protocol truths | Strong | Core V2 semantics are accepted and should remain stable unless contradicted by live evidence. | |
|||
| Simulator / prototype evidence | Strong | Main failure classes and protocol boundaries are already well-exercised. | |
|||
| Engine recovery core | Strong | Sender/session/orchestrator/driver/executor are substantially implemented. | |
|||
| Weed bridge integration | Strong | Reader / pinner / control / executor are real and tested on the chosen path. | |
|||
| Integrated candidate path | Medium-strong | `P1` + `P2` + `P3` prove one bounded candidate path. | |
|||
| Runtime ownership inside live server loop | Medium | Real intake exists, but full product-grade recovery ownership is not yet fully closed. | |
|||
| Production-grade data transfer | Medium-weak | Validation-grade transfer exists; full production byte streaming is still incomplete. | |
|||
| Truncation / replica-ahead execution | Weak | Detection exists; full execution path is still incomplete. | |
|||
| End-to-end control-plane closure | Medium | `ProcessAssignments()` is real; full heartbeat/gRPC proof is still bounded. | |
|||
| Product surfaces (`CSI`, `NVMe`, `iSCSI`, snapshot productization) | Partial | Mostly reuse candidates, but not the current core closure target. | |
|||
| Production hardening / ops | Partial | Candidate-level evidence exists; production-grade hardening is still ahead. | |
|||
|
|||
## Reuse Strategy |
|||
|
|||
Use this rule: |
|||
|
|||
1. if a component decides truth, V2 must own it |
|||
2. if a component consumes truth, V1 engineering can often be reused |
|||
|
|||
### V2-owned semantics |
|||
|
|||
These should not inherit V1 semantics casually: |
|||
|
|||
1. recovery choice: `zero_gap` / `catchup` / `needs_rebuild` |
|||
2. sender/session ownership and fencing |
|||
3. stable `ReplicaID` and stale-authority rejection |
|||
4. committed/checkpoint interpretation |
|||
5. rebuild-source choice and recovery outcome meaning |
|||
|
|||
### V1 engineering that is usually reusable |
|||
|
|||
These are implementation/reality layers, not protocol truth: |
|||
|
|||
1. `blockvol` storage runtime |
|||
2. WAL / flusher / checkpoint machinery |
|||
3. real assignment receive/apply path |
|||
4. front-end adapters such as `NVMe` / `iSCSI` |
|||
5. much of `CSI` lifecycle integration |
|||
6. monitoring / metrics / test harness infrastructure |
|||
|
|||
### Reuse with explicit bounds |
|||
|
|||
These can reuse implementation, but their semantic placement must remain V2-owned: |
|||
|
|||
1. snapshot export / checkpoint plumbing |
|||
2. rebuild transport / extent read path |
|||
3. master/heartbeat/control delivery path |
|||
|
|||
## Module Treatment Overview |
|||
|
|||
| Module area | Current treatment | Near-term plan | |
|||
|-------------|-------------------|----------------| |
|||
| Recovery engine | V2-owned | Continue closing runtime/product path under accepted semantics. | |
|||
| `v2bridge` | V2 boundary adapter | Keep expanding real I/O/runtime closure without leaking policy downward. | |
|||
| `blockvol` WAL/flusher/runtime | Reuse reality | Reuse implementation, but do not let V1 replication semantics redefine V2 truth. | |
|||
| Snapshot capability | Reuse implementation, V2-owned semantics | Do not make this a main near-term phase goal until core execution/runtime closure is stronger. | |
|||
| `CSI` | Later product surface | Rebind after the V2-backed candidate path is stable enough. | |
|||
| `NVMe` / `iSCSI` | Later product surface | Reuse as front-end adapters once the backend candidate path is stronger. | |
|||
| Rebuild server / transfer mechanisms | Reuse with redesign boundary | Good candidate for later production execution closure work. | |
|||
| Control plane | Reuse existing path | Continue from `ProcessAssignments()` toward stronger end-to-end closure. | |
|||
|
|||
## What The Candidate Path Already Proves |
|||
|
|||
For the chosen `RF=2 sync_all` path, the project can already claim: |
|||
|
|||
1. stable remote identity across address change when `ServerID` is present |
|||
2. stale epoch/session fencing through the integrated path |
|||
3. real catch-up one-chain closure on the chosen path |
|||
4. rebuild control/execution chain proven on the chosen path |
|||
- validation-grade execution closure |
|||
- not yet production-grade block/image streaming |
|||
5. replay of accepted failure classes on the unified live path |
|||
6. one real failover / reassignment cycle |
|||
7. one true simultaneous-overlap retention safety proof |
|||
8. committed/checkpoint separation accepted for this candidate path: |
|||
- `CommittedLSN = WALHeadLSN` |
|||
- `CheckpointLSN` remains the durable base-image boundary |
|||
|
|||
## What Is Still Missing For Product Completion |
|||
|
|||
The biggest remaining product-completion gaps are: |
|||
|
|||
1. production-grade rebuild data transfer |
|||
- `TransferFullBase` must become real streaming, not only accessibility validation |
|||
- `TransferSnapshot` must become real image streaming, not only checkpoint validation |
|||
2. replica-ahead physical correction |
|||
- `TruncateWAL` must stop being a stub |
|||
3. stronger live runtime ownership |
|||
- the V2 recovery driver/executors should become a more complete live runtime path, not only a bounded hardening path |
|||
4. stronger control-plane closure |
|||
- current proof reaches `ProcessAssignments()` |
|||
- full heartbeat/gRPC-level closure is still bounded |
|||
5. product-surface rebinding |
|||
- `CSI` |
|||
- `NVMe` |
|||
- `iSCSI` |
|||
- snapshot product path |
|||
6. production hardening |
|||
- restart / soak / repeated disturbance / diagnosis quality |
|||
|
|||
## Recommended Completion Roadmap |
|||
|
|||
### Stage 1: Finish Phase 08 cleanly |
|||
|
|||
Target: |
|||
|
|||
1. close candidate-path judgment with explicit bounds and package it cleanly inside `Phase 08 P4` |
|||
|
|||
Main output: |
|||
|
|||
1. one accepted candidate package for the chosen path |
|||
|
|||
### Stage 2: Phase 09 Production Execution Closure |
|||
|
|||
Target: |
|||
|
|||
1. turn validation-grade execution into production-grade execution |
|||
|
|||
Main work: |
|||
|
|||
1. real `TransferFullBase` |
|||
2. real `TransferSnapshot` |
|||
3. real `TruncateWAL` |
|||
4. stronger runtime ownership of recovery execution |
|||
|
|||
Why it matters: |
|||
|
|||
This is the largest remaining engineering block between "candidate-safe-with-bounds" and a serious product path. |
|||
|
|||
### Stage 3: Phase 10 Real Control-Plane Closure |
|||
|
|||
Target: |
|||
|
|||
1. strengthen from accepted assignment-entry closure to fuller end-to-end control-path closure |
|||
|
|||
Main work: |
|||
|
|||
1. heartbeat/gRPC-level proof |
|||
2. stronger control/result convergence |
|||
3. better identity completeness for local and remote server roles |
|||
|
|||
### Stage 4: Phase 11 Product Surface Rebinding |
|||
|
|||
Target: |
|||
|
|||
1. connect product-facing surfaces to the V2-backed block path |
|||
|
|||
Candidate areas: |
|||
|
|||
1. snapshot product path |
|||
2. `CSI` |
|||
3. `NVMe` |
|||
4. `iSCSI` |
|||
|
|||
Rule: |
|||
|
|||
Do this after the backend engine/runtime path is strong enough, not before. |
|||
|
|||
### Stage 5: Phase 12 Production Hardening |
|||
|
|||
Target: |
|||
|
|||
1. move from candidate-safe to production-safe |
|||
|
|||
Main work: |
|||
|
|||
1. soak / restart / repeated failover |
|||
2. operational diagnosis quality |
|||
3. performance floor and cost characterization |
|||
4. explicit production blockers / rollout gates |
|||
|
|||
## Completion Gates |
|||
|
|||
The most important gates from here are: |
|||
|
|||
1. execution gate |
|||
- validation-grade transfer/truncation must become production-grade |
|||
2. runtime ownership gate |
|||
- V2 recovery must be a stronger live runtime path, not only a bounded tested path |
|||
3. control-plane gate |
|||
- stronger end-to-end control delivery proof |
|||
4. product-surface gate |
|||
- front-end surfaces should only rebind after backend correctness is strong enough |
|||
5. production-hardening gate |
|||
- restart, soak, diagnosis, and repeated disturbance must be acceptable |
|||
|
|||
## Near-Term Planning Guidance |
|||
|
|||
If the goal is to maximize product completion efficiently: |
|||
|
|||
1. do not make `CSI`, `NVMe`, or broad snapshot productization the immediate next heavy phase |
|||
2. first close production execution gaps in the backend path |
|||
3. then strengthen control-plane closure |
|||
4. then rebind product surfaces |
|||
|
|||
In short: |
|||
|
|||
1. backend truth and execution first |
|||
2. product surfaces second |
|||
3. production hardening last |
|||
|
|||
## Short Summary |
|||
|
|||
The V2 line is already beyond "algorithm only". |
|||
It has a real bounded candidate path. |
|||
|
|||
But the remaining work is still substantial, and it is mostly engineering work: |
|||
|
|||
1. production-grade execution |
|||
2. stronger runtime/control closure |
|||
3. product-surface rebinding |
|||
4. production hardening |
|||
|
|||
That is the practical path from the current candidate-safe engine to a production-ready block product. |
|||
Write
Preview
Loading…
Cancel
Save
Reference in new issue