Browse Source

feat: Phase 09 P0 — production execution closure plan

Execution-closure targets:
- P1: TransferFullBase — reuse rebuild.go TCP protocol
- P2: TransferSnapshot — checkpoint image + WAL tail
- P3: TruncateWAL — AdvanceTail + superblock update
- P4: Runtime ownership — V2 orchestrator drives execution

Key reuse sources identified:
- rebuild.go: rebuildFullExtent (client), RebuildServer (server)
- wal_writer.go: AdvanceTail
- flusher.go: updateSuperblockCheckpoint
- blockvol.go: ScanWALEntries (already wired)

Slice order: full-base first (highest value), then snapshot,
then truncation, then runtime ownership.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feature/sw-block
pingqiu 22 hours ago
parent
commit
46faf0f7e3
  1. 109
      sw-block/.private/phase/phase-08-decisions.md
  2. 397
      sw-block/.private/phase/phase-08-log.md
  3. 397
      sw-block/.private/phase/phase-08.md
  4. 26
      sw-block/.private/phase/phase-09-decisions.md
  5. 269
      sw-block/.private/phase/phase-09-log.md
  6. 127
      sw-block/.private/phase/phase-09.md
  7. 3
      sw-block/design/README.md
  8. 25
      sw-block/design/agent_dev_process.md
  9. 301
      sw-block/design/phase-08-engine-skeleton-map.md
  10. 231
      sw-block/design/v2-phase-development-plan.md
  11. 267
      sw-block/design/v2-product-completion-overview.md

109
sw-block/.private/phase/phase-08-decisions.md

@ -76,3 +76,112 @@ The hard rule remains:
1. engine owns recovery policy
2. bridge translates confirmed control/storage truth
3. `blockvol` executes I/O
## Decision 9: Phase 08 P1 is accepted with explicit scope limits
Accepted `P1` coverage is:
1. real `ProcessAssignments()` path drives V2 engine sender/session state change
2. stable remote `ReplicaID` is derived from `ServerID`, not address
3. address change preserves sender identity through the live control path
4. stale epoch/session invalidation occurs through the live control path
5. missing `ServerID` fails closed
Not accepted as part of `P1`:
1. full end-to-end gRPC heartbeat delivery proof
2. integrated catch-up execution through the live path
3. rebuild execution through the live path
4. final local stable identity beyond transport-shaped `listenAddr`
## Decision 10: Phase 08 P2 is accepted as real execution closure
Accepted `P2` coverage is:
1. `CommittedLSN` is separated from `CheckpointLSN` on the chosen `sync_all` path
2. catch-up is proven as one live chain:
- engine plan
- engine executor
- `v2bridge`
- real `blockvol` I/O
- completion
- cleanup
3. rebuild is proven as one live chain for the delivered path
4. cleanup/pin release is asserted after execution
Residual non-blocking scope notes:
1. `CatchUpStartLSN` is not directly asserted in tests
2. rebuild source variants are not all forced and individually asserted
## Decision 11: Phase 08 now moves to unified hardening validation
With `P1` and `P2` accepted, the next required step is:
1. replay the accepted failure-class set again on the unified live path
2. validate at least one real failover / reassignment cycle
3. validate concurrent retention/pinner behavior
4. make the committed-truth gate decision explicit for the chosen candidate path
## Decision 12: Phase 08 P3 is accepted as unified hardening validation
Accepted `P3` coverage is:
1. replay of the accepted failure-class set on the unified `P1` + `P2` live path
2. at least one real failover / reassignment cycle through the live control path
3. one true simultaneous-overlap retention/pinner safety proof
4. stronger causality assertions for invalidation, escalation, catch-up, and completion
## Decision 13: The committed-truth gate is decided for the chosen candidate path
For the chosen `RF=2 sync_all` candidate path:
1. `CommittedLSN = WALHeadLSN`
2. `CheckpointLSN` remains the durable base-image boundary
3. this separation is accepted as sufficient for the candidate-path hardening boundary
This decision is intentionally scoped:
1. it is accepted for the chosen candidate path
2. it is not yet a blanket truth for every future path or durability mode
## Decision 14: Phase 08 P4 is candidate-path judgment, not broad new engineering expansion
`P4` should close `Phase 08` by producing one explicit candidate-path judgment.
Its main output is not more isolated engineering progress, but:
1. a bounded candidate-path statement
2. an evidence-to-claim mapping from accepted `P1` / `P2` / `P3` results
3. an explicit list of accepted bounds, remaining deferrals, and production blockers
`P4` may include small closure work if needed to make the candidate statement coherent, but it should not reopen protocol design or grow into another broad hardening slice.
## Decision 15: Phase 08 P4 is accepted as candidate package closure
Accepted `P4` coverage is:
1. one explicit candidate package for the chosen `RF=2 sync_all` path
2. candidate-safe claims mapped to accepted `P1` / `P2` / `P3` evidence
3. explicit bounds, deferred items, and production blockers
4. committed-truth decision scoped to the chosen candidate path
5. module/package boundary summary for the next heavy engineering phase
Accepted judgment:
1. candidate-safe-with-bounds
2. not production-ready
## Decision 16: Phase 08 is closed and the next heavy phase is production execution closure
With `P0` through `P4` accepted, `Phase 08` is closed.
The next phase should not be a light packaging-only round.
It should begin with:
1. `Phase 09: Production Execution Closure`
2. `P0` planning for:
- real `TransferFullBase`
- real `TransferSnapshot`
- real `TruncateWAL`
- stronger live runtime execution ownership

397
sw-block/.private/phase/phase-08-log.md

@ -17,5 +17,398 @@
### Next
1. Phase 08 P0 accepted
2. Phase 08 P1 real master/control delivery integration
3. Phase 08 P2 integrated execution closure
2. Phase 08 P1 accepted
3. Phase 08 P2 accepted
4. Phase 08 P3 hardening validation on the unified live path
5. Phase 08 P4 candidate package closure accepted
6. Phase 08 closeout bookkeeping complete
7. next: open Phase 09 P0 for production execution closure planning
### P3 Technical Pack
Purpose:
- provide the minimum design/algo/test detail needed to execute `P3`
- reuse accepted `P1` / `P2` live-path closure
- avoid broad scenario growth or repeated proof of already accepted mechanics
#### Design / algo focus
`P3` is not another execution-closure slice.
It assumes these are already accepted on the chosen path:
- real control delivery
- real catch-up one-chain closure
- real rebuild one-chain closure
What `P3` adds is hardening evidence on top of that live path:
1. replay accepted failure classes again on the unified path
2. prove one real failover / reassignment cycle
3. prove one overlapping retention/pinner safety case
4. produce one explicit committed-truth gate decision
Key algorithm rules for `P3`:
- control truth remains primary:
- failover / reassignment is driven by new assignment / epoch truth
- storage/runtime must not invent role changes
- recovery choice remains engine-owned:
- engine chooses `zero_gap` / `catchup` / `needs_rebuild`
- bridge and `blockvol` execute what the engine already decided
- overlapping recovery must remain fail-closed:
- retained floor = minimum active retention requirement
- stale or cancelled plan must release its hold
- a new authoritative plan must not inherit leaked resources from an old one
- committed-truth gate must be output, not discussed informally:
- either the chosen candidate path is accepted with current committed/checkpoint semantics
- or the next phase is blocked on further separation/bounding work
#### Validation matrix
Use one compact replay matrix rather than many near-duplicate tests.
1. Changed-address restart
- trigger: address refresh / reassignment while prior identity is preserved
- expected: old session invalidated, same logical `ReplicaID`, new recovery starts cleanly
- assert:
- no stale session mutation
- no leaked pins
- logs show why identity stayed and session changed
2. Stale epoch / stale session
- trigger: epoch bump during or before recovery continuation
- expected: stale execution loses authority immediately
- assert:
- old session cannot mutate
- replacement assignment/session becomes the only live authority
- logs show invalidation reason
3. Unrecoverable gap / needs-rebuild
- trigger: replica falls behind retained WAL
- expected: engine chooses `needs_rebuild`, rebuild path executes or is prepared according to accepted boundary
- assert:
- no catch-up overclaim
- correct rebuild source/result logged
- no leaked pins after completion/failure
4. Post-checkpoint boundary behavior
- trigger: replica state around checkpoint / committed boundary
- expected: classification and execution match the chosen candidate-path semantics
- assert:
- chosen path does not overclaim beyond the accepted boundary
- committed/checkpoint truth used here matches the explicit gate decision
#### Required extra cases
Besides the replay matrix, `P3` should add only two new validation cases:
1. One real failover / promotion / reassignment cycle
- primary change or reassignment through the live control path
- verify old authority dies, new authority starts, recovery resumes/starts correctly
2. One true simultaneous-overlap retention/pinner case
- two live recovery holds coexist before the earlier one is released
- verify:
- minimum retention floor is respected while both are live
- releasing one hold leaves the other hold still contributing the correct floor
- released/cancelled plan stops contributing to retention floor
- final hold count returns to zero
#### Expected evidence
For each accepted `P3` case, prefer explicit evidence blocks:
- entry truth:
- assignment / epoch / role that started the case
- engine result:
- selected outcome or invalidation result
- execution result:
- completion / cancel / failure
- cleanup result:
- `ActiveHoldCount() == 0`
- no surviving active session when case should be closed
- observability result:
- logs explain:
- why control truth changed
- why session changed
- why catch-up vs rebuild happened
- why execution completed / failed / cancelled
#### Efficient test plan
Keep `P3` small and high-signal:
- one unified replay test package or compact matrix
- one real failover-cycle test
- one overlapping-retention test
- one explicit gate-decision record in delivery / phase status
Avoid:
- re-proving isolated `P2` one-chain mechanics
- broad combinatorial growth across many replicas / roles / timing permutations
- turning `P3` into another protocol-design slice
### P4 Technical Pack
Purpose:
- provide the minimum design/algo/test detail needed to close `Phase 08`
- convert accepted `P1` / `P2` / `P3` evidence into one candidate-path judgment
- keep `P4` as a closure slice, not another broad engineering slice
#### Delivery sequence
Use this order:
1. `sw` develops the candidate package
2. `architect` reviews code/claim shape before tester time is spent
3. `tester` validates the evidence-to-claim mapping
4. `manager` records the final phase/accounting decision
Do not collapse these roles:
- `sw` builds the candidate statement and supporting artifacts
- `architect` checks whether the resulting package has obvious semantic, scope, or evidence-shape problems before tester validation
- `tester` checks whether every claim is actually supported
- `manager` decides acceptance/bookkeeping after architect + tester feedback
Recommended handoff gate before tester:
- if architect finds obvious overclaim, missing evidence mapping, or broken candidate shape, return to `sw` first
- do not spend tester time on a package that is clearly not ready
#### Design / algo focus
`P4` should not introduce new protocol shape.
It consumes already accepted results:
- `P1`: real control delivery
- `P2`: real execution closure
- `P3`: unified hardening validation
The main design task is to classify the chosen path into three buckets:
1. candidate-safe
- supported by accepted evidence
- allowed to appear in the candidate statement
2. intentionally bounded
- accepted only within narrow limits
- must appear as explicit candidate bounds
3. deferred or blocking
- not yet supported enough
- must not be implied as candidate-ready
Algorithmically, `P4` is a classification/output slice:
- no new recovery FSM
- no new identity model
- no new rebuild policy
- no new durability model
It should only:
- map accepted evidence to accepted candidate claims
- map residual limitations to explicit bounds or blockers
- separate candidate readiness from production readiness
#### Required output artifacts
`sw` should produce exactly these artifacts:
1. Candidate statement
- what the chosen `RF=2 sync_all` path is allowed to claim
2. Evidence-to-claim map
- each candidate claim points to accepted evidence from `P1` / `P2` / `P3`
3. Bound list
- explicit candidate-safe bounds, for example:
- chosen path only
- chosen durability mode only
- accepted rebuild coverage only
4. Deferred / blocking list
- what remains outside the candidate path
- what still blocks production readiness
#### Candidate statement shape
Keep the candidate statement short and structured.
It should answer only:
1. What path is the candidate?
2. What is proven for that path?
3. What is intentionally bounded for that path?
4. What is still deferred or blocking?
Good pattern:
- candidate path:
- `RF=2 sync_all` on the accepted master/heartbeat control path
- proven:
- real control delivery
- real catch-up closure
- real rebuild closure for accepted coverage
- unified replay and failover validation
- bounded:
- only the chosen path / mode
- only accepted rebuild/source coverage
- not yet claimed:
- general future path/mode truth
- production readiness
#### Candidate statement template
Use this exact structure for the `P4` delivery statement:
1. Candidate path
- The first candidate path is:
- `<path / topology / durability mode>`
2. Candidate-safe claims
- The candidate path is supported for:
- `<claim 1>` — evidence: `<P1/P2/P3 reference>`
- `<claim 2>` — evidence: `<P1/P2/P3 reference>`
- `<claim 3>` — evidence: `<P1/P2/P3 reference>`
3. Explicit bounds
- This candidate statement is intentionally bounded to:
- `<bound 1>`
- `<bound 2>`
- `<bound 3>`
4. Deferred or blocking items
- Not yet claimed as candidate-safe:
- `<deferred item 1>`
- `<deferred item 2>`
- Still blocking production readiness:
- `<blocker 1>`
- `<blocker 2>`
5. Committed-truth decision
- For this candidate path:
- `<committed-truth decision>`
- Scope:
- `<why this does not automatically generalize>`
6. Overall judgment
- Judgment:
- `<candidate-safe / candidate-safe-with-bounds / not-yet-candidate>`
- Reason:
- `<one short paragraph tying evidence to judgment>`
When `sw` fills this template:
- every positive claim must carry an evidence reference
- every important missing area must appear either under:
- explicit bounds
- deferred
- blockers
- avoid prose that mixes candidate judgment with production-readiness language
#### Assignment template
Use this template when assigning `P4` work to `sw`:
1. Goal
- Build the `P4` candidate package for the chosen path.
2. Required outputs
- candidate statement
- evidence-to-claim mapping
- explicit bounds list
- deferred / blocking list
- committed-truth decision statement
3. Hard rules
- no new protocol redesign
- no broad scope growth without candidate impact
- every positive claim must map to accepted `P1` / `P2` / `P3` evidence
- do not mix candidate readiness with production readiness
4. Delivery order
- first hand to architect review
- only after architect review passes, hand to tester validation
- manager records final acceptance/bookkeeping last
5. Reject before handoff if
- evidence-to-claim mapping is incomplete
- important limitations are not classified as bounded / deferred / blocking
- claims exceed accepted evidence
Use this template when assigning `P4` validation to `tester`:
1. Goal
- Validate that the candidate package is fully supported by accepted evidence.
2. Validate
- each claim has accepted evidence
- each bound/deferred/blocker is explicit
- committed-truth decision stays scoped correctly
- no candidate-to-production overclaim exists
3. Output
- pass/fail on each candidate claim group
- findings on unsupported claims, missing bounds, or hidden blockers
#### Tester validation checklist
`tester` should validate:
1. every positive candidate claim has accepted evidence
2. every important limitation appears in either:
- bounded
- deferred
- blocking
3. no accepted evidence is stretched into a broader product claim
4. committed-truth decision stays scoped to the chosen candidate path
5. candidate readiness is not confused with production readiness
#### Architect review focus
`architect` should review only:
1. semantic correctness of the candidate statement
2. whether the evidence-to-claim mapping is honest
3. whether bounds are explicit enough to prevent future drift
4. whether any hidden overclaim remains
This review should not reopen already accepted `P1` / `P2` / `P3` mechanics unless the candidate statement contradicts them.
#### Efficient test / evidence plan
`P4` should mostly reuse accepted evidence rather than add new broad tests.
Preferred work:
- collect accepted evidence references
- compress them into candidate-safe claims
- write one explicit residual-gap list
Only add new code/tests if a small missing blocker prevents a coherent candidate statement.
Avoid:
- large new replay matrices
- new protocol experiments
- broad implementation growth without candidate impact
### Closeout bookkeeping
Manager follow-up after `P4` acceptance found only a minor bookkeeping concern:
- ensure `phase-08.md` is explicitly closed before treating `Phase 09` as opened
Closeout check:
1. `phase-08.md` is `Status: complete`
2. `P4` is recorded as accepted
3. `Phase-close note` points to `Phase 09: Production Execution Closure`
4. `phase-08-decisions.md` records `Decision 16`
Final bookkeeping judgment:
- `Phase 08` is closed
- `Phase 09 P0` is the active next planning/engineering package

397
sw-block/.private/phase/phase-08.md

@ -1,7 +1,7 @@
# Phase 08
Date: 2026-03-31
Status: active
Status: complete
Purpose: convert the accepted Phase 07 product path into a pre-production-hardening program without reopening accepted V2 protocol shape
## Why This Phase Exists
@ -19,6 +19,15 @@ What still does not exist is a pre-production-ready system path. The remaining w
Harden the first accepted V2 product path until the remaining gap to a production candidate is explicit, bounded, and implementation-driven.
This phase doc is the canonical hardening contract for `sw` and `tester`.
Use `phase-08-log.md` for deeper engineering process, alternatives, and implementation detail.
Algorithm note:
- the accepted V2 algorithm / protocol shape is treated as fixed for this phase
- remaining work is engineering closure over real Seaweed/V1 runtime paths under V2 boundaries
- do not reopen protocol design unless a live contradiction is found
## Scope
### In scope
@ -67,6 +76,11 @@ Status:
- candidate-path readiness vs production readiness
- accepted
Reference:
- `sw-block/design/phase-08-engine-skeleton-map.md` is the implementation-side skeleton map for this phase
- it is subordinate to `sw-block/design/v2-protocol-truths.md` and this `phase-08.md`; use it for module layout, execution order, interim fields, hard gates, and reuse guidance
### P1: Real Control Delivery
1. connect real master/heartbeat assignment delivery into the bridge
@ -109,13 +123,6 @@ Implementation route (`reuse map`):
- keep engine as the recovery-policy owner
- keep `blockvol` as the I/O executor
Expectation note:
- the `P1` tester expectation is already embedded in this phase doc under:
- `P1 / Validation focus`
- `P1 / Reject if`
- do not grow a separate long template unless `P1` scope expands materially
Validation focus:
- prove live assignment delivery into the bridge/engine path
@ -135,34 +142,312 @@ Reject if:
- failover / reassignment is claimed without a real replay target
- delivery claims general production readiness rather than control-path closure
Status:
- accepted
- real assignment delivery into the V2 path is now proven through `ProcessAssignments()`
- accepted evidence includes:
- live assignment -> engine sender/session creation
- stable remote `ReplicaID = <volume>/<ServerID>`
- address-change identity preservation through the live path
- stale epoch/session invalidation through the live path
- fail-closed skip on missing `ServerID`
- accepted with explicit carry-forwards:
- `localServerID = listenAddr` remains transport-shaped for local identity
- heartbeat -> `ProcessAssignments()` is proven, but not full end-to-end gRPC delivery
- integrated catch-up execution is not yet proven through the live path
- rebuild execution remains deferred
- `CommittedLSN = CheckpointLSN` remains unresolved
### P2: Execution Closure
1. close the live engine -> executor -> `v2bridge` execution chain
2. make catch-up execution evidence integrated rather than split across layers
3. close the first rebuild execution path required by the product path
Technical focus:
- keep execution ownership explicit:
- engine plans and owns recovery state transitions
- engine executor drives stepwise execution
- `v2bridge` translates execution requests into real blockvol work
- `blockvol` performs I/O only
- prove catch-up as one real path:
- accepted control delivery
- real retained-history input
- real WAL retention pin
- real WAL scan / progress return
- real session completion
- choose the narrowest rebuild closure required by the current product path:
- first real `full-base` rebuild path is preferred
- `snapshot + tail` can remain later unless needed by the chosen path
- keep resource ownership fail-closed:
- pin acquisition before execution
- release on success
- release on cancel / invalidation
- release on partial failure
- keep observability causal:
- execution start
- execution progress
- execution cancel / invalidation
- execution failure
- completion
Implementation route:
- reuse engine-side execution core:
- `sw-block/engine/replication/driver.go`
- `sw-block/engine/replication/executor.go`
- `sw-block/engine/replication/orchestrator.go`
- reuse storage/runtime execution bridge:
- `weed/storage/blockvol/v2bridge/executor.go`
- `weed/storage/blockvol/v2bridge/pinner.go`
- `weed/storage/blockvol/v2bridge/reader.go`
- reuse block runtime execution reality:
- `weed/storage/blockvol/blockvol.go`
- `weed/storage/blockvol/replica_apply.go`
- `weed/storage/blockvol/replica_barrier.go`
- rebuild-side files under `weed/storage/blockvol/`
- preserve the boundary:
- do not move zero-gap / catch-up / rebuild classification into `blockvol`
- do not let executor convenience paths redefine protocol semantics
Validation focus:
- prove one live integrated catch-up chain:
- assignment/control arrives through accepted `P1` path
- engine plans
- executor drives `v2bridge`
- `blockvol` executes
- progress returns
- session completes
- prove one real rebuild execution path for the chosen product path
- prove retention pin / release symmetry on the live path
- prove rebuild resource pin / release symmetry on the live path
- prove invalidation / cancel cleanup on the live path
- prove execution logs explain:
- why catch-up started
- why rebuild started
- why execution failed
- why execution was cancelled
- why completion succeeded
Reject if:
- catch-up is still only proven by split evidence
- rebuild remains only a detection outcome
- `blockvol` starts deciding recovery mode or rebuild fallback
- resources leak on cancel / invalidation / partial failure
- execution logs are too weak to replay causality offline
- the slice quietly broadens protocol semantics beyond the current accepted boundary
Recommended first cut:
1. close the live catch-up chain first
2. close the first real `full-base` rebuild path second
3. leave unified replay to `P3`
Minimum closure threshold:
- do not accept `P2` on glue code + partial chain tests alone
- at least one accepted catch-up proof must drive the real engine executor path:
- `PlanRecovery(...)`
- `NewCatchUpExecutor(...)`
- executor-managed progress / completion
- real `v2bridge` / `blockvol` execution underneath
- at least one accepted rebuild proof must drive the real engine executor path:
- rebuild assignment
- `PlanRebuild(...)`
- `NewRebuildExecutor(...)`
- executor-managed completion
- real `TransferFullBase(...)` underneath
- resource-cleanup proof must include live-path assertions, not only logs:
- active holds released
- retention floor no longer pinned after release
- no surviving session/plan ownership after cancel / invalidation / failure
- observability proof should include executor-generated events, not only planner-side events
- if these thresholds are not met, record `P2` as partial execution progress, not execution closure
Carry-forward note:
- on the chosen `RF=2 sync_all` path, `CommittedLSN` separation is resolved in this slice:
- `CommittedLSN = WALHeadLSN`
- `CheckpointLSN` remains the durable base-image boundary
- this is not yet a blanket truth for every future path or durability mode
- post-checkpoint catch-up remains bounded unless explicitly closed
- rebuild coverage is limited to the first chosen executable path if that is all that lands
Status:
- accepted
- real one-chain execution is now proven for:
- catch-up
- rebuild
- accepted evidence includes:
- `CommittedLSN` separated from `CheckpointLSN` on the chosen `sync_all` path
- live engine plan -> executor -> `v2bridge` -> `blockvol` catch-up chain
- live engine plan -> executor -> `v2bridge` -> `blockvol` rebuild chain
- explicit pin cleanup assertions after execution
- accepted with explicit residual scope:
- `CatchUpStartLSN` is not directly asserted in tests
- rebuild source is not yet forced/verified per source variant
- broader rebuild-source coverage can remain follow-up work
Review checklist:
- is there one accepted catch-up proof from real `P1` control path to real session completion, using `CatchUpExecutor`
- is there one accepted first rebuild proof on the chosen path, using `RebuildExecutor`
- do live-path assertions prove pin/hold release on success, cancel, invalidation, and failure
- do logs/status explain start, cancel, failure, and completion without hidden transitions
- does the delivery avoid overclaiming general post-checkpoint catch-up, broad rebuild coverage, or production readiness
### P3: Hardening Validation
1. validate diagnosability under the live integrated path
2. validate retention/pinner behavior under concurrent load
3. replay the accepted failure-class set again on the newly unified live path after `P1` and `P2` land
4. confirm the remaining gap to a production candidate
1. replay the accepted failure-class set again on the unified live path after `P1` + `P2`
2. validate at least one real failover / promotion / reassignment cycle through the live control path
3. validate concurrent retention/pinner behavior under overlapping recovery activity
4. make the committed-truth gate decision explicit for the chosen candidate path
Slice adjustment note:
- if `P2` lands only partially, `P3` should first close the missing execution outcome:
- real catch-up closure if still missing
- real first rebuild closure if still missing
- only after both are real should `P3` spend most of its weight on unified replay, failover / reassignment validation, and concurrent retention / cleanup hardening
Efficiency note:
- `P3` is a hardening-validation slice, not another execution-closure slice
- reuse the accepted `P1` / `P2` live path as the base; do not re-prove already accepted chain mechanics in isolation
- prefer one compact replay matrix over many near-duplicate tests
- prefer one real failover cycle and one true simultaneous-overlap retention case over broad scenario expansion
- the required new outputs are:
- unified replay evidence
- one real failover / reassignment replay
- one concurrent retention/pinner safety result
- one explicit committed-truth gate decision
Validation focus:
- prove the chosen path through a real control-delivery path
- prove the live engine -> executor -> `v2bridge` execution chain as one path, not split evidence
- prove the first rebuild execution path required by the chosen product path
- prove at least one real failover / promotion / reassignment cycle
- prove concurrent retention/pinner behavior does not break recovery guarantees
- unified replay for:
- changed-address restart
- stale epoch / stale session
- unrecoverable gap / needs-rebuild
- post-checkpoint boundary behavior
- at least one real failover / promotion / reassignment cycle
- concurrent retention/pinner safety under at least one true simultaneous-overlap hold case
- logs explain:
- why control truth changed
- why a session was invalidated
- why catch-up vs rebuild was chosen
- why execution completed, failed, or was cancelled
Reject if:
- catch-up semantics are overclaimed beyond the currently proven boundary
- rebuild is claimed as supported without real execution closure
- master/control delivery is claimed as real without the live path in place
- `CommittedLSN` vs `CheckpointLSN` remains an unclassified note instead of a gate decision
- `P1` and `P2` land independently but the accepted failure-class set is not replayed again on the unified live path
- accepted failure classes are still only partially replayed on the unified path
- failover / reassignment is claimed without a real live-path replay
- concurrent retention/pinner behavior leaks pins or violates recovery safety
- logs are too weak to replay causality offline
- the committed-truth gate is still just a note instead of an explicit decision
Status:
- accepted
- unified hardening replay is now proven on the accepted live path
- accepted evidence includes:
- replay of the accepted failure-class set on the unified `P1` + `P2` path
- at least one real failover / reassignment cycle through the live control path
- one true simultaneous-overlap retention/pinner safety proof
- stronger causality assertions for invalidation, escalation, catch-up, and completion
- committed-truth gate decision for the chosen candidate path:
- for the chosen `RF=2 sync_all` candidate path, `CommittedLSN = WALHeadLSN` with `CheckpointLSN` kept separate is accepted as sufficient for the candidate-path hardening boundary
- this is not yet a blanket truth for every future path or durability mode
### P4: Candidate Package Closure
1. classify what is truly ready for a first candidate path
2. package the accepted `P1` / `P2` / `P3` evidence into one bounded candidate package
3. turn carry-forwards into explicit candidate bounds or hard gates
4. state clearly what still remains before production readiness
Goal:
- finish `Phase 08` with one explicit candidate package, not just a collection of accepted slices
Verification mechanism:
- evidence map:
- every candidate claim must point to accepted evidence from `P1` / `P2` / `P3`
- tester validation:
- verify each candidate claim is supported by accepted evidence
- reject any claim that exceeds the proven boundary
- manager validation:
- verify the candidate statement is explicit, bounded, and not confused with production readiness
Output artifacts:
1. candidate-path statement in `phase-08.md`
2. candidate/gate decision record in `phase-08-decisions.md`
3. concise candidate package summary:
- candidate-safe capabilities
- explicit bounds
- deferred / blocking items
4. concise residual-gap summary:
- candidate-safe
- intentionally bounded
- still deferred / still blocking
5. short module/package boundary summary for later phases:
- what is already strong enough
- what moves to the next heavy engineering phase
Efficiency note:
- `P4` should mostly consume already accepted evidence, not create broad new engineering work
- only add implementation work if a small remaining blocker must be closed to make the candidate statement coherent
- if a gap is real but not worth closing in `Phase 08`, classify it explicitly rather than expanding scope implicitly
- `P4` exists inside `Phase 08` so the next phase can begin with substantial engineering work, not a light packaging-only round
Validation focus:
- make the candidate-path boundary explicit:
- what is proven
- what is intentionally bounded
- what is still deferred
- make the candidate package explicit:
- candidate-safe capability list
- evidence-to-claim mapping
- short module/package boundary summary
- make the committed-truth decision explicit:
- accepted for the chosen `RF=2 sync_all` candidate path
- still unclassified for future paths / durability modes unless separately proven
- prove the accepted product path can be described as an engineering candidate, not only as a set of slice-local proofs
- provide one explicit residual-gap list that separates:
- candidate-safe bounds
- future hardening work
- production blockers
Reject if:
- `P4` reopens protocol design instead of closing engineering gaps
- candidate claims are broader than the proven path
- carry-forwards remain informal notes rather than bounds or gates
- production readiness is implied from candidate readiness
- `P4` produces only prose summary without an evidence-to-claim mapping
- `P4` is too thin to leave the next phase with substantial engineering closure work
Status:
- accepted
- the first candidate package is now explicit for the chosen path
- accepted evidence includes:
- candidate-safe claims mapped to accepted `P1` / `P2` / `P3` evidence
- explicit bounds for `RF=2 sync_all`
- explicit deferred / blocking items before production use
- committed-truth decision scoped to the chosen candidate path
- short module/package boundary summary for the next heavy engineering phase
- accepted judgment:
- candidate-safe-with-bounds
- not production-ready
## Guardrails
@ -195,12 +480,13 @@ Especially:
### Guardrail 5: The committed-truth carry-forward must become a gate, not a note
Before the next phase, `Phase 08` must decide one of:
For the chosen `RF=2 sync_all` candidate path, this gate is now decided:
1. committed-truth separation is mandatory before a production-candidate phase
2. the first candidate path is intentionally bounded to the currently proven pre-checkpoint replay behavior
1. `CommittedLSN = WALHeadLSN`
2. `CheckpointLSN` remains the durable base-image boundary
3. this separation is accepted as sufficient for the candidate-path hardening boundary
It must not remain an unclassified carry-forward.
For future paths or durability modes, the gate must still be classified explicitly rather than carried forward informally.
## Exit Criteria
@ -214,41 +500,36 @@ Phase 08 is done when:
6. operational/debug evidence is sufficient for pre-production use
7. the remaining gap to a production candidate is small and explicit
Phase-close note:
- `Phase 08` is now closed
- next phase:
- `Phase 09: Production Execution Closure`
- start with `P0` planning for real execution completeness:
- real `TransferFullBase`
- real `TransferSnapshot`
- real `TruncateWAL`
- stronger live runtime execution ownership
## Assignment For `sw`
Next tasks:
Current next tasks:
1. drive `Phase 08 P1` as real master/control delivery integration
2. replace direct `AssignmentIntent` construction for the first live path
3. preserve through the real control path:
- stable `ReplicaID`
- epoch fencing
- address-change invalidation
4. include at least one real failover / promotion / reassignment validation target
5. keep acceptance claims scoped:
- real control delivery path
- not yet general production readiness
6. keep explicit carry-forwards:
- `CommittedLSN != CheckpointLSN` still unresolved
- integrated catch-up execution chain still incomplete
- rebuild execution still incomplete
1. close out `Phase 08` bookkeeping only if any wording drift remains
2. move to `Phase 09 P0` planning for production execution closure
3. focus the next heavy engineering package on:
- real `TransferFullBase`
- real `TransferSnapshot`
- real `TruncateWAL`
- stronger live runtime execution ownership
## Assignment For `tester`
Next tasks:
1. use the accepted `Phase 08` plan framing as the `P1` validation oracle
2. validate real control delivery for:
- live assignment delivery
- stable identity through the control path
- stale epoch/session invalidation
- at least one real failover / reassignment cycle
3. keep the no-overclaim rule active around:
- catch-up semantics
- rebuild execution
- master/control delivery
4. keep the committed-truth gate explicit:
- still unresolved in `P1`
5. prepare `P2` follow-up expectations for:
- integrated engine -> executor -> `v2bridge` execution closure
- unified replay after `P1` and `P2`
Current next tasks:
1. treat `Phase 08` as closed after any final wording/bookkeeping sync
2. prepare the `Phase 09 P0` validation oracle for production execution closure
3. keep no-overclaim active around:
- validation-grade transfer vs production-grade transfer
- truncation execution
- stronger runtime ownership vs current bounded path

26
sw-block/.private/phase/phase-09-decisions.md

@ -0,0 +1,26 @@
# Phase 09 Decisions
## Decision 1: Phase 09 is production execution closure, not packaging
The candidate-path packaging/judgment work remains inside `Phase 08 P4`.
`Phase 09` starts directly with substantial backend engineering closure.
## Decision 2: The first Phase 09 targets are real transfer, truncation, and stronger runtime ownership
The initial heavy execution blockers are:
1. real `TransferFullBase`
2. real `TransferSnapshot`
3. real `TruncateWAL`
4. stronger live runtime execution ownership
## Decision 3: Phase 09 remains bounded to the chosen candidate path unless evidence forces expansion
Default scope remains:
1. `RF=2`
2. `sync_all`
3. existing master / volume-server heartbeat path
Future paths or durability modes should not be absorbed casually into this phase.

269
sw-block/.private/phase/phase-09-log.md

@ -0,0 +1,269 @@
# Phase 09 Log
## 2026-03-31
### Opened
`Phase 09` opened as:
- production execution closure
### Starting basis
1. `Phase 08`: closed
2. chosen candidate path exists for `RF=2 sync_all`
3. main remaining heavy engineering work is backend production-grade execution
### Next
1. `Phase 09 P0` planning for:
- real `TransferFullBase`
- real `TransferSnapshot`
- real `TruncateWAL`
- stronger live runtime execution ownership
### P0 Technical Pack
Purpose:
- provide the minimum design/algo/test detail needed to start `Phase 09`
- keep the work centered on backend production execution closure
- avoid broad scope growth into control-plane redesign or product-surface work
#### Execution-closure target
`Phase 09` should close this gap:
- current path is candidate-safe but still partially validation-grade
- next path must be backend-production-grade for the chosen `RF=2 sync_all` path
This phase should not try to make every surrounding product surface complete.
It should make the backend execution path real enough that later phases can build on it.
#### What "real" means in this phase
Use these definitions.
1. Real `TransferFullBase`
- not only "extent is accessible"
- must read and transfer real block/base contents through the execution path
- completion must depend on the transfer actually occurring
2. Real `TransferSnapshot`
- not only "checkpoint exists and is readable"
- must read and transfer the snapshot/base image through the execution path
- tail replay must remain aligned with the transferred snapshot boundary
3. Real `TruncateWAL`
- not only "replica ahead detected"
- must execute the physical correction required by the chosen path
- completion must depend on truncation having actually happened
4. Stronger live runtime execution ownership
- V2 recovery execution should be driven by a stronger live runtime path than test-only orchestration
- the volume-server path should own plan / execute / cancel / cleanup more directly
- avoid split ownership where tests prove the path but the running service still does not drive it coherently
#### Recommended slice order inside Phase 09
Keep the phase substantial, but still ordered by dependency:
1. `P1` full-base execution closure
- make `TransferFullBase` real
- prove rebuild path no longer depends on accessibility-only validation
2. `P2` snapshot execution closure
- make `TransferSnapshot` real
- prove snapshot/tail rebuild path can use a real transferred base
3. `P3` truncation execution closure
- make `TruncateWAL` real
- prove replica-ahead path is executable, not only detectable
4. `P4` stronger live runtime ownership
- move the accepted execution logic closer to the real volume-server/runtime loop
- prove cleanup / cancel / replacement under the stronger live path
This order is recommended because:
1. transfer closure is the largest production blocker
2. truncation depends on a clearer execution contract
3. runtime ownership should build on real execution, not on validation-grade stubs
#### Design rules
1. engine still owns policy
- do not move catch-up / rebuild / truncation decision logic into `v2bridge` or `blockvol`
2. `v2bridge` owns real execution translation
- implement real transfer/truncate behavior there or through bounded runtime hooks
- keep it as the execution adapter, not the policy owner
3. `blockvol` owns storage/runtime reality
- WAL
- checkpoint
- extent/snapshot data
- low-level execution primitives
4. chosen-path bounds remain explicit
- `RF=2`
- `sync_all`
- existing master / volume-server heartbeat path
#### Primary reuse / update targets
For each target, state whether the expected action is:
1. `update in place`
2. `reference only`
3. `copy is allowed`
For `Phase 09`, the main targets are:
1. `weed/storage/blockvol/v2bridge/executor.go`
- action: `update in place`
- why:
- this is the direct V2 execution adapter
- real transfer/truncate closure belongs here first
- boundary:
- add real execution behavior
- do not add policy decisions here
2. `weed/storage/blockvol/blockvol.go`
- action: `update in place`
- why:
- authoritative runtime/storage hooks live here
- transfer/truncate/recovery primitives may need to be exposed or tightened here
- boundary:
- expose/execute real runtime behavior
- do not let old replication semantics redefine V2 truth
3. `weed/storage/blockvol/rebuild.go`
- action: `reference first`, then `update in place if needed`
- why:
- existing rebuild transport/runtime reality may be reused
- boundary:
- reuse transfer reality
- keep rebuild-source choice and recovery policy in the V2 engine
4. `weed/server/volume_server_block.go`
- action: `update in place`
- why:
- stronger live runtime execution ownership will likely terminate here
- boundary:
- strengthen live orchestration/runtime handoff
- do not collapse V2 boundaries into server-local convenience semantics
5. `weed/storage/blockvol/v2bridge/control.go`
- action: `reference only` for `Phase 09` unless execution work exposes a real gap
- why:
- `Phase 09` is not the main control-plane closure phase
6. product surfaces (`CSI`, `NVMe`, `iSCSI`)
- action: `reference only`
- why:
- not in scope for this phase
- avoid accidental scope growth
7. old V1 shipper/rebuild execution paths
- action: `reference only`, `copy is not default`
- why:
- they are reality/integration references
- not the default semantic template for V2
When `sw` submits a slice package in this phase, include a short reuse note:
- files updated in place
- files used as references only
- any file copied from older code and why copy was safer than in-place update
#### Validation expectations
For each execution closure target, require:
1. one-chain proof
- engine plan
- engine executor/runtime driver
- `v2bridge`
- real `blockvol` operation
- completion
- cleanup
2. physical-effect proof
- the operation must do real transfer/truncate work, not only validation
3. fail-closed behavior
- partial failure
- cancellation
- replacement
- all release resources correctly
4. observability
- logs explain:
- why the execution started
- what exact execution path ran
- why it completed / failed / cancelled
#### Suggested validation package
Keep the package focused:
1. one full-base execution test
2. one snapshot execution test
3. one truncation execution test
4. one live runtime ownership test
5. one cleanup/adversarial package covering:
- cancel
- replacement
- partial failure
Avoid:
1. broad matrix growth before these closures are real
2. product-surface tests (`CSI` / `NVMe` / `iSCSI`) in this phase
3. new protocol exploration
#### Assignment template for `sw`
1. Goal
- Build the `Phase 09` production-execution closure package for the chosen candidate path.
2. Required outputs
- explicit definition of "real" for each target:
- `TransferFullBase`
- `TransferSnapshot`
- `TruncateWAL`
- stronger runtime ownership
- slice/package order inside `Phase 09`
- implementation plan for the first heavy closure target
- expected tests/evidence for each target
3. Hard rules
- no protocol redesign
- no broad scope growth into product surfaces
- no moving policy logic into `v2bridge` / `blockvol`
- keep the chosen-path bound explicit
4. Delivery order
- first hand to architect review
- only after architect review passes, hand to tester validation
5. Reject before handoff if
- "real" is still defined as accessibility-only validation
- target order ignores execution dependency
- runtime ownership remains too vague to verify
#### Assignment template for `tester`
1. Goal
- Validate that the `Phase 09` execution-closure plan is concrete enough to support substantial engineering work.
2. Validate
- each target has a real/physical-effect definition
- each target has a one-chain proof expectation
- cleanup and fail-closed expectations are explicit
- the phase remains bounded to the chosen path
3. Output
- pass/fail on execution-target clarity
- findings on vague "real" definitions, missing cleanup proofs, or hidden scope growth

127
sw-block/.private/phase/phase-09.md

@ -0,0 +1,127 @@
# Phase 09
Date: 2026-03-31
Status: active
Purpose: turn the accepted candidate-safe backend path into a production-grade execution path without reopening accepted V2 recovery semantics
## Why This Phase Exists
`Phase 08` closed:
1. real control delivery on the chosen path
2. real one-chain catch-up and rebuild closure on the chosen path
3. unified hardening replay on the accepted live path
4. one bounded candidate package for `RF=2 sync_all`
What still does not exist is production-grade execution completeness.
The main remaining gap is no longer:
1. whether the path is candidate-safe
It is now:
1. whether the backend execution path is production-grade rather than validation-grade
## Phase Goal
Close the main backend execution gaps so the chosen path is no longer blocked by validation-grade transfer/truncation behavior.
## Scope
### In scope
1. real `TransferFullBase`
2. real `TransferSnapshot`
3. real `TruncateWAL`
4. stronger live runtime execution ownership on the volume-server path
### Out of scope
1. broad control-plane redesign
2. `RF>2`
3. `best_effort` / `sync_quorum` recovery semantics
4. product-surface rebinding (`CSI` / `NVMe` / `iSCSI`)
5. broad performance optimization
## Phase 09 Items
### P0: Production Execution Closure Plan
1. convert the accepted candidate package into a production-execution closure plan
2. define the minimum execution blockers that must be closed in this phase
3. order the execution work by dependency and risk
4. keep the chosen-path bound explicit while making the backend path production-grade
Goal:
- start `Phase 09` with one substantial execution-closure plan, not another light packaging round
Must prove:
1. the phase is centered on real backend execution work
2. the required closures are explicit:
- `TransferFullBase`
- `TransferSnapshot`
- `TruncateWAL`
- stronger runtime ownership
3. the phase remains bounded to the chosen candidate path unless new evidence expands it
Verification mechanism:
1. architect review:
- phase shape is substantial and outcome-based
- work is ordered by real engineering dependency
2. tester review:
- validation expectations are explicit for each execution closure target
3. manager review:
- the phase is large enough to justify a full engineering round
Output artifacts:
1. explicit execution-closure target list
2. explicit execution blocker list
3. initial slice/package order inside `Phase 09`
Execution note:
- use `phase-09-log.md` as the technical pack for:
- the definition of "real" for each execution target
- recommended slice order
- validation expectations
- assignment templates for `sw` and `tester`
Reject if:
1. `Phase 09` is framed as another packaging/documentation phase
2. execution blockers remain implicit
3. the phase quietly expands into product surfaces or unrelated control-plane work
4. the phase has no clear verification mechanism
## Assignment For `sw`
Current next tasks:
1. define the concrete execution-closure package for `Phase 09`
2. specify what "real" means for:
- `TransferFullBase`
- `TransferSnapshot`
- `TruncateWAL`
3. specify how stronger live runtime execution ownership should work on the volume-server path
4. keep the phase bounded to the chosen candidate path unless new evidence forces expansion
5. hand the package to architect review before tester work begins
## Assignment For `tester`
Current next tasks:
1. prepare the validation oracle for production execution closure
2. require explicit validation targets for:
- real transfer behavior
- truncation execution
- cleanup on success/failure/cancel
- stronger live runtime ownership
3. keep no-overclaim active around:
- validation-grade vs production-grade execution
- chosen path vs future paths/modes
4. review only after architect pre-review passes

3
sw-block/design/README.md

@ -22,7 +22,10 @@ Current WAL V2 design set:
- `v2-engine-slicing-plan.md`
- `v2-protocol-truths.md`
- `v2-production-roadmap.md`
- `v2-product-completion-overview.md`
- `v2-phase-development-plan.md`
- `phase-07-service-slice-plan.md`
- `phase-08-engine-skeleton-map.md`
- `agent_dev_process.md`
These documents are the working design home for the V2 line.

25
sw-block/design/agent_dev_process.md

@ -101,6 +101,10 @@ Each delivery should include:
3. resources acquired/released
4. test inventory
5. known carry-forward notes
6. reuse note:
- files updated in place
- files used as references only
- files copied and why
This template is required between:
@ -126,6 +130,9 @@ Test inventory:
Carry-forward notes:
- ...
Reuse note:
- ...
```
## Phase Doc Usage
@ -153,6 +160,10 @@ Use for:
3. carry-forward discussion
4. open observations
5. why wording or scope changed
6. slice-level reuse instructions:
- `update in place`
- `reference only`
- `copy is allowed`
This document may be longer and more detailed.
@ -290,6 +301,20 @@ Any such reuse should be reviewed explicitly as:
3. temporary carry-forward
4. hard gate before later phases
### Rule 6: Every substantial slice should declare reuse instructions
Before implementation grows, the slice package should state:
1. which existing files are expected to be updated in place
2. which existing files are reference-only
3. whether any copying is allowed and why
This helps prevent:
1. accidental scope growth
2. unclear ownership of old files
3. hidden semantic inheritance from V1/V1.5 paths
## Current Direction
The project has moved from exploration-heavy work to evidence-first engine work.

301
sw-block/design/phase-08-engine-skeleton-map.md

@ -0,0 +1,301 @@
# Phase 08 Engine Skeleton Map
Date: 2026-03-31
Status: active
Purpose: provide a short structural map for the `Phase 08` hardening path so implementation can move faster without reopening accepted V2 boundaries
## Scope
This is not the final standalone `sw-block` architecture.
It is the shortest useful engine skeleton for the accepted `Phase 08` hardening path:
- `RF=2`
- `sync_all`
- existing `Seaweed` master / volume-server heartbeat path
- V2 engine owns recovery policy
- `blockvol` remains the execution backend
## Module Map
### 1. Control plane
Role:
- authoritative control truth
Primary sources:
- `weed/server/master_grpc_server.go`
- `weed/server/master_block_registry.go`
- `weed/server/master_block_failover.go`
- `weed/server/volume_grpc_client_to_master.go`
What it produces:
- confirmed assignment
- `Epoch`
- target `Role`
- failover / promotion / reassignment result
- stable server identity
### 2. Control bridge
Role:
- translate real control truth into V2 engine intent
Primary files:
- `weed/storage/blockvol/v2bridge/control.go`
- `sw-block/bridge/blockvol/control_adapter.go`
- entry path in `weed/server/volume_server_block.go`
What it produces:
- `AssignmentIntent`
- stable `ReplicaID`
- `Endpoint`
- `SessionKind`
### 3. Engine runtime
Role:
- recovery-policy core
Primary files:
- `sw-block/engine/replication/orchestrator.go`
- `sw-block/engine/replication/driver.go`
- `sw-block/engine/replication/executor.go`
- `sw-block/engine/replication/sender.go`
- `sw-block/engine/replication/history.go`
What it decides:
- zero-gap / catch-up / needs-rebuild
- sender/session ownership
- stale authority rejection
- resource acquisition / release
- rebuild source selection
### 4. Storage bridge
Role:
- translate real blockvol storage truth and execution capability into engine-facing adapters
Primary files:
- `weed/storage/blockvol/v2bridge/reader.go`
- `weed/storage/blockvol/v2bridge/pinner.go`
- `weed/storage/blockvol/v2bridge/executor.go`
- `sw-block/bridge/blockvol/storage_adapter.go`
What it provides:
- `RetainedHistory`
- WAL retention pin / release
- snapshot pin / release
- full-base pin / release
- WAL scan execution
### 5. Block runtime
Role:
- execute real I/O
Primary files:
- `weed/storage/blockvol/blockvol.go`
- `weed/storage/blockvol/replica_apply.go`
- `weed/storage/blockvol/replica_barrier.go`
- `weed/storage/blockvol/recovery.go`
- `weed/storage/blockvol/rebuild.go`
- `weed/storage/blockvol/wal_shipper.go`
What it owns:
- WAL
- extent
- flusher
- checkpoint / superblock
- receiver / shipper
- rebuild server
## Execution Order
### Control path
```text
master heartbeat / failover truth
-> BlockVolumeAssignment
-> volume server ProcessAssignments
-> v2bridge control conversion
-> engine ProcessAssignment
-> sender/session state updated
```
### Catch-up path
```text
assignment accepted
-> engine reads retained history
-> engine plans catch-up
-> storage bridge pins WAL retention
-> engine executor drives v2bridge executor
-> blockvol scans WAL / ships entries
-> engine completes session
```
### Rebuild path
```text
assignment accepted
-> engine detects NeedsRebuild
-> engine selects rebuild source
-> storage bridge pins snapshot/full-base/tail
-> executor drives transfer path
-> blockvol performs restore / replay work
-> engine completes rebuild
```
### Local durability path
```text
WriteLBA / Trim
-> WAL append
-> shipping / barrier
-> client-visible durability decision
-> flusher writes extent
-> checkpoint advances
-> retention floor decides WAL reclaimability
```
## Interim Fields
These are currently acceptable only as explicit hardening carry-forwards:
### `localServerID`
Current source:
- `BlockService.listenAddr`
Meaning:
- temporary local identity source for replica/rebuild-side assignment translation
Status:
- interim only
- should become registry-assigned stable server identity later
### `CommittedLSN = CheckpointLSN`
Current source:
- `v2bridge.Reader` / `BlockVol.StatusSnapshot()`
Meaning:
- current V1-style interim mapping where committed truth collapses to local checkpoint truth
Status:
- not final V2 truth
- must become a gate decision before a production-candidate phase
### heartbeat as control carrier
Current source:
- existing master <-> volume-server heartbeat path
Meaning:
- current transport for assignment/control delivery
Status:
- acceptable as current carrier
- not yet a final proof that no separate control channel will ever be needed
## Hard Gates
These should remain explicit in `Phase 08`:
### Gate 1: committed truth
Before production-candidate:
- either separate `CommittedLSN` from `CheckpointLSN`
- or explicitly bound the first candidate path to currently proven pre-checkpoint replay behavior
### Gate 2: live control delivery
Required:
- real assignment delivery must reach the engine on the live path
- not only converter-level proof
### Gate 3: integrated catch-up closure
Required:
- engine -> executor -> `v2bridge` -> blockvol must be proven as one live chain
- not planner proof plus direct WAL-scan proof as separate evidence
### Gate 4: first rebuild execution path
Required:
- rebuild must not remain only a detection outcome
- the chosen product path needs one real executable rebuild closure
### Gate 5: unified replay
Required:
- after control and execution closure land, rerun the accepted failure-class set on the unified live path
## Reuse Map
### Reuse directly
- `weed/server/master_grpc_server.go`
- `weed/server/volume_grpc_client_to_master.go`
- `weed/server/volume_server_block.go`
- `weed/server/master_block_registry.go`
- `weed/server/master_block_failover.go`
- `weed/storage/blockvol/blockvol.go`
- `weed/storage/blockvol/replica_apply.go`
- `weed/storage/blockvol/replica_barrier.go`
- `weed/storage/blockvol/v2bridge/`
### Reuse as implementation reality, not truth
- `shipperGroup`
- `RetentionFloorFn`
- `ReplicaReceiver`
- checkpoint/superblock machinery
- existing failover heuristics
### Do not inherit as V2 semantics
- address-shaped identity
- old degraded/catch-up intuition from V1/V1.5
- `CommittedLSN = CheckpointLSN` as final truth
- blockvol-side recovery policy decisions
## Short Rule
Use this skeleton as:
- a hardening map for the current product path
Do not mistake it for:
- the final standalone `sw-block` architecture

231
sw-block/design/v2-phase-development-plan.md

@ -0,0 +1,231 @@
# V2 Phase Development Plan
Date: 2026-03-31
Status: active
Purpose: define the execution-oriented phase plan after the current candidate-path work, with explicit module status and target phase ownership
## Why This Document Exists
The project now needs a development plan that is:
1. phase-oriented
2. execution-oriented
3. large enough to avoid overhead-heavy micro-slices
4. explicit about which module belongs to which future phase
This document is the planning bridge between:
1. `v2-product-completion-overview.md`
2. `../.private/phase/phase-08.md`
3. future implementation phases
## Planning Rules
Use these rules for all later phases:
1. one phase should close one meaningful product/engineering outcome
2. every phase must have a clear verification mechanism
3. phases should prefer real code/test/evidence over wording-only progress
4. later phases may reuse V1 engineering reality, but must not inherit V1 recovery semantics as truth
5. a phase is too small if it does not move the overall product-completion state clearly
## Current Baseline
Current accepted/closing path through `Phase 08`:
1. protocol/algo truth set is strong
2. engine recovery core is strong
3. real control delivery exists on the chosen path
4. real one-chain catch-up and rebuild closure exist on the chosen path
5. unified hardening validation exists on the chosen path
6. one bounded candidate statement exists for:
- `RF=2`
- `sync_all`
- existing master / volume-server heartbeat path
Phase-accounting note:
1. this document assumes the current `Phase 08` path and bookkeeping are being closed consistently
2. if `Phase 08` bookkeeping is still open, read the candidate statement items above as the current accepted/closing path, not as a fully closed phase label
This means the next phases should focus mainly on:
1. production-grade execution completeness
2. stronger runtime ownership
3. stronger control-plane closure
4. later product-surface rebinding
5. production hardening
## Phase Roadmap
### Phase 09: Production Execution Closure
Goal:
1. turn validation-grade backend execution into production-grade backend execution
Must prove:
1. full-base rebuild performs real data transfer
2. snapshot rebuild performs real image transfer
3. replica-ahead path is physically executable, not only detected
4. runtime execution ownership is stronger than the current bounded candidate path
Typical outputs:
1. real `TransferFullBase`
2. real `TransferSnapshot`
3. real `TruncateWAL`
4. stronger executor/runtime integration in the live volume-server path
Verification mechanism:
1. one-chain execution tests on real backend paths
2. cleanup assertions after success/failure/cancel
3. focused adversarial tests for truncation and rebuild execution
Workload:
1. large
2. this is likely the single biggest remaining engineering phase
### Phase 10: Real Control-Plane Closure
Goal:
1. strengthen from accepted assignment-entry closure to fuller end-to-end control-plane closure
Must prove:
1. heartbeat/gRPC-level delivery is real for the chosen path
2. failover / reassignment state converges through the real control path
3. local and remote identity are consistent enough for product use
Typical outputs:
1. stronger heartbeat/gRPC delivery proof
2. stronger result/reporting convergence
3. cleaner local identity than transport-shaped `listenAddr`
Verification mechanism:
1. real failover/reassignment tests at the fuller control-plane level
2. identity/fencing assertions through the end-to-end path
Workload:
1. medium-large
### Phase 11: Product Surface Rebinding
Goal:
1. bind product-facing surfaces onto the V2-backed block path after backend closure is strong enough
Must prove:
1. the V2-backed backend can support selected product surfaces without semantic drift
2. reuse of V1 surfaces does not reintroduce V1 recovery truth
Candidate areas:
1. snapshot product path
2. `CSI`
3. `NVMe`
4. `iSCSI`
Verification mechanism:
1. selected surface integration tests
2. product-surface contract checks
3. no-overclaim review that the surface does not imply unsupported backend capability
Workload:
1. medium-large
2. can be split by product surface if needed, but only after backend closure is strong
### Phase 12: Production Hardening
Goal:
1. move from candidate-safe to production-safe
Must prove:
1. restart/recovery stability under repeated disturbance
2. long-run/soak viability
3. operational diagnosability
4. acceptable production blockers list or production-ready gate
Verification mechanism:
1. soak/adversarial runs
2. failover/restart under disturbance
3. runbook/debug validation
Workload:
1. large
## Module Status Map
| Module area | Current status | Current owner phase | Next target phase | Notes |
| ------------------------------------------------------------- | ---------------------------- | ------------------------ | ----------------- | ------------------------------------------------------------------------------------------ |
| `sw-block/engine/replication` core FSM/orchestrator/driver | Strong | `Phase 08` accepted | `Phase 09` | Main later work is runtime/product execution closure, not new core semantics. |
| Engine executor real I/O boundary (`CatchUpIO` / `RebuildIO`) | Strong on chosen path | `Phase 08 P2/P3` | `Phase 09` | Keep the boundary; make underlying transfer/truncate production-grade. |
| `weed/storage/blockvol/v2bridge/control.go` | Strong on chosen path | `Phase 08 P1` | `Phase 10` | Next step is fuller control-plane closure, not new mapping semantics. |
| `weed/storage/blockvol/v2bridge/reader.go` | Strong | `Phase 08 P2/P3` | `Phase 09/10` | Keep comments/status aligned with candidate-path committed-truth decision. |
| `weed/storage/blockvol/v2bridge/pinner.go` | Strong | `Phase 08 P1/P3` | `Phase 09` | Retention safety proven; later work is product-grade execution under that safety. |
| `weed/storage/blockvol/v2bridge/executor.go` WAL scan | Strong | `Phase 08 P2` | `Phase 09` | Real scan is good; later work is real transfer/truncate completeness. |
| `v2bridge` `TransferFullBase` | Partial | `Phase 08 P2/P4` | `Phase 09` | Validation-grade now; target is real production streaming. |
| `v2bridge` `TransferSnapshot` | Partial | `Phase 08 P2/P4` | `Phase 09` | Validation-grade now; target is real image transfer. |
| `v2bridge` `TruncateWAL` | Weak/stub | `Phase 08 P4` bound | `Phase 09` | Must become a real executable path. |
| `weed/server/volume_server_block.go` V2 assignment intake | Medium-strong | `Phase 08 P1` | `Phase 09/10` | Real intake exists; later work is stronger runtime ownership + fuller control-plane proof. |
| `blockvol` WAL/flusher/checkpoint runtime | Reuse reality | Existing production code | `Phase 09` | Reuse implementation; do not let old semantics redefine V2 truth. |
| `blockvol` rebuild transport/server reality | Reuse with redesign boundary | Existing production code | `Phase 09` | Good area for production execution closure work. |
| local server identity (`localServerID`) | Partial | `Phase 08` bounded | `Phase 10` | Still transport-shaped; should become cleaner under control-plane closure. |
| Snapshot product path | Partial/reuse candidate | not core in `Phase 08` | `Phase 11` | Reuse implementation, but V2 semantics own placement and claims. |
| `CSI` integration | Deferred reuse candidate | not core in `Phase 08` | `Phase 11` | Product surface, not next core closure target. |
| `NVMe` / `iSCSI` front-ends | Deferred reuse candidate | not core in `Phase 08` | `Phase 11` | Rebind after backend path is stronger. |
| Testrunner / infra / metrics | Strong support layer | existing | `Phase 10-12` | Reuse to validate later control-plane and hardening phases. |
## Completion-State Targets
Use these rough targets to judge whether a phase is moving the product meaningfully.
| Phase | Expected completion move |
| ---------- | ----------------------------------------------------------------------------- |
| `Phase 09` | from validation-grade backend execution to production-grade backend execution |
| `Phase 10` | from bounded control-entry proof to stronger end-to-end control-plane closure |
| `Phase 11` | from backend-ready path to selected product-surface readiness |
| `Phase 12` | from candidate-safe to production-safe |
## Near-Term Execution Direction
If the goal is to maximize product completion efficiently, the recommended order is:
1. finish `Phase 08` bookkeeping cleanly
2. `Phase 09` production execution closure
3. `Phase 10` real control-plane closure
4. `Phase 11` product surface rebinding
5. `Phase 12` production hardening
The most important near-term engineering weight should go to `Phase 09`.
## Short Summary
The V2 line already has a real bounded candidate path.
The next development plan should treat later work as product-completion phases, not more protocol discovery.
The main heavy engineering work still ahead is:
1. production-grade execution
2. stronger runtime/control closure
3. later product-surface rebinding
4. production hardening

267
sw-block/design/v2-product-completion-overview.md

@ -0,0 +1,267 @@
# V2 Product Completion Overview
Date: 2026-03-31
Status: active
Purpose: provide one product-level overview of current V2 engineering completion, V1 reuse strategy, and the roadmap from the accepted candidate path to a production-ready block engine
## Why This Document Exists
The project now has enough accepted V2 algorithm, engine, and hardening evidence that the next question is no longer only:
1. is the protocol correct
It is also:
1. how complete is the product path
2. which parts are already strong
3. which parts can reuse V1 engineering
4. which parts still require major implementation work
5. which future phases actually move product completion
This document is the product-completion view.
It complements:
1. `v2-protocol-truths.md` for accepted semantics
2. `v2-production-roadmap.md` for the older roadmap ladder
3. `../.private/phase/phase-08.md` for current phase contract
## Current Position
The accepted first candidate path is:
1. `RF=2`
2. `sync_all`
3. existing master / volume-server heartbeat path
4. V2 engine owns recovery policy
5. `v2bridge` translates real storage/control truth
6. `blockvol` remains the execution backend
This means the project is no longer at "algorithm only".
It already has:
1. accepted protocol truths
2. accepted engine execution closure
3. accepted hardening replay on a real integrated path
4. one bounded candidate statement
## Engineering Completion Snapshot
These levels are rough engineering estimates, not exact percentages.
| Area | Current level | Notes |
|------|---------------|-------|
| Algorithm / protocol truths | Strong | Core V2 semantics are accepted and should remain stable unless contradicted by live evidence. |
| Simulator / prototype evidence | Strong | Main failure classes and protocol boundaries are already well-exercised. |
| Engine recovery core | Strong | Sender/session/orchestrator/driver/executor are substantially implemented. |
| Weed bridge integration | Strong | Reader / pinner / control / executor are real and tested on the chosen path. |
| Integrated candidate path | Medium-strong | `P1` + `P2` + `P3` prove one bounded candidate path. |
| Runtime ownership inside live server loop | Medium | Real intake exists, but full product-grade recovery ownership is not yet fully closed. |
| Production-grade data transfer | Medium-weak | Validation-grade transfer exists; full production byte streaming is still incomplete. |
| Truncation / replica-ahead execution | Weak | Detection exists; full execution path is still incomplete. |
| End-to-end control-plane closure | Medium | `ProcessAssignments()` is real; full heartbeat/gRPC proof is still bounded. |
| Product surfaces (`CSI`, `NVMe`, `iSCSI`, snapshot productization) | Partial | Mostly reuse candidates, but not the current core closure target. |
| Production hardening / ops | Partial | Candidate-level evidence exists; production-grade hardening is still ahead. |
## Reuse Strategy
Use this rule:
1. if a component decides truth, V2 must own it
2. if a component consumes truth, V1 engineering can often be reused
### V2-owned semantics
These should not inherit V1 semantics casually:
1. recovery choice: `zero_gap` / `catchup` / `needs_rebuild`
2. sender/session ownership and fencing
3. stable `ReplicaID` and stale-authority rejection
4. committed/checkpoint interpretation
5. rebuild-source choice and recovery outcome meaning
### V1 engineering that is usually reusable
These are implementation/reality layers, not protocol truth:
1. `blockvol` storage runtime
2. WAL / flusher / checkpoint machinery
3. real assignment receive/apply path
4. front-end adapters such as `NVMe` / `iSCSI`
5. much of `CSI` lifecycle integration
6. monitoring / metrics / test harness infrastructure
### Reuse with explicit bounds
These can reuse implementation, but their semantic placement must remain V2-owned:
1. snapshot export / checkpoint plumbing
2. rebuild transport / extent read path
3. master/heartbeat/control delivery path
## Module Treatment Overview
| Module area | Current treatment | Near-term plan |
|-------------|-------------------|----------------|
| Recovery engine | V2-owned | Continue closing runtime/product path under accepted semantics. |
| `v2bridge` | V2 boundary adapter | Keep expanding real I/O/runtime closure without leaking policy downward. |
| `blockvol` WAL/flusher/runtime | Reuse reality | Reuse implementation, but do not let V1 replication semantics redefine V2 truth. |
| Snapshot capability | Reuse implementation, V2-owned semantics | Do not make this a main near-term phase goal until core execution/runtime closure is stronger. |
| `CSI` | Later product surface | Rebind after the V2-backed candidate path is stable enough. |
| `NVMe` / `iSCSI` | Later product surface | Reuse as front-end adapters once the backend candidate path is stronger. |
| Rebuild server / transfer mechanisms | Reuse with redesign boundary | Good candidate for later production execution closure work. |
| Control plane | Reuse existing path | Continue from `ProcessAssignments()` toward stronger end-to-end closure. |
## What The Candidate Path Already Proves
For the chosen `RF=2 sync_all` path, the project can already claim:
1. stable remote identity across address change when `ServerID` is present
2. stale epoch/session fencing through the integrated path
3. real catch-up one-chain closure on the chosen path
4. rebuild control/execution chain proven on the chosen path
- validation-grade execution closure
- not yet production-grade block/image streaming
5. replay of accepted failure classes on the unified live path
6. one real failover / reassignment cycle
7. one true simultaneous-overlap retention safety proof
8. committed/checkpoint separation accepted for this candidate path:
- `CommittedLSN = WALHeadLSN`
- `CheckpointLSN` remains the durable base-image boundary
## What Is Still Missing For Product Completion
The biggest remaining product-completion gaps are:
1. production-grade rebuild data transfer
- `TransferFullBase` must become real streaming, not only accessibility validation
- `TransferSnapshot` must become real image streaming, not only checkpoint validation
2. replica-ahead physical correction
- `TruncateWAL` must stop being a stub
3. stronger live runtime ownership
- the V2 recovery driver/executors should become a more complete live runtime path, not only a bounded hardening path
4. stronger control-plane closure
- current proof reaches `ProcessAssignments()`
- full heartbeat/gRPC-level closure is still bounded
5. product-surface rebinding
- `CSI`
- `NVMe`
- `iSCSI`
- snapshot product path
6. production hardening
- restart / soak / repeated disturbance / diagnosis quality
## Recommended Completion Roadmap
### Stage 1: Finish Phase 08 cleanly
Target:
1. close candidate-path judgment with explicit bounds and package it cleanly inside `Phase 08 P4`
Main output:
1. one accepted candidate package for the chosen path
### Stage 2: Phase 09 Production Execution Closure
Target:
1. turn validation-grade execution into production-grade execution
Main work:
1. real `TransferFullBase`
2. real `TransferSnapshot`
3. real `TruncateWAL`
4. stronger runtime ownership of recovery execution
Why it matters:
This is the largest remaining engineering block between "candidate-safe-with-bounds" and a serious product path.
### Stage 3: Phase 10 Real Control-Plane Closure
Target:
1. strengthen from accepted assignment-entry closure to fuller end-to-end control-path closure
Main work:
1. heartbeat/gRPC-level proof
2. stronger control/result convergence
3. better identity completeness for local and remote server roles
### Stage 4: Phase 11 Product Surface Rebinding
Target:
1. connect product-facing surfaces to the V2-backed block path
Candidate areas:
1. snapshot product path
2. `CSI`
3. `NVMe`
4. `iSCSI`
Rule:
Do this after the backend engine/runtime path is strong enough, not before.
### Stage 5: Phase 12 Production Hardening
Target:
1. move from candidate-safe to production-safe
Main work:
1. soak / restart / repeated failover
2. operational diagnosis quality
3. performance floor and cost characterization
4. explicit production blockers / rollout gates
## Completion Gates
The most important gates from here are:
1. execution gate
- validation-grade transfer/truncation must become production-grade
2. runtime ownership gate
- V2 recovery must be a stronger live runtime path, not only a bounded tested path
3. control-plane gate
- stronger end-to-end control delivery proof
4. product-surface gate
- front-end surfaces should only rebind after backend correctness is strong enough
5. production-hardening gate
- restart, soak, diagnosis, and repeated disturbance must be acceptable
## Near-Term Planning Guidance
If the goal is to maximize product completion efficiently:
1. do not make `CSI`, `NVMe`, or broad snapshot productization the immediate next heavy phase
2. first close production execution gaps in the backend path
3. then strengthen control-plane closure
4. then rebind product surfaces
In short:
1. backend truth and execution first
2. product surfaces second
3. production hardening last
## Short Summary
The V2 line is already beyond "algorithm only".
It has a real bounded candidate path.
But the remaining work is still substantial, and it is mostly engineering work:
1. production-grade execution
2. stronger runtime/control closure
3. product-surface rebinding
4. production hardening
That is the practical path from the current candidate-safe engine to a production-ready block product.
Loading…
Cancel
Save