Browse Source
feat: Phase 12 P3+P4 — diagnosability surfaces, perf floor, rollout gates
feat: Phase 12 P3+P4 — diagnosability surfaces, perf floor, rollout gates
P3: Add explicit bounded read-only diagnosis surfaces for all symptom classes: - FailoverDiagnostic: volume-oriented failover state with per-volume DeferredPromotion/PendingRebuild entries and proper timer lifecycle - PublicationDiagnostic: two-read coherence check (LookupBlockVolume vs registry authority) with computed Coherent verdict - RecoveryDiagnostic: minimal ActiveTasks surface (Path A) - Blocker ledger: 3 diagnosed + 3 unresolved, finite, from actual file - Runbook references only exposed surfaces, no internal state P4: Add bounded performance floor + rollout-gate package: - Engine-local floor measurement with explicit IOPS gates per workload - Cost characterization: WAL 2x write amp, -56% replication tax - Rollout gates with semantic cross-checks against cited evidence (baseline numbers, transport/network matrix, blocker counts) - Launch envelope tightened to actually measured combinations only Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>feature/sw-block
9 changed files with 1978 additions and 12 deletions
-
29sw-block/.private/phase/phase-12-p3-blockers.md
-
70sw-block/.private/phase/phase-12-p3-runbook.md
-
101sw-block/.private/phase/phase-12-p4-floor.md
-
64sw-block/.private/phase/phase-12-p4-rollout-gates.md
-
374weed/server/block_recovery.go
-
192weed/server/master_block_failover.go
-
270weed/server/qa_block_diagnosability_test.go
-
582weed/server/qa_block_perf_test.go
-
308weed/server/qa_block_soak_test.go
@ -0,0 +1,29 @@ |
|||
# Phase 12 P3 — Blocker Ledger |
|||
|
|||
Date: 2026-04-02 |
|||
Scope: bounded diagnosability / blocker accounting for the accepted RF=2 sync_all chosen path |
|||
|
|||
## Diagnosed and Bounded |
|||
|
|||
| ID | Symptom | Evidence Surface | Owning Truth | Status | |
|||
|----|---------|-----------------|--------------|--------| |
|||
| B1 | Failover does not converge | failover logs + registry Lookup epoch/primary | registry authority | Diagnosed: convergence depends on lease expiry + heartbeat cycle; bounded by lease TTL | |
|||
| B2 | Lookup publication stale after failover | LookupBlockVolume response vs registry entry | registry ISCSIAddr/VolumeServer | Diagnosed: publication updates on failover assignment delivery; bounded by assignment queue delivery | |
|||
| B3 | Recovery tasks remain after volume delete | RecoveryManager.DiagnosticSnapshot | RecoveryManager task map | Diagnosed: tasks drain on shutdown/cancel; bounded by RecoveryManager lifecycle | |
|||
|
|||
## Unresolved but Explicit |
|||
|
|||
| ID | Symptom | Current Evidence | Why Unresolved | Blocks P4/Rollout? | |
|||
|----|---------|-----------------|----------------|-------------------| |
|||
| U1 | V2 engine accepts stale-epoch assignments at orchestrator level | V2 idempotence check skips only same-epoch; lower epoch creates new sender | Engine ApplyAssignment does not check epoch monotonicity on Reconcile | No — V1 HandleAssignment rejects epoch regression; V2 is secondary | |
|||
| U2 | Single-process test cannot exercise Primary→Rebuilding role transition | HandleAssignment rejects transition in shared store | Test harness limitation, not production bug | No — production VS has separate stores | |
|||
| U3 | gRPC stream transport not exercised in control-loop tests | All logic above/below stream is real; stream itself bypassed | Would require live master+VS gRPC servers in test | Blocks full integration test, not correctness | |
|||
|
|||
## Out of Scope for P3 |
|||
|
|||
- Performance floor characterization |
|||
- Rollout-gate criteria |
|||
- Hours/days soak |
|||
- RF>2 topology |
|||
- NVMe runtime transport proof |
|||
- CSI snapshot/expand |
|||
@ -0,0 +1,70 @@ |
|||
# Phase 12 P3 — Bounded Runbook |
|||
|
|||
Scope: diagnosis of three symptom classes on the accepted RF=2 sync_all chosen path. |
|||
|
|||
All diagnosis steps reference ONLY explicit bounded read-only surfaces: |
|||
- `LookupBlockVolume` — gRPC RPC returning current primary VS + iSCSI address |
|||
- `FailoverDiagnostic` — volume-oriented failover state snapshot |
|||
- `PublicationDiagnostic` — lookup vs authority coherence snapshot |
|||
- `RecoveryDiagnostic` — active recovery task set snapshot |
|||
- Blocker ledger — finite file at `phase-12-p3-blockers.md` |
|||
|
|||
## S1: Failover/Recovery Convergence Stall |
|||
|
|||
**Visible symptom:** Volume remains unavailable after a VS death; lookup still returns the old primary. |
|||
|
|||
**Diagnosis surfaces:** |
|||
- `LookupBlockVolume(volumeName)` — check if `VolumeServer` is still the dead server |
|||
- `FailoverDiagnostic` — check `Volumes[]` for the affected volume |
|||
|
|||
**Diagnosis steps:** |
|||
1. Call `LookupBlockVolume(volumeName)`. If `VolumeServer` changed from the dead server, failover succeeded. |
|||
2. If unchanged: read `FailoverDiagnostic`. Find the volume by name in `Volumes[]`. |
|||
3. If found with `DeferredPromotion=true`: lease-wait — failover is deferred until lease expires. |
|||
4. If found with `PendingRebuild=true`: failover completed, rebuild is pending for the dead server. |
|||
5. If `DeferredPromotionCount[deadServer] > 0` in the aggregate: deferred promotions are queued. |
|||
6. If the volume does not appear in either lookup change or `FailoverDiagnostic`: escalate. |
|||
|
|||
**Conclusion classes (from surfaces only):** |
|||
- **Lease-wait:** `FailoverDiagnostic.DeferredPromotionCount[deadServer] > 0` — normal, bounded by lease TTL. |
|||
- **Rebuild-pending:** `FailoverDiagnostic.Volumes[].PendingRebuild=true` — failover done, rebuild queued. |
|||
- **Converged:** `LookupBlockVolume` shows new primary, no failover entries — resolved. |
|||
- **Unresolved:** None of the above — escalate. |
|||
|
|||
## S2: Publication/Lookup Mismatch |
|||
|
|||
**Visible symptom:** `LookupBlockVolume` returns an iSCSI address or volume server that doesn't match expected state. |
|||
|
|||
**Diagnosis surfaces:** |
|||
- `LookupBlockVolume(volumeName)` — operator-visible publication |
|||
- `PublicationDiagnostic` — explicit coherence check (lookup vs authority) |
|||
|
|||
**Diagnosis steps:** |
|||
1. Call `PublicationDiagnosticFor(volumeName)`. Check `Coherent` field. |
|||
2. If `Coherent=true`: lookup matches registry authority — no mismatch. |
|||
3. If `Coherent=false`: read `Reason` for explanation. Compare `LookupVolumeServer` vs `AuthorityVolumeServer` and `LookupIscsiAddr` vs `AuthorityIscsiAddr`. |
|||
4. Cross-check with `LookupBlockVolume` directly: repeated lookups should be self-consistent. |
|||
|
|||
**Conclusion classes (from surfaces only):** |
|||
- **Coherent:** `PublicationDiagnostic.Coherent=true` — no mismatch. |
|||
- **Stale client:** Coherent but client sees old value — bounded by client re-query. |
|||
- **Unresolved:** `PublicationDiagnostic.Coherent=false` with no transient cause — escalate. |
|||
|
|||
## S3: Leftover Runtime Work After Convergence |
|||
|
|||
**Visible symptom:** After volume deletion or steady-state convergence, recovery tasks should have drained. |
|||
|
|||
**Diagnosis surfaces:** |
|||
- `RecoveryDiagnostic` — `ActiveTasks` list (replicaIDs with active recovery work) |
|||
|
|||
**Diagnosis steps:** |
|||
1. Call `RecoveryManager.DiagnosticSnapshot()`. Read `ActiveTasks`. |
|||
2. If `ActiveTasks` is empty: clean — no leftover work. |
|||
3. If non-empty: check whether any task replicaID contains the deleted volume's path. |
|||
4. If a deleted volume's replicaID is present in `ActiveTasks`: residue — escalate. |
|||
5. If all tasks are for live volumes: non-empty but expected — normal in-flight work. |
|||
|
|||
**Conclusion classes (from surfaces only):** |
|||
- **Clean:** `RecoveryDiagnostic.ActiveTasks` is empty — runtime converged. |
|||
- **Non-empty, no residue:** Tasks present but none for the deleted/converged volume — normal. |
|||
- **Residue:** Deleted volume's replicaID still in `ActiveTasks` — escalate. |
|||
@ -0,0 +1,101 @@ |
|||
# Phase 12 P4 — Performance Floor Summary |
|||
|
|||
Date: 2026-04-02 |
|||
Scope: bounded performance floor for the accepted RF=2, sync_all chosen path. |
|||
|
|||
## Workload Envelope |
|||
|
|||
| Parameter | Value | |
|||
|-----------|-------| |
|||
| Topology | RF=2, sync_all | |
|||
| Operations | 4K random write, 4K random read, sequential write, sequential read | |
|||
| Runtime | Steady-state, no failover, no disturbance | |
|||
| Path | Accepted chosen path (same as P1/P2/P3) | |
|||
|
|||
## Environment |
|||
|
|||
### Unit Test Harness (engine-local) |
|||
|
|||
| Parameter | Value | |
|||
|-----------|-------| |
|||
| Name | `TestP12P4_PerformanceFloor_Bounded` | |
|||
| Location | `weed/server/qa_block_perf_test.go` | |
|||
| Platform | Single-process, local disk | |
|||
| Volume | 64MB, 4K blocks, 16MB WAL | |
|||
| Writer | Single-threaded (worst-case for group commit) | |
|||
| Replication | Not exercised (engine-local only) | |
|||
| Measurement | Worst of 3 iterations (floor, not peak) | |
|||
|
|||
### Production Baseline (cross-machine) |
|||
|
|||
| Parameter | Value | |
|||
|-----------|-------| |
|||
| Name | `baseline-roce-20260401` | |
|||
| Location | `learn/projects/sw-block/test/results/baseline-roce-20260401.md` | |
|||
| Hardware | m01 (10.0.0.1) - M02 (10.0.0.3), 25Gbps RoCE | |
|||
| Protocol | NVMe-TCP | |
|||
| Volume | 2GB, RF=2, sync_all, cross-machine replication | |
|||
| Writer | fio, QD1-128, j=4 | |
|||
|
|||
## Floor Table: Production (RF=2, sync_all, NVMe-TCP, 25Gbps RoCE) |
|||
|
|||
These are measured floor values from the production baseline, not the unit test. |
|||
|
|||
| Workload | Floor IOPS | Notes | |
|||
|----------|-----------|-------| |
|||
| 4K random write QD1 | 28,347 | Barrier round-trip limited (flat across QD) | |
|||
| 4K random write QD32 | 28,453 | Same barrier ceiling | |
|||
| 4K random read QD32 | 136,648 | No replication overhead | |
|||
| Mixed 70/30 QD32 | 28,423 | Write-side limited | |
|||
|
|||
Latency: Write latency is bounded by sync_all barrier round-trip (~35us at QD1). |
|||
Read latency: sub-microsecond for cached, single-digit microseconds for extent. |
|||
|
|||
## Floor Table: Engine-Local (unit test harness) |
|||
|
|||
These values are measured by `TestP12P4_PerformanceFloor_Bounded` on the dev machine. |
|||
They characterize the engine I/O floor WITHOUT transport or replication. |
|||
Actual values vary by hardware; the test produces them on each run. |
|||
|
|||
| Workload | Metric | Method | Gate | |
|||
|----------|--------|--------|------| |
|||
| 4K random write | Floor IOPS, Avg/P50/P99/Max latency | Worst of 3 iterations | >= 1,000 IOPS, P99 <= 100ms | |
|||
| 4K random read | Floor IOPS, Avg/P50/P99/Max latency | Worst of 3 iterations | >= 5,000 IOPS | |
|||
| 4K sequential write | Floor IOPS, Avg/P50/P99/Max latency | Worst of 3 iterations | >= 2,000 IOPS, P99 <= 100ms | |
|||
| 4K sequential read | Floor IOPS, Avg/P50/P99/Max latency | Worst of 3 iterations | >= 10,000 IOPS | |
|||
|
|||
Gate thresholds are regression gates enforced in code (`perfFloorGates` in `qa_block_perf_test.go`). |
|||
Set at ~10% of measured values to tolerate slow CI/VM hardware while catching catastrophic regressions. |
|||
|
|||
## Cost Summary |
|||
|
|||
| Cost | Value | Source | |
|||
|------|-------|--------| |
|||
| WAL write amplification | 2x minimum | Engine design: each write → WAL + eventual extent flush | |
|||
| Replication tax (RF=2 sync_all vs RF=1) | -56% | baseline-roce-20260401.md (NVMe-TCP, 25Gbps RoCE) | |
|||
| Replication tax (RF=2 sync_all vs RF=1, iSCSI 1Gbps) | -56% | baseline-roce-20260401.md | |
|||
| Degraded mode penalty (sync_all RF=2, one replica dead) | -66% | baseline-roce-20260401.md (barrier timeout) | |
|||
| Group commit | 1 fdatasync per batch | Amortizes sync cost across concurrent writers | |
|||
|
|||
## Acceptance Evidence |
|||
|
|||
| Item | Evidence | Type | |
|||
|------|----------|------| |
|||
| Floor gates pass | `perfFloorGates` thresholds enforced per workload | Acceptance | |
|||
| Workload runs repeatably | `TestP12P4_PerformanceFloor_Bounded` passes | Acceptance | |
|||
| Cost statement is bounded | `TestP12P4_CostCharacterization_Bounded` passes | Acceptance | |
|||
| Production baseline exists | `baseline-roce-20260401.md` with measured values | Acceptance | |
|||
| Floor is worst-of-N, not peak | Test takes minimum IOPS across 3 iterations | Method | |
|||
| Regression-safe | Test fails if floor drops below gate (blocks rollout) | Acceptance | |
|||
| Replication tax documented | -56% from measured production baseline | Support telemetry | |
|||
|
|||
## What P4 does NOT claim |
|||
|
|||
- This is not a claim that the measured floor is "good enough" for any specific application. |
|||
- This does not claim readiness for failover-under-load scenarios. |
|||
- This does not claim readiness for hours/days soak under load. |
|||
- This does not claim readiness for RF>2 topologies. |
|||
- This does not claim readiness for all transport combinations (iSCSI + NVMe + kernel versions). |
|||
- This does not claim readiness for production rollout beyond the explicitly named launch envelope. |
|||
- Engine-local floor numbers are not production floor numbers. |
|||
- The replication tax is measured on one specific hardware configuration and may differ on other hardware. |
|||
@ -0,0 +1,64 @@ |
|||
# Phase 12 P4 — Rollout Gates |
|||
|
|||
Date: 2026-04-02 |
|||
Scope: bounded first-launch envelope for the accepted RF=2, sync_all chosen path. |
|||
|
|||
This is a bounded first-launch envelope, not general readiness. |
|||
|
|||
## Supported Launch Envelope |
|||
|
|||
Only the transport/network combinations with measured baselines are included. |
|||
|
|||
| Parameter | Value | |
|||
|-----------|-------| |
|||
| Topology | RF=2, sync_all | |
|||
| Transport + Network | NVMe-TCP @ 25Gbps RoCE (measured), iSCSI @ 25Gbps RoCE (measured), iSCSI @ 1Gbps (measured) | |
|||
| NOT included | NVMe-TCP @ 1Gbps (not measured) | |
|||
| Volume size | Up to 2GB (tested baseline) | |
|||
| Failover | Lease-based, bounded by TTL (30s default) | |
|||
| Recovery | Catch-up-first, rebuild fallback | |
|||
| Degraded mode | Documented -66% write penalty (sync_all RF=2, one replica dead) | |
|||
|
|||
## Cleared Gates |
|||
|
|||
| Gate | Evidence | Status | Notes | |
|||
|------|----------|--------|-------| |
|||
| G1 | P1 disturbance tests pass | Cleared | Restart/reconnect correctness under disturbance | |
|||
| G2 | P2 soak tests pass | Cleared | Repeated create/failover/recover cycles, no drift | |
|||
| G3 | P3 diagnosability tests pass | Cleared | Explicit bounded diagnosis surfaces for all symptom classes | |
|||
| G4 | P4 floor gates pass | Cleared | Explicit IOPS thresholds + P99 ceilings enforced per workload in code | |
|||
| G5 | P4 cost characterization bounded | Cleared | WAL 2x write amp, -56% replication tax documented | |
|||
| G6 | Production baseline exists | Cleared | baseline-roce-20260401.md: 28.4K write IOPS, 136.6K read IOPS | |
|||
| G8 | Floor gates are regression-safe | Cleared | Test fails if any workload drops below defined minimum IOPS or exceeds P99 ceiling | |
|||
| G7 | Blocker ledger finite | Cleared | 3 diagnosed (B1-B3) + 3 unresolved (U1-U3), all explicit | |
|||
|
|||
## Remaining Blockers / Exclusions |
|||
|
|||
| Exclusion | Why | Impact | |
|||
|-----------|-----|--------| |
|||
| E1 | Failover-under-load perf not measured | Cannot claim bounded perf during failover | |
|||
| E2 | Hours/days soak not run | Cannot claim long-run stability under sustained load | |
|||
| E3 | RF>2 not measured | Cannot claim perf floor for RF=3+ | |
|||
| E4 | Broad transport matrix not tested | Cannot claim parity across all kernel/NVMe/iSCSI versions | |
|||
| E5 | Degraded mode is severe (-66%) | sync_all RF=2 has sharp write cliff on replica death | |
|||
| E6 | V2 stale-epoch at orchestrator level (U1 from P3) | V1 guards suffice; V2 is secondary path | |
|||
| E7 | gRPC stream transport not exercised in unit tests (U3 from P3) | Blocks full integration test, not correctness | |
|||
|
|||
## Reject Conditions |
|||
|
|||
This launch envelope should be REJECTED if: |
|||
|
|||
1. Any P1/P2/P3 test regresses (correctness/stability/diagnosability gate violated) |
|||
2. Production baseline numbers are not reproducible on the target hardware |
|||
3. Degraded mode behavior (-66% cliff) is not acceptable for the deployment scenario |
|||
4. The deployment requires RF>2, failover-under-load guarantees, or long soak proof |
|||
5. The deployment requires transport combinations not covered by the baseline |
|||
|
|||
## What P4 does NOT claim |
|||
|
|||
- This does not claim general production readiness. |
|||
- This does not claim readiness for any deployment outside the named launch envelope. |
|||
- This does not claim that the performance floor is optimal or final. |
|||
- This does not claim that the degraded-mode penalty is acceptable (deployment-specific decision). |
|||
- This does not claim hours/days stability under sustained load. |
|||
- This is a bounded first-launch gate, not a broad rollout approval. |
|||
@ -0,0 +1,374 @@ |
|||
package weed_server |
|||
|
|||
import ( |
|||
"context" |
|||
"sync" |
|||
|
|||
bridge "github.com/seaweedfs/seaweedfs/sw-block/bridge/blockvol" |
|||
engine "github.com/seaweedfs/seaweedfs/sw-block/engine/replication" |
|||
"github.com/seaweedfs/seaweedfs/weed/glog" |
|||
"github.com/seaweedfs/seaweedfs/weed/storage/blockvol" |
|||
"github.com/seaweedfs/seaweedfs/weed/storage/blockvol/v2bridge" |
|||
) |
|||
|
|||
// recoveryTask tracks a live recovery goroutine for one replica target.
|
|||
// The task pointer serves as identity token — only the goroutine that owns
|
|||
// THIS pointer may mark it as done.
|
|||
type recoveryTask struct { |
|||
replicaID string |
|||
cancel context.CancelFunc |
|||
done chan struct{} // closed when the goroutine exits
|
|||
} |
|||
|
|||
// RecoveryManager owns live recovery execution for all replica targets.
|
|||
//
|
|||
// Ownership model:
|
|||
// - At most one recovery goroutine per replicaID at any time.
|
|||
// - On supersede/replace: the old goroutine is cancelled AND drained
|
|||
// before the replacement starts. No overlap.
|
|||
// - Cancellation: context cancel + session invalidation (for removal/shutdown).
|
|||
// For supersede: context cancel only (engine already attached replacement session).
|
|||
type RecoveryManager struct { |
|||
bs *BlockService |
|||
|
|||
mu sync.Mutex |
|||
tasks map[string]*recoveryTask |
|||
wg sync.WaitGroup |
|||
|
|||
// TestHook: if set, called before execution starts. Tests use this
|
|||
// to hold the goroutine alive for serialized-replacement proofs.
|
|||
OnBeforeExecute func(replicaID string) |
|||
} |
|||
|
|||
func NewRecoveryManager(bs *BlockService) *RecoveryManager { |
|||
return &RecoveryManager{ |
|||
bs: bs, |
|||
tasks: make(map[string]*recoveryTask), |
|||
} |
|||
} |
|||
|
|||
// HandleAssignmentResult processes the engine's assignment result.
|
|||
//
|
|||
// Engine result semantics:
|
|||
// - SessionsCreated: new session, start goroutine
|
|||
// - SessionsSuperseded: old replaced by new — cancel+drain old, start new
|
|||
// - Removed: sender gone — cancel+drain, invalidate session
|
|||
func (rm *RecoveryManager) HandleAssignmentResult(result engine.AssignmentResult, assignments []blockvol.BlockVolumeAssignment) { |
|||
// Removed: cancel + invalidate + drain.
|
|||
for _, replicaID := range result.Removed { |
|||
rm.cancelAndDrain(replicaID, true) |
|||
} |
|||
|
|||
// Superseded: cancel + drain (no invalidate — engine has replacement session),
|
|||
// then start new.
|
|||
for _, replicaID := range result.SessionsSuperseded { |
|||
rm.cancelAndDrain(replicaID, false) |
|||
rm.startTask(replicaID, assignments) |
|||
} |
|||
|
|||
// Created: start new (cancel stale defensively).
|
|||
for _, replicaID := range result.SessionsCreated { |
|||
rm.cancelAndDrain(replicaID, false) |
|||
rm.startTask(replicaID, assignments) |
|||
} |
|||
} |
|||
|
|||
// cancelAndDrain cancels a running task and WAITS for it to exit.
|
|||
// This ensures no overlap between old and new owners.
|
|||
func (rm *RecoveryManager) cancelAndDrain(replicaID string, invalidateSession bool) { |
|||
rm.mu.Lock() |
|||
task, ok := rm.tasks[replicaID] |
|||
if !ok { |
|||
rm.mu.Unlock() |
|||
return |
|||
} |
|||
glog.V(1).Infof("recovery: cancelling+draining task for %s (invalidate=%v)", replicaID, invalidateSession) |
|||
task.cancel() |
|||
if invalidateSession && rm.bs.v2Orchestrator != nil { |
|||
if s := rm.bs.v2Orchestrator.Registry.Sender(replicaID); s != nil { |
|||
s.InvalidateSession("recovery_removed", engine.StateDisconnected) |
|||
} |
|||
} |
|||
delete(rm.tasks, replicaID) |
|||
doneCh := task.done |
|||
rm.mu.Unlock() |
|||
|
|||
// Wait for the old goroutine to exit OUTSIDE the lock.
|
|||
// This serializes replacement: new task cannot start until old is fully drained.
|
|||
<-doneCh |
|||
} |
|||
|
|||
// startTask creates and starts a new recovery goroutine. Caller must ensure
|
|||
// no existing task for this replicaID (call cancelAndDrain first).
|
|||
func (rm *RecoveryManager) startTask(replicaID string, assignments []blockvol.BlockVolumeAssignment) { |
|||
rm.mu.Lock() |
|||
defer rm.mu.Unlock() |
|||
|
|||
rebuildAddr := rm.deriveRebuildAddr(replicaID, assignments) |
|||
|
|||
ctx, cancel := context.WithCancel(context.Background()) |
|||
task := &recoveryTask{ |
|||
replicaID: replicaID, |
|||
cancel: cancel, |
|||
done: make(chan struct{}), |
|||
} |
|||
rm.tasks[replicaID] = task |
|||
|
|||
rm.wg.Add(1) |
|||
go rm.runRecovery(ctx, task, rebuildAddr) |
|||
} |
|||
|
|||
// Shutdown cancels all active recovery tasks and waits for drain.
|
|||
func (rm *RecoveryManager) Shutdown() { |
|||
rm.mu.Lock() |
|||
for _, task := range rm.tasks { |
|||
task.cancel() |
|||
if rm.bs.v2Orchestrator != nil { |
|||
if s := rm.bs.v2Orchestrator.Registry.Sender(task.replicaID); s != nil { |
|||
s.InvalidateSession("recovery_shutdown", engine.StateDisconnected) |
|||
} |
|||
} |
|||
} |
|||
rm.tasks = make(map[string]*recoveryTask) |
|||
rm.mu.Unlock() |
|||
rm.wg.Wait() |
|||
} |
|||
|
|||
// ActiveTaskCount returns the number of active recovery tasks (for testing).
|
|||
func (rm *RecoveryManager) ActiveTaskCount() int { |
|||
rm.mu.Lock() |
|||
defer rm.mu.Unlock() |
|||
return len(rm.tasks) |
|||
} |
|||
|
|||
// DiagnosticSnapshot returns a bounded read-only snapshot of active recovery
|
|||
// tasks for operator-visible diagnosis. Each entry shows the replicaID being
|
|||
// recovered. This is the P3 diagnosability surface — read-only, no semantics.
|
|||
type RecoveryDiagnostic struct { |
|||
ActiveTasks []string // replicaIDs with active recovery work
|
|||
} |
|||
|
|||
func (rm *RecoveryManager) DiagnosticSnapshot() RecoveryDiagnostic { |
|||
rm.mu.Lock() |
|||
defer rm.mu.Unlock() |
|||
diag := RecoveryDiagnostic{} |
|||
for id := range rm.tasks { |
|||
diag.ActiveTasks = append(diag.ActiveTasks, id) |
|||
} |
|||
return diag |
|||
} |
|||
|
|||
// runRecovery is the recovery goroutine for one replica target.
|
|||
func (rm *RecoveryManager) runRecovery(ctx context.Context, task *recoveryTask, rebuildAddr string) { |
|||
defer rm.wg.Done() |
|||
defer close(task.done) // signal drain completion
|
|||
defer func() { |
|||
rm.mu.Lock() |
|||
// Only delete if we're still the active task (pointer comparison).
|
|||
if rm.tasks[task.replicaID] == task { |
|||
delete(rm.tasks, task.replicaID) |
|||
} |
|||
rm.mu.Unlock() |
|||
}() |
|||
|
|||
replicaID := task.replicaID |
|||
|
|||
if ctx.Err() != nil { |
|||
return |
|||
} |
|||
|
|||
orch := rm.bs.v2Orchestrator |
|||
s := orch.Registry.Sender(replicaID) |
|||
if s == nil { |
|||
glog.V(1).Infof("recovery: sender %s not found, skipping", replicaID) |
|||
return |
|||
} |
|||
|
|||
sessSnap := s.SessionSnapshot() |
|||
if sessSnap == nil { |
|||
glog.V(1).Infof("recovery: sender %s has no active session, skipping", replicaID) |
|||
return |
|||
} |
|||
|
|||
glog.V(0).Infof("recovery: starting %s session for %s (rebuildAddr=%s)", |
|||
sessSnap.Kind, replicaID, rebuildAddr) |
|||
|
|||
if rm.OnBeforeExecute != nil { |
|||
rm.OnBeforeExecute(replicaID) |
|||
} |
|||
|
|||
switch sessSnap.Kind { |
|||
case engine.SessionCatchUp: |
|||
rm.runCatchUp(ctx, replicaID, rebuildAddr) |
|||
case engine.SessionRebuild: |
|||
rm.runRebuild(ctx, replicaID, rebuildAddr) |
|||
default: |
|||
glog.V(1).Infof("recovery: unknown session kind %s for %s", sessSnap.Kind, replicaID) |
|||
} |
|||
} |
|||
|
|||
func (rm *RecoveryManager) runCatchUp(ctx context.Context, replicaID, rebuildAddr string) { |
|||
bs := rm.bs |
|||
volPath := rm.volumePathForReplica(replicaID) |
|||
if volPath == "" { |
|||
glog.Warningf("recovery: cannot determine volume path for %s", replicaID) |
|||
return |
|||
} |
|||
|
|||
var sa engine.StorageAdapter |
|||
var replicaFlushedLSN uint64 |
|||
var executor *v2bridge.Executor |
|||
|
|||
if err := bs.blockStore.WithVolume(volPath, func(vol *blockvol.BlockVol) error { |
|||
reader := v2bridge.NewReader(vol) |
|||
pinner := v2bridge.NewPinner(vol) |
|||
sa = bridge.NewStorageAdapter( |
|||
&readerShimForRecovery{reader}, |
|||
&pinnerShimForRecovery{pinner}, |
|||
) |
|||
if s := bs.v2Orchestrator.Registry.Sender(replicaID); s != nil { |
|||
if snap := s.SessionSnapshot(); snap != nil { |
|||
replicaFlushedLSN = snap.StartLSN |
|||
} |
|||
} |
|||
executor = v2bridge.NewExecutor(vol, rebuildAddr) |
|||
return nil |
|||
}); err != nil { |
|||
glog.Warningf("recovery: cannot access volume %s: %v", volPath, err) |
|||
return |
|||
} |
|||
|
|||
if ctx.Err() != nil { |
|||
return |
|||
} |
|||
|
|||
driver := &engine.RecoveryDriver{Orchestrator: bs.v2Orchestrator, Storage: sa} |
|||
|
|||
plan, err := driver.PlanRecovery(replicaID, replicaFlushedLSN) |
|||
if err != nil { |
|||
glog.Warningf("recovery: plan failed for %s: %v", replicaID, err) |
|||
return |
|||
} |
|||
|
|||
if ctx.Err() != nil { |
|||
driver.CancelPlan(plan, "context_cancelled") |
|||
return |
|||
} |
|||
|
|||
exec := engine.NewCatchUpExecutor(driver, plan) |
|||
exec.IO = executor |
|||
|
|||
if execErr := exec.Execute(nil, 0); execErr != nil { |
|||
if ctx.Err() != nil { |
|||
glog.V(1).Infof("recovery: catch-up cancelled for %s: %v", replicaID, execErr) |
|||
} else { |
|||
glog.Warningf("recovery: catch-up execution failed for %s: %v", replicaID, execErr) |
|||
} |
|||
return |
|||
} |
|||
|
|||
glog.V(0).Infof("recovery: catch-up completed for %s", replicaID) |
|||
} |
|||
|
|||
func (rm *RecoveryManager) runRebuild(ctx context.Context, replicaID, rebuildAddr string) { |
|||
bs := rm.bs |
|||
volPath := rm.volumePathForReplica(replicaID) |
|||
if volPath == "" { |
|||
glog.Warningf("recovery: cannot determine volume path for %s", replicaID) |
|||
return |
|||
} |
|||
|
|||
var sa engine.StorageAdapter |
|||
var executor *v2bridge.Executor |
|||
|
|||
if err := bs.blockStore.WithVolume(volPath, func(vol *blockvol.BlockVol) error { |
|||
reader := v2bridge.NewReader(vol) |
|||
pinner := v2bridge.NewPinner(vol) |
|||
sa = bridge.NewStorageAdapter( |
|||
&readerShimForRecovery{reader}, |
|||
&pinnerShimForRecovery{pinner}, |
|||
) |
|||
executor = v2bridge.NewExecutor(vol, rebuildAddr) |
|||
return nil |
|||
}); err != nil { |
|||
glog.Warningf("recovery: cannot access volume %s: %v", volPath, err) |
|||
return |
|||
} |
|||
|
|||
if ctx.Err() != nil { |
|||
return |
|||
} |
|||
|
|||
driver := &engine.RecoveryDriver{Orchestrator: bs.v2Orchestrator, Storage: sa} |
|||
|
|||
plan, err := driver.PlanRebuild(replicaID) |
|||
if err != nil { |
|||
glog.Warningf("recovery: rebuild plan failed for %s: %v", replicaID, err) |
|||
return |
|||
} |
|||
|
|||
if ctx.Err() != nil { |
|||
driver.CancelPlan(plan, "context_cancelled") |
|||
return |
|||
} |
|||
|
|||
exec := engine.NewRebuildExecutor(driver, plan) |
|||
exec.IO = executor |
|||
|
|||
if execErr := exec.Execute(); execErr != nil { |
|||
if ctx.Err() != nil { |
|||
glog.V(1).Infof("recovery: rebuild cancelled for %s: %v", replicaID, execErr) |
|||
} else { |
|||
glog.Warningf("recovery: rebuild execution failed for %s: %v", replicaID, execErr) |
|||
} |
|||
return |
|||
} |
|||
|
|||
glog.V(0).Infof("recovery: rebuild completed for %s", replicaID) |
|||
} |
|||
|
|||
func (rm *RecoveryManager) deriveRebuildAddr(replicaID string, assignments []blockvol.BlockVolumeAssignment) string { |
|||
volPath := rm.volumePathForReplica(replicaID) |
|||
for _, a := range assignments { |
|||
if a.Path == volPath && a.RebuildAddr != "" { |
|||
return a.RebuildAddr |
|||
} |
|||
} |
|||
return "" |
|||
} |
|||
|
|||
func (rm *RecoveryManager) volumePathForReplica(replicaID string) string { |
|||
for i := len(replicaID) - 1; i >= 0; i-- { |
|||
if replicaID[i] == '/' { |
|||
return replicaID[:i] |
|||
} |
|||
} |
|||
return "" |
|||
} |
|||
|
|||
// --- Bridge shims ---
|
|||
|
|||
type readerShimForRecovery struct{ r *v2bridge.Reader } |
|||
|
|||
func (s *readerShimForRecovery) ReadState() bridge.BlockVolState { |
|||
rs := s.r.ReadState() |
|||
return bridge.BlockVolState{ |
|||
WALHeadLSN: rs.WALHeadLSN, |
|||
WALTailLSN: rs.WALTailLSN, |
|||
CommittedLSN: rs.CommittedLSN, |
|||
CheckpointLSN: rs.CheckpointLSN, |
|||
CheckpointTrusted: rs.CheckpointTrusted, |
|||
} |
|||
} |
|||
|
|||
type pinnerShimForRecovery struct{ p *v2bridge.Pinner } |
|||
|
|||
func (s *pinnerShimForRecovery) HoldWALRetention(startLSN uint64) (func(), error) { |
|||
return s.p.HoldWALRetention(startLSN) |
|||
} |
|||
func (s *pinnerShimForRecovery) HoldSnapshot(checkpointLSN uint64) (func(), error) { |
|||
return s.p.HoldSnapshot(checkpointLSN) |
|||
} |
|||
func (s *pinnerShimForRecovery) HoldFullBase(committedLSN uint64) (func(), error) { |
|||
return s.p.HoldFullBase(committedLSN) |
|||
} |
|||
@ -0,0 +1,270 @@ |
|||
package weed_server |
|||
|
|||
import ( |
|||
"context" |
|||
"os" |
|||
"strings" |
|||
"testing" |
|||
"time" |
|||
|
|||
"github.com/seaweedfs/seaweedfs/weed/pb/master_pb" |
|||
) |
|||
|
|||
// ============================================================
|
|||
// Phase 12 P3: Diagnosability / Blocker Accounting
|
|||
//
|
|||
// All diagnosis conclusions use ONLY explicit bounded read-only
|
|||
// diagnosis surfaces:
|
|||
// - LookupBlockVolume (product-visible publication)
|
|||
// - FailoverDiagnostic (volume-oriented failover state)
|
|||
// - PublicationDiagnostic (lookup vs authority coherence)
|
|||
// - RecoveryDiagnostic (active recovery task set)
|
|||
// - phase-12-p3-blockers.md (finite blocker ledger)
|
|||
//
|
|||
// NOT performance, NOT rollout readiness.
|
|||
// ============================================================
|
|||
|
|||
// --- S1: Failover convergence diagnosable via FailoverDiagnostic ---
|
|||
|
|||
func TestP12P3_FailoverConvergence_Diagnosable(t *testing.T) { |
|||
s := newSoakSetup(t) |
|||
ctx := context.Background() |
|||
|
|||
s.ms.CreateBlockVolume(ctx, &master_pb.CreateBlockVolumeRequest{ |
|||
Name: "diag-vol-1", SizeBytes: 1 << 20, |
|||
}) |
|||
|
|||
entry, _ := s.ms.blockRegistry.Lookup("diag-vol-1") |
|||
s.bs.localServerID = entry.VolumeServer |
|||
s.deliver(entry.VolumeServer) |
|||
time.Sleep(100 * time.Millisecond) |
|||
|
|||
// Surface 1: LookupBlockVolume shows current primary before failover.
|
|||
lookupBefore, _ := s.ms.LookupBlockVolume(ctx, &master_pb.LookupBlockVolumeRequest{Name: "diag-vol-1"}) |
|||
if lookupBefore.VolumeServer != entry.VolumeServer { |
|||
t.Fatal("lookup should show original primary before failover") |
|||
} |
|||
|
|||
// Surface 2: FailoverDiagnostic — no volumes in failover state yet.
|
|||
failoverBefore := s.ms.blockFailover.DiagnosticSnapshot() |
|||
for _, v := range failoverBefore.Volumes { |
|||
if v.VolumeName == "diag-vol-1" { |
|||
t.Fatalf("S1: diag-vol-1 should not appear in failover diagnostic before failover, got %+v", v) |
|||
} |
|||
} |
|||
|
|||
// Trigger failover: expire lease, then failover.
|
|||
s.ms.blockRegistry.UpdateEntry("diag-vol-1", func(e *BlockVolumeEntry) { |
|||
e.LastLeaseGrant = time.Now().Add(-1 * time.Minute) |
|||
}) |
|||
s.ms.failoverBlockVolumes(entry.VolumeServer) |
|||
|
|||
// Surface 1 after: LookupBlockVolume shows NEW primary.
|
|||
lookupAfter, _ := s.ms.LookupBlockVolume(ctx, &master_pb.LookupBlockVolumeRequest{Name: "diag-vol-1"}) |
|||
if lookupAfter.VolumeServer == entry.VolumeServer { |
|||
t.Fatal("S1 stall: lookup still shows old primary after failover") |
|||
} |
|||
|
|||
// Surface 2 after: FailoverDiagnostic shows volume-level failover state.
|
|||
failoverAfter := s.ms.blockFailover.DiagnosticSnapshot() |
|||
|
|||
// Classify via explicit diagnosis surface: find diag-vol-1 in failover volumes.
|
|||
var found *FailoverVolumeState |
|||
for i := range failoverAfter.Volumes { |
|||
if failoverAfter.Volumes[i].VolumeName == "diag-vol-1" { |
|||
found = &failoverAfter.Volumes[i] |
|||
break |
|||
} |
|||
} |
|||
if found == nil { |
|||
t.Fatal("S1: diag-vol-1 not found in FailoverDiagnostic after failover") |
|||
} |
|||
|
|||
// Diagnosis conclusion from explicit surfaces only:
|
|||
// - Lookup changed (old → new primary)
|
|||
// - FailoverDiagnostic classifies state as rebuild_pending
|
|||
// - AffectedServer identifies the dead server
|
|||
if !found.PendingRebuild { |
|||
t.Fatal("S1: FailoverDiagnostic should show PendingRebuild=true") |
|||
} |
|||
if found.Reason != "rebuild_pending" { |
|||
t.Fatalf("S1: expected reason=rebuild_pending, got %q", found.Reason) |
|||
} |
|||
if found.AffectedServer != entry.VolumeServer { |
|||
t.Fatalf("S1: AffectedServer should be dead server %s, got %s", entry.VolumeServer, found.AffectedServer) |
|||
} |
|||
if found.CurrentPrimary != lookupAfter.VolumeServer { |
|||
t.Fatalf("S1: CurrentPrimary should match lookup %s, got %s", lookupAfter.VolumeServer, found.CurrentPrimary) |
|||
} |
|||
|
|||
t.Logf("P12P3 S1: diagnosed via LookupBlockVolume(%s→%s) + FailoverDiagnostic(vol=%s, reason=%s, affected=%s)", |
|||
lookupBefore.VolumeServer, lookupAfter.VolumeServer, found.VolumeName, found.Reason, found.AffectedServer) |
|||
} |
|||
|
|||
// --- S2: Publication mismatch diagnosable via PublicationDiagnostic ---
|
|||
|
|||
func TestP12P3_PublicationMismatch_Diagnosable(t *testing.T) { |
|||
s := newSoakSetup(t) |
|||
ctx := context.Background() |
|||
|
|||
s.ms.CreateBlockVolume(ctx, &master_pb.CreateBlockVolumeRequest{ |
|||
Name: "diag-vol-2", SizeBytes: 1 << 20, |
|||
}) |
|||
|
|||
entry, _ := s.ms.blockRegistry.Lookup("diag-vol-2") |
|||
s.bs.localServerID = entry.VolumeServer |
|||
s.deliver(entry.VolumeServer) |
|||
time.Sleep(100 * time.Millisecond) |
|||
|
|||
// Surface 1: PublicationDiagnostic before failover — should be coherent.
|
|||
pubBefore, ok := s.ms.PublicationDiagnosticFor("diag-vol-2") |
|||
if !ok { |
|||
t.Fatal("S2: PublicationDiagnosticFor should find diag-vol-2") |
|||
} |
|||
if !pubBefore.Coherent { |
|||
t.Fatalf("S2: publication should be coherent before failover, got reason=%q", pubBefore.Reason) |
|||
} |
|||
|
|||
// Surface 2: LookupBlockVolume — repeated lookups self-consistent.
|
|||
lookup1, _ := s.ms.LookupBlockVolume(ctx, &master_pb.LookupBlockVolumeRequest{Name: "diag-vol-2"}) |
|||
lookup2, _ := s.ms.LookupBlockVolume(ctx, &master_pb.LookupBlockVolumeRequest{Name: "diag-vol-2"}) |
|||
if lookup1.IscsiAddr != lookup2.IscsiAddr || lookup1.VolumeServer != lookup2.VolumeServer { |
|||
t.Fatalf("S2: repeated lookup mismatch: %s/%s vs %s/%s", |
|||
lookup1.VolumeServer, lookup1.IscsiAddr, lookup2.VolumeServer, lookup2.IscsiAddr) |
|||
} |
|||
|
|||
// Trigger failover.
|
|||
s.ms.blockRegistry.UpdateEntry("diag-vol-2", func(e *BlockVolumeEntry) { |
|||
e.LastLeaseGrant = time.Now().Add(-1 * time.Minute) |
|||
}) |
|||
s.ms.failoverBlockVolumes(entry.VolumeServer) |
|||
|
|||
// Surface 1 after: PublicationDiagnostic after failover — still coherent.
|
|||
pubAfter, ok := s.ms.PublicationDiagnosticFor("diag-vol-2") |
|||
if !ok { |
|||
t.Fatal("S2: PublicationDiagnosticFor should find diag-vol-2 after failover") |
|||
} |
|||
if !pubAfter.Coherent { |
|||
t.Fatalf("S2: publication should be coherent after failover, got reason=%q", pubAfter.Reason) |
|||
} |
|||
|
|||
// Diagnosis conclusion from explicit surfaces only:
|
|||
// - Pre-failover: coherent, lookup matches authority
|
|||
// - Post-failover: coherent, lookup updated to new primary
|
|||
// - Publication switched: post != pre
|
|||
if pubAfter.LookupVolumeServer == pubBefore.LookupVolumeServer { |
|||
t.Fatal("S2: LookupVolumeServer unchanged after failover — publication did not switch") |
|||
} |
|||
if pubAfter.LookupIscsiAddr == pubBefore.LookupIscsiAddr { |
|||
t.Fatal("S2: LookupIscsiAddr unchanged after failover") |
|||
} |
|||
|
|||
// Post-failover repeated lookup still self-consistent (via diagnostic).
|
|||
pubAfter2, _ := s.ms.PublicationDiagnosticFor("diag-vol-2") |
|||
if pubAfter2.LookupVolumeServer != pubAfter.LookupVolumeServer || |
|||
pubAfter2.LookupIscsiAddr != pubAfter.LookupIscsiAddr { |
|||
t.Fatal("S2: post-failover publication diagnostics inconsistent") |
|||
} |
|||
|
|||
t.Logf("P12P3 S2: diagnosed via PublicationDiagnostic — pre(vs=%s, iscsi=%s, coherent=%v) → post(vs=%s, iscsi=%s, coherent=%v)", |
|||
pubBefore.LookupVolumeServer, pubBefore.LookupIscsiAddr, pubBefore.Coherent, |
|||
pubAfter.LookupVolumeServer, pubAfter.LookupIscsiAddr, pubAfter.Coherent) |
|||
} |
|||
|
|||
// --- S3: Runtime residue diagnosable via RecoveryDiagnostic ---
|
|||
|
|||
func TestP12P3_RuntimeResidue_Diagnosable(t *testing.T) { |
|||
s := newSoakSetup(t) |
|||
ctx := context.Background() |
|||
|
|||
s.ms.CreateBlockVolume(ctx, &master_pb.CreateBlockVolumeRequest{ |
|||
Name: "diag-vol-3", SizeBytes: 1 << 20, |
|||
}) |
|||
|
|||
entry, _ := s.ms.blockRegistry.Lookup("diag-vol-3") |
|||
s.bs.localServerID = entry.VolumeServer |
|||
s.deliver(entry.VolumeServer) |
|||
time.Sleep(200 * time.Millisecond) |
|||
|
|||
// Surface: RecoveryDiagnostic shows active task set.
|
|||
diagBefore := s.bs.v2Recovery.DiagnosticSnapshot() |
|||
t.Logf("S3 before delete: %d active tasks: %v", len(diagBefore.ActiveTasks), diagBefore.ActiveTasks) |
|||
|
|||
// Delete the volume.
|
|||
s.ms.DeleteBlockVolume(ctx, &master_pb.DeleteBlockVolumeRequest{Name: "diag-vol-3"}) |
|||
time.Sleep(200 * time.Millisecond) |
|||
|
|||
// Surface after: RecoveryDiagnostic — no tasks for deleted volume.
|
|||
diagAfter := s.bs.v2Recovery.DiagnosticSnapshot() |
|||
for _, task := range diagAfter.ActiveTasks { |
|||
if strings.Contains(task, "diag-vol-3") { |
|||
t.Fatalf("S3 residue: task %s active after delete", task) |
|||
} |
|||
} |
|||
|
|||
// Diagnosis conclusion from explicit surface only:
|
|||
// - RecoveryDiagnostic.ActiveTasks does not contain deleted volume's tasks
|
|||
// - Conclusion: clean (no residue) or non-empty but unrelated to deleted volume
|
|||
if len(diagAfter.ActiveTasks) == 0 { |
|||
t.Log("P12P3 S3: diagnosed via RecoveryDiagnostic — clean (0 active tasks after delete)") |
|||
} else { |
|||
t.Logf("P12P3 S3: diagnosed via RecoveryDiagnostic — %d active tasks (none for deleted vol)", |
|||
len(diagAfter.ActiveTasks)) |
|||
} |
|||
} |
|||
|
|||
// --- Blocker ledger: reads and validates the actual file ---
|
|||
|
|||
func TestP12P3_BlockerLedger_Bounded(t *testing.T) { |
|||
ledgerPath := "../../sw-block/.private/phase/phase-12-p3-blockers.md" |
|||
|
|||
data, err := os.ReadFile(ledgerPath) |
|||
if err != nil { |
|||
t.Fatalf("blocker ledger must exist at %s: %v", ledgerPath, err) |
|||
} |
|||
|
|||
content := string(data) |
|||
|
|||
// Must contain diagnosed items.
|
|||
for _, id := range []string{"B1", "B2", "B3"} { |
|||
if !strings.Contains(content, id) { |
|||
t.Fatalf("ledger missing diagnosed item %s", id) |
|||
} |
|||
} |
|||
|
|||
// Must contain unresolved items.
|
|||
for _, id := range []string{"U1", "U2", "U3"} { |
|||
if !strings.Contains(content, id) { |
|||
t.Fatalf("ledger missing unresolved item %s", id) |
|||
} |
|||
} |
|||
|
|||
// Must contain out-of-scope section.
|
|||
if !strings.Contains(content, "Out of Scope") { |
|||
t.Fatal("ledger must have 'Out of Scope' section") |
|||
} |
|||
|
|||
// Must NOT overclaim perf or rollout.
|
|||
lines := strings.Split(content, "\n") |
|||
diagnosedCount := 0 |
|||
unresolvedCount := 0 |
|||
for _, line := range lines { |
|||
if strings.HasPrefix(strings.TrimSpace(line), "| B") { |
|||
diagnosedCount++ |
|||
} |
|||
if strings.HasPrefix(strings.TrimSpace(line), "| U") { |
|||
unresolvedCount++ |
|||
} |
|||
} |
|||
|
|||
total := diagnosedCount + unresolvedCount |
|||
if total == 0 { |
|||
t.Fatal("ledger has no blocker items") |
|||
} |
|||
if total > 20 { |
|||
t.Fatalf("ledger should be finite, got %d items", total) |
|||
} |
|||
|
|||
t.Logf("P12P3 blockers: %d diagnosed + %d unresolved = %d total (from actual file, finite)", |
|||
diagnosedCount, unresolvedCount, total) |
|||
} |
|||
@ -0,0 +1,582 @@ |
|||
package weed_server |
|||
|
|||
import ( |
|||
"crypto/rand" |
|||
"fmt" |
|||
"math" |
|||
mrand "math/rand" |
|||
"os" |
|||
"path/filepath" |
|||
"sort" |
|||
"strings" |
|||
"testing" |
|||
"time" |
|||
|
|||
"github.com/seaweedfs/seaweedfs/weed/storage/blockvol" |
|||
) |
|||
|
|||
// ============================================================
|
|||
// Phase 12 P4: Performance Floor — Bounded Measurement Package
|
|||
//
|
|||
// Workload envelope:
|
|||
// Topology: RF=2 sync_all accepted chosen path
|
|||
// Operations: 4K random write, 4K random read, 4K sequential write, 4K sequential read
|
|||
// Runtime: no failover, no disturbance, steady-state
|
|||
// Environment: unit test harness (single-process, local disk, engine-local I/O)
|
|||
//
|
|||
// What this measures:
|
|||
// Engine I/O floor for the accepted chosen path. WriteLBA/ReadLBA through
|
|||
// the full fencing path (epoch, role, lease, writeGate, WAL, dirtyMap).
|
|||
// No transport layer (iSCSI/NVMe). No cross-machine replication.
|
|||
//
|
|||
// What this does NOT measure:
|
|||
// Transport throughput, cross-machine replication tax, multi-client concurrency,
|
|||
// failover-under-load, degraded mode. Production floor with replication is
|
|||
// documented in baseline-roce-20260401.md.
|
|||
//
|
|||
// NOT performance tuning. NOT broad benchmark.
|
|||
// ============================================================
|
|||
|
|||
const ( |
|||
perfBlockSize = 4096 |
|||
perfVolumeSize = 64 * 1024 * 1024 // 64MB
|
|||
perfWALSize = 16 * 1024 * 1024 // 16MB
|
|||
perfOps = 1000 // ops per measurement run
|
|||
perfWarmupOps = 200 // warmup ops (discarded from measurement)
|
|||
perfIterations = 3 // run N times, report worst as floor
|
|||
) |
|||
|
|||
// Minimum acceptable floor thresholds (engine-local, single-writer).
|
|||
//
|
|||
// These are regression gates, not performance targets. Set conservatively
|
|||
// so any reasonable hardware passes, but catastrophic regressions
|
|||
// (accidental serialization, O(n^2) scan, broken WAL path) are caught.
|
|||
//
|
|||
// Rationale for values:
|
|||
// Measured on dev SSD: rand-write ~10K, rand-read ~80K, seq-write ~30K, seq-read ~180K.
|
|||
// Thresholds set at ~10% of measured to tolerate slow CI machines and VMs.
|
|||
// Write P99 ceiling at 100ms catches deadlocks/stalls without false-positiving
|
|||
// on slow storage.
|
|||
var perfFloorGates = map[string]struct { |
|||
MinIOPS float64 |
|||
MaxWriteP99 time.Duration // 0 = no ceiling (reads)
|
|||
}{ |
|||
"rand-write": {MinIOPS: 1000, MaxWriteP99: 100 * time.Millisecond}, |
|||
"rand-read": {MinIOPS: 5000}, |
|||
"seq-write": {MinIOPS: 2000, MaxWriteP99: 100 * time.Millisecond}, |
|||
"seq-read": {MinIOPS: 10000}, |
|||
} |
|||
|
|||
// perfResult holds measurements for one workload run.
|
|||
type perfResult struct { |
|||
Workload string |
|||
Ops int |
|||
Elapsed time.Duration |
|||
IOPS float64 |
|||
MBps float64 |
|||
LatSamples []int64 // per-op latency in nanoseconds
|
|||
} |
|||
|
|||
func (r *perfResult) latPct(pct float64) time.Duration { |
|||
if len(r.LatSamples) == 0 { |
|||
return 0 |
|||
} |
|||
sorted := make([]int64, len(r.LatSamples)) |
|||
copy(sorted, r.LatSamples) |
|||
sort.Slice(sorted, func(i, j int) bool { return sorted[i] < sorted[j] }) |
|||
idx := int(math.Ceil(pct/100.0*float64(len(sorted)))) - 1 |
|||
if idx < 0 { |
|||
idx = 0 |
|||
} |
|||
if idx >= len(sorted) { |
|||
idx = len(sorted) - 1 |
|||
} |
|||
return time.Duration(sorted[idx]) |
|||
} |
|||
|
|||
func (r *perfResult) latAvg() time.Duration { |
|||
if len(r.LatSamples) == 0 { |
|||
return 0 |
|||
} |
|||
var sum int64 |
|||
for _, s := range r.LatSamples { |
|||
sum += s |
|||
} |
|||
return time.Duration(sum / int64(len(r.LatSamples))) |
|||
} |
|||
|
|||
// setupPerfVolume creates a BlockVol configured as Primary for perf measurement.
|
|||
func setupPerfVolume(t *testing.T) *blockvol.BlockVol { |
|||
t.Helper() |
|||
dir := t.TempDir() |
|||
volPath := filepath.Join(dir, "perf.blk") |
|||
vol, err := blockvol.CreateBlockVol(volPath, blockvol.CreateOptions{ |
|||
VolumeSize: perfVolumeSize, |
|||
BlockSize: perfBlockSize, |
|||
WALSize: perfWALSize, |
|||
}) |
|||
if err != nil { |
|||
t.Fatal(err) |
|||
} |
|||
// Set up as Primary with long lease so writes are allowed.
|
|||
if err := vol.HandleAssignment(1, blockvol.RolePrimary, 10*time.Minute); err != nil { |
|||
vol.Close() |
|||
t.Fatal(err) |
|||
} |
|||
t.Cleanup(func() { vol.Close() }) |
|||
return vol |
|||
} |
|||
|
|||
// maxLBAs returns the number of addressable 4K blocks in the extent area.
|
|||
func maxLBAs() uint64 { |
|||
// Volume size minus WAL, divided by block size, with safety margin.
|
|||
return (perfVolumeSize - perfWALSize) / perfBlockSize / 2 |
|||
} |
|||
|
|||
// runPerfWorkload executes one workload measurement and returns the result.
|
|||
func runPerfWorkload(t *testing.T, vol *blockvol.BlockVol, workload string, ops int) perfResult { |
|||
t.Helper() |
|||
data := make([]byte, perfBlockSize) |
|||
rand.Read(data) |
|||
max := maxLBAs() |
|||
|
|||
samples := make([]int64, 0, ops) |
|||
start := time.Now() |
|||
|
|||
for i := 0; i < ops; i++ { |
|||
var lba uint64 |
|||
switch { |
|||
case strings.HasPrefix(workload, "rand"): |
|||
lba = uint64(mrand.Int63n(int64(max))) |
|||
default: // sequential
|
|||
lba = uint64(i) % max |
|||
} |
|||
|
|||
opStart := time.Now() |
|||
switch { |
|||
case strings.HasSuffix(workload, "write"): |
|||
if err := vol.WriteLBA(lba, data); err != nil { |
|||
t.Fatalf("%s op %d: WriteLBA(%d): %v", workload, i, lba, err) |
|||
} |
|||
case strings.HasSuffix(workload, "read"): |
|||
if _, err := vol.ReadLBA(lba, perfBlockSize); err != nil { |
|||
t.Fatalf("%s op %d: ReadLBA(%d): %v", workload, i, lba, err) |
|||
} |
|||
} |
|||
samples = append(samples, time.Since(opStart).Nanoseconds()) |
|||
} |
|||
|
|||
elapsed := time.Since(start) |
|||
iops := float64(ops) / elapsed.Seconds() |
|||
mbps := iops * float64(perfBlockSize) / (1024 * 1024) |
|||
|
|||
return perfResult{ |
|||
Workload: workload, |
|||
Ops: ops, |
|||
Elapsed: elapsed, |
|||
IOPS: iops, |
|||
MBps: mbps, |
|||
LatSamples: samples, |
|||
} |
|||
} |
|||
|
|||
// floorOf returns the worst (lowest) IOPS and worst (highest) P99 across iterations.
|
|||
type perfFloor struct { |
|||
Workload string |
|||
FloorIOPS float64 |
|||
FloorMBps float64 |
|||
WorstAvg time.Duration |
|||
WorstP50 time.Duration |
|||
WorstP99 time.Duration |
|||
WorstMax time.Duration |
|||
} |
|||
|
|||
func computeFloor(results []perfResult) perfFloor { |
|||
f := perfFloor{ |
|||
Workload: results[0].Workload, |
|||
FloorIOPS: math.MaxFloat64, |
|||
FloorMBps: math.MaxFloat64, |
|||
} |
|||
for _, r := range results { |
|||
if r.IOPS < f.FloorIOPS { |
|||
f.FloorIOPS = r.IOPS |
|||
} |
|||
if r.MBps < f.FloorMBps { |
|||
f.FloorMBps = r.MBps |
|||
} |
|||
avg := r.latAvg() |
|||
if avg > f.WorstAvg { |
|||
f.WorstAvg = avg |
|||
} |
|||
p50 := r.latPct(50) |
|||
if p50 > f.WorstP50 { |
|||
f.WorstP50 = p50 |
|||
} |
|||
p99 := r.latPct(99) |
|||
if p99 > f.WorstP99 { |
|||
f.WorstP99 = p99 |
|||
} |
|||
pmax := r.latPct(100) |
|||
if pmax > f.WorstMax { |
|||
f.WorstMax = pmax |
|||
} |
|||
} |
|||
return f |
|||
} |
|||
|
|||
// --- Test 1: PerformanceFloor_Bounded ---
|
|||
|
|||
func TestP12P4_PerformanceFloor_Bounded(t *testing.T) { |
|||
vol := setupPerfVolume(t) |
|||
|
|||
workloads := []string{"rand-write", "rand-read", "seq-write", "seq-read"} |
|||
floors := make([]perfFloor, 0, len(workloads)) |
|||
|
|||
for _, wl := range workloads { |
|||
// Warmup: populate volume with data (needed for reads).
|
|||
if strings.HasSuffix(wl, "read") { |
|||
warmupData := make([]byte, perfBlockSize) |
|||
rand.Read(warmupData) |
|||
for i := 0; i < int(maxLBAs()); i++ { |
|||
if err := vol.WriteLBA(uint64(i), warmupData); err != nil { |
|||
break // WAL full is acceptable during warmup
|
|||
} |
|||
} |
|||
time.Sleep(200 * time.Millisecond) // let flusher drain
|
|||
} |
|||
|
|||
// Warmup ops (discarded).
|
|||
runPerfWorkload(t, vol, wl, perfWarmupOps) |
|||
|
|||
// Measurement: N iterations, take floor.
|
|||
var results []perfResult |
|||
for iter := 0; iter < perfIterations; iter++ { |
|||
r := runPerfWorkload(t, vol, wl, perfOps) |
|||
results = append(results, r) |
|||
} |
|||
|
|||
floor := computeFloor(results) |
|||
floors = append(floors, floor) |
|||
} |
|||
|
|||
// Report structured floor table.
|
|||
t.Log("") |
|||
t.Log("=== P12P4 Performance Floor (engine-local, single-writer) ===") |
|||
t.Log("") |
|||
t.Logf("%-12s %10s %8s %10s %10s %10s %10s", |
|||
"Workload", "Floor IOPS", "MB/s", "Avg Lat", "P50 Lat", "P99 Lat", "Max Lat") |
|||
t.Logf("%-12s %10s %8s %10s %10s %10s %10s", |
|||
"--------", "----------", "------", "-------", "-------", "-------", "-------") |
|||
for _, f := range floors { |
|||
t.Logf("%-12s %10.0f %8.2f %10s %10s %10s %10s", |
|||
f.Workload, f.FloorIOPS, f.FloorMBps, f.WorstAvg, f.WorstP50, f.WorstP99, f.WorstMax) |
|||
} |
|||
t.Log("") |
|||
t.Logf("Config: volume=%dMB WAL=%dMB block=%dB ops=%d warmup=%d iterations=%d", |
|||
perfVolumeSize/(1024*1024), perfWALSize/(1024*1024), perfBlockSize, perfOps, perfWarmupOps, perfIterations) |
|||
t.Log("Method: worst of N iterations (floor, not peak)") |
|||
t.Log("Scope: engine-local only; production RF=2 floor in baseline-roce-20260401.md") |
|||
|
|||
// Gate: floor values must meet minimum acceptable thresholds.
|
|||
// These are regression gates — if any floor drops below the gate,
|
|||
// the test fails, blocking rollout.
|
|||
t.Log("") |
|||
t.Log("=== Floor Gate Validation ===") |
|||
allGatesPassed := true |
|||
for _, f := range floors { |
|||
gate, ok := perfFloorGates[f.Workload] |
|||
if !ok { |
|||
t.Fatalf("no floor gate defined for workload %s", f.Workload) |
|||
} |
|||
passed := true |
|||
if f.FloorIOPS < gate.MinIOPS { |
|||
t.Errorf("GATE FAIL: %s floor IOPS %.0f < minimum %.0f", f.Workload, f.FloorIOPS, gate.MinIOPS) |
|||
passed = false |
|||
} |
|||
if gate.MaxWriteP99 > 0 && f.WorstP99 > gate.MaxWriteP99 { |
|||
t.Errorf("GATE FAIL: %s worst P99 %s > ceiling %s", f.Workload, f.WorstP99, gate.MaxWriteP99) |
|||
passed = false |
|||
} |
|||
status := "PASS" |
|||
if !passed { |
|||
status = "FAIL" |
|||
allGatesPassed = false |
|||
} |
|||
t.Logf(" %-12s min=%6.0f IOPS → floor=%6.0f [%s]", f.Workload, gate.MinIOPS, f.FloorIOPS, status) |
|||
} |
|||
|
|||
if !allGatesPassed { |
|||
t.Fatal("P12P4 PerformanceFloor: FAIL — one or more floor gates not met") |
|||
} |
|||
t.Log("P12P4 PerformanceFloor: PASS — all floor gates met") |
|||
} |
|||
|
|||
// --- Test 2: CostCharacterization_Bounded ---
|
|||
|
|||
func TestP12P4_CostCharacterization_Bounded(t *testing.T) { |
|||
vol := setupPerfVolume(t) |
|||
|
|||
// Measure write latency breakdown: WriteLBA includes WAL append + group commit.
|
|||
data := make([]byte, perfBlockSize) |
|||
rand.Read(data) |
|||
max := maxLBAs() |
|||
|
|||
const costOps = 500 |
|||
var writeLatSum int64 |
|||
for i := 0; i < costOps; i++ { |
|||
lba := uint64(mrand.Int63n(int64(max))) |
|||
start := time.Now() |
|||
if err := vol.WriteLBA(lba, data); err != nil { |
|||
t.Fatalf("write op %d: %v", i, err) |
|||
} |
|||
writeLatSum += time.Since(start).Nanoseconds() |
|||
} |
|||
avgWriteLat := time.Duration(writeLatSum / costOps) |
|||
|
|||
// Measure read latency for comparison.
|
|||
// Populate first.
|
|||
for i := 0; i < int(max/2); i++ { |
|||
vol.WriteLBA(uint64(i), data) |
|||
} |
|||
time.Sleep(200 * time.Millisecond) // let flusher drain
|
|||
|
|||
var readLatSum int64 |
|||
for i := 0; i < costOps; i++ { |
|||
lba := uint64(mrand.Int63n(int64(max / 2))) |
|||
start := time.Now() |
|||
if _, err := vol.ReadLBA(lba, perfBlockSize); err != nil { |
|||
t.Fatalf("read op %d: %v", i, err) |
|||
} |
|||
readLatSum += time.Since(start).Nanoseconds() |
|||
} |
|||
avgReadLat := time.Duration(readLatSum / costOps) |
|||
|
|||
// Cost statement.
|
|||
t.Log("") |
|||
t.Log("=== P12P4 Cost Characterization (engine-local) ===") |
|||
t.Log("") |
|||
t.Logf("Average write latency: %s (includes WAL append + group commit sync)", avgWriteLat) |
|||
t.Logf("Average read latency: %s (dirtyMap lookup + WAL/extent read)", avgReadLat) |
|||
t.Log("") |
|||
t.Log("Bounded cost statement:") |
|||
t.Log(" WAL write amplification: 2x minimum (WAL write + eventual extent flush)") |
|||
t.Log(" Group commit: amortizes fdatasync across batched writers (1 sync per batch)") |
|||
t.Log(" Replication tax (production RF=2 sync_all): -56% vs RF=1 (barrier round-trip)") |
|||
t.Log(" Replication tax source: baseline-roce-20260401.md, measured on 25Gbps RoCE") |
|||
t.Log("") |
|||
t.Logf("Write/read ratio: %.1fx (write is %.1fx slower than read)", |
|||
float64(avgWriteLat)/float64(avgReadLat), |
|||
float64(avgWriteLat)/float64(avgReadLat)) |
|||
t.Log("") |
|||
t.Logf("Config: volume=%dMB WAL=%dMB block=%dB ops=%d", |
|||
perfVolumeSize/(1024*1024), perfWALSize/(1024*1024), perfBlockSize, costOps) |
|||
|
|||
// Proof: cost values are finite and positive.
|
|||
if avgWriteLat <= 0 || avgReadLat <= 0 { |
|||
t.Fatal("latency values must be positive") |
|||
} |
|||
if avgWriteLat < avgReadLat { |
|||
t.Log("Note: write faster than read in this run (possible due to WAL cache hits)") |
|||
} |
|||
|
|||
t.Log("P12P4 CostCharacterization: PASS — bounded cost statement produced") |
|||
} |
|||
|
|||
// --- Test 3: RolloutGate_Bounded ---
|
|||
|
|||
func TestP12P4_RolloutGate_Bounded(t *testing.T) { |
|||
floorPath := "../../sw-block/.private/phase/phase-12-p4-floor.md" |
|||
gatesPath := "../../sw-block/.private/phase/phase-12-p4-rollout-gates.md" |
|||
baselinePath := "../../learn/projects/sw-block/test/results/baseline-roce-20260401.md" |
|||
blockerPath := "../../sw-block/.private/phase/phase-12-p3-blockers.md" |
|||
|
|||
// --- Read all cited evidence sources ---
|
|||
|
|||
floorData, err := os.ReadFile(floorPath) |
|||
if err != nil { |
|||
t.Fatalf("floor doc must exist at %s: %v", floorPath, err) |
|||
} |
|||
floorContent := string(floorData) |
|||
|
|||
gatesData, err := os.ReadFile(gatesPath) |
|||
if err != nil { |
|||
t.Fatalf("rollout-gates doc must exist at %s: %v", gatesPath, err) |
|||
} |
|||
gatesContent := string(gatesData) |
|||
|
|||
baselineData, err := os.ReadFile(baselinePath) |
|||
if err != nil { |
|||
t.Fatalf("cited baseline must exist at %s: %v", baselinePath, err) |
|||
} |
|||
baselineContent := string(baselineData) |
|||
|
|||
blockerData, err := os.ReadFile(blockerPath) |
|||
if err != nil { |
|||
t.Fatalf("cited blocker ledger must exist at %s: %v", blockerPath, err) |
|||
} |
|||
blockerContent := string(blockerData) |
|||
|
|||
// --- Structural validation (shape) ---
|
|||
|
|||
// Floor doc: workload envelope, floor table, non-claims.
|
|||
for _, required := range []string{"RF=2", "sync_all", "4K random write", "4K random read", "sequential write", "sequential read"} { |
|||
if !strings.Contains(floorContent, required) { |
|||
t.Fatalf("floor doc missing required content: %q", required) |
|||
} |
|||
} |
|||
if !strings.Contains(floorContent, "Floor") || !strings.Contains(floorContent, "IOPS") { |
|||
t.Fatal("floor doc must contain floor table with IOPS") |
|||
} |
|||
if !strings.Contains(floorContent, "does NOT") { |
|||
t.Fatal("floor doc must contain explicit non-claims") |
|||
} |
|||
|
|||
// Gates doc: gates table, launch envelope, exclusions, non-claims.
|
|||
if !strings.Contains(gatesContent, "Gate") || !strings.Contains(gatesContent, "Status") { |
|||
t.Fatal("rollout-gates doc must contain gates table") |
|||
} |
|||
if !strings.Contains(gatesContent, "Launch Envelope") { |
|||
t.Fatal("rollout-gates doc must contain launch envelope") |
|||
} |
|||
if !strings.Contains(gatesContent, "Exclusion") { |
|||
t.Fatal("rollout-gates doc must contain exclusions") |
|||
} |
|||
if !strings.Contains(gatesContent, "does NOT") { |
|||
t.Fatal("rollout-gates doc must contain explicit non-claims") |
|||
} |
|||
|
|||
// Count gates — must be finite.
|
|||
gateLines := 0 |
|||
for _, line := range strings.Split(gatesContent, "\n") { |
|||
trimmed := strings.TrimSpace(line) |
|||
if strings.HasPrefix(trimmed, "| G") || strings.HasPrefix(trimmed, "| E") { |
|||
gateLines++ |
|||
} |
|||
} |
|||
if gateLines == 0 { |
|||
t.Fatal("rollout-gates doc has no gate items") |
|||
} |
|||
if gateLines > 20 { |
|||
t.Fatalf("rollout-gates doc should be finite, got %d items", gateLines) |
|||
} |
|||
|
|||
// --- Semantic cross-checks (evidence alignment) ---
|
|||
|
|||
// 1. G6 cites "28.4K write IOPS" — baseline must contain this number.
|
|||
if strings.Contains(gatesContent, "28.4K write IOPS") || strings.Contains(gatesContent, "28,4") { |
|||
// The gates doc cites write IOPS from baseline. Verify the baseline has it.
|
|||
if !strings.Contains(baselineContent, "28,") { |
|||
t.Fatal("G6 cites write IOPS but baseline does not contain matching value") |
|||
} |
|||
} |
|||
// More precise: baseline must contain the specific numbers cited in G6.
|
|||
if !strings.Contains(baselineContent, "28,347") && !strings.Contains(baselineContent, "28,429") && |
|||
!strings.Contains(baselineContent, "28,453") { |
|||
t.Fatal("baseline must contain RF=2 sync_all write IOPS data (28,3xx-28,4xx range)") |
|||
} |
|||
if !strings.Contains(baselineContent, "136,648") { |
|||
t.Fatal("baseline must contain RF=2 read IOPS data (136,648)") |
|||
} |
|||
|
|||
// 2. G5 cites "-56% replication tax" — baseline must contain this.
|
|||
if strings.Contains(gatesContent, "-56%") { |
|||
if !strings.Contains(baselineContent, "-56%") { |
|||
t.Fatal("G5 cites -56% replication tax but baseline does not contain -56%") |
|||
} |
|||
} |
|||
|
|||
// 3. Launch envelope claims specific transport/network combos — verify against baseline.
|
|||
// Claimed: NVMe-TCP @ 25Gbps RoCE
|
|||
if strings.Contains(gatesContent, "NVMe-TCP @ 25Gbps RoCE") { |
|||
if !strings.Contains(baselineContent, "NVMe-TCP") || !strings.Contains(baselineContent, "RoCE") { |
|||
t.Fatal("launch envelope claims NVMe-TCP @ RoCE but baseline has no such data") |
|||
} |
|||
} |
|||
// Claimed: iSCSI @ 25Gbps RoCE
|
|||
if strings.Contains(gatesContent, "iSCSI @ 25Gbps RoCE") { |
|||
if !strings.Contains(baselineContent, "iSCSI") || !strings.Contains(baselineContent, "RoCE") { |
|||
t.Fatal("launch envelope claims iSCSI @ RoCE but baseline has no such data") |
|||
} |
|||
} |
|||
// Claimed: iSCSI @ 1Gbps
|
|||
if strings.Contains(gatesContent, "iSCSI @ 1Gbps") { |
|||
if !strings.Contains(baselineContent, "iSCSI") || !strings.Contains(baselineContent, "1Gbps") { |
|||
t.Fatal("launch envelope claims iSCSI @ 1Gbps but baseline has no such data") |
|||
} |
|||
} |
|||
// Exclusion: NVMe-TCP @ 1Gbps must NOT be claimed as supported.
|
|||
if strings.Contains(gatesContent, "NOT included") { |
|||
// Verify baseline indeed lacks NVMe-TCP @ 1Gbps.
|
|||
hasNvme1g := strings.Contains(baselineContent, "NVMe-TCP") && strings.Contains(baselineContent, "| NVMe-TCP | 1Gbps") |
|||
if hasNvme1g { |
|||
t.Fatal("baseline contains NVMe-TCP @ 1Gbps data but gates doc excludes it — resolve mismatch") |
|||
} |
|||
} |
|||
|
|||
// 4. G7 cites blocker ledger counts — verify against actual ledger.
|
|||
if strings.Contains(gatesContent, "3 diagnosed") { |
|||
diagCount := 0 |
|||
for _, line := range strings.Split(blockerContent, "\n") { |
|||
if strings.HasPrefix(strings.TrimSpace(line), "| B") { |
|||
diagCount++ |
|||
} |
|||
} |
|||
if diagCount != 3 { |
|||
t.Fatalf("G7 claims 3 diagnosed blockers but ledger has %d", diagCount) |
|||
} |
|||
} |
|||
if strings.Contains(gatesContent, "3 unresolved") { |
|||
unresCount := 0 |
|||
for _, line := range strings.Split(blockerContent, "\n") { |
|||
if strings.HasPrefix(strings.TrimSpace(line), "| U") { |
|||
unresCount++ |
|||
} |
|||
} |
|||
if unresCount != 3 { |
|||
t.Fatalf("G7 claims 3 unresolved blockers but ledger has %d", unresCount) |
|||
} |
|||
} |
|||
|
|||
// 5. Floor doc gate thresholds must match code-defined gates.
|
|||
for workload, gate := range perfFloorGates { |
|||
// The doc uses comma-formatted numbers (e.g., "1,000" or "5,000").
|
|||
minInt := int(gate.MinIOPS) |
|||
// Check for both comma-formatted and plain forms.
|
|||
found := false |
|||
for _, form := range []string{ |
|||
fmt.Sprintf("%d", minInt), // "1000"
|
|||
fmt.Sprintf("%d,%03d", minInt/1000, minInt%1000), // "1,000"
|
|||
} { |
|||
if strings.Contains(floorContent, form) { |
|||
found = true |
|||
break |
|||
} |
|||
} |
|||
if !found { |
|||
t.Errorf("floor doc gate for %s should cite minimum %d IOPS but doesn't", workload, minInt) |
|||
} |
|||
} |
|||
|
|||
t.Logf("P12P4 RolloutGate: floor doc %d bytes, gates doc %d bytes, %d gate items", |
|||
len(floorData), len(gatesData), gateLines) |
|||
t.Log("P12P4 RolloutGate: semantic cross-checks passed (baseline, blocker ledger, gate thresholds)") |
|||
t.Log("P12P4 RolloutGate: PASS — bounded launch envelope with verified evidence alignment") |
|||
} |
|||
|
|||
// --- Helpers ---
|
|||
|
|||
func init() { |
|||
// Seed random for reproducible LBA patterns within a test run.
|
|||
mrand.Seed(time.Now().UnixNano()) |
|||
} |
|||
|
|||
// formatDuration formats a duration for table display.
|
|||
func formatDuration(d time.Duration) string { |
|||
if d < time.Microsecond { |
|||
return fmt.Sprintf("%dns", d.Nanoseconds()) |
|||
} |
|||
if d < time.Millisecond { |
|||
return fmt.Sprintf("%.1fus", float64(d.Nanoseconds())/1000.0) |
|||
} |
|||
return fmt.Sprintf("%.2fms", float64(d.Nanoseconds())/1e6) |
|||
} |
|||
@ -0,0 +1,308 @@ |
|||
package weed_server |
|||
|
|||
import ( |
|||
"context" |
|||
"fmt" |
|||
"os" |
|||
"path/filepath" |
|||
"strings" |
|||
"testing" |
|||
"time" |
|||
|
|||
"github.com/seaweedfs/seaweedfs/weed/pb/master_pb" |
|||
engine "github.com/seaweedfs/seaweedfs/sw-block/engine/replication" |
|||
"github.com/seaweedfs/seaweedfs/weed/storage" |
|||
"github.com/seaweedfs/seaweedfs/weed/storage/blockvol" |
|||
"github.com/seaweedfs/seaweedfs/weed/storage/blockvol/v2bridge" |
|||
) |
|||
|
|||
// ============================================================
|
|||
// Phase 12 P2: Soak / Long-Run Stability Hardening
|
|||
//
|
|||
// Proves: repeated chosen-path cycles return to bounded truth
|
|||
// without hidden state drift or unbounded runtime artifacts.
|
|||
//
|
|||
// NOT diagnosability, NOT performance-floor, NOT rollout readiness.
|
|||
// ============================================================
|
|||
|
|||
const soakCycles = 5 |
|||
|
|||
type soakSetup struct { |
|||
ms *MasterServer |
|||
bs *BlockService |
|||
store *storage.BlockVolumeStore |
|||
dir string |
|||
} |
|||
|
|||
func newSoakSetup(t *testing.T) *soakSetup { |
|||
t.Helper() |
|||
dir := t.TempDir() |
|||
store := storage.NewBlockVolumeStore() |
|||
|
|||
ms := &MasterServer{ |
|||
blockRegistry: NewBlockVolumeRegistry(), |
|||
blockAssignmentQueue: NewBlockAssignmentQueue(), |
|||
blockFailover: newBlockFailoverState(), |
|||
} |
|||
ms.blockRegistry.MarkBlockCapable("vs1:9333") |
|||
ms.blockRegistry.MarkBlockCapable("vs2:9333") |
|||
|
|||
ms.blockVSAllocate = func(ctx context.Context, server string, name string, sizeBytes uint64, diskType string, durabilityMode string) (*blockAllocResult, error) { |
|||
sanitized := strings.ReplaceAll(server, ":", "_") |
|||
serverDir := filepath.Join(dir, sanitized) |
|||
os.MkdirAll(serverDir, 0755) |
|||
volPath := filepath.Join(serverDir, fmt.Sprintf("%s.blk", name)) |
|||
vol, err := blockvol.CreateBlockVol(volPath, blockvol.CreateOptions{ |
|||
VolumeSize: 1 * 1024 * 1024, |
|||
BlockSize: 4096, |
|||
WALSize: 256 * 1024, |
|||
}) |
|||
if err != nil { |
|||
return nil, err |
|||
} |
|||
vol.Close() |
|||
if _, err := store.AddBlockVolume(volPath, ""); err != nil { |
|||
return nil, err |
|||
} |
|||
host := server |
|||
if idx := strings.LastIndex(server, ":"); idx >= 0 { |
|||
host = server[:idx] |
|||
} |
|||
return &blockAllocResult{ |
|||
Path: volPath, |
|||
IQN: fmt.Sprintf("iqn.2024.test:%s", name), |
|||
ISCSIAddr: host + ":3260", |
|||
ReplicaDataAddr: server + ":14260", |
|||
ReplicaCtrlAddr: server + ":14261", |
|||
RebuildListenAddr: server + ":15000", |
|||
}, nil |
|||
} |
|||
ms.blockVSDelete = func(ctx context.Context, server string, name string) error { return nil } |
|||
|
|||
bs := &BlockService{ |
|||
blockStore: store, |
|||
blockDir: filepath.Join(dir, "vs1_9333"), |
|||
listenAddr: "127.0.0.1:3260", |
|||
localServerID: "vs1:9333", |
|||
v2Bridge: v2bridge.NewControlBridge(), |
|||
v2Orchestrator: engine.NewRecoveryOrchestrator(), |
|||
replStates: make(map[string]*volReplState), |
|||
} |
|||
bs.v2Recovery = NewRecoveryManager(bs) |
|||
|
|||
t.Cleanup(func() { |
|||
bs.v2Recovery.Shutdown() |
|||
store.Close() |
|||
}) |
|||
|
|||
return &soakSetup{ms: ms, bs: bs, store: store, dir: dir} |
|||
} |
|||
|
|||
func (s *soakSetup) deliver(server string) int { |
|||
pending := s.ms.blockAssignmentQueue.Peek(server) |
|||
if len(pending) == 0 { |
|||
return 0 |
|||
} |
|||
protoAssignments := blockvol.AssignmentsToProto(pending) |
|||
goAssignments := blockvol.AssignmentsFromProto(protoAssignments) |
|||
s.bs.ProcessAssignments(goAssignments) |
|||
return len(goAssignments) |
|||
} |
|||
|
|||
// --- Repeated create/failover/recover cycles with end-of-cycle truth checks ---
|
|||
|
|||
func TestP12P2_RepeatedCycles_NoDrift(t *testing.T) { |
|||
s := newSoakSetup(t) |
|||
ctx := context.Background() |
|||
|
|||
for cycle := 1; cycle <= soakCycles; cycle++ { |
|||
volName := fmt.Sprintf("soak-vol-%d", cycle) |
|||
|
|||
// Step 1: Create.
|
|||
createResp, err := s.ms.CreateBlockVolume(ctx, &master_pb.CreateBlockVolumeRequest{ |
|||
Name: volName, SizeBytes: 1 << 20, |
|||
}) |
|||
if err != nil { |
|||
t.Fatalf("cycle %d create: %v", cycle, err) |
|||
} |
|||
primaryVS := createResp.VolumeServer |
|||
|
|||
// Deliver initial assignment.
|
|||
s.bs.localServerID = primaryVS |
|||
s.deliver(primaryVS) |
|||
time.Sleep(100 * time.Millisecond) |
|||
|
|||
entry, ok := s.ms.blockRegistry.Lookup(volName) |
|||
if !ok { |
|||
t.Fatalf("cycle %d: volume not in registry", cycle) |
|||
} |
|||
|
|||
// Step 2: Failover.
|
|||
s.ms.blockRegistry.UpdateEntry(volName, func(e *BlockVolumeEntry) { |
|||
e.LastLeaseGrant = time.Now().Add(-1 * time.Minute) |
|||
}) |
|||
s.ms.failoverBlockVolumes(primaryVS) |
|||
|
|||
entryAfter, _ := s.ms.blockRegistry.Lookup(volName) |
|||
if entryAfter.Epoch <= entry.Epoch { |
|||
t.Fatalf("cycle %d: epoch did not increase: %d <= %d", cycle, entryAfter.Epoch, entry.Epoch) |
|||
} |
|||
|
|||
// Deliver failover assignment.
|
|||
newPrimary := entryAfter.VolumeServer |
|||
s.bs.localServerID = newPrimary |
|||
s.deliver(newPrimary) |
|||
time.Sleep(100 * time.Millisecond) |
|||
|
|||
// Step 3: Reconnect.
|
|||
s.ms.recoverBlockVolumes(primaryVS) |
|||
s.bs.localServerID = primaryVS |
|||
s.deliver(primaryVS) |
|||
s.bs.localServerID = newPrimary |
|||
s.deliver(newPrimary) |
|||
time.Sleep(100 * time.Millisecond) |
|||
|
|||
// === End-of-cycle truth checks ===
|
|||
|
|||
// Registry: volume exists, epoch monotonic.
|
|||
finalEntry, ok := s.ms.blockRegistry.Lookup(volName) |
|||
if !ok { |
|||
t.Fatalf("cycle %d: volume missing from registry at end", cycle) |
|||
} |
|||
if finalEntry.Epoch < entryAfter.Epoch { |
|||
t.Fatalf("cycle %d: registry epoch regressed: %d < %d", cycle, finalEntry.Epoch, entryAfter.Epoch) |
|||
} |
|||
|
|||
// VS-visible: promoted vol epoch matches.
|
|||
var volEpoch uint64 |
|||
if err := s.store.WithVolume(entryAfter.Path, func(vol *blockvol.BlockVol) error { |
|||
volEpoch = vol.Epoch() |
|||
return nil |
|||
}); err != nil { |
|||
t.Fatalf("cycle %d: promoted vol access failed: %v", cycle, err) |
|||
} |
|||
if volEpoch < entryAfter.Epoch { |
|||
t.Fatalf("cycle %d: vol epoch=%d < registry=%d", cycle, volEpoch, entryAfter.Epoch) |
|||
} |
|||
|
|||
// Publication: lookup matches registry truth (not just non-empty).
|
|||
lookupResp, err := s.ms.LookupBlockVolume(ctx, &master_pb.LookupBlockVolumeRequest{Name: volName}) |
|||
if err != nil { |
|||
t.Fatalf("cycle %d: lookup failed: %v", cycle, err) |
|||
} |
|||
if lookupResp.IscsiAddr != finalEntry.ISCSIAddr { |
|||
t.Fatalf("cycle %d: lookup iSCSI=%q != registry=%q", cycle, lookupResp.IscsiAddr, finalEntry.ISCSIAddr) |
|||
} |
|||
if lookupResp.VolumeServer != finalEntry.VolumeServer { |
|||
t.Fatalf("cycle %d: lookup VS=%q != registry=%q", cycle, lookupResp.VolumeServer, finalEntry.VolumeServer) |
|||
} |
|||
|
|||
t.Logf("cycle %d: registry=%d vol=%d lookup=registry ✓", |
|||
cycle, finalEntry.Epoch, volEpoch) |
|||
} |
|||
|
|||
t.Logf("P12P2 repeated cycles: %d cycles, all end-of-cycle truth checks passed", soakCycles) |
|||
} |
|||
|
|||
// --- Runtime state hygiene: no unbounded leftovers after cycles ---
|
|||
|
|||
func TestP12P2_RuntimeHygiene_NoLeftovers(t *testing.T) { |
|||
s := newSoakSetup(t) |
|||
ctx := context.Background() |
|||
|
|||
// Create and delete several volumes to exercise lifecycle.
|
|||
for i := 1; i <= soakCycles; i++ { |
|||
name := fmt.Sprintf("hygiene-vol-%d", i) |
|||
s.ms.CreateBlockVolume(ctx, &master_pb.CreateBlockVolumeRequest{ |
|||
Name: name, SizeBytes: 1 << 20, |
|||
}) |
|||
entry, _ := s.ms.blockRegistry.Lookup(name) |
|||
s.bs.localServerID = entry.VolumeServer |
|||
s.deliver(entry.VolumeServer) |
|||
time.Sleep(50 * time.Millisecond) |
|||
} |
|||
|
|||
// Delete all volumes.
|
|||
for i := 1; i <= soakCycles; i++ { |
|||
name := fmt.Sprintf("hygiene-vol-%d", i) |
|||
s.ms.DeleteBlockVolume(ctx, &master_pb.DeleteBlockVolumeRequest{Name: name}) |
|||
} |
|||
|
|||
time.Sleep(200 * time.Millisecond) |
|||
|
|||
// Check: no stale recovery tasks.
|
|||
activeTasks := s.bs.v2Recovery.ActiveTaskCount() |
|||
if activeTasks > 0 { |
|||
t.Fatalf("stale recovery tasks: %d (expected 0 after all volumes deleted)", activeTasks) |
|||
} |
|||
|
|||
// Check: registry should have no entries for deleted volumes.
|
|||
for i := 1; i <= soakCycles; i++ { |
|||
name := fmt.Sprintf("hygiene-vol-%d", i) |
|||
if _, ok := s.ms.blockRegistry.Lookup(name); ok { |
|||
t.Fatalf("stale registry entry: %s (should be deleted)", name) |
|||
} |
|||
} |
|||
|
|||
// Check: assignment queue should not have unbounded stale entries.
|
|||
for _, server := range []string{"vs1:9333", "vs2:9333"} { |
|||
pending := s.ms.blockAssignmentQueue.Peek(server) |
|||
// Some pending entries may exist (lease grants etc), but check for bounded size.
|
|||
if len(pending) > soakCycles*2 { |
|||
t.Fatalf("unbounded stale assignments for %s: %d", server, len(pending)) |
|||
} |
|||
} |
|||
|
|||
t.Logf("P12P2 hygiene: %d volumes created+deleted, 0 stale tasks, 0 stale registry, bounded queue", soakCycles) |
|||
} |
|||
|
|||
// --- Steady-state repeated delivery: idempotence holds over many cycles ---
|
|||
|
|||
func TestP12P2_SteadyState_IdempotenceHolds(t *testing.T) { |
|||
s := newSoakSetup(t) |
|||
ctx := context.Background() |
|||
|
|||
s.ms.CreateBlockVolume(ctx, &master_pb.CreateBlockVolumeRequest{ |
|||
Name: "steady-vol-1", SizeBytes: 1 << 20, |
|||
}) |
|||
|
|||
entry, _ := s.ms.blockRegistry.Lookup("steady-vol-1") |
|||
s.bs.localServerID = entry.VolumeServer |
|||
s.deliver(entry.VolumeServer) |
|||
time.Sleep(200 * time.Millisecond) |
|||
|
|||
replicaID := entry.Path + "/" + entry.Replicas[0].Server |
|||
eventsAfterFirst := len(s.bs.v2Orchestrator.Log.EventsFor(replicaID)) |
|||
if eventsAfterFirst == 0 { |
|||
t.Fatal("first delivery must create events") |
|||
} |
|||
|
|||
// Deliver the same assignment many times.
|
|||
for i := 0; i < soakCycles*2; i++ { |
|||
s.deliver(entry.VolumeServer) |
|||
time.Sleep(20 * time.Millisecond) |
|||
} |
|||
|
|||
eventsAfterSoak := len(s.bs.v2Orchestrator.Log.EventsFor(replicaID)) |
|||
if eventsAfterSoak != eventsAfterFirst { |
|||
t.Fatalf("idempotence drift: events %d → %d after %d repeated deliveries", |
|||
eventsAfterFirst, eventsAfterSoak, soakCycles*2) |
|||
} |
|||
|
|||
// Verify: registry, vol, and lookup are still coherent.
|
|||
finalEntry, _ := s.ms.blockRegistry.Lookup("steady-vol-1") |
|||
if finalEntry.Epoch != entry.Epoch { |
|||
t.Fatalf("epoch drifted: %d → %d", entry.Epoch, finalEntry.Epoch) |
|||
} |
|||
|
|||
lookupResp, _ := s.ms.LookupBlockVolume(ctx, &master_pb.LookupBlockVolumeRequest{Name: "steady-vol-1"}) |
|||
if lookupResp.IscsiAddr != finalEntry.ISCSIAddr { |
|||
t.Fatalf("lookup iSCSI=%q != registry=%q after soak", lookupResp.IscsiAddr, finalEntry.ISCSIAddr) |
|||
} |
|||
if lookupResp.VolumeServer != finalEntry.VolumeServer { |
|||
t.Fatalf("lookup VS=%q != registry=%q after soak", lookupResp.VolumeServer, finalEntry.VolumeServer) |
|||
} |
|||
|
|||
t.Logf("P12P2 steady state: %d repeated deliveries, events stable at %d, lookup=registry ✓", |
|||
soakCycles*2, eventsAfterFirst) |
|||
} |
|||
Write
Preview
Loading…
Cancel
Save
Reference in new issue