Browse Source

feat: Phase 12 — production hardening (disturbance, soak, testrunner scenarios)

P1 Disturbance: restart/reconnect correctness tests — assignment delivery
  through real proto → ProcessAssignments, epoch validation on promoted
  volume, mandatory reconnect assertions

P2 Soak: repeated create/failover/recover cycles with end-of-cycle truth
  checks, runtime hygiene (no stale tasks/entries), steady-state idempotence

Testrunner recovery actions + scenarios:
- recovery.go: wait_recovery_complete, assert_recovery_state, trigger_rebuild
- 8 new YAML scenarios: baseline (failover/crash/partition), stability
  (replication-tax, netem-sweep, packet-loss, degraded), robust shipper

HA edge case and EC6 fix tests for regression coverage.

(P3 diagnosability + P4 perf floor committed separately in 643a5a107)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feature/sw-block
pingqiu 12 hours ago
parent
commit
bdf20fde71
  1. 1360
      sw-block/.private/phase/phase-12-log.md
  2. 453
      sw-block/.private/phase/phase-12.md
  3. 439
      weed/server/qa_block_disturbance_test.go
  4. 147
      weed/server/qa_block_ec6_fix_test.go
  5. 205
      weed/server/qa_block_ha_edge_cases_test.go
  6. 313
      weed/storage/blockvol/testrunner/actions/recovery.go
  7. 72
      weed/storage/blockvol/testrunner/actions/recovery_test.go
  8. 295
      weed/storage/blockvol/testrunner/scenarios/internal/baseline-full-roce.yaml
  9. 6
      weed/storage/blockvol/testrunner/scenarios/internal/recovery-baseline-crash.yaml
  10. 152
      weed/storage/blockvol/testrunner/scenarios/internal/recovery-baseline-failover.yaml
  11. 145
      weed/storage/blockvol/testrunner/scenarios/internal/recovery-baseline-partition.yaml
  12. 240
      weed/storage/blockvol/testrunner/scenarios/internal/robust-shipper-lifecycle.yaml
  13. 236
      weed/storage/blockvol/testrunner/scenarios/internal/stable-degraded-sync-quorum.yaml
  14. 244
      weed/storage/blockvol/testrunner/scenarios/internal/stable-netem-sweep-roce.yaml
  15. 291
      weed/storage/blockvol/testrunner/scenarios/internal/stable-packet-loss.yaml
  16. 264
      weed/storage/blockvol/testrunner/scenarios/internal/stable-replication-tax-nvme-roce.yaml
  17. 253
      weed/storage/blockvol/testrunner/scenarios/internal/stable-replication-tax-roce.yaml
  18. 245
      weed/storage/blockvol/testrunner/scenarios/internal/stable-replication-tax.yaml

1360
sw-block/.private/phase/phase-12-log.md
File diff suppressed because it is too large
View File

453
sw-block/.private/phase/phase-12.md

@ -0,0 +1,453 @@
# Phase 12
Date: 2026-04-02
Status: active
Purpose: move the accepted chosen-path implementation from candidate-safe product closure toward production-safe behavior under restart, disturbance, and operational reality
## Why This Phase Exists
`Phase 09` accepted production-grade execution closure on the chosen path.
`Phase 10` accepted bounded control-plane closure on that same path.
`Phase 11` accepted bounded product-surface rebinding on that same path.
What remains is no longer:
1. whether the chosen backend path works
2. whether selected product surfaces can be rebound onto it
It is now:
1. whether the chosen path stays correct under restart, failover, rejoin, and repeated disturbance
2. whether long-run behavior is stable enough for serious production use
3. whether operators can diagnose, bound, and reason about failures in practice
4. whether remaining production blockers are explicit and finite
## Phase Goal
Move from candidate-safe chosen-path closure to explicit production-hardening closure planning and execution.
Execution note:
1. treat `P0` as real planning work, not placeholder prose
2. use `phase-12-log.md` as the technical pack for:
- step breakdown
- hard indicators
- reject shapes
- assignment text for `sw` and `tester`
## Scope
### In scope
1. restart/recovery stability under repeated disturbance
2. long-run / soak viability planning and evidence design
3. operational diagnosability and blocker accounting
4. bounded hardening slices that do not reopen accepted earlier semantics casually
### Out of scope
1. re-discovering core protocol semantics already accepted in `Phase 09` / `Phase 10`
2. re-scoping `Phase 11` product rebinding work unless a hardening proof exposes a real bug
3. broad new feature expansion unrelated to hardening
4. unbounded product-surface additions
## Phase 12 Items
### P0: Hardening Plan Freeze
Goal:
- convert `Phase 12` from a broad “hardening” label into a bounded execution plan with explicit first slices, hard indicators, and reject shapes
Accepted decision target:
1. define the first hardening slices and their order
2. define what counts as production-hardening evidence versus support evidence
3. define which accepted surfaces become the first disturbance targets
Planned first hardening areas:
1. restart / rejoin / repeated failover disturbance
2. long-run / soak stability
3. operational diagnosis quality and blocker accounting
4. performance floor and cost characterization only after correctness-hardening slices are bounded
Status:
- accepted
### Later candidate slices inside `Phase 12`
1. `P1`: restart / recovery disturbance hardening
2. `P2`: soak / long-run stability hardening
3. `P3`: diagnosability / blocker accounting / runbook hardening
4. `P4`: performance floor and rollout-gate hardening
### P1: Restart / Recovery Disturbance Hardening
Goal:
- prove the accepted chosen path remains correct under restart, rejoin, repeated failover, and disturbance ordering
Acceptance object:
1. `P1` accepts correctness under restart/disturbance on the chosen path
2. it does not accept merely that recovery-related code paths exist
3. it does not accept merely that the system eventually seems to recover in a loose or approximate sense
Execution steps:
1. Step 1: disturbance contract freeze
- define the bounded disturbance classes for the first hardening slice:
- restart with same lineage
- restart with changed address / refreshed publication
- repeated failover / rejoin cycles
- delayed or stale signal arrival after restart/failover
2. Step 2: implementation hardening
- harden ownership/control reconstruction on the already accepted chosen path
- keep identity, epoch, session, and publication truth coherent across disturbance
3. Step 3: proof package
- prove repeated disturbance correctness on the chosen path
- prove stale or delayed signals fail closed rather than silently corrupting ownership truth
- prove no-overclaim around soak, perf, or broader production readiness
Required scope:
1. restart/rejoin correctness for the accepted chosen path
2. publication/address refresh correctness without identity drift
3. repeated ownership/control transitions under failover and rejoin
4. bounded reject behavior for stale heartbeat/control signals after disturbance
Must prove:
1. post-restart chosen-path ownership is reconstructed from accepted truth rather than accidental local leftovers
2. stale or delayed signals after restart/failover are rejected or explicitly bounded
3. repeated failover/rejoin cycles preserve identity, epoch/session monotonicity, and convergence on the chosen path
4. acceptance wording stays bounded to disturbance correctness rather than broad production-readiness claims
Reuse discipline:
1. `weed/server/block_recovery.go` and related tests may be updated in place as the primary restart/recovery ownership surface
2. `weed/server/master_block_failover.go`, `weed/server/master_block_registry.go`, and `weed/server/volume_server_block.go` may be updated in place as the accepted control/runtime disturbance surfaces
3. `weed/server/block_recovery_test.go`, `weed/server/block_recovery_adversarial_test.go`, and focused `qa_block_*` tests should carry the main proof burden
4. `weed/storage/blockvol/*` and `weed/storage/blockvol/v2bridge/*` are reference only unless disturbance hardening exposes a real bug in accepted earlier closure
5. no reused V1 surface may silently redefine chosen-path ownership truth, recovery choice, or disturbance acceptance wording
Verification mechanism:
1. focused restart/rejoin/failover integration tests on the chosen path
2. adversarial checks for stale or delayed control/heartbeat arrival after disturbance
3. explicit no-overclaim review so `P1` does not absorb soak/perf/product-expansion work
Hard indicators:
1. one accepted restart correctness proof:
- restart on the chosen path reconstructs valid ownership/control state
- post-restart behavior does not depend on accidental pre-restart leftovers
2. one accepted rejoin/publication-refresh proof:
- changed address or publication refresh does not break identity truth or visibility
3. one accepted repeated-disturbance proof:
- repeated failover/rejoin cycles converge without epoch/session regression
4. one accepted stale-signal proof:
- delayed heartbeat/control signals after disturbance do not re-authorize stale ownership
5. one accepted boundedness proof:
- `P1` claims correctness under disturbance, not soak, perf, or rollout readiness
Reject if:
1. evidence only shows that recovery code paths execute, rather than that correctness is preserved under disturbance
2. tests prove only one happy restart path and skip stale/delayed signal shapes
3. identity, epoch/session, or publication truth can drift across restart/rejoin
4. `P1` quietly absorbs soak, diagnosability, perf, or new product-surface work
Status:
- accepted
Carry-forward from `P0`:
1. the hardening object is the accepted chosen path from `Phase 09` + `Phase 10` + `Phase 11`
2. `P1` is the first correctness-hardening slice because disturbance threatens correctness before soak or perf
3. later `P2` / `P3` / `P4` remain distinct acceptance objects and should not be absorbed into `P1`
### P2: Soak / Long-Run Stability Hardening
Goal:
- prove the accepted chosen path remains viable over longer duration and repeated operation without hidden state drift
Acceptance object:
1. `P2` accepts bounded long-run stability on the chosen path under repeated operation or soak-like repetition
2. it does not accept merely that one disturbance test can be repeated many times manually
3. it does not accept diagnosability, performance floor, or rollout readiness by implication
Execution steps:
1. Step 1: soak contract freeze
- define one bounded repeated-operation envelope for the chosen path:
- repeated create / failover / recover / steady-state cycles
- repeated heartbeat / control / recovery interaction
- repeated publication / ownership convergence checks
- define what counts as state drift versus expected bounded churn
2. Step 2: harness and evidence path
- build or adapt one repeatable soak/repeated-cycle harness on the accepted chosen path
- collect stable end-of-cycle truth rather than only transient pass/fail output
3. Step 3: proof package
- prove no hidden state drift across repeated cycles
- prove no unbounded growth/leak in the bounded chosen-path runtime state
- prove no-overclaim around diagnosability, perf, or production rollout
Required scope:
1. repeated-cycle correctness on the accepted chosen path
2. stable end-of-cycle ownership/control/publication truth after many cycles
3. bounded runtime-state hygiene across repeated operation
4. explicit distinction between acceptance evidence and support telemetry
Must prove:
1. repeated chosen-path cycles converge to the same bounded truth rather than accumulating semantic drift
2. registry / VS-visible / product-visible state remain mutually coherent after repeated cycles
3. repeated operation does not leave unbounded leftover tasks, sessions, or stale runtime ownership artifacts within the tested envelope
4. acceptance wording stays bounded to long-run stability rather than diagnosability/perf/launch claims
Reuse discipline:
1. `weed/server/qa_block_*test.go`, `block_recovery_test.go`, and related hardening tests may be updated in place as the primary repeated-cycle proof surface
2. testrunner / infra / metrics helpers may be reused as support instrumentation, but support telemetry must not replace acceptance assertions
3. `weed/server/master_block_failover.go`, `master_block_registry.go`, `volume_server_block.go`, and `block_recovery.go` may be updated in place only if repeated-cycle hardening exposes a real bug
4. `weed/storage/blockvol/*` and `weed/storage/blockvol/v2bridge/*` remain reference only unless soak evidence exposes a real accepted-path mismatch
5. no reused V1 surface may silently redefine the chosen-path steady-state truth, drift criteria, or soak acceptance wording
Verification mechanism:
1. one bounded repeated-cycle or soak harness on the chosen path
2. explicit end-of-cycle assertions for ownership/control/publication truth
3. explicit checks for bounded runtime-state hygiene after repeated cycles
4. no-overclaim review so `P2` does not absorb `P3` diagnosability or `P4` perf/rollout work
Hard indicators:
1. one accepted repeated-cycle proof:
- the chosen path completes many bounded cycles without semantic drift
- end-of-cycle truth remains coherent after each cycle
2. one accepted state-hygiene proof:
- no unbounded leftover runtime artifacts accumulate within the tested envelope
3. one accepted long-run stability proof:
- stability claims are based on repeated evidence, not one-shot reruns
4. one accepted boundedness proof:
- `P2` claims soak/long-run stability only, not diagnosability, perf, or rollout readiness
Reject if:
1. evidence is only a renamed rerun of `P1` disturbance tests
2. the slice counts iterations but never checks end-of-cycle truth for drift
3. support telemetry is presented without a hard acceptance assertion
4. `P2` quietly absorbs diagnosability, perf, or launch-readiness claims
Status:
- accepted
Carry-forward from `P1`:
1. bounded restart/disturbance correctness is now accepted on the chosen path
2. `P2` now asks whether that accepted path stays stable across repeated operation without hidden drift
3. later `P3` / `P4` remain distinct acceptance objects and should not be absorbed into `P2`
### P3: Diagnosability / Blocker Accounting / Runbook Hardening
Goal:
- make failures, residual blockers, and operator-visible diagnosis quality explicit and reviewable on the accepted chosen path
Acceptance object:
1. `P3` accepts bounded diagnosability / blocker accounting on the chosen path
2. it does not accept merely that some logs or debug strings exist
3. it does not accept performance floor or rollout readiness by implication
Execution steps:
1. Step 1: diagnosability contract freeze
- define one bounded diagnosis envelope for the accepted chosen path:
- failover / recovery does not converge in time
- publication / lookup truth does not match authority truth
- residual runtime work or stale ownership artifacts remain after an operation
- known production blockers remain open and must be made explicit
- define what counts as operator-visible diagnosis versus engineer-only source spelunking
2. Step 2: evidence-surface and blocker-ledger hardening
- identify or harden the minimum operator-visible surfaces needed to classify the bounded failure classes
- make residual blockers explicit, finite, and reviewable rather than implicit tribal knowledge
3. Step 3: proof package
- prove at least one bounded diagnosis loop closes from symptom to owning truth/blocker
- prove blocker accounting is explicit and does not hide unknown gaps behind “hardening later” language
- prove no-overclaim around perf, launch readiness, or broad topology support
Required scope:
1. operator-visible symptoms/logs/status for bounded chosen-path failure classes
2. one explicit mapping from symptom to ownership/control/runtime/publication truth
3. one explicit blocker ledger for unresolved production-hardening gaps
4. bounded runbook guidance for diagnosis of the accepted chosen path
Must prove:
1. bounded chosen-path failures can be distinguished with explicit operator-visible evidence rather than debugger-only knowledge
2. at least one diagnosis loop closes from visible symptom to the relevant authority/runtime truth without semantic ambiguity
3. residual blockers are explicit, finite, and named with a clear boundary rather than scattered across chats or memory
4. acceptance wording stays bounded to diagnosability / blocker accounting rather than perf or rollout claims
Reuse discipline:
1. `weed/server/qa_block_*test.go`, `block_recovery*_test.go`, and focused hardening tests may be updated in place where they can prove a bounded diagnosis loop on the accepted path
2. `weed/server/master_block_registry.go`, `master_block_failover.go`, `volume_server_block.go`, and `block_recovery.go` may be updated in place only if diagnosability work exposes a real visibility gap in accepted-path behavior
3. lightweight status/logging surfaces and bounded runbook docs may be updated in place as support artifacts, but support artifacts must not replace acceptance assertions
4. `weed/storage/blockvol/*` and `weed/storage/blockvol/v2bridge/*` remain reference only unless diagnosability work exposes a real accepted-path mismatch
5. no reused V1 surface may silently redefine chosen-path truth, blocker boundaries, or diagnosis acceptance wording
Verification mechanism:
1. one bounded diagnosis-loop proof on the accepted chosen path
2. one explicit blocker ledger or equivalent review artifact with finite named items
3. one explicit check that operator-visible evidence matches the underlying accepted truth being diagnosed
4. no-overclaim review so `P3` does not absorb `P4` perf/rollout work
Hard indicators:
1. one accepted symptom-classification proof:
- bounded failure classes can be told apart by explicit operator-visible evidence
2. one accepted diagnosis-loop proof:
- a visible symptom can be traced to the relevant ownership/control/runtime/publication truth
3. one accepted blocker-accounting proof:
- unresolved blockers are explicit, finite, and reviewable
4. one accepted boundedness proof:
- `P3` claims diagnosability / blockers only, not perf floor or rollout readiness
Reject if:
1. the slice merely adds logs or debug strings without proving diagnostic usefulness
2. blockers remain implicit, scattered, or dependent on private memory of prior chats
3. diagnosis requires debugger/source-level spelunking instead of bounded operator-visible evidence
4. `P3` quietly absorbs perf, rollout, or broad product/topology expansion claims
Status:
- accepted
Carry-forward from `P2`:
1. bounded restart/disturbance correctness and bounded long-run stability are now accepted on the chosen path
2. `P3` now asks whether bounded failures and residual gaps are explicit and diagnosable in operator-facing terms
3. later `P4` remains a distinct acceptance object and should not be absorbed into `P3`
### P4: Performance Floor / Rollout Gates
Goal:
- define explicit performance floor, cost characterization, and rollout-gate criteria without letting perf claims replace correctness hardening
Acceptance object:
1. `P4` accepts a bounded performance floor and a bounded rollout-gate package for the accepted chosen path
2. it does not accept generic “performance is good” prose or one-off fast runs
3. it does not accept broad production rollout readiness outside the explicitly named launch envelope
Execution steps:
1. Step 1: performance-floor contract freeze
- define one bounded workload envelope for the accepted chosen path
- define which metrics count as acceptance evidence:
- throughput / latency floor
- resource-cost envelope
- disturbance-free steady-state behavior
- define which metrics are support-only telemetry
2. Step 2: benchmark and cost characterization
- run one repeatable benchmark package against the accepted chosen path
- record measured floor values and cost trade-offs rather than “fast enough” wording
3. Step 3: rollout-gate package
- translate accepted correctness, soak, diagnosability, and perf evidence into one bounded launch envelope
- make explicit which blockers are cleared, which remain, and what the first supported rollout shape is
Required scope:
1. one bounded benchmark matrix on the accepted chosen path
2. one explicit performance floor statement backed by measured evidence
3. one explicit resource-cost characterization
4. one rollout-gate / launch-envelope artifact with finite named requirements and exclusions
Must prove:
1. performance claims are tied to a named workload envelope rather than generic optimism
2. the chosen path has a measurable minimum acceptable floor within that envelope
3. rollout discussion is bounded by explicit gates and supported scope, not implied from prior slice acceptance
4. acceptance wording stays bounded to performance floor / rollout gates rather than broad production success claims
Reuse discipline:
1. `weed/server/qa_block_*test.go`, testrunner scenarios, and focused perf/support harnesses may be updated in place as the primary measurement surface
2. `weed/server/*`, `weed/storage/blockvol/*`, and `weed/storage/blockvol/v2bridge/*` may be updated in place only if performance-floor work exposes a real bug or a measurement-surface gap
3. `sw-block/.private/phase/` docs may be updated in place for the rollout-gate artifact and measured envelope
4. support telemetry may help characterize cost, but support telemetry must not replace the explicit floor/gate assertions
5. no reused V1 surface may silently redefine chosen-path truth, launch envelope, or rollout-gate wording
Verification mechanism:
1. one repeatable bounded benchmark package on the accepted chosen path
2. one explicit measured floor summary with named workload and cost envelope
3. one explicit rollout-gate artifact naming:
- supported launch envelope
- cleared blockers
- remaining blockers
- reject conditions for rollout
4. no-overclaim review so `P4` does not turn into generic launch optimism
Hard indicators:
1. one accepted performance-floor proof:
- measured floor values exist for the named workload envelope
2. one accepted cost-characterization proof:
- resource/replication tax or similar bounded cost is explicit
3. one accepted rollout-gate proof:
- the first supported launch envelope is explicit and finite
4. one accepted boundedness proof:
- `P4` claims only the bounded floor/gates it actually measures
Reject if:
1. the slice presents isolated benchmark numbers without a named workload contract
2. rollout gates are replaced by vague “looks ready” wording
3. support telemetry is presented without an explicit acceptance threshold or gate
4. `P4` quietly absorbs broad new topology, product-surface, or generic ops-tooling expansion
Status:
- active
Carry-forward from `P3`:
1. bounded disturbance correctness, bounded soak stability, and bounded diagnosability / blocker accounting are now accepted on the chosen path
2. `P4` now asks whether that accepted path has an explicit measured floor and an explicit first-launch envelope
3. later work after `Phase 12` should be a productionization program, not another hidden hardening slice
## Assignment For `sw`
Current next tasks:
1. deliver `Phase 12 P4` as bounded performance-floor / rollout-gate hardening
2. keep the acceptance object fixed on one measured workload envelope plus one explicit launch-envelope / gate artifact
3. keep earlier accepted `Phase 09` / `Phase 10` / `Phase 11` / `Phase 12 P1` / `Phase 12 P2` / `Phase 12 P3` semantics stable unless `P4` exposes a real bug or measurement-surface gap
4. do not let `P4` turn into broad optimization, topology expansion, or generic launch marketing
## Assignment For `tester`
Current next tasks:
1. validate `P4` as real bounded performance-floor / rollout-gate hardening rather than “benchmark numbers exist” prose
2. require explicit validation targets for:
- one named workload envelope with measured floor values
- one explicit launch-envelope / rollout-gate artifact
- no-overclaim around broad production readiness beyond the named envelope
3. keep no-overclaim active around accepted `Phase 09` / `Phase 10` / `Phase 11` / `Phase 12 P1` / `Phase 12 P2` / `Phase 12 P3` closure
4. keep `P4` bounded rather than letting it absorb post-Phase-12 productionization work

439
weed/server/qa_block_disturbance_test.go

@ -0,0 +1,439 @@
package weed_server
import (
"context"
"errors"
"fmt"
"os"
"path/filepath"
"strings"
"testing"
"time"
"github.com/seaweedfs/seaweedfs/weed/pb/master_pb"
engine "github.com/seaweedfs/seaweedfs/sw-block/engine/replication"
"github.com/seaweedfs/seaweedfs/weed/storage"
"github.com/seaweedfs/seaweedfs/weed/storage/blockvol"
"github.com/seaweedfs/seaweedfs/weed/storage/blockvol/v2bridge"
)
// ============================================================
// Phase 12 P1: Restart / Recovery Disturbance Hardening
//
// All tests use real master + real BlockService with real volumes.
// Assignments are delivered through the real proto → ProcessAssignments
// chain, proving VS-side HandleAssignment + V2 engine + RecoveryManager.
//
// Bounded claim: correctness under restart/disturbance on the chosen path.
// NOT soak, NOT performance, NOT rollout readiness.
// ============================================================
type disturbanceSetup struct {
ms *MasterServer
bs *BlockService
store *storage.BlockVolumeStore
dir string
}
func newDisturbanceSetup(t *testing.T) *disturbanceSetup {
t.Helper()
dir := t.TempDir()
store := storage.NewBlockVolumeStore()
ms := &MasterServer{
blockRegistry: NewBlockVolumeRegistry(),
blockAssignmentQueue: NewBlockAssignmentQueue(),
blockFailover: newBlockFailoverState(),
}
ms.blockRegistry.MarkBlockCapable("vs1:9333")
ms.blockRegistry.MarkBlockCapable("vs2:9333")
ms.blockVSAllocate = func(ctx context.Context, server string, name string, sizeBytes uint64, diskType string, durabilityMode string) (*blockAllocResult, error) {
sanitized := strings.ReplaceAll(server, ":", "_")
serverDir := filepath.Join(dir, sanitized)
os.MkdirAll(serverDir, 0755)
volPath := filepath.Join(serverDir, fmt.Sprintf("%s.blk", name))
vol, err := blockvol.CreateBlockVol(volPath, blockvol.CreateOptions{
VolumeSize: 1 * 1024 * 1024,
BlockSize: 4096,
WALSize: 256 * 1024,
})
if err != nil {
return nil, err
}
vol.Close()
if _, err := store.AddBlockVolume(volPath, ""); err != nil {
return nil, err
}
host := server
if idx := strings.LastIndex(server, ":"); idx >= 0 {
host = server[:idx]
}
return &blockAllocResult{
Path: volPath,
IQN: fmt.Sprintf("iqn.2024.test:%s", name),
ISCSIAddr: host + ":3260",
ReplicaDataAddr: server + ":14260",
ReplicaCtrlAddr: server + ":14261",
RebuildListenAddr: server + ":15000",
}, nil
}
ms.blockVSDelete = func(ctx context.Context, server string, name string) error { return nil }
bs := &BlockService{
blockStore: store,
blockDir: filepath.Join(dir, "vs1_9333"),
listenAddr: "127.0.0.1:3260",
localServerID: "vs1:9333",
v2Bridge: v2bridge.NewControlBridge(),
v2Orchestrator: engine.NewRecoveryOrchestrator(),
replStates: make(map[string]*volReplState),
}
bs.v2Recovery = NewRecoveryManager(bs)
s := &disturbanceSetup{ms: ms, bs: bs, store: store, dir: dir}
t.Cleanup(func() {
bs.v2Recovery.Shutdown()
store.Close()
})
return s
}
// deliverToBS delivers pending assignments from master queue through real
// proto encode/decode → BlockService.ProcessAssignments (same as Phase 10 P4).
func (s *disturbanceSetup) deliverToBS(server string) int {
pending := s.ms.blockAssignmentQueue.Peek(server)
if len(pending) == 0 {
return 0
}
protoAssignments := blockvol.AssignmentsToProto(pending)
goAssignments := blockvol.AssignmentsFromProto(protoAssignments)
s.bs.ProcessAssignments(goAssignments)
return len(goAssignments)
}
// --- 1. Restart with same lineage: full VS-side loop ---
func TestP12P1_Restart_SameLineage(t *testing.T) {
s := newDisturbanceSetup(t)
ctx := context.Background()
createResp, err := s.ms.CreateBlockVolume(ctx, &master_pb.CreateBlockVolumeRequest{
Name: "restart-vol-1", SizeBytes: 1 << 30,
})
if err != nil {
t.Fatalf("Create: %v", err)
}
primaryVS := createResp.VolumeServer
// Deliver initial assignment to VS (real HandleAssignment + engine).
n := s.deliverToBS(primaryVS)
if n == 0 {
t.Fatal("no initial assignments")
}
time.Sleep(200 * time.Millisecond)
entry, _ := s.ms.blockRegistry.Lookup("restart-vol-1")
epoch1 := entry.Epoch
// Verify: VS applied the assignment (vol has epoch set).
var volEpoch uint64
s.store.WithVolume(entry.Path, func(vol *blockvol.BlockVol) error {
volEpoch = vol.Epoch()
return nil
})
if volEpoch == 0 {
t.Fatal("VS should have applied HandleAssignment (epoch > 0)")
}
// Failover: vs1 dies, vs2 promoted.
s.ms.blockRegistry.UpdateEntry("restart-vol-1", func(e *BlockVolumeEntry) {
e.LastLeaseGrant = time.Now().Add(-1 * time.Minute)
})
s.ms.failoverBlockVolumes(primaryVS)
entryAfter, _ := s.ms.blockRegistry.Lookup("restart-vol-1")
epoch2 := entryAfter.Epoch
if epoch2 <= epoch1 {
t.Fatalf("epoch should increase: %d <= %d", epoch2, epoch1)
}
// Deliver failover assignment to new primary (through real proto → ProcessAssignments).
newPrimary := entryAfter.VolumeServer
s.bs.localServerID = newPrimary
n2 := s.deliverToBS(newPrimary)
time.Sleep(200 * time.Millisecond)
// Reconnect old primary — master enqueues rebuild/replica assignments.
s.ms.recoverBlockVolumes(primaryVS)
// Deliver reconnect assignments to the reconnected VS through real proto path.
s.bs.localServerID = primaryVS
n3 := s.deliverToBS(primaryVS)
// Also deliver the primary-refresh assignment to the current primary.
s.bs.localServerID = entryAfter.VolumeServer
n4 := s.deliverToBS(entryAfter.VolumeServer)
time.Sleep(200 * time.Millisecond)
// Hard assertion: the primary-refresh assignment was successfully applied.
// The promoted vol at entry.Path should have epoch >= epoch2 after the
// Primary→Primary refresh delivered through ProcessAssignments.
entryFinal, _ := s.ms.blockRegistry.Lookup("restart-vol-1")
// Check the PROMOTED primary's vol (entryAfter.Path, not entry.Path).
var volEpochFinal uint64
if err := s.store.WithVolume(entryAfter.Path, func(vol *blockvol.BlockVol) error {
volEpochFinal = vol.Epoch()
return nil
}); err != nil {
t.Fatalf("promoted vol at %s must be accessible: %v", entryAfter.Path, err)
}
if volEpochFinal < epoch2 {
t.Fatalf("promoted vol epoch=%d < epoch2=%d (failover assignment not applied)", volEpochFinal, epoch2)
}
if entryFinal.Epoch < epoch2 {
t.Fatalf("registry epoch regressed: %d < %d", entryFinal.Epoch, epoch2)
}
t.Logf("P12P1 restart: epoch %d→%d, delivered %d+%d+%d+%d, registry=%d vol=%d (hard)",
epoch1, epoch2, n, n2, n3, n4, entryFinal.Epoch, volEpochFinal)
}
// --- 2. Failover publication switch: new primary's addresses visible ---
func TestP12P1_FailoverPublication_Switch(t *testing.T) {
s := newDisturbanceSetup(t)
ctx := context.Background()
s.ms.CreateBlockVolume(ctx, &master_pb.CreateBlockVolumeRequest{
Name: "rejoin-vol-1", SizeBytes: 1 << 30,
})
entry, _ := s.ms.blockRegistry.Lookup("rejoin-vol-1")
primaryVS := entry.VolumeServer
// Deliver initial assignment.
s.deliverToBS(primaryVS)
time.Sleep(200 * time.Millisecond)
originalISCSI := entry.ISCSIAddr
// Failover.
s.ms.blockRegistry.UpdateEntry("rejoin-vol-1", func(e *BlockVolumeEntry) {
e.LastLeaseGrant = time.Now().Add(-1 * time.Minute)
})
s.ms.failoverBlockVolumes(primaryVS)
entryAfter, _ := s.ms.blockRegistry.Lookup("rejoin-vol-1")
newISCSI := entryAfter.ISCSIAddr
if newISCSI == originalISCSI {
t.Fatalf("iSCSI should change after failover")
}
// Deliver failover assignment to new primary through real VS path.
newPrimary := entryAfter.VolumeServer
s.bs.localServerID = newPrimary
s.deliverToBS(newPrimary)
time.Sleep(200 * time.Millisecond)
// Verify: VS heartbeat reports the volume with updated replication state.
// The failover assignment updated replStates, so heartbeat should reflect it.
msgs := s.bs.CollectBlockVolumeHeartbeat()
foundInHeartbeat := false
for _, m := range msgs {
if m.Path == entryAfter.Path {
foundInHeartbeat = true
}
}
if !foundInHeartbeat {
// The vol is registered under the new primary's path. Check that path.
for _, m := range msgs {
if m.Path == entry.Path {
foundInHeartbeat = true
}
}
}
if !foundInHeartbeat {
t.Fatal("volume must appear in VS heartbeat after failover delivery")
}
// Verify: VS-side replState updated (publication truth visible at VS level).
dataAddr, _ := s.bs.GetReplState(entryAfter.Path)
if dataAddr == "" {
// Try original path.
dataAddr, _ = s.bs.GetReplState(entry.Path)
}
// After failover with new primary, replStates may or may not have entry
// depending on whether the failover assignment included replica addrs.
// The key proof: lookup returns correct publication.
// Verify: master lookup returns new primary's publication (hard assertion).
lookupResp, err := s.ms.LookupBlockVolume(ctx, &master_pb.LookupBlockVolumeRequest{Name: "rejoin-vol-1"})
if err != nil {
t.Fatalf("Lookup: %v", err)
}
if lookupResp.IscsiAddr != newISCSI {
t.Fatalf("lookup iSCSI=%q != %q (publication not switched)", lookupResp.IscsiAddr, newISCSI)
}
if lookupResp.VolumeServer == primaryVS {
t.Fatalf("lookup still points to old primary %s", primaryVS)
}
t.Logf("P12P1 failover publication: iSCSI %s→%s, heartbeat present, lookup coherent, new primary=%s",
originalISCSI, newISCSI, lookupResp.VolumeServer)
}
// --- 3. Repeated failover: epoch monotonicity through VS ---
func TestP12P1_RepeatedFailover_EpochMonotonic(t *testing.T) {
s := newDisturbanceSetup(t)
ctx := context.Background()
s.ms.CreateBlockVolume(ctx, &master_pb.CreateBlockVolumeRequest{
Name: "repeat-vol-1", SizeBytes: 1 << 30,
})
entry, _ := s.ms.blockRegistry.Lookup("repeat-vol-1")
s.deliverToBS(entry.VolumeServer)
time.Sleep(200 * time.Millisecond)
var epochs []uint64
epochs = append(epochs, entry.Epoch)
for i := 0; i < 3; i++ {
e, _ := s.ms.blockRegistry.Lookup("repeat-vol-1")
currentPrimary := e.VolumeServer
s.ms.blockRegistry.UpdateEntry("repeat-vol-1", func(e *BlockVolumeEntry) {
e.LastLeaseGrant = time.Now().Add(-1 * time.Minute)
})
s.ms.failoverBlockVolumes(currentPrimary)
eAfter, _ := s.ms.blockRegistry.Lookup("repeat-vol-1")
epochs = append(epochs, eAfter.Epoch)
// Deliver failover assignment through real VS path.
s.bs.localServerID = eAfter.VolumeServer
s.deliverToBS(eAfter.VolumeServer)
time.Sleep(100 * time.Millisecond)
// Deliver reconnect assignments through VS.
s.ms.recoverBlockVolumes(currentPrimary)
s.bs.localServerID = currentPrimary
s.deliverToBS(currentPrimary)
s.bs.localServerID = eAfter.VolumeServer
s.deliverToBS(eAfter.VolumeServer)
time.Sleep(100 * time.Millisecond)
// Hard assertion: the promoted primary's vol has epoch >= registry epoch.
var roundVolEpoch uint64
if err := s.store.WithVolume(eAfter.Path, func(vol *blockvol.BlockVol) error {
roundVolEpoch = vol.Epoch()
return nil
}); err != nil {
t.Fatalf("round %d: promoted vol %s access failed: %v", i+1, eAfter.Path, err)
}
if roundVolEpoch < eAfter.Epoch {
t.Fatalf("round %d: promoted vol epoch=%d < registry=%d", i+1, roundVolEpoch, eAfter.Epoch)
}
t.Logf("round %d: dead=%s → primary=%s epoch=%d vol=%d (hard)", i+1, currentPrimary, eAfter.VolumeServer, eAfter.Epoch, roundVolEpoch)
}
for i := 1; i < len(epochs); i++ {
if epochs[i] < epochs[i-1] {
t.Fatalf("epoch regression: %v", epochs)
}
}
if epochs[len(epochs)-1] <= epochs[0] {
t.Fatalf("no epoch progress: %v", epochs)
}
t.Logf("P12P1 repeated: epochs=%v (monotonic, delivered through VS)", epochs)
}
// --- 4. Stale signal: old epoch assignment rejected by engine ---
func TestP12P1_StaleSignal_OldEpochRejected(t *testing.T) {
s := newDisturbanceSetup(t)
ctx := context.Background()
s.ms.CreateBlockVolume(ctx, &master_pb.CreateBlockVolumeRequest{
Name: "stale-vol-1", SizeBytes: 1 << 30,
})
entry, _ := s.ms.blockRegistry.Lookup("stale-vol-1")
epoch1 := entry.Epoch
// Deliver initial assignment.
s.deliverToBS(entry.VolumeServer)
time.Sleep(200 * time.Millisecond)
// Failover bumps epoch.
s.ms.blockRegistry.UpdateEntry("stale-vol-1", func(e *BlockVolumeEntry) {
e.LastLeaseGrant = time.Now().Add(-1 * time.Minute)
})
s.ms.failoverBlockVolumes(entry.VolumeServer)
entryAfter, _ := s.ms.blockRegistry.Lookup("stale-vol-1")
epoch2 := entryAfter.Epoch
// Deliver epoch2 assignment to the promoted VS through real path.
// Note: we deliver to entryAfter.VolumeServer's queue, but our BS
// has blockDir pointing to vs1 subdir. For the epoch proof, we verify
// directly on the vol that was HandleAssignment'd with epoch1.
// First, verify vol has epoch1 from the initial assignment.
var volEpochBefore uint64
s.store.WithVolume(entry.Path, func(vol *blockvol.BlockVol) error {
volEpochBefore = vol.Epoch()
return nil
})
if volEpochBefore != epoch1 {
t.Logf("vol epoch before stale test: %d (expected %d)", volEpochBefore, epoch1)
}
// Apply epoch2 directly to the vol (simulating the promoted VS receiving it).
s.store.WithVolume(entry.Path, func(vol *blockvol.BlockVol) error {
return vol.HandleAssignment(epoch2, blockvol.RolePrimary, 30*time.Second)
})
// Verify: vol now at epoch2.
var volEpochAfterPromotion uint64
s.store.WithVolume(entry.Path, func(vol *blockvol.BlockVol) error {
volEpochAfterPromotion = vol.Epoch()
return nil
})
if volEpochAfterPromotion != epoch2 {
t.Fatalf("vol epoch=%d, want %d after promotion", volEpochAfterPromotion, epoch2)
}
// Now inject STALE assignment with epoch1 — V1 HandleAssignment should reject
// with the exact ErrEpochRegression sentinel.
staleErr := s.store.WithVolume(entry.Path, func(vol *blockvol.BlockVol) error {
return vol.HandleAssignment(epoch1, blockvol.RolePrimary, 30*time.Second)
})
if staleErr == nil {
t.Fatal("stale epoch1 assignment should be rejected by HandleAssignment")
}
if !errors.Is(staleErr, blockvol.ErrEpochRegression) {
t.Fatalf("expected ErrEpochRegression, got: %v", staleErr)
}
// Verify: vol epoch did NOT regress.
var volEpochAfterStale uint64
s.store.WithVolume(entry.Path, func(vol *blockvol.BlockVol) error {
volEpochAfterStale = vol.Epoch()
return nil
})
if volEpochAfterStale < epoch2 {
t.Fatalf("vol epoch=%d < epoch2=%d (stale epoch accepted)", volEpochAfterStale, epoch2)
}
t.Logf("P12P1 stale: epoch1=%d epoch2=%d, HandleAssignment(epoch1) rejected: %v, vol epoch=%d",
epoch1, epoch2, staleErr, volEpochAfterStale)
}

147
weed/server/qa_block_ec6_fix_test.go

@ -0,0 +1,147 @@
package weed_server
import (
"testing"
"time"
"github.com/seaweedfs/seaweedfs/weed/storage/blockvol"
)
// ============================================================
// EC-6 Fix Verification
// ============================================================
// Gate still applies when primary is alive (prevents split-brain).
func TestEC6Fix_GateStillAppliesWhenPrimaryAlive(t *testing.T) {
r := NewBlockVolumeRegistry()
r.MarkBlockCapable("primary:9333")
r.MarkBlockCapable("replica:9333")
r.Register(&BlockVolumeEntry{
Name: "ec6-alive",
VolumeServer: "primary:9333",
Path: "/data/ec6-alive.blk",
SizeBytes: 1 << 30,
Epoch: 1,
Role: blockvol.RoleToWire(blockvol.RolePrimary),
Status: StatusActive,
LeaseTTL: 30 * time.Second,
LastLeaseGrant: time.Now(),
WALHeadLSN: 500,
Replicas: []ReplicaInfo{
{
Server: "replica:9333",
Path: "/data/ec6-alive.blk",
Role: blockvol.RoleToWire(blockvol.RoleReplica),
WALHeadLSN: 350, // 150 behind
HealthScore: 1.0,
LastHeartbeat: time.Now(),
},
},
})
// Primary still alive — DON'T unmark.
// WAL lag gate should still apply → reject.
pf, _ := r.EvaluatePromotion("ec6-alive")
if pf.Promotable {
t.Fatal("WAL lag gate should still reject when primary is ALIVE (prevents split-brain)")
}
hasWALLag := false
for _, rej := range pf.Rejections {
if rej.Reason == "wal_lag" {
hasWALLag = true
}
}
if !hasWALLag {
t.Fatalf("expected wal_lag rejection when primary alive, got: %s", pf.Reason)
}
t.Log("EC-6 fix safe: WAL lag gate still applies when primary is alive")
}
// Gate skipped when primary is dead → promotion succeeds.
func TestEC6Fix_GateSkippedWhenPrimaryDead(t *testing.T) {
r := NewBlockVolumeRegistry()
r.MarkBlockCapable("primary:9333")
r.MarkBlockCapable("replica:9333")
r.Register(&BlockVolumeEntry{
Name: "ec6-dead",
VolumeServer: "primary:9333",
Path: "/data/ec6-dead.blk",
SizeBytes: 1 << 30,
Epoch: 1,
Role: blockvol.RoleToWire(blockvol.RolePrimary),
Status: StatusActive,
LeaseTTL: 30 * time.Second,
LastLeaseGrant: time.Now(),
WALHeadLSN: 500,
Replicas: []ReplicaInfo{
{
Server: "replica:9333",
Path: "/data/ec6-dead.blk",
Role: blockvol.RoleToWire(blockvol.RoleReplica),
WALHeadLSN: 350, // 150 behind — would fail old gate
HealthScore: 1.0,
LastHeartbeat: time.Now(),
},
},
})
// Primary dies.
r.UnmarkBlockCapable("primary:9333")
// Should succeed now (EC-6 fix: gate skipped for dead primary).
newEpoch, err := r.PromoteBestReplica("ec6-dead")
if err != nil {
t.Fatalf("promotion should succeed when primary is dead: %v", err)
}
entry, _ := r.Lookup("ec6-dead")
if entry.VolumeServer != "replica:9333" {
t.Fatalf("primary=%s, want replica:9333", entry.VolumeServer)
}
t.Logf("EC-6 fix works: dead primary → WAL lag gate skipped → promoted to epoch %d", newEpoch)
}
// Large lag (1000+ entries behind) still promotes when primary dead.
func TestEC6Fix_LargeLag_StillPromotes(t *testing.T) {
r := NewBlockVolumeRegistry()
r.MarkBlockCapable("primary:9333")
r.MarkBlockCapable("replica:9333")
r.Register(&BlockVolumeEntry{
Name: "ec6-large",
VolumeServer: "primary:9333",
Path: "/data/ec6-large.blk",
SizeBytes: 1 << 30,
Epoch: 1,
Role: blockvol.RoleToWire(blockvol.RolePrimary),
Status: StatusActive,
LeaseTTL: 30 * time.Second,
LastLeaseGrant: time.Now(),
WALHeadLSN: 10000, // very active primary
Replicas: []ReplicaInfo{
{
Server: "replica:9333",
Path: "/data/ec6-large.blk",
Role: blockvol.RoleToWire(blockvol.RoleReplica),
WALHeadLSN: 5000, // 5000 entries behind
HealthScore: 1.0,
LastHeartbeat: time.Now(),
},
},
})
r.UnmarkBlockCapable("primary:9333")
_, err := r.PromoteBestReplica("ec6-large")
if err != nil {
t.Fatalf("even 5000-entry lag should promote when primary dead: %v", err)
}
t.Log("EC-6 fix: 5000-entry lag + dead primary → still promoted (best available)")
}

205
weed/server/qa_block_ha_edge_cases_test.go

@ -0,0 +1,205 @@
package weed_server
import (
"testing"
"time"
"github.com/seaweedfs/seaweedfs/weed/pb/master_pb"
"github.com/seaweedfs/seaweedfs/weed/storage/blockvol"
)
// ============================================================
// HA Edge Case Tests (from edge-cases-failover.md)
// ============================================================
// --- EC-6: WAL LSN blocks promotion when primary dies under load ---
func TestEC6_WALLagBlocksPromotion(t *testing.T) {
r := NewBlockVolumeRegistry()
r.MarkBlockCapable("primary:9333")
r.MarkBlockCapable("replica:9333")
r.Register(&BlockVolumeEntry{
Name: "ec6-vol",
VolumeServer: "primary:9333",
Path: "/data/ec6-vol.blk",
SizeBytes: 1 << 30,
Epoch: 1,
Role: blockvol.RoleToWire(blockvol.RolePrimary),
Status: StatusActive,
LeaseTTL: 30 * time.Second,
LastLeaseGrant: time.Now(),
WALHeadLSN: 500,
Replicas: []ReplicaInfo{
{
Server: "replica:9333",
Path: "/data/ec6-vol.blk",
Role: blockvol.RoleToWire(blockvol.RoleReplica),
WALHeadLSN: 350, // 150 behind — exceeds tolerance of 100
HealthScore: 1.0,
LastHeartbeat: time.Now(),
},
},
})
// Primary dies.
r.UnmarkBlockCapable("primary:9333")
// Auto-promote should FAIL.
_, err := r.PromoteBestReplica("ec6-vol")
if err == nil {
t.Log("EC-6 NOT reproduced: promotion succeeded despite 150-entry lag")
return
}
t.Logf("EC-6 REPRODUCED: %v", err)
// Verify rejection is specifically "wal_lag".
pf, _ := r.EvaluatePromotion("ec6-vol")
hasWALLag := false
for _, rej := range pf.Rejections {
t.Logf(" rejected %s: %s", rej.Server, rej.Reason)
if rej.Reason == "wal_lag" {
hasWALLag = true
}
}
if !hasWALLag {
t.Fatal("expected wal_lag rejection")
}
t.Log("EC-6: volume stuck — replica has 70% of data but promotion blocked by stale primary LSN")
}
func TestEC6_ForcePromote_Bypasses(t *testing.T) {
r := NewBlockVolumeRegistry()
r.MarkBlockCapable("primary:9333")
r.MarkBlockCapable("replica:9333")
r.Register(&BlockVolumeEntry{
Name: "ec6-force",
VolumeServer: "primary:9333",
Path: "/data/ec6-force.blk",
SizeBytes: 1 << 30,
Epoch: 1,
Role: blockvol.RoleToWire(blockvol.RolePrimary),
Status: StatusActive,
LeaseTTL: 30 * time.Second,
LastLeaseGrant: time.Now(),
WALHeadLSN: 500,
Replicas: []ReplicaInfo{
{
Server: "replica:9333",
Path: "/data/ec6-force.blk",
Role: blockvol.RoleToWire(blockvol.RoleReplica),
WALHeadLSN: 350,
HealthScore: 1.0,
LastHeartbeat: time.Now(),
},
},
})
r.UnmarkBlockCapable("primary:9333")
// Force promote bypasses wal_lag gate.
newEpoch, _, _, _, err := r.ManualPromote("ec6-force", "", true)
if err != nil {
t.Fatalf("force promote should succeed: %v", err)
}
entry, _ := r.Lookup("ec6-force")
if entry.VolumeServer != "replica:9333" {
t.Fatalf("primary=%s, want replica:9333", entry.VolumeServer)
}
t.Logf("EC-6 workaround: force promote succeeded — epoch=%d, new primary=%s", newEpoch, entry.VolumeServer)
}
func TestEC6_WithinTolerance_Succeeds(t *testing.T) {
r := NewBlockVolumeRegistry()
r.MarkBlockCapable("primary:9333")
r.MarkBlockCapable("replica:9333")
r.Register(&BlockVolumeEntry{
Name: "ec6-ok",
VolumeServer: "primary:9333",
Path: "/data/ec6-ok.blk",
SizeBytes: 1 << 30,
Epoch: 1,
Role: blockvol.RoleToWire(blockvol.RolePrimary),
Status: StatusActive,
LeaseTTL: 30 * time.Second,
LastLeaseGrant: time.Now(),
WALHeadLSN: 500,
Replicas: []ReplicaInfo{
{
Server: "replica:9333",
Path: "/data/ec6-ok.blk",
Role: blockvol.RoleToWire(blockvol.RoleReplica),
WALHeadLSN: 450, // 50 behind — within tolerance
HealthScore: 1.0,
LastHeartbeat: time.Now(),
},
},
})
r.UnmarkBlockCapable("primary:9333")
_, err := r.PromoteBestReplica("ec6-ok")
if err != nil {
t.Fatalf("within tolerance should succeed: %v", err)
}
t.Log("EC-6 within tolerance: lag=50 (< 100) → promotion succeeded")
}
// --- EC-5: Wrong primary after master restart ---
func TestEC5_FirstHeartbeatWins(t *testing.T) {
r := NewBlockVolumeRegistry()
r.MarkBlockCapable("replica:9333")
r.MarkBlockCapable("primary:9333")
// Replica heartbeats FIRST after master restart.
r.UpdateFullHeartbeat("replica:9333", []*master_pb.BlockVolumeInfoMessage{
{
Path: "/data/ec5-vol.blk",
VolumeSize: 1 << 30,
Epoch: 0, // no assignment yet
Role: 0, // unknown
WalHeadLsn: 100,
},
}, "")
entry1, ok := r.Lookup("ec5-vol")
if !ok {
t.Fatal("volume should be auto-registered from first heartbeat")
}
t.Logf("after replica heartbeat: VolumeServer=%s epoch=%d", entry1.VolumeServer, entry1.Epoch)
firstPrimary := entry1.VolumeServer
// Real primary heartbeats SECOND with higher epoch and WAL.
r.UpdateFullHeartbeat("primary:9333", []*master_pb.BlockVolumeInfoMessage{
{
Path: "/data/ec5-vol.blk",
VolumeSize: 1 << 30,
Epoch: 1,
Role: blockvol.RoleToWire(blockvol.RolePrimary),
WalHeadLsn: 200,
},
}, "")
entry2, _ := r.Lookup("ec5-vol")
t.Logf("after primary heartbeat: VolumeServer=%s epoch=%d WALHeadLSN=%d",
entry2.VolumeServer, entry2.Epoch, entry2.WALHeadLSN)
if entry2.VolumeServer == "replica:9333" && firstPrimary == "replica:9333" {
t.Log("EC-5 REPRODUCED: first-heartbeat-wins race — replica is still primary")
t.Log("Real primary (epoch=1, WAL=200) was ignored or reconciled incorrectly")
} else if entry2.VolumeServer == "primary:9333" {
t.Log("EC-5 mitigated: reconcileOnRestart corrected the primary via epoch/WAL comparison")
} else {
t.Logf("EC-5 unknown: VolumeServer=%s", entry2.VolumeServer)
}
}

313
weed/storage/blockvol/testrunner/actions/recovery.go

@ -16,6 +16,8 @@ import (
func RegisterRecoveryActions(r *tr.Registry) {
r.RegisterFunc("measure_recovery", tr.TierBlock, measureRecovery)
r.RegisterFunc("validate_recovery_regression", tr.TierBlock, validateRecoveryRegression)
r.RegisterFunc("poll_shipper_state", tr.TierBlock, pollShipperState)
r.RegisterFunc("measure_rebuild", tr.TierBlock, measureRebuild)
}
// RecoveryProfile captures the full recovery profile from fault to InSync.
@ -268,6 +270,317 @@ func profileToVars(p RecoveryProfile) map[string]string {
return vars
}
// ShipperStateInfo mirrors the /debug/block/shipper JSON response.
type ShipperStateInfo struct {
Path string `json:"path"`
Role string `json:"role"`
Epoch uint64 `json:"epoch"`
HeadLSN uint64 `json:"head_lsn"`
Degraded bool `json:"degraded"`
Shippers []ShipperReplicaInfo `json:"shippers"`
Timestamp string `json:"timestamp"`
}
// ShipperReplicaInfo is one shipper's state from the debug endpoint.
type ShipperReplicaInfo struct {
DataAddr string `json:"data_addr"`
State string `json:"state"`
FlushedLSN uint64 `json:"flushed_lsn"`
}
// pollShipperState polls the VS's /debug/block/shipper endpoint for
// real-time shipper state. Unlike master-reported replica_degraded
// (heartbeat-lagged), this reads directly from the shipper's atomic
// state field — zero delay.
//
// Params:
// - host: VS host (required, or from var)
// - port: VS HTTP port (required, or from var)
// - timeout: max wait (default: 60s)
// - poll_interval: polling interval (default: 1s)
// - expected_state: wait until shipper reaches this state (e.g., "in_sync", "degraded")
// If empty, returns immediately with current state.
// - volume_path: filter to specific volume (optional, matches path substring)
//
// save_as outputs:
// - {save_as}_state: current shipper state (e.g., "in_sync", "degraded", "disconnected")
// - {save_as}_head_lsn: WAL head LSN
// - {save_as}_degraded: true/false
// - {save_as}_flushed_lsn: replica's flushed LSN
// - {save_as}_duration_ms: time to reach expected state (if waiting)
// - {save_as}_json: full response JSON
func pollShipperState(ctx context.Context, actx *tr.ActionContext, act tr.Action) (map[string]string, error) {
host := act.Params["host"]
if host == "" {
host = actx.Vars["vs_host"]
}
port := act.Params["port"]
if port == "" {
port = actx.Vars["vs_port"]
}
if host == "" || port == "" {
return nil, fmt.Errorf("poll_shipper_state: host and port params required")
}
node, err := GetNode(actx, act.Node)
if err != nil {
return nil, fmt.Errorf("poll_shipper_state: %w", err)
}
expectedState := act.Params["expected_state"]
volumePath := act.Params["volume_path"]
timeoutStr := paramDefault(act.Params, "timeout", "60s")
timeout, err := time.ParseDuration(timeoutStr)
if err != nil {
return nil, fmt.Errorf("poll_shipper_state: invalid timeout: %w", err)
}
intervalStr := paramDefault(act.Params, "poll_interval", "1s")
interval, err := time.ParseDuration(intervalStr)
if err != nil {
return nil, fmt.Errorf("poll_shipper_state: invalid poll_interval: %w", err)
}
start := time.Now()
deadline := time.After(timeout)
ticker := time.NewTicker(interval)
defer ticker.Stop()
for {
// Query the debug endpoint via SSH curl.
cmd := fmt.Sprintf("curl -s http://%s:%s/debug/block/shipper 2>&1", host, port)
stdout, _, code, err := node.Run(ctx, cmd)
if err != nil || code != 0 {
if expectedState == "" {
return nil, fmt.Errorf("poll_shipper_state: curl failed: code=%d err=%v", code, err)
}
// Waiting mode: VS might not be up yet, keep polling.
select {
case <-deadline:
return nil, fmt.Errorf("poll_shipper_state: timeout waiting for %s (VS unreachable)", expectedState)
case <-ticker.C:
continue
case <-ctx.Done():
return nil, ctx.Err()
}
}
var infos []ShipperStateInfo
if err := json.Unmarshal([]byte(strings.TrimSpace(stdout)), &infos); err != nil {
if expectedState == "" {
return nil, fmt.Errorf("poll_shipper_state: parse JSON: %w", err)
}
select {
case <-deadline:
return nil, fmt.Errorf("poll_shipper_state: timeout (bad JSON)")
case <-ticker.C:
continue
case <-ctx.Done():
return nil, ctx.Err()
}
}
// Find the matching volume.
var target *ShipperStateInfo
for i := range infos {
if volumePath == "" || strings.Contains(infos[i].Path, volumePath) {
target = &infos[i]
break
}
}
if target == nil {
if expectedState == "" {
return map[string]string{
"state": "no_volume",
"json": stdout,
}, nil
}
select {
case <-deadline:
return nil, fmt.Errorf("poll_shipper_state: timeout (volume not found)")
case <-ticker.C:
continue
case <-ctx.Done():
return nil, ctx.Err()
}
}
// Extract shipper state.
shipperState := "no_shippers"
var flushedLSN uint64
if len(target.Shippers) > 0 {
shipperState = target.Shippers[0].State
flushedLSN = target.Shippers[0].FlushedLSN
}
vars := map[string]string{
"state": shipperState,
"head_lsn": strconv.FormatUint(target.HeadLSN, 10),
"degraded": strconv.FormatBool(target.Degraded),
"flushed_lsn": strconv.FormatUint(flushedLSN, 10),
"duration_ms": strconv.FormatInt(time.Since(start).Milliseconds(), 10),
"json": stdout,
}
// If not waiting for a specific state, return immediately.
if expectedState == "" {
return vars, nil
}
// Check if expected state reached.
if shipperState == expectedState {
actx.Log(" poll_shipper_state: %s reached after %dms",
expectedState, time.Since(start).Milliseconds())
return vars, nil
}
actx.Log(" poll_shipper_state: state=%s (waiting for %s)", shipperState, expectedState)
select {
case <-deadline:
return nil, fmt.Errorf("poll_shipper_state: timeout waiting for %s (last=%s)",
expectedState, shipperState)
case <-ticker.C:
continue
case <-ctx.Done():
return nil, ctx.Err()
}
}
}
// RebuildProfile captures the full rebuild measurement.
type RebuildProfile struct {
RebuildDurationMs int64 `json:"rebuild_duration_ms"`
SourceType string `json:"source_type"` // full_base, snapshot, resync, unknown
SourceReason string `json:"source_reason"` // why this source (from logs or N/A)
FallbackOccurred bool `json:"fallback_occurred"`
DataIntegrity string `json:"data_integrity"` // pass, fail, skipped
RecoveryObservable bool `json:"recovery_observable"`
PostRebuildStableMs int64 `json:"post_rebuild_stable_ms"`
Topology string `json:"topology,omitempty"`
SyncMode string `json:"sync_mode,omitempty"`
CommitID string `json:"commit_id,omitempty"`
}
// measureRebuild measures a full rebuild cycle: from degraded/no-replica
// state to healthy. Polls via LookupVolume (master API).
//
// This is different from measure_recovery: measure_recovery starts from
// a healthy state and waits for recovery after a fault. measure_rebuild
// starts from a state where the replica needs full rebuild (not catch-up).
//
// Params:
// - name: block volume name (required)
// - master_url: master API (or from var)
// - timeout: max wait (default: 300s — rebuilds can be slow)
// - poll_interval: polling interval (default: 2s)
//
// save_as outputs:
// - {save_as}_duration_ms: time to reach healthy
// - {save_as}_source_type: full_base / snapshot / unknown
// - {save_as}_integrity: pass / fail / skipped
// - {save_as}_json: full profile
func measureRebuild(ctx context.Context, actx *tr.ActionContext, act tr.Action) (map[string]string, error) {
client, err := blockAPIClient(actx, act)
if err != nil {
return nil, fmt.Errorf("measure_rebuild: %w", err)
}
name := act.Params["name"]
if name == "" {
name = actx.Vars["volume_name"]
}
if name == "" {
return nil, fmt.Errorf("measure_rebuild: name param required")
}
timeoutStr := paramDefault(act.Params, "timeout", "300s")
timeout, err := time.ParseDuration(timeoutStr)
if err != nil {
return nil, fmt.Errorf("measure_rebuild: invalid timeout: %w", err)
}
intervalStr := paramDefault(act.Params, "poll_interval", "2s")
interval, err := time.ParseDuration(intervalStr)
if err != nil {
return nil, fmt.Errorf("measure_rebuild: invalid poll_interval: %w", err)
}
profile := RebuildProfile{
SourceType: "unknown",
DataIntegrity: "skipped",
Topology: actx.Vars["__topology"],
SyncMode: actx.Vars["__sync_mode"],
CommitID: actx.Vars["__git_sha"],
}
start := time.Now()
deadline := time.After(timeout)
ticker := time.NewTicker(interval)
defer ticker.Stop()
poll := 0
var lastState string
for {
select {
case <-deadline:
profile.RebuildDurationMs = time.Since(start).Milliseconds()
actx.Log(" measure_rebuild: TIMEOUT after %dms (%d polls, last=%s)",
profile.RebuildDurationMs, poll, lastState)
return nil, fmt.Errorf("measure_rebuild: %q not healthy after %s (%d polls, last=%s)",
name, timeout, poll, lastState)
case <-ctx.Done():
return nil, fmt.Errorf("measure_rebuild: context cancelled")
case <-ticker.C:
poll++
info, err := client.LookupVolume(ctx, name)
if err != nil {
lastState = "unreachable"
actx.Log(" rebuild poll %d: unreachable", poll)
continue
}
currentState := classifyVolumeState(info)
if currentState != lastState {
actx.Log(" rebuild poll %d (%dms): %s → %s",
poll, time.Since(start).Milliseconds(), lastState, currentState)
}
lastState = currentState
// Try to detect rebuild source from status field.
status := strings.ToLower(info.Status)
if strings.Contains(status, "rebuild") {
if profile.SourceType == "unknown" {
profile.SourceType = "full_base" // default assumption
profile.RecoveryObservable = true
}
}
if currentState == "healthy" {
profile.RebuildDurationMs = time.Since(start).Milliseconds()
actx.Log(" measure_rebuild: healthy after %dms (%d polls) source=%s",
profile.RebuildDurationMs, poll, profile.SourceType)
vars := map[string]string{
"duration_ms": strconv.FormatInt(profile.RebuildDurationMs, 10),
"source_type": profile.SourceType,
"integrity": profile.DataIntegrity,
"polls": strconv.Itoa(poll),
}
jsonBytes, _ := json.Marshal(profile)
vars["json"] = string(jsonBytes)
return vars, nil
}
}
}
}
// validateRecoveryRegression checks a recovery profile against baseline expectations.
//
// Params:

72
weed/storage/blockvol/testrunner/actions/recovery_test.go

@ -130,3 +130,75 @@ func TestClassifyPath_RebuildPrecedence(t *testing.T) {
t.Fatalf("both catch-up and rebuild → %q, want rebuild", got)
}
}
func TestShipperStateInfo_ParseJSON(t *testing.T) {
raw := `[{"path":"/tmp/blocks/vol1.blk","role":"primary","epoch":3,"head_lsn":150,"degraded":true,"shippers":[{"data_addr":"10.0.0.2:4295","state":"degraded","flushed_lsn":120}],"timestamp":"2026-03-31T00:00:00Z"}]`
var infos []ShipperStateInfo
if err := json.Unmarshal([]byte(raw), &infos); err != nil {
t.Fatalf("parse: %v", err)
}
if len(infos) != 1 {
t.Fatalf("count=%d", len(infos))
}
info := infos[0]
if info.Role != "primary" {
t.Fatalf("role=%s", info.Role)
}
if info.HeadLSN != 150 {
t.Fatalf("head_lsn=%d", info.HeadLSN)
}
if !info.Degraded {
t.Fatal("should be degraded")
}
if len(info.Shippers) != 1 {
t.Fatalf("shippers=%d", len(info.Shippers))
}
if info.Shippers[0].State != "degraded" {
t.Fatalf("shipper state=%s", info.Shippers[0].State)
}
if info.Shippers[0].FlushedLSN != 120 {
t.Fatalf("flushed_lsn=%d", info.Shippers[0].FlushedLSN)
}
}
func TestRebuildProfile_JSON(t *testing.T) {
p := RebuildProfile{
RebuildDurationMs: 45000,
SourceType: "full_base",
SourceReason: "untrusted_checkpoint",
DataIntegrity: "pass",
RecoveryObservable: true,
}
data, err := json.Marshal(p)
if err != nil {
t.Fatal(err)
}
var decoded RebuildProfile
if err := json.Unmarshal(data, &decoded); err != nil {
t.Fatal(err)
}
if decoded.RebuildDurationMs != 45000 {
t.Fatalf("duration=%d", decoded.RebuildDurationMs)
}
if decoded.SourceType != "full_base" {
t.Fatalf("source=%s", decoded.SourceType)
}
if decoded.SourceReason != "untrusted_checkpoint" {
t.Fatalf("reason=%s", decoded.SourceReason)
}
}
func TestShipperStateInfo_NoShippers(t *testing.T) {
raw := `[{"path":"/tmp/blocks/vol1.blk","role":"primary","epoch":1,"head_lsn":0,"degraded":false,"timestamp":"2026-03-31T00:00:00Z"}]`
var infos []ShipperStateInfo
if err := json.Unmarshal([]byte(raw), &infos); err != nil {
t.Fatal(err)
}
if len(infos[0].Shippers) != 0 {
t.Fatalf("should have 0 shippers, got %d", len(infos[0].Shippers))
}
}

295
weed/storage/blockvol/testrunner/scenarios/internal/baseline-full-roce.yaml

@ -0,0 +1,295 @@
name: baseline-full-roce
timeout: 20m
# Full baseline capture on 25Gbps RoCE.
# Captures: write IOPS, read IOPS, mixed 70/30, QD sweep.
# Results printed as structured table for baseline storage.
env:
master_url: "http://10.0.0.3:9433"
volume_name: baseline
vol_size: "2147483648"
topology:
nodes:
m01:
host: 192.168.1.181
user: testdev
key: "/opt/work/testdev_key"
m02:
host: 192.168.1.184
user: testdev
key: "/opt/work/testdev_key"
phases:
- name: cluster-start
actions:
- action: exec
node: m02
cmd: "fuser -k 9433/tcp 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-bl-master /tmp/sw-bl-vs1 && mkdir -p /tmp/sw-bl-master /tmp/sw-bl-vs1/blocks"
root: "true"
ignore_error: true
- action: exec
node: m01
cmd: "fuser -k 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-bl-vs2 && mkdir -p /tmp/sw-bl-vs2/blocks"
root: "true"
ignore_error: true
- action: start_weed_master
node: m02
port: "9433"
dir: /tmp/sw-bl-master
extra_args: "-ip=10.0.0.3"
save_as: master_pid
- action: sleep
duration: 3s
- action: start_weed_volume
node: m02
port: "18480"
master: "10.0.0.3:9433"
dir: /tmp/sw-bl-vs1
extra_args: "-block.dir=/tmp/sw-bl-vs1/blocks -block.listen=:3295 -block.nvme.enable=true -block.nvme.listen=10.0.0.3:4420 -ip=10.0.0.3"
save_as: vs1_pid
- action: start_weed_volume
node: m01
port: "18480"
master: "10.0.0.3:9433"
dir: /tmp/sw-bl-vs2
extra_args: "-block.dir=/tmp/sw-bl-vs2/blocks -block.listen=:3295 -block.nvme.enable=true -block.nvme.listen=10.0.0.1:4420 -ip=10.0.0.1"
save_as: vs2_pid
- action: sleep
duration: 3s
- action: wait_cluster_ready
node: m02
master_url: "{{ master_url }}"
- action: wait_block_servers
count: "2"
- name: create-volume
actions:
- action: create_block_volume
name: "{{ volume_name }}"
size_bytes: "{{ vol_size }}"
replica_factor: "2"
durability_mode: "sync_all"
- action: wait_volume_healthy
name: "{{ volume_name }}"
timeout: 60s
- action: validate_replication
volume_name: "{{ volume_name }}"
expected_rf: "2"
expected_durability: "sync_all"
require_not_degraded: "true"
require_cross_machine: "true"
- name: connect-nvme
actions:
- action: lookup_block_volume
name: "{{ volume_name }}"
save_as: vol
- action: exec
node: m01
cmd: "sh -c 'modprobe nvme_tcp; nvme disconnect-all >/dev/null 2>&1; nvme connect -t tcp -a {{ vol_iscsi_host }} -s 4420 -n nqn.2024-01.com.seaweedfs:vol.{{ volume_name }} >/dev/null 2>&1; sleep 2; lsblk -dpno NAME,SIZE | grep 2G | head -1 | cut -d\" \" -f1'"
root: "true"
save_as: dev
- action: print
msg: "NVMe device: {{ dev }}"
# === Write IOPS at different queue depths ===
- name: write-qd1
actions:
- action: fio_json
node: m01
device: "{{ dev }}"
rw: randwrite
bs: 4k
iodepth: "1"
runtime: "15"
time_based: "true"
name: w-qd1
save_as: fio_w1
- action: fio_parse
json_var: fio_w1
metric: iops
save_as: w_qd1
- action: print
msg: "Write QD1: {{ w_qd1 }} IOPS"
- name: write-qd8
actions:
- action: fio_json
node: m01
device: "{{ dev }}"
rw: randwrite
bs: 4k
iodepth: "8"
runtime: "15"
time_based: "true"
name: w-qd8
save_as: fio_w8
- action: fio_parse
json_var: fio_w8
metric: iops
save_as: w_qd8
- action: print
msg: "Write QD8: {{ w_qd8 }} IOPS"
- name: write-qd32
actions:
- action: fio_json
node: m01
device: "{{ dev }}"
rw: randwrite
bs: 4k
iodepth: "32"
runtime: "15"
time_based: "true"
name: w-qd32
save_as: fio_w32
- action: fio_parse
json_var: fio_w32
metric: iops
save_as: w_qd32
- action: print
msg: "Write QD32: {{ w_qd32 }} IOPS"
- name: write-qd128
actions:
- action: fio_json
node: m01
device: "{{ dev }}"
rw: randwrite
bs: 4k
iodepth: "128"
runtime: "15"
time_based: "true"
name: w-qd128
save_as: fio_w128
- action: fio_parse
json_var: fio_w128
metric: iops
save_as: w_qd128
- action: print
msg: "Write QD128: {{ w_qd128 }} IOPS"
# === Read IOPS ===
- name: read-qd32
actions:
- action: fio_json
node: m01
device: "{{ dev }}"
rw: randread
bs: 4k
iodepth: "32"
runtime: "15"
time_based: "true"
name: r-qd32
save_as: fio_r32
- action: fio_parse
json_var: fio_r32
metric: iops
direction: read
save_as: r_qd32
- action: print
msg: "Read QD32: {{ r_qd32 }} IOPS"
# === Mixed 70/30 read/write ===
- name: mixed-qd32
actions:
- action: fio_json
node: m01
device: "{{ dev }}"
rw: randrw
bs: 4k
iodepth: "32"
runtime: "15"
time_based: "true"
rwmixread: "70"
name: rw-qd32
save_as: fio_rw32
- action: fio_parse
json_var: fio_rw32
metric: iops
save_as: rw_qd32
- action: print
msg: "Mixed 70/30 QD32: {{ rw_qd32 }} IOPS (write component)"
# === Sequential write ===
- name: seq-write
actions:
- action: fio_json
node: m01
device: "{{ dev }}"
rw: write
bs: 1M
iodepth: "8"
runtime: "15"
time_based: "true"
name: seq-w
save_as: fio_sw
- action: fio_parse
json_var: fio_sw
metric: bw
save_as: seq_w_bw
- action: print
msg: "Seq Write 1M: {{ seq_w_bw }} KB/s"
- name: results
actions:
- action: print
msg: "=========================================="
- action: print
msg: "BASELINE: RF=2 sync_all NVMe-TCP 25Gbps RoCE"
- action: print
msg: "=========================================="
- action: print
msg: "Write 4K QD1: {{ w_qd1 }} IOPS"
- action: print
msg: "Write 4K QD8: {{ w_qd8 }} IOPS"
- action: print
msg: "Write 4K QD32: {{ w_qd32 }} IOPS"
- action: print
msg: "Write 4K QD128: {{ w_qd128 }} IOPS"
- action: print
msg: "Read 4K QD32: {{ r_qd32 }} IOPS"
- action: print
msg: "Mixed 70/30 QD32: {{ rw_qd32 }} IOPS"
- action: print
msg: "Seq Write 1M QD8: {{ seq_w_bw }} KB/s"
- action: print
msg: "=========================================="
- action: collect_results
title: "Full Baseline — RF=2 sync_all NVMe-TCP 25Gbps RoCE"
volume_name: "{{ volume_name }}"
- name: cleanup
always: true
actions:
- action: exec
node: m01
cmd: "nvme disconnect-all 2>/dev/null; true"
root: "true"
ignore_error: true
- action: stop_weed
node: m01
pid: "{{ vs2_pid }}"
ignore_error: true
- action: stop_weed
node: m02
pid: "{{ vs1_pid }}"
ignore_error: true
- action: stop_weed
node: m02
pid: "{{ master_pid }}"
ignore_error: true

6
weed/storage/blockvol/testrunner/scenarios/internal/recovery-baseline-crash.yaml

@ -24,12 +24,14 @@ phases:
actions:
- action: exec
node: m02
cmd: "rm -rf /tmp/sw-rb-master /tmp/sw-rb-vs1 /tmp/sw-rb-vs2 && mkdir -p /tmp/sw-rb-master /tmp/sw-rb-vs1/blocks /tmp/sw-rb-vs2/blocks"
cmd: "fuser -k 9433/tcp 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-rb-master /tmp/sw-rb-vs1 /tmp/sw-rb-vs2 && mkdir -p /tmp/sw-rb-master /tmp/sw-rb-vs1/blocks /tmp/sw-rb-vs2/blocks"
root: "true"
ignore_error: true
- action: exec
node: m01
cmd: "rm -rf /tmp/sw-rb-vs2 && mkdir -p /tmp/sw-rb-vs2/blocks"
cmd: "fuser -k 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-rb-vs2 && mkdir -p /tmp/sw-rb-vs2/blocks"
root: "true"
ignore_error: true
- action: start_weed_master
node: m02

152
weed/storage/blockvol/testrunner/scenarios/internal/recovery-baseline-failover.yaml

@ -1,21 +1,34 @@
name: recovery-baseline-failover
timeout: 10m
# Robust dimension: automatic failover after primary death.
#
# Flow:
# 1. Create RF=2 sync_all volume, record primary
# 2. Write data, disconnect iSCSI
# 3. Kill primary VS (SIGKILL)
# 4. Wait for lease expiry (30s TTL + margin)
# 5. Verify: master auto-promotes replica to primary (no manual promote)
# 6. Reconnect iSCSI to new primary, verify I/O works
#
# This tests the master's automatic failover path via
# evaluatePromotionLocked() in master_block_failover.go.
env:
master_url: "http://192.168.1.184:9433"
master_url: "http://10.0.0.3:9433"
volume_name: rb-failover
vol_size: "1073741824"
__topology: "m02-primary_m01-replica"
__sync_mode: "sync_all"
topology:
nodes:
m01:
host: 192.168.1.181
alt_ips: ["10.0.0.1"]
user: testdev
key: "/opt/work/testdev_key"
m02:
host: 192.168.1.184
alt_ips: ["10.0.0.3"]
user: testdev
key: "/opt/work/testdev_key"
@ -24,17 +37,20 @@ phases:
actions:
- action: exec
node: m02
cmd: "rm -rf /tmp/sw-rb-master /tmp/sw-rb-vs1 /tmp/sw-rb-vs2 && mkdir -p /tmp/sw-rb-master /tmp/sw-rb-vs1/blocks /tmp/sw-rb-vs2/blocks"
cmd: "fuser -k 9433/tcp 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-fo-master /tmp/sw-fo-vs1 && mkdir -p /tmp/sw-fo-master /tmp/sw-fo-vs1/blocks"
root: "true"
ignore_error: true
- action: exec
node: m01
cmd: "rm -rf /tmp/sw-rb-vs2 && mkdir -p /tmp/sw-rb-vs2/blocks"
cmd: "fuser -k 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-fo-vs2 && mkdir -p /tmp/sw-fo-vs2/blocks"
root: "true"
ignore_error: true
- action: start_weed_master
node: m02
port: "9433"
dir: /tmp/sw-rb-master
dir: /tmp/sw-fo-master
extra_args: "-ip=10.0.0.3"
save_as: master_pid
- action: sleep
@ -43,17 +59,17 @@ phases:
- action: start_weed_volume
node: m02
port: "18480"
master: "localhost:9433"
dir: /tmp/sw-rb-vs1
extra_args: "-block.dir=/tmp/sw-rb-vs1/blocks -block.listen=:3295 -ip=192.168.1.184"
master: "10.0.0.3:9433"
dir: /tmp/sw-fo-vs1
extra_args: "-block.dir=/tmp/sw-fo-vs1/blocks -block.listen=:3295 -ip=10.0.0.3"
save_as: vs1_pid
- action: start_weed_volume
node: m01
port: "18480"
master: "192.168.1.184:9433"
dir: /tmp/sw-rb-vs2
extra_args: "-block.dir=/tmp/sw-rb-vs2/blocks -block.listen=:3295 -ip=192.168.1.181"
master: "10.0.0.3:9433"
dir: /tmp/sw-fo-vs2
extra_args: "-block.dir=/tmp/sw-fo-vs2/blocks -block.listen=:3295 -ip=10.0.0.1"
save_as: vs2_pid
- action: sleep
@ -78,15 +94,30 @@ phases:
name: "{{ volume_name }}"
timeout: 60s
- action: validate_replication
volume_name: "{{ volume_name }}"
expected_rf: "2"
expected_durability: "sync_all"
require_not_degraded: "true"
require_cross_machine: "true"
# Force primary to m02 so we have a known node to kill.
- action: block_promote
name: "{{ volume_name }}"
target_server: "10.0.0.3:18480"
force: "true"
reason: "failover-setup"
- action: sleep
duration: 5s
- action: wait_volume_healthy
name: "{{ volume_name }}"
timeout: 60s
- name: write-data
- name: record-before
actions:
- action: discover_primary
name: "{{ volume_name }}"
save_as: before
- action: print
msg: "Before: primary={{ before }} ({{ before_server }}), replica={{ before_replica_node }}"
# Write data so volume has real state.
- action: lookup_block_volume
name: "{{ volume_name }}"
save_as: vol
@ -104,39 +135,94 @@ phases:
rw: randwrite
bs: 4k
iodepth: "16"
runtime: "30"
runtime: "10"
time_based: "true"
name: pre-fault-write
name: pre-write
- action: iscsi_cleanup
node: m01
ignore_error: true
- name: fault-failover
- name: kill-primary
actions:
- action: print
msg: "=== Killing primary on m02 ({{ before_server }}) ==="
- action: exec
node: m02
cmd: "kill -9 {{ vs1_pid }}"
root: "true"
ignore_error: true
- action: measure_recovery
- action: print
msg: "Primary killed. Waiting for lease expiry (45s)..."
# Lease TTL is 30s. Wait 45s for expiry + master failover cycle.
- action: sleep
duration: 45s
- name: verify-auto-failover
actions:
# Master should auto-promote m01 (the surviving replica) to primary.
# Wait for primary to change from m02 to something else.
- action: wait_block_primary
name: "{{ volume_name }}"
not: "{{ before_server }}"
timeout: 60s
save_as: after
- action: discover_primary
name: "{{ volume_name }}"
timeout: 120s
poll_interval: 1s
fault_type: failover
save_as: rp
save_as: new_pri
- action: print
msg: "After auto-failover: primary={{ new_pri }} ({{ new_pri_server }})"
# Verify primary actually changed.
- action: assert_block_field
name: "{{ volume_name }}"
field: volume_server
expected: "10.0.0.1:18480"
- action: print
msg: "AUTO-FAILOVER VERIFIED: {{ before_server }} → {{ new_pri_server }}"
- name: verify
- name: verify-io-after
actions:
# Reconnect iSCSI to the new primary and verify I/O.
- action: lookup_block_volume
name: "{{ volume_name }}"
save_as: vol_after
save_as: vol2
- action: iscsi_login_direct
node: m01
host: "{{ vol2_iscsi_host }}"
port: "{{ vol2_iscsi_port }}"
iqn: "{{ vol2_iqn }}"
save_as: device2
- action: fio_json
node: m01
device: "{{ device2 }}"
rw: randwrite
bs: 4k
iodepth: "16"
runtime: "10"
time_based: "true"
name: post-failover-write
save_as: fio_after
- action: collect_results
title: "Recovery Baseline: Failover"
volume_name: "{{ volume_name }}"
recovery_profile: rp
- action: fio_parse
json_var: fio_after
metric: iops
save_as: iops_after
- action: print
msg: "Post-failover write IOPS: {{ iops_after }}"
- action: iscsi_cleanup
node: m01
ignore_error: true
- name: cleanup
always: true

145
weed/storage/blockvol/testrunner/scenarios/internal/recovery-baseline-partition.yaml

@ -1,21 +1,30 @@
name: recovery-baseline-partition
timeout: 10m
# Robust dimension: replication recovery after network partition heal.
#
# Flow:
# 1. Create RF=2 sync_all, write data
# 2. Block replication ports (iptables DROP) → shipper degrades
# 3. Remove iptables rules → network heals
# 4. Write continuously to trigger barrier → shipper reconnects
# 5. Verify replication healthy again
env:
master_url: "http://192.168.1.184:9433"
master_url: "http://10.0.0.3:9433"
volume_name: rb-partition
vol_size: "1073741824"
__topology: "m02-primary_m01-replica"
__sync_mode: "sync_all"
topology:
nodes:
m01:
host: 192.168.1.181
alt_ips: ["10.0.0.1"]
user: testdev
key: "/opt/work/testdev_key"
m02:
host: 192.168.1.184
alt_ips: ["10.0.0.3"]
user: testdev
key: "/opt/work/testdev_key"
@ -24,17 +33,20 @@ phases:
actions:
- action: exec
node: m02
cmd: "rm -rf /tmp/sw-rb-master /tmp/sw-rb-vs1 /tmp/sw-rb-vs2 && mkdir -p /tmp/sw-rb-master /tmp/sw-rb-vs1/blocks /tmp/sw-rb-vs2/blocks"
cmd: "fuser -k 9433/tcp 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-part-master /tmp/sw-part-vs1 && mkdir -p /tmp/sw-part-master /tmp/sw-part-vs1/blocks"
root: "true"
ignore_error: true
- action: exec
node: m01
cmd: "rm -rf /tmp/sw-rb-vs2 && mkdir -p /tmp/sw-rb-vs2/blocks"
cmd: "fuser -k 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-part-vs2 && mkdir -p /tmp/sw-part-vs2/blocks"
root: "true"
ignore_error: true
- action: start_weed_master
node: m02
port: "9433"
dir: /tmp/sw-rb-master
dir: /tmp/sw-part-master
extra_args: "-ip=10.0.0.3"
save_as: master_pid
- action: sleep
@ -43,17 +55,17 @@ phases:
- action: start_weed_volume
node: m02
port: "18480"
master: "localhost:9433"
dir: /tmp/sw-rb-vs1
extra_args: "-block.dir=/tmp/sw-rb-vs1/blocks -block.listen=:3295 -ip=192.168.1.184"
master: "10.0.0.3:9433"
dir: /tmp/sw-part-vs1
extra_args: "-block.dir=/tmp/sw-part-vs1/blocks -block.listen=:3295 -ip=10.0.0.3"
save_as: vs1_pid
- action: start_weed_volume
node: m01
port: "18480"
master: "192.168.1.184:9433"
dir: /tmp/sw-rb-vs2
extra_args: "-block.dir=/tmp/sw-rb-vs2/blocks -block.listen=:3295 -ip=192.168.1.181"
master: "10.0.0.3:9433"
dir: /tmp/sw-part-vs2
extra_args: "-block.dir=/tmp/sw-part-vs2/blocks -block.listen=:3295 -ip=10.0.0.1"
save_as: vs2_pid
- action: sleep
@ -78,19 +90,29 @@ phases:
name: "{{ volume_name }}"
timeout: 60s
- action: validate_replication
volume_name: "{{ volume_name }}"
expected_rf: "2"
expected_durability: "sync_all"
require_not_degraded: "true"
require_cross_machine: "true"
# Force primary to m02 for deterministic port targeting.
- action: block_promote
name: "{{ volume_name }}"
target_server: "10.0.0.3:18480"
force: "true"
reason: "partition-setup"
- action: sleep
duration: 5s
- name: write-data
- action: wait_volume_healthy
name: "{{ volume_name }}"
timeout: 60s
- name: connect-and-write
actions:
- action: lookup_block_volume
name: "{{ volume_name }}"
save_as: vol
- action: print
msg: "repl_data={{ vol_replica_data_addr }} repl_port={{ vol_replica_data_port }}"
- action: iscsi_login_direct
node: m01
host: "{{ vol_iscsi_host }}"
@ -104,38 +126,88 @@ phases:
rw: randwrite
bs: 4k
iodepth: "16"
runtime: "30"
runtime: "10"
time_based: "true"
name: pre-fault-write
- action: print
msg: "Pre-fault write complete, replication healthy"
- name: fault-partition
actions:
- action: inject_partition
- action: print
msg: "=== Blocking replication ports ==="
# Block replication data + ctrl ports between primary (m02) and replica (m01).
# Use management IPs since replication goes over 192.168.1.x network.
- action: exec
node: m02
target_ip: "192.168.1.181"
ports: "18480,3295"
cmd: "iptables -A OUTPUT -d 192.168.1.181 -p tcp --dport {{ vol_replica_data_port }} -j DROP; iptables -A OUTPUT -d 192.168.1.181 -p tcp --dport {{ vol_replica_ctrl_port }} -j DROP; iptables -A INPUT -s 192.168.1.181 -p tcp --sport {{ vol_replica_data_port }} -j DROP"
root: "true"
ignore_error: true
- action: exec
node: m01
cmd: "iptables -A OUTPUT -d 192.168.1.184 -p tcp --dport {{ vol_replica_data_port }} -j DROP; iptables -A OUTPUT -d 192.168.1.184 -p tcp --dport {{ vol_replica_ctrl_port }} -j DROP; iptables -A INPUT -s 192.168.1.184 -p tcp --sport {{ vol_replica_data_port }} -j DROP"
root: "true"
ignore_error: true
# Write during partition — forces barrier to fail, shipper goes degraded.
- action: exec
node: m01
cmd: "dd if=/dev/urandom of={{ device }} bs=4k count=100 oflag=direct 2>/dev/null; true"
root: "true"
ignore_error: true
- action: sleep
duration: 10s
- action: clear_fault
- action: print
msg: "Partition active for 10s, shipper should be degraded"
- name: heal-and-recover
actions:
- action: print
msg: "=== Healing partition ==="
# Remove iptables rules.
- action: exec
node: m02
type: partition
cmd: "iptables -F OUTPUT 2>/dev/null; iptables -F INPUT 2>/dev/null; true"
root: "true"
- action: exec
node: m01
cmd: "iptables -F OUTPUT 2>/dev/null; iptables -F INPUT 2>/dev/null; true"
root: "true"
- action: print
msg: "Partition healed. Writing to trigger barrier reconnect..."
# Write continuously to trigger barrier cycles → shipper reconnect.
# The shipper only reconnects during a barrier (sync_all).
# Multiple writes ensure multiple barrier cycles fire.
- action: exec
node: m01
cmd: "for i in $(seq 1 30); do dd if=/dev/urandom of={{ device }} bs=4k count=10 oflag=direct 2>/dev/null; sleep 1; done"
root: "true"
ignore_error: true
- action: measure_recovery
name: "{{ volume_name }}"
timeout: 120s
poll_interval: 1s
poll_interval: 2s
fault_type: partition
save_as: rp
- name: verify
actions:
- action: validate_replication
volume_name: "{{ volume_name }}"
expected_rf: "2"
expected_durability: "sync_all"
require_not_degraded: "true"
- action: assert_block_field
name: "{{ volume_name }}"
field: replica_degraded
expected: "false"
- action: print
msg: "Partition recovery complete"
- action: collect_results
title: "Recovery Baseline: Partition"
@ -145,9 +217,16 @@ phases:
- name: cleanup
always: true
actions:
- action: clear_fault
# Always clear iptables.
- action: exec
node: m02
type: partition
cmd: "iptables -F OUTPUT 2>/dev/null; iptables -F INPUT 2>/dev/null; true"
root: "true"
ignore_error: true
- action: exec
node: m01
cmd: "iptables -F OUTPUT 2>/dev/null; iptables -F INPUT 2>/dev/null; true"
root: "true"
ignore_error: true
- action: iscsi_cleanup
node: m01

240
weed/storage/blockvol/testrunner/scenarios/internal/robust-shipper-lifecycle.yaml

@ -0,0 +1,240 @@
name: robust-shipper-lifecycle
timeout: 5m
# Robust dimension: Shipper degradation → recovery lifecycle.
# Uses /debug/block/shipper for real-time state (no heartbeat lag).
# Uses exact replication ports from lookup_block_volume (not hardcoded range).
env:
master_url: "http://192.168.1.184:9433"
volume_name: robust-ship
vol_size: "1073741824"
topology:
nodes:
m01:
host: 192.168.1.181
user: testdev
key: "/opt/work/testdev_key"
m02:
host: 192.168.1.184
user: testdev
key: "/opt/work/testdev_key"
phases:
- name: cluster-start
actions:
- action: exec
node: m02
cmd: "fuser -k 9433/tcp 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-ship-master /tmp/sw-ship-vs1 && mkdir -p /tmp/sw-ship-master /tmp/sw-ship-vs1/blocks"
root: "true"
ignore_error: true
- action: exec
node: m01
cmd: "fuser -k 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-ship-vs2 && mkdir -p /tmp/sw-ship-vs2/blocks"
root: "true"
ignore_error: true
- action: start_weed_master
node: m02
port: "9433"
dir: /tmp/sw-ship-master
save_as: master_pid
- action: sleep
duration: 3s
- action: start_weed_volume
node: m02
port: "18480"
master: "localhost:9433"
dir: /tmp/sw-ship-vs1
extra_args: "-block.dir=/tmp/sw-ship-vs1/blocks -block.listen=:3295 -ip=192.168.1.184"
save_as: vs1_pid
- action: start_weed_volume
node: m01
port: "18480"
master: "192.168.1.184:9433"
dir: /tmp/sw-ship-vs2
extra_args: "-block.dir=/tmp/sw-ship-vs2/blocks -block.listen=:3295 -ip=192.168.1.181"
save_as: vs2_pid
- action: sleep
duration: 3s
- action: wait_cluster_ready
node: m02
master_url: "{{ master_url }}"
- action: wait_block_servers
count: "2"
- name: create-and-write
actions:
- action: create_block_volume
name: "{{ volume_name }}"
size_bytes: "{{ vol_size }}"
replica_factor: "2"
durability_mode: "sync_all"
- action: wait_volume_healthy
name: "{{ volume_name }}"
timeout: 60s
# Lookup to get exact replication ports + primary host.
- action: lookup_block_volume
name: "{{ volume_name }}"
save_as: vol
- action: print
msg: "primary={{ vol_primary_host }} repl_data={{ vol_replica_data_addr }} repl_port={{ vol_replica_data_port }}"
- action: iscsi_login_direct
node: m01
host: "{{ vol_iscsi_host }}"
port: "{{ vol_iscsi_port }}"
iqn: "{{ vol_iqn }}"
save_as: device
# Wait for master assignment to set up shipper (heartbeat cycle ≈ 25s).
- action: sleep
duration: 30s
# Write to trigger shipper connection (sync_all → barrier).
- action: exec
node: m01
cmd: "dd if=/dev/urandom of={{ device }} bs=4k count=10 oflag=direct,sync 2>/dev/null"
root: "true"
ignore_error: true
- action: sleep
duration: 5s
- name: verify-in-sync
actions:
# Instant shipper state check.
- action: poll_shipper_state
node: m01
host: "{{ vol_primary_host }}"
port: "18480"
save_as: ship_initial
ignore_error: true
- action: print
msg: "=== Initial: state={{ ship_initial_state }} flushed={{ ship_initial_flushed_lsn }} ==="
- name: block-repl-ports
actions:
- action: print
msg: "=== Blocking replication port {{ vol_replica_data_port }} ==="
# Block exact replication data + ctrl ports from primary to replica.
- action: exec
node: m01
cmd: "iptables -A OUTPUT -p tcp --dport {{ vol_replica_data_port }} -j REJECT --reject-with tcp-reset; iptables -A OUTPUT -p tcp --dport {{ vol_replica_ctrl_port }} -j REJECT --reject-with tcp-reset"
root: "true"
ignore_error: true
- action: exec
node: m02
cmd: "iptables -A OUTPUT -p tcp --dport {{ vol_replica_data_port }} -j REJECT --reject-with tcp-reset; iptables -A OUTPUT -p tcp --dport {{ vol_replica_ctrl_port }} -j REJECT --reject-with tcp-reset"
root: "true"
ignore_error: true
# Write to trigger Ship() failure → degradation.
- action: exec
node: m01
cmd: "dd if=/dev/urandom of={{ device }} bs=4k count=10 oflag=direct,sync 2>/dev/null"
root: "true"
ignore_error: true
- action: sleep
duration: 3s
# Poll — should be degraded now.
- action: poll_shipper_state
node: m01
host: "{{ vol_primary_host }}"
port: "18480"
expected_state: degraded
timeout: 10s
save_as: ship_degraded
- action: print
msg: "=== Degraded: state={{ ship_degraded_state }} flushed={{ ship_degraded_flushed_lsn }} ==="
- name: unblock-and-recover
actions:
- action: print
msg: "=== Unblocking replication ports ==="
- action: exec
node: m01
cmd: "iptables -D OUTPUT -p tcp --dport {{ vol_replica_data_port }} -j REJECT --reject-with tcp-reset 2>/dev/null; iptables -D OUTPUT -p tcp --dport {{ vol_replica_ctrl_port }} -j REJECT --reject-with tcp-reset 2>/dev/null; true"
root: "true"
- action: exec
node: m02
cmd: "iptables -D OUTPUT -p tcp --dport {{ vol_replica_data_port }} -j REJECT --reject-with tcp-reset 2>/dev/null; iptables -D OUTPUT -p tcp --dport {{ vol_replica_ctrl_port }} -j REJECT --reject-with tcp-reset 2>/dev/null; true"
root: "true"
# Write with sync to trigger barrier → reconnect.
- action: exec
node: m01
cmd: "dd if=/dev/urandom of={{ device }} bs=4k count=10 oflag=direct,sync 2>/dev/null"
root: "true"
ignore_error: true
- action: sleep
duration: 5s
# Poll — should recover via barrier-triggered reconnect.
- action: poll_shipper_state
node: m01
host: "{{ vol_primary_host }}"
port: "18480"
expected_state: in_sync
timeout: 30s
save_as: ship_recovered
- action: print
msg: "=== Recovered: state={{ ship_recovered_state }} flushed={{ ship_recovered_flushed_lsn }} duration={{ ship_recovered_duration_ms }}ms ==="
- name: results
actions:
- action: print
msg: "=== Shipper Lifecycle ==="
- action: print
msg: "Initial: {{ ship_initial_state }} (flushed={{ ship_initial_flushed_lsn }})"
- action: print
msg: "Degraded: {{ ship_degraded_state }} (flushed={{ ship_degraded_flushed_lsn }})"
- action: print
msg: "Recovered: {{ ship_recovered_state }} (flushed={{ ship_recovered_flushed_lsn }}, {{ ship_recovered_duration_ms }}ms)"
- name: cleanup
always: true
actions:
- action: exec
node: m01
cmd: "iptables -D OUTPUT -p tcp --dport {{ vol_replica_data_port }} -j REJECT --reject-with tcp-reset 2>/dev/null; iptables -D OUTPUT -p tcp --dport {{ vol_replica_ctrl_port }} -j REJECT --reject-with tcp-reset 2>/dev/null; true"
root: "true"
ignore_error: true
- action: exec
node: m02
cmd: "iptables -D OUTPUT -p tcp --dport {{ vol_replica_data_port }} -j REJECT --reject-with tcp-reset 2>/dev/null; iptables -D OUTPUT -p tcp --dport {{ vol_replica_ctrl_port }} -j REJECT --reject-with tcp-reset 2>/dev/null; true"
root: "true"
ignore_error: true
- action: iscsi_cleanup
node: m01
ignore_error: true
- action: stop_weed
node: m01
pid: "{{ vs2_pid }}"
ignore_error: true
- action: stop_weed
node: m02
pid: "{{ vs1_pid }}"
ignore_error: true
- action: stop_weed
node: m02
pid: "{{ master_pid }}"
ignore_error: true

236
weed/storage/blockvol/testrunner/scenarios/internal/stable-degraded-sync-quorum.yaml

@ -0,0 +1,236 @@
name: stable-degraded-sync-quorum
timeout: 12m
# Stable dimension: RF=3 sync_quorum with one replica down.
#
# sync_quorum requires quorum (RF/2+1 = 2) of 3 nodes durable.
# With one replica dead, 2 nodes remain — quorum met, writes succeed.
# Expected: near-zero IOPS penalty (unlike sync_all which drops 66%).
#
# Comparison:
# sync_all degraded: 27,863 → 9,442 IOPS (−66%)
# best_effort degraded: 35,544 → 37,653 IOPS (+6%)
# sync_quorum degraded: ? → ? (this test)
env:
master_url: "http://10.0.0.3:9433"
volume_name: stable-quorum
vol_size: "1073741824"
topology:
nodes:
m01:
host: 192.168.1.181
alt_ips: ["10.0.0.1"]
user: testdev
key: "/opt/work/testdev_key"
m02:
host: 192.168.1.184
alt_ips: ["10.0.0.3"]
user: testdev
key: "/opt/work/testdev_key"
# NOTE: RF=3 requires 3 volume servers. With only 2 physical nodes,
# we start 2 VS on m02 (ports 18480 and 18481) and 1 on m01.
# This tests the quorum logic, not the physical redundancy.
phases:
- name: cluster-start
actions:
- action: exec
node: m02
cmd: "fuser -k 9433/tcp 18480/tcp 18481/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-quorum-master /tmp/sw-quorum-vs1 /tmp/sw-quorum-vs3 && mkdir -p /tmp/sw-quorum-master /tmp/sw-quorum-vs1/blocks /tmp/sw-quorum-vs3/blocks"
root: "true"
ignore_error: true
- action: exec
node: m01
cmd: "fuser -k 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-quorum-vs2 && mkdir -p /tmp/sw-quorum-vs2/blocks"
root: "true"
ignore_error: true
- action: start_weed_master
node: m02
port: "9433"
dir: /tmp/sw-quorum-master
extra_args: "-ip=10.0.0.3"
save_as: master_pid
- action: sleep
duration: 3s
- action: start_weed_volume
node: m02
port: "18480"
master: "10.0.0.3:9433"
dir: /tmp/sw-quorum-vs1
extra_args: "-block.dir=/tmp/sw-quorum-vs1/blocks -block.listen=:3295 -ip=10.0.0.3"
save_as: vs1_pid
- action: start_weed_volume
node: m01
port: "18480"
master: "10.0.0.3:9433"
dir: /tmp/sw-quorum-vs2
extra_args: "-block.dir=/tmp/sw-quorum-vs2/blocks -block.listen=:3296 -ip=10.0.0.1"
save_as: vs2_pid
# Third VS on m02, different port.
- action: start_weed_volume
node: m02
port: "18481"
master: "10.0.0.3:9433"
dir: /tmp/sw-quorum-vs3
extra_args: "-block.dir=/tmp/sw-quorum-vs3/blocks -block.listen=:3297 -ip=10.0.0.3"
save_as: vs3_pid
- action: sleep
duration: 3s
- action: wait_cluster_ready
node: m02
master_url: "{{ master_url }}"
- action: wait_block_servers
count: "3"
timeout: 30s
- name: create-volume
actions:
- action: create_block_volume
name: "{{ volume_name }}"
size_bytes: "{{ vol_size }}"
replica_factor: "3"
durability_mode: "sync_quorum"
- action: wait_volume_healthy
name: "{{ volume_name }}"
timeout: 60s
# Force primary to m02:18480 for deterministic setup.
- action: block_promote
name: "{{ volume_name }}"
target_server: "10.0.0.3:18480"
force: "true"
reason: "quorum-setup"
- action: sleep
duration: 5s
- action: wait_volume_healthy
name: "{{ volume_name }}"
timeout: 60s
- action: print
msg: "RF=3 sync_quorum volume created, primary on m02"
- name: connect
actions:
- action: lookup_block_volume
name: "{{ volume_name }}"
save_as: vol
- action: iscsi_login_direct
node: m01
host: "{{ vol_iscsi_host }}"
port: "{{ vol_iscsi_port }}"
iqn: "{{ vol_iqn }}"
save_as: device
- name: baseline-healthy
actions:
- action: print
msg: "=== Baseline: RF=3 sync_quorum, all 3 replicas healthy ==="
- action: fio_json
node: m01
device: "{{ device }}"
rw: randwrite
bs: 4k
iodepth: "32"
runtime: "30"
time_based: "true"
name: baseline-healthy
save_as: fio_healthy
- action: fio_parse
json_var: fio_healthy
metric: iops
save_as: iops_healthy
- action: print
msg: "Healthy RF=3: {{ iops_healthy }} IOPS"
- name: kill-one-replica
actions:
- action: print
msg: "=== Killing one replica (m01 VS) — quorum still met ==="
- action: exec
node: m01
cmd: "kill -9 {{ vs2_pid }}"
root: "true"
ignore_error: true
# Wait for shipper to detect dead replica.
- action: sleep
duration: 10s
- name: degraded-one-dead
actions:
- action: print
msg: "=== Degraded: 1 of 2 replicas dead, quorum=2 still met ==="
- action: fio_json
node: m01
device: "{{ device }}"
rw: randwrite
bs: 4k
iodepth: "32"
runtime: "30"
time_based: "true"
name: degraded-one
save_as: fio_deg1
- action: fio_parse
json_var: fio_deg1
metric: iops
save_as: iops_deg1
- action: print
msg: "Degraded (1 dead): {{ iops_deg1 }} IOPS"
- name: results
actions:
- action: print
msg: "=== Stable: Degraded Mode (sync_quorum RF=3, iSCSI/RoCE) ==="
- action: print
msg: "Healthy (3/3): {{ iops_healthy }} IOPS"
- action: print
msg: "Degraded (2/3): {{ iops_deg1 }} IOPS"
- action: collect_results
title: "Stable: Degraded Mode (sync_quorum RF=3)"
volume_name: "{{ volume_name }}"
- name: cleanup
always: true
actions:
- action: iscsi_cleanup
node: m01
ignore_error: true
- action: stop_weed
node: m01
pid: "{{ vs2_pid }}"
ignore_error: true
- action: stop_weed
node: m02
pid: "{{ vs3_pid }}"
ignore_error: true
- action: stop_weed
node: m02
pid: "{{ vs1_pid }}"
ignore_error: true
- action: stop_weed
node: m02
pid: "{{ master_pid }}"
ignore_error: true

244
weed/storage/blockvol/testrunner/scenarios/internal/stable-netem-sweep-roce.yaml

@ -0,0 +1,244 @@
name: stable-netem-sweep-roce
timeout: 15m
# Stable dimension on 25Gbps RoCE: write IOPS under latency injection.
# Uses NVMe-TCP for client path, netem on replication link.
env:
master_url: "http://10.0.0.3:9433"
volume_name: stable-roce
vol_size: "2147483648"
topology:
nodes:
m01:
host: 192.168.1.181
user: testdev
key: "/opt/work/testdev_key"
m02:
host: 192.168.1.184
user: testdev
key: "/opt/work/testdev_key"
phases:
- name: cluster-start
actions:
- action: exec
node: m02
cmd: "fuser -k 9433/tcp 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-nroce-master /tmp/sw-nroce-vs1 && mkdir -p /tmp/sw-nroce-master /tmp/sw-nroce-vs1/blocks"
root: "true"
ignore_error: true
- action: exec
node: m01
cmd: "fuser -k 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-nroce-vs2 && mkdir -p /tmp/sw-nroce-vs2/blocks"
root: "true"
ignore_error: true
- action: start_weed_master
node: m02
port: "9433"
dir: /tmp/sw-nroce-master
extra_args: "-ip=10.0.0.3"
save_as: master_pid
- action: sleep
duration: 3s
- action: start_weed_volume
node: m02
port: "18480"
master: "10.0.0.3:9433"
dir: /tmp/sw-nroce-vs1
extra_args: "-block.dir=/tmp/sw-nroce-vs1/blocks -block.listen=:3295 -block.nvme.enable=true -block.nvme.listen=10.0.0.3:4420 -ip=10.0.0.3"
save_as: vs1_pid
- action: start_weed_volume
node: m01
port: "18480"
master: "10.0.0.3:9433"
dir: /tmp/sw-nroce-vs2
extra_args: "-block.dir=/tmp/sw-nroce-vs2/blocks -block.listen=:3295 -block.nvme.enable=true -block.nvme.listen=10.0.0.1:4420 -ip=10.0.0.1"
save_as: vs2_pid
- action: sleep
duration: 3s
- action: wait_cluster_ready
node: m02
master_url: "{{ master_url }}"
- action: wait_block_servers
count: "2"
- name: create-volume
actions:
- action: create_block_volume
name: "{{ volume_name }}"
size_bytes: "{{ vol_size }}"
replica_factor: "2"
durability_mode: "sync_all"
- action: wait_volume_healthy
name: "{{ volume_name }}"
timeout: 60s
- name: connect-nvme
actions:
- action: lookup_block_volume
name: "{{ volume_name }}"
save_as: vol
- action: exec
node: m01
cmd: "sh -c 'modprobe nvme_tcp; nvme disconnect-all >/dev/null 2>&1; nvme connect -t tcp -a {{ vol_iscsi_host }} -s 4420 -n nqn.2024-01.com.seaweedfs:vol.{{ volume_name }} >/dev/null 2>&1; sleep 2; lsblk -dpno NAME,SIZE | grep 2G | head -1 | cut -d\" \" -f1'"
root: "true"
save_as: dev
# Baseline
- name: baseline
actions:
- action: fio_json
node: m01
device: "{{ dev }}"
rw: randwrite
bs: 4k
iodepth: "32"
runtime: "30"
time_based: "true"
name: baseline
save_as: fio_0
- action: fio_parse
json_var: fio_0
metric: iops
save_as: iops_0
- action: print
msg: "0ms: {{ iops_0 }} IOPS"
# 1ms
- name: netem-1ms
actions:
- action: inject_netem
node: m02
target_ip: "10.0.0.1"
delay_ms: "1"
- action: sleep
duration: 2s
- action: fio_json
node: m01
device: "{{ dev }}"
rw: randwrite
bs: 4k
iodepth: "32"
runtime: "30"
time_based: "true"
name: netem-1ms
save_as: fio_1
- action: fio_parse
json_var: fio_1
metric: iops
save_as: iops_1
- action: print
msg: "1ms: {{ iops_1 }} IOPS"
- action: clear_fault
node: m02
type: netem
- action: sleep
duration: 2s
# 5ms
- name: netem-5ms
actions:
- action: inject_netem
node: m02
target_ip: "10.0.0.1"
delay_ms: "5"
- action: sleep
duration: 2s
- action: fio_json
node: m01
device: "{{ dev }}"
rw: randwrite
bs: 4k
iodepth: "32"
runtime: "30"
time_based: "true"
name: netem-5ms
save_as: fio_5
- action: fio_parse
json_var: fio_5
metric: iops
save_as: iops_5
- action: print
msg: "5ms: {{ iops_5 }} IOPS"
- action: clear_fault
node: m02
type: netem
- action: sleep
duration: 2s
# 20ms
- name: netem-20ms
actions:
- action: inject_netem
node: m02
target_ip: "10.0.0.1"
delay_ms: "20"
- action: sleep
duration: 2s
- action: fio_json
node: m01
device: "{{ dev }}"
rw: randwrite
bs: 4k
iodepth: "32"
runtime: "30"
time_based: "true"
name: netem-20ms
save_as: fio_20
- action: fio_parse
json_var: fio_20
metric: iops
save_as: iops_20
- action: print
msg: "20ms: {{ iops_20 }} IOPS"
- action: clear_fault
node: m02
type: netem
- name: results
actions:
- action: print
msg: "=== Stable: Netem Sweep NVMe/RoCE (sync_all RF=2) ==="
- action: print
msg: "0ms: {{ iops_0 }} IOPS"
- action: print
msg: "1ms: {{ iops_1 }} IOPS"
- action: print
msg: "5ms: {{ iops_5 }} IOPS"
- action: print
msg: "20ms: {{ iops_20 }} IOPS"
- name: cleanup
always: true
actions:
- action: clear_fault
node: m02
type: netem
ignore_error: true
- action: exec
node: m01
cmd: "nvme disconnect-all 2>/dev/null; true"
root: "true"
ignore_error: true
- action: stop_weed
node: m01
pid: "{{ vs2_pid }}"
ignore_error: true
- action: stop_weed
node: m02
pid: "{{ vs1_pid }}"
ignore_error: true
- action: stop_weed
node: m02
pid: "{{ master_pid }}"
ignore_error: true

291
weed/storage/blockvol/testrunner/scenarios/internal/stable-packet-loss.yaml

@ -0,0 +1,291 @@
name: stable-packet-loss
timeout: 15m
# Stable dimension: measure write IOPS under increasing packet loss
# on the replication link. Uses netem loss (not delay).
#
# Loss levels: 0% (baseline from netem-sweep), 0.1%, 1%, 5%
# Workload: 4K random write, QD16, 30s per level
env:
master_url: "http://192.168.1.184:9433"
volume_name: stable-loss
vol_size: "1073741824"
topology:
nodes:
m01:
host: 192.168.1.181
user: testdev
key: "/opt/work/testdev_key"
m02:
host: 192.168.1.184
user: testdev
key: "/opt/work/testdev_key"
phases:
- name: cluster-start
actions:
- action: exec
node: m02
cmd: "fuser -k 9433/tcp 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-loss-master /tmp/sw-loss-vs1 && mkdir -p /tmp/sw-loss-master /tmp/sw-loss-vs1/blocks"
root: "true"
ignore_error: true
- action: exec
node: m01
cmd: "fuser -k 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-loss-vs2 && mkdir -p /tmp/sw-loss-vs2/blocks"
root: "true"
ignore_error: true
- action: start_weed_master
node: m02
port: "9433"
dir: /tmp/sw-loss-master
save_as: master_pid
- action: sleep
duration: 3s
- action: start_weed_volume
node: m02
port: "18480"
master: "localhost:9433"
dir: /tmp/sw-loss-vs1
extra_args: "-block.dir=/tmp/sw-loss-vs1/blocks -block.listen=:3295 -ip=192.168.1.184"
save_as: vs1_pid
- action: start_weed_volume
node: m01
port: "18480"
master: "192.168.1.184:9433"
dir: /tmp/sw-loss-vs2
extra_args: "-block.dir=/tmp/sw-loss-vs2/blocks -block.listen=:3295 -ip=192.168.1.181"
save_as: vs2_pid
- action: sleep
duration: 3s
- action: wait_cluster_ready
node: m02
master_url: "{{ master_url }}"
- action: wait_block_servers
count: "2"
- name: create-volume
actions:
- action: create_block_volume
name: "{{ volume_name }}"
size_bytes: "{{ vol_size }}"
replica_factor: "2"
durability_mode: "sync_all"
- action: wait_volume_healthy
name: "{{ volume_name }}"
timeout: 60s
- action: validate_replication
volume_name: "{{ volume_name }}"
expected_rf: "2"
expected_durability: "sync_all"
require_not_degraded: "true"
require_cross_machine: "true"
- name: connect
actions:
- action: lookup_block_volume
name: "{{ volume_name }}"
save_as: vol
- action: iscsi_login_direct
node: m01
host: "{{ vol_iscsi_host }}"
port: "{{ vol_iscsi_port }}"
iqn: "{{ vol_iqn }}"
save_as: device
# === Baseline: 0% loss ===
- name: baseline
actions:
- action: print
msg: "=== Baseline: 0% packet loss ==="
- action: fio_json
node: m01
device: "{{ device }}"
rw: randwrite
bs: 4k
iodepth: "16"
runtime: "30"
time_based: "true"
name: baseline
save_as: fio_0
- action: fio_parse
json_var: fio_0
metric: iops
save_as: iops_0
- action: print
msg: "0% loss: {{ iops_0 }} IOPS"
# === 0.1% packet loss ===
- name: loss-01pct
actions:
- action: print
msg: "=== Injecting 0.1% packet loss ==="
- action: exec
node: m02
cmd: "tc qdisc add dev $(ip route get 192.168.1.181 | head -1 | awk '{for(i=1;i<=NF;i++) if($i==\"dev\") print $(i+1)}') root netem loss 0.1%"
root: "true"
- action: sleep
duration: 2s
- action: fio_json
node: m01
device: "{{ device }}"
rw: randwrite
bs: 4k
iodepth: "16"
runtime: "30"
time_based: "true"
name: loss-01pct
save_as: fio_01
- action: fio_parse
json_var: fio_01
metric: iops
save_as: iops_01
- action: print
msg: "0.1% loss: {{ iops_01 }} IOPS"
- action: exec
node: m02
cmd: "tc qdisc del dev $(ip route get 192.168.1.181 | head -1 | awk '{for(i=1;i<=NF;i++) if($i==\"dev\") print $(i+1)}') root 2>/dev/null; true"
root: "true"
- action: sleep
duration: 2s
# === 1% packet loss ===
- name: loss-1pct
actions:
- action: print
msg: "=== Injecting 1% packet loss ==="
- action: exec
node: m02
cmd: "tc qdisc add dev $(ip route get 192.168.1.181 | head -1 | awk '{for(i=1;i<=NF;i++) if($i==\"dev\") print $(i+1)}') root netem loss 1%"
root: "true"
- action: sleep
duration: 2s
- action: fio_json
node: m01
device: "{{ device }}"
rw: randwrite
bs: 4k
iodepth: "16"
runtime: "30"
time_based: "true"
name: loss-1pct
save_as: fio_1
- action: fio_parse
json_var: fio_1
metric: iops
save_as: iops_1
- action: print
msg: "1% loss: {{ iops_1 }} IOPS"
- action: exec
node: m02
cmd: "tc qdisc del dev $(ip route get 192.168.1.181 | head -1 | awk '{for(i=1;i<=NF;i++) if($i==\"dev\") print $(i+1)}') root 2>/dev/null; true"
root: "true"
- action: sleep
duration: 2s
# === 5% packet loss ===
- name: loss-5pct
actions:
- action: print
msg: "=== Injecting 5% packet loss ==="
- action: exec
node: m02
cmd: "tc qdisc add dev $(ip route get 192.168.1.181 | head -1 | awk '{for(i=1;i<=NF;i++) if($i==\"dev\") print $(i+1)}') root netem loss 5%"
root: "true"
- action: sleep
duration: 2s
- action: fio_json
node: m01
device: "{{ device }}"
rw: randwrite
bs: 4k
iodepth: "16"
runtime: "30"
time_based: "true"
name: loss-5pct
save_as: fio_5
- action: fio_parse
json_var: fio_5
metric: iops
save_as: iops_5
- action: print
msg: "5% loss: {{ iops_5 }} IOPS"
- action: exec
node: m02
cmd: "tc qdisc del dev $(ip route get 192.168.1.181 | head -1 | awk '{for(i=1;i<=NF;i++) if($i==\"dev\") print $(i+1)}') root 2>/dev/null; true"
root: "true"
- name: results
actions:
- action: print
msg: "=== Stable Dimension: Packet Loss Sweep (V1 sync_all RF=2) ==="
- action: print
msg: "0% (baseline): {{ iops_0 }} IOPS"
- action: print
msg: "0.1% loss: {{ iops_01 }} IOPS"
- action: print
msg: "1% loss: {{ iops_1 }} IOPS"
- action: print
msg: "5% loss: {{ iops_5 }} IOPS"
- action: collect_results
title: "Stable: Packet Loss Sweep (V1 sync_all RF=2)"
volume_name: "{{ volume_name }}"
- name: cleanup
always: true
actions:
- action: exec
node: m02
cmd: "tc qdisc del dev $(ip route get 192.168.1.181 | head -1 | awk '{for(i=1;i<=NF;i++) if($i==\"dev\") print $(i+1)}') root 2>/dev/null; true"
root: "true"
ignore_error: true
- action: iscsi_cleanup
node: m01
ignore_error: true
- action: stop_weed
node: m01
pid: "{{ vs2_pid }}"
ignore_error: true
- action: stop_weed
node: m02
pid: "{{ vs1_pid }}"
ignore_error: true
- action: stop_weed
node: m02
pid: "{{ master_pid }}"
ignore_error: true

264
weed/storage/blockvol/testrunner/scenarios/internal/stable-replication-tax-nvme-roce.yaml

@ -0,0 +1,264 @@
name: stable-replication-tax-nvme-roce
timeout: 10m
# Replication tax on 25Gbps RoCE via NVMe-TCP (not iSCSI).
# Previous results (iSCSI over RoCE):
# RF=1: 54,785 | RF=2 be: 28,408 | RF=2 sa: 29,893
#
# NVMe-TCP should have lower protocol overhead than iSCSI,
# potentially showing higher IOPS and a different replication tax ratio.
env:
master_url: "http://10.0.0.3:9433"
vol_size: "2147483648"
topology:
nodes:
m01:
host: 192.168.1.181
user: testdev
key: "/opt/work/testdev_key"
m02:
host: 192.168.1.184
user: testdev
key: "/opt/work/testdev_key"
phases:
- name: cluster-start
actions:
- action: exec
node: m02
cmd: "fuser -k 9433/tcp 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-nvme-master /tmp/sw-nvme-vs1 && mkdir -p /tmp/sw-nvme-master /tmp/sw-nvme-vs1/blocks"
root: "true"
ignore_error: true
- action: exec
node: m01
cmd: "fuser -k 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-nvme-vs2 && mkdir -p /tmp/sw-nvme-vs2/blocks"
root: "true"
ignore_error: true
- action: start_weed_master
node: m02
port: "9433"
dir: /tmp/sw-nvme-master
extra_args: "-ip=10.0.0.3"
save_as: master_pid
- action: sleep
duration: 3s
# VS1 on m02 with NVMe-TCP enabled on RoCE IP.
- action: start_weed_volume
node: m02
port: "18480"
master: "10.0.0.3:9433"
dir: /tmp/sw-nvme-vs1
extra_args: "-block.dir=/tmp/sw-nvme-vs1/blocks -block.listen=:3295 -block.nvme.enable=true -block.nvme.listen=10.0.0.3:4420 -ip=10.0.0.3"
save_as: vs1_pid
# VS2 on m01 with NVMe-TCP on RoCE IP.
- action: start_weed_volume
node: m01
port: "18480"
master: "10.0.0.3:9433"
dir: /tmp/sw-nvme-vs2
extra_args: "-block.dir=/tmp/sw-nvme-vs2/blocks -block.listen=:3295 -block.nvme.enable=true -block.nvme.listen=10.0.0.1:4420 -ip=10.0.0.1"
save_as: vs2_pid
- action: sleep
duration: 3s
- action: wait_cluster_ready
node: m02
master_url: "{{ master_url }}"
- action: wait_block_servers
count: "2"
# === RF=1 via NVMe-TCP ===
- name: rf1-nvme
actions:
- action: create_block_volume
name: "nvme-rf1"
size_bytes: "{{ vol_size }}"
replica_factor: "1"
durability_mode: "best_effort"
- action: sleep
duration: 2s
- action: lookup_block_volume
name: "nvme-rf1"
save_as: rf1
# Connect via NVMe-TCP.
- action: exec
node: m01
cmd: "sh -c 'modprobe nvme_tcp; nvme disconnect-all >/dev/null 2>&1; nvme connect -t tcp -a {{ rf1_iscsi_host }} -s 4420 -n nqn.2024-01.com.seaweedfs:vol.nvme-rf1 >/dev/null 2>&1; sleep 2; lsblk -dpno NAME,SIZE | grep 2G | head -1 | cut -d\" \" -f1'"
root: "true"
save_as: nvme_dev_rf1
- action: print
msg: "RF=1 NVMe device: {{ nvme_dev_rf1 }}"
- action: fio_json
node: m01
device: "{{ nvme_dev_rf1 }}"
rw: randwrite
bs: 4k
iodepth: "32"
runtime: "30"
time_based: "true"
name: rf1-nvme
save_as: fio_rf1
- action: fio_parse
json_var: fio_rf1
metric: iops
save_as: iops_rf1
- action: print
msg: "RF=1 NVMe (RoCE): {{ iops_rf1 }} IOPS"
- action: exec
node: m01
cmd: "nvme disconnect-all 2>/dev/null; true"
root: "true"
ignore_error: true
# === RF=2 best_effort via NVMe-TCP ===
- name: rf2-be-nvme
actions:
- action: create_block_volume
name: "nvme-rf2be"
size_bytes: "{{ vol_size }}"
replica_factor: "2"
durability_mode: "best_effort"
- action: wait_volume_healthy
name: "nvme-rf2be"
timeout: 60s
- action: lookup_block_volume
name: "nvme-rf2be"
save_as: rf2be
- action: exec
node: m01
cmd: "sh -c 'nvme disconnect-all >/dev/null 2>&1; nvme connect -t tcp -a {{ rf2be_iscsi_host }} -s 4420 -n nqn.2024-01.com.seaweedfs:vol.nvme-rf2be >/dev/null 2>&1; sleep 2; lsblk -dpno NAME,SIZE | grep 2G | head -1 | cut -d\" \" -f1'"
root: "true"
save_as: nvme_dev_rf2be
- action: print
msg: "RF=2 BE NVMe device: {{ nvme_dev_rf2be }}"
- action: fio_json
node: m01
device: "{{ nvme_dev_rf2be }}"
rw: randwrite
bs: 4k
iodepth: "32"
runtime: "30"
time_based: "true"
name: rf2-be-nvme
save_as: fio_rf2be
- action: fio_parse
json_var: fio_rf2be
metric: iops
save_as: iops_rf2be
- action: print
msg: "RF=2 best_effort NVMe (RoCE): {{ iops_rf2be }} IOPS"
- action: exec
node: m01
cmd: "nvme disconnect-all 2>/dev/null; true"
root: "true"
ignore_error: true
# === RF=2 sync_all via NVMe-TCP ===
- name: rf2-sa-nvme
actions:
- action: create_block_volume
name: "nvme-rf2sa"
size_bytes: "{{ vol_size }}"
replica_factor: "2"
durability_mode: "sync_all"
- action: wait_volume_healthy
name: "nvme-rf2sa"
timeout: 60s
- action: lookup_block_volume
name: "nvme-rf2sa"
save_as: rf2sa
- action: exec
node: m01
cmd: "sh -c 'nvme disconnect-all >/dev/null 2>&1; nvme connect -t tcp -a {{ rf2sa_iscsi_host }} -s 4420 -n nqn.2024-01.com.seaweedfs:vol.nvme-rf2sa >/dev/null 2>&1; sleep 2; lsblk -dpno NAME,SIZE | grep 2G | head -1 | cut -d\" \" -f1'"
root: "true"
save_as: nvme_dev_rf2sa
- action: print
msg: "RF=2 SA NVMe device: {{ nvme_dev_rf2sa }}"
- action: fio_json
node: m01
device: "{{ nvme_dev_rf2sa }}"
rw: randwrite
bs: 4k
iodepth: "32"
runtime: "30"
time_based: "true"
name: rf2-sa-nvme
save_as: fio_rf2sa
- action: fio_parse
json_var: fio_rf2sa
metric: iops
save_as: iops_rf2sa
- action: print
msg: "RF=2 sync_all NVMe (RoCE): {{ iops_rf2sa }} IOPS"
- action: exec
node: m01
cmd: "nvme disconnect-all 2>/dev/null; true"
root: "true"
ignore_error: true
- name: results
actions:
- action: print
msg: "=== Replication Tax — NVMe-TCP over 25Gbps RoCE (4K randwrite QD32) ==="
- action: print
msg: "RF=1 (no repl): {{ iops_rf1 }} IOPS"
- action: print
msg: "RF=2 best_effort: {{ iops_rf2be }} IOPS"
- action: print
msg: "RF=2 sync_all: {{ iops_rf2sa }} IOPS"
- action: collect_results
title: "Replication Tax — NVMe-TCP over 25Gbps RoCE"
- name: cleanup
always: true
actions:
- action: exec
node: m01
cmd: "nvme disconnect-all 2>/dev/null; true"
root: "true"
ignore_error: true
- action: stop_weed
node: m01
pid: "{{ vs2_pid }}"
ignore_error: true
- action: stop_weed
node: m02
pid: "{{ vs1_pid }}"
ignore_error: true
- action: stop_weed
node: m02
pid: "{{ master_pid }}"
ignore_error: true

253
weed/storage/blockvol/testrunner/scenarios/internal/stable-replication-tax-roce.yaml

@ -0,0 +1,253 @@
name: stable-replication-tax-roce
timeout: 10m
# Replication tax on 25Gbps RoCE network (10.0.0.x).
# Compares RF=1, RF=2 best_effort, RF=2 sync_all.
#
# Previous run on 1Gbps (192.168.1.x):
# RF=1: 54,650 IOPS | RF=2 be: 22,773 | RF=2 sa: 23,846
# Replication tax: ~58% — mostly 1Gbps wire limit
#
# RoCE (25Gbps) should remove the wire bottleneck and show
# the true protocol overhead.
env:
master_url: "http://10.0.0.3:9433"
vol_size: "1073741824"
topology:
nodes:
m01:
host: 192.168.1.181
user: testdev
key: "/opt/work/testdev_key"
m02:
host: 192.168.1.184
user: testdev
key: "/opt/work/testdev_key"
phases:
- name: cluster-start
actions:
- action: exec
node: m02
cmd: "fuser -k 9433/tcp 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-roce-master /tmp/sw-roce-vs1 && mkdir -p /tmp/sw-roce-master /tmp/sw-roce-vs1/blocks"
root: "true"
ignore_error: true
- action: exec
node: m01
cmd: "fuser -k 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-roce-vs2 && mkdir -p /tmp/sw-roce-vs2/blocks"
root: "true"
ignore_error: true
# Master on m02, listening on RoCE IP.
- action: start_weed_master
node: m02
port: "9433"
dir: /tmp/sw-roce-master
extra_args: "-ip=10.0.0.3"
save_as: master_pid
- action: sleep
duration: 3s
# VS1 on m02, advertising RoCE IP.
- action: start_weed_volume
node: m02
port: "18480"
master: "10.0.0.3:9433"
dir: /tmp/sw-roce-vs1
extra_args: "-block.dir=/tmp/sw-roce-vs1/blocks -block.listen=:3295 -ip=10.0.0.3"
save_as: vs1_pid
# VS2 on m01, advertising RoCE IP.
- action: start_weed_volume
node: m01
port: "18480"
master: "10.0.0.3:9433"
dir: /tmp/sw-roce-vs2
extra_args: "-block.dir=/tmp/sw-roce-vs2/blocks -block.listen=:3295 -ip=10.0.0.1"
save_as: vs2_pid
- action: sleep
duration: 3s
- action: wait_cluster_ready
node: m02
master_url: "{{ master_url }}"
- action: wait_block_servers
count: "2"
# === RF=1: No replication ===
- name: rf1-test
actions:
- action: create_block_volume
name: "roce-rf1"
size_bytes: "{{ vol_size }}"
replica_factor: "1"
durability_mode: "best_effort"
- action: sleep
duration: 2s
- action: lookup_block_volume
name: "roce-rf1"
save_as: rf1
- action: iscsi_login_direct
node: m01
host: "{{ rf1_iscsi_host }}"
port: "{{ rf1_iscsi_port }}"
iqn: "{{ rf1_iqn }}"
save_as: dev_rf1
- action: fio_json
node: m01
device: "{{ dev_rf1 }}"
rw: randwrite
bs: 4k
iodepth: "16"
runtime: "30"
time_based: "true"
name: rf1
save_as: fio_rf1
- action: fio_parse
json_var: fio_rf1
metric: iops
save_as: iops_rf1
- action: print
msg: "RF=1 (RoCE): {{ iops_rf1 }} IOPS"
- action: iscsi_cleanup
node: m01
ignore_error: true
# === RF=2 best_effort ===
- name: rf2-best-effort
actions:
- action: create_block_volume
name: "roce-rf2be"
size_bytes: "{{ vol_size }}"
replica_factor: "2"
durability_mode: "best_effort"
- action: wait_volume_healthy
name: "roce-rf2be"
timeout: 60s
- action: lookup_block_volume
name: "roce-rf2be"
save_as: rf2be
- action: iscsi_login_direct
node: m01
host: "{{ rf2be_iscsi_host }}"
port: "{{ rf2be_iscsi_port }}"
iqn: "{{ rf2be_iqn }}"
save_as: dev_rf2be
- action: fio_json
node: m01
device: "{{ dev_rf2be }}"
rw: randwrite
bs: 4k
iodepth: "16"
runtime: "30"
time_based: "true"
name: rf2-best-effort
save_as: fio_rf2be
- action: fio_parse
json_var: fio_rf2be
metric: iops
save_as: iops_rf2be
- action: print
msg: "RF=2 best_effort (RoCE): {{ iops_rf2be }} IOPS"
- action: iscsi_cleanup
node: m01
ignore_error: true
# === RF=2 sync_all ===
- name: rf2-sync-all
actions:
- action: create_block_volume
name: "roce-rf2sa"
size_bytes: "{{ vol_size }}"
replica_factor: "2"
durability_mode: "sync_all"
- action: wait_volume_healthy
name: "roce-rf2sa"
timeout: 60s
- action: lookup_block_volume
name: "roce-rf2sa"
save_as: rf2sa
- action: iscsi_login_direct
node: m01
host: "{{ rf2sa_iscsi_host }}"
port: "{{ rf2sa_iscsi_port }}"
iqn: "{{ rf2sa_iqn }}"
save_as: dev_rf2sa
- action: fio_json
node: m01
device: "{{ dev_rf2sa }}"
rw: randwrite
bs: 4k
iodepth: "16"
runtime: "30"
time_based: "true"
name: rf2-sync-all
save_as: fio_rf2sa
- action: fio_parse
json_var: fio_rf2sa
metric: iops
save_as: iops_rf2sa
- action: print
msg: "RF=2 sync_all (RoCE): {{ iops_rf2sa }} IOPS"
- action: iscsi_cleanup
node: m01
ignore_error: true
- name: results
actions:
- action: print
msg: "=== Replication Tax — 25Gbps RoCE (4K randwrite QD16) ==="
- action: print
msg: "RF=1 (no repl): {{ iops_rf1 }} IOPS"
- action: print
msg: "RF=2 best_effort: {{ iops_rf2be }} IOPS"
- action: print
msg: "RF=2 sync_all: {{ iops_rf2sa }} IOPS"
- action: collect_results
title: "Replication Tax — 25Gbps RoCE"
- name: cleanup
always: true
actions:
- action: iscsi_cleanup
node: m01
ignore_error: true
- action: stop_weed
node: m01
pid: "{{ vs2_pid }}"
ignore_error: true
- action: stop_weed
node: m02
pid: "{{ vs1_pid }}"
ignore_error: true
- action: stop_weed
node: m02
pid: "{{ master_pid }}"
ignore_error: true

245
weed/storage/blockvol/testrunner/scenarios/internal/stable-replication-tax.yaml

@ -0,0 +1,245 @@
name: stable-replication-tax
timeout: 10m
# Stable dimension: replication tax comparison.
# RF=1 (no replication) vs RF=2 sync_all vs RF=2 best_effort.
# All on the same hardware, same workload.
#
# Answers: what does replication cost in IOPS?
env:
master_url: "http://192.168.1.184:9433"
vol_size: "1073741824"
topology:
nodes:
m01:
host: 192.168.1.181
user: testdev
key: "/opt/work/testdev_key"
m02:
host: 192.168.1.184
user: testdev
key: "/opt/work/testdev_key"
phases:
- name: cluster-start
actions:
- action: exec
node: m02
cmd: "fuser -k 9433/tcp 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-tax-master /tmp/sw-tax-vs1 && mkdir -p /tmp/sw-tax-master /tmp/sw-tax-vs1/blocks"
root: "true"
ignore_error: true
- action: exec
node: m01
cmd: "fuser -k 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-tax-vs2 && mkdir -p /tmp/sw-tax-vs2/blocks"
root: "true"
ignore_error: true
- action: start_weed_master
node: m02
port: "9433"
dir: /tmp/sw-tax-master
save_as: master_pid
- action: sleep
duration: 3s
- action: start_weed_volume
node: m02
port: "18480"
master: "localhost:9433"
dir: /tmp/sw-tax-vs1
extra_args: "-block.dir=/tmp/sw-tax-vs1/blocks -block.listen=:3295 -ip=192.168.1.184"
save_as: vs1_pid
- action: start_weed_volume
node: m01
port: "18480"
master: "192.168.1.184:9433"
dir: /tmp/sw-tax-vs2
extra_args: "-block.dir=/tmp/sw-tax-vs2/blocks -block.listen=:3295 -ip=192.168.1.181"
save_as: vs2_pid
- action: sleep
duration: 3s
- action: wait_cluster_ready
node: m02
master_url: "{{ master_url }}"
- action: wait_block_servers
count: "2"
# === RF=1: No replication ===
- name: rf1-test
actions:
- action: create_block_volume
name: "tax-rf1"
size_bytes: "{{ vol_size }}"
replica_factor: "1"
durability_mode: "best_effort"
- action: sleep
duration: 2s
- action: lookup_block_volume
name: "tax-rf1"
save_as: rf1
- action: iscsi_login_direct
node: m01
host: "{{ rf1_iscsi_host }}"
port: "{{ rf1_iscsi_port }}"
iqn: "{{ rf1_iqn }}"
save_as: dev_rf1
- action: fio_json
node: m01
device: "{{ dev_rf1 }}"
rw: randwrite
bs: 4k
iodepth: "16"
runtime: "30"
time_based: "true"
name: rf1
save_as: fio_rf1
- action: fio_parse
json_var: fio_rf1
metric: iops
save_as: iops_rf1
- action: print
msg: "RF=1: {{ iops_rf1 }} IOPS"
- action: iscsi_cleanup
node: m01
ignore_error: true
# === RF=2 best_effort ===
- name: rf2-best-effort
actions:
- action: create_block_volume
name: "tax-rf2be"
size_bytes: "{{ vol_size }}"
replica_factor: "2"
durability_mode: "best_effort"
- action: wait_volume_healthy
name: "tax-rf2be"
timeout: 60s
- action: lookup_block_volume
name: "tax-rf2be"
save_as: rf2be
- action: iscsi_login_direct
node: m01
host: "{{ rf2be_iscsi_host }}"
port: "{{ rf2be_iscsi_port }}"
iqn: "{{ rf2be_iqn }}"
save_as: dev_rf2be
- action: fio_json
node: m01
device: "{{ dev_rf2be }}"
rw: randwrite
bs: 4k
iodepth: "16"
runtime: "30"
time_based: "true"
name: rf2-best-effort
save_as: fio_rf2be
- action: fio_parse
json_var: fio_rf2be
metric: iops
save_as: iops_rf2be
- action: print
msg: "RF=2 best_effort: {{ iops_rf2be }} IOPS"
- action: iscsi_cleanup
node: m01
ignore_error: true
# === RF=2 sync_all ===
- name: rf2-sync-all
actions:
- action: create_block_volume
name: "tax-rf2sa"
size_bytes: "{{ vol_size }}"
replica_factor: "2"
durability_mode: "sync_all"
- action: wait_volume_healthy
name: "tax-rf2sa"
timeout: 60s
- action: lookup_block_volume
name: "tax-rf2sa"
save_as: rf2sa
- action: iscsi_login_direct
node: m01
host: "{{ rf2sa_iscsi_host }}"
port: "{{ rf2sa_iscsi_port }}"
iqn: "{{ rf2sa_iqn }}"
save_as: dev_rf2sa
- action: fio_json
node: m01
device: "{{ dev_rf2sa }}"
rw: randwrite
bs: 4k
iodepth: "16"
runtime: "30"
time_based: "true"
name: rf2-sync-all
save_as: fio_rf2sa
- action: fio_parse
json_var: fio_rf2sa
metric: iops
save_as: iops_rf2sa
- action: print
msg: "RF=2 sync_all: {{ iops_rf2sa }} IOPS"
- action: iscsi_cleanup
node: m01
ignore_error: true
- name: results
actions:
- action: print
msg: "=== Replication Tax (4K randwrite QD16) ==="
- action: print
msg: "RF=1 (no repl): {{ iops_rf1 }} IOPS"
- action: print
msg: "RF=2 best_effort: {{ iops_rf2be }} IOPS"
- action: print
msg: "RF=2 sync_all: {{ iops_rf2sa }} IOPS"
- action: collect_results
title: "Replication Tax"
- name: cleanup
always: true
actions:
- action: iscsi_cleanup
node: m01
ignore_error: true
- action: stop_weed
node: m01
pid: "{{ vs2_pid }}"
ignore_error: true
- action: stop_weed
node: m02
pid: "{{ vs1_pid }}"
ignore_error: true
- action: stop_weed
node: m02
pid: "{{ master_pid }}"
ignore_error: true
Loading…
Cancel
Save