Browse Source
feat: Phase 12 — production hardening (disturbance, soak, testrunner scenarios)
feat: Phase 12 — production hardening (disturbance, soak, testrunner scenarios)
P1 Disturbance: restart/reconnect correctness tests — assignment delivery
through real proto → ProcessAssignments, epoch validation on promoted
volume, mandatory reconnect assertions
P2 Soak: repeated create/failover/recover cycles with end-of-cycle truth
checks, runtime hygiene (no stale tasks/entries), steady-state idempotence
Testrunner recovery actions + scenarios:
- recovery.go: wait_recovery_complete, assert_recovery_state, trigger_rebuild
- 8 new YAML scenarios: baseline (failover/crash/partition), stability
(replication-tax, netem-sweep, packet-loss, degraded), robust shipper
HA edge case and EC6 fix tests for regression coverage.
(P3 diagnosability + P4 perf floor committed separately in 643a5a107)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feature/sw-block
18 changed files with 5292 additions and 68 deletions
-
1360sw-block/.private/phase/phase-12-log.md
-
453sw-block/.private/phase/phase-12.md
-
439weed/server/qa_block_disturbance_test.go
-
147weed/server/qa_block_ec6_fix_test.go
-
205weed/server/qa_block_ha_edge_cases_test.go
-
313weed/storage/blockvol/testrunner/actions/recovery.go
-
72weed/storage/blockvol/testrunner/actions/recovery_test.go
-
295weed/storage/blockvol/testrunner/scenarios/internal/baseline-full-roce.yaml
-
6weed/storage/blockvol/testrunner/scenarios/internal/recovery-baseline-crash.yaml
-
152weed/storage/blockvol/testrunner/scenarios/internal/recovery-baseline-failover.yaml
-
145weed/storage/blockvol/testrunner/scenarios/internal/recovery-baseline-partition.yaml
-
240weed/storage/blockvol/testrunner/scenarios/internal/robust-shipper-lifecycle.yaml
-
236weed/storage/blockvol/testrunner/scenarios/internal/stable-degraded-sync-quorum.yaml
-
244weed/storage/blockvol/testrunner/scenarios/internal/stable-netem-sweep-roce.yaml
-
291weed/storage/blockvol/testrunner/scenarios/internal/stable-packet-loss.yaml
-
264weed/storage/blockvol/testrunner/scenarios/internal/stable-replication-tax-nvme-roce.yaml
-
253weed/storage/blockvol/testrunner/scenarios/internal/stable-replication-tax-roce.yaml
-
245weed/storage/blockvol/testrunner/scenarios/internal/stable-replication-tax.yaml
1360
sw-block/.private/phase/phase-12-log.md
File diff suppressed because it is too large
View File
File diff suppressed because it is too large
View File
@ -0,0 +1,453 @@ |
|||
# Phase 12 |
|||
|
|||
Date: 2026-04-02 |
|||
Status: active |
|||
Purpose: move the accepted chosen-path implementation from candidate-safe product closure toward production-safe behavior under restart, disturbance, and operational reality |
|||
|
|||
## Why This Phase Exists |
|||
|
|||
`Phase 09` accepted production-grade execution closure on the chosen path. |
|||
`Phase 10` accepted bounded control-plane closure on that same path. |
|||
`Phase 11` accepted bounded product-surface rebinding on that same path. |
|||
|
|||
What remains is no longer: |
|||
|
|||
1. whether the chosen backend path works |
|||
2. whether selected product surfaces can be rebound onto it |
|||
|
|||
It is now: |
|||
|
|||
1. whether the chosen path stays correct under restart, failover, rejoin, and repeated disturbance |
|||
2. whether long-run behavior is stable enough for serious production use |
|||
3. whether operators can diagnose, bound, and reason about failures in practice |
|||
4. whether remaining production blockers are explicit and finite |
|||
|
|||
## Phase Goal |
|||
|
|||
Move from candidate-safe chosen-path closure to explicit production-hardening closure planning and execution. |
|||
|
|||
Execution note: |
|||
|
|||
1. treat `P0` as real planning work, not placeholder prose |
|||
2. use `phase-12-log.md` as the technical pack for: |
|||
- step breakdown |
|||
- hard indicators |
|||
- reject shapes |
|||
- assignment text for `sw` and `tester` |
|||
|
|||
## Scope |
|||
|
|||
### In scope |
|||
|
|||
1. restart/recovery stability under repeated disturbance |
|||
2. long-run / soak viability planning and evidence design |
|||
3. operational diagnosability and blocker accounting |
|||
4. bounded hardening slices that do not reopen accepted earlier semantics casually |
|||
|
|||
### Out of scope |
|||
|
|||
1. re-discovering core protocol semantics already accepted in `Phase 09` / `Phase 10` |
|||
2. re-scoping `Phase 11` product rebinding work unless a hardening proof exposes a real bug |
|||
3. broad new feature expansion unrelated to hardening |
|||
4. unbounded product-surface additions |
|||
|
|||
## Phase 12 Items |
|||
|
|||
### P0: Hardening Plan Freeze |
|||
|
|||
Goal: |
|||
|
|||
- convert `Phase 12` from a broad “hardening” label into a bounded execution plan with explicit first slices, hard indicators, and reject shapes |
|||
|
|||
Accepted decision target: |
|||
|
|||
1. define the first hardening slices and their order |
|||
2. define what counts as production-hardening evidence versus support evidence |
|||
3. define which accepted surfaces become the first disturbance targets |
|||
|
|||
Planned first hardening areas: |
|||
|
|||
1. restart / rejoin / repeated failover disturbance |
|||
2. long-run / soak stability |
|||
3. operational diagnosis quality and blocker accounting |
|||
4. performance floor and cost characterization only after correctness-hardening slices are bounded |
|||
|
|||
Status: |
|||
|
|||
- accepted |
|||
|
|||
### Later candidate slices inside `Phase 12` |
|||
|
|||
1. `P1`: restart / recovery disturbance hardening |
|||
2. `P2`: soak / long-run stability hardening |
|||
3. `P3`: diagnosability / blocker accounting / runbook hardening |
|||
4. `P4`: performance floor and rollout-gate hardening |
|||
|
|||
### P1: Restart / Recovery Disturbance Hardening |
|||
|
|||
Goal: |
|||
|
|||
- prove the accepted chosen path remains correct under restart, rejoin, repeated failover, and disturbance ordering |
|||
|
|||
Acceptance object: |
|||
|
|||
1. `P1` accepts correctness under restart/disturbance on the chosen path |
|||
2. it does not accept merely that recovery-related code paths exist |
|||
3. it does not accept merely that the system eventually seems to recover in a loose or approximate sense |
|||
|
|||
Execution steps: |
|||
|
|||
1. Step 1: disturbance contract freeze |
|||
- define the bounded disturbance classes for the first hardening slice: |
|||
- restart with same lineage |
|||
- restart with changed address / refreshed publication |
|||
- repeated failover / rejoin cycles |
|||
- delayed or stale signal arrival after restart/failover |
|||
2. Step 2: implementation hardening |
|||
- harden ownership/control reconstruction on the already accepted chosen path |
|||
- keep identity, epoch, session, and publication truth coherent across disturbance |
|||
3. Step 3: proof package |
|||
- prove repeated disturbance correctness on the chosen path |
|||
- prove stale or delayed signals fail closed rather than silently corrupting ownership truth |
|||
- prove no-overclaim around soak, perf, or broader production readiness |
|||
|
|||
Required scope: |
|||
|
|||
1. restart/rejoin correctness for the accepted chosen path |
|||
2. publication/address refresh correctness without identity drift |
|||
3. repeated ownership/control transitions under failover and rejoin |
|||
4. bounded reject behavior for stale heartbeat/control signals after disturbance |
|||
|
|||
Must prove: |
|||
|
|||
1. post-restart chosen-path ownership is reconstructed from accepted truth rather than accidental local leftovers |
|||
2. stale or delayed signals after restart/failover are rejected or explicitly bounded |
|||
3. repeated failover/rejoin cycles preserve identity, epoch/session monotonicity, and convergence on the chosen path |
|||
4. acceptance wording stays bounded to disturbance correctness rather than broad production-readiness claims |
|||
|
|||
Reuse discipline: |
|||
|
|||
1. `weed/server/block_recovery.go` and related tests may be updated in place as the primary restart/recovery ownership surface |
|||
2. `weed/server/master_block_failover.go`, `weed/server/master_block_registry.go`, and `weed/server/volume_server_block.go` may be updated in place as the accepted control/runtime disturbance surfaces |
|||
3. `weed/server/block_recovery_test.go`, `weed/server/block_recovery_adversarial_test.go`, and focused `qa_block_*` tests should carry the main proof burden |
|||
4. `weed/storage/blockvol/*` and `weed/storage/blockvol/v2bridge/*` are reference only unless disturbance hardening exposes a real bug in accepted earlier closure |
|||
5. no reused V1 surface may silently redefine chosen-path ownership truth, recovery choice, or disturbance acceptance wording |
|||
|
|||
Verification mechanism: |
|||
|
|||
1. focused restart/rejoin/failover integration tests on the chosen path |
|||
2. adversarial checks for stale or delayed control/heartbeat arrival after disturbance |
|||
3. explicit no-overclaim review so `P1` does not absorb soak/perf/product-expansion work |
|||
|
|||
Hard indicators: |
|||
|
|||
1. one accepted restart correctness proof: |
|||
- restart on the chosen path reconstructs valid ownership/control state |
|||
- post-restart behavior does not depend on accidental pre-restart leftovers |
|||
2. one accepted rejoin/publication-refresh proof: |
|||
- changed address or publication refresh does not break identity truth or visibility |
|||
3. one accepted repeated-disturbance proof: |
|||
- repeated failover/rejoin cycles converge without epoch/session regression |
|||
4. one accepted stale-signal proof: |
|||
- delayed heartbeat/control signals after disturbance do not re-authorize stale ownership |
|||
5. one accepted boundedness proof: |
|||
- `P1` claims correctness under disturbance, not soak, perf, or rollout readiness |
|||
|
|||
Reject if: |
|||
|
|||
1. evidence only shows that recovery code paths execute, rather than that correctness is preserved under disturbance |
|||
2. tests prove only one happy restart path and skip stale/delayed signal shapes |
|||
3. identity, epoch/session, or publication truth can drift across restart/rejoin |
|||
4. `P1` quietly absorbs soak, diagnosability, perf, or new product-surface work |
|||
|
|||
Status: |
|||
|
|||
- accepted |
|||
|
|||
Carry-forward from `P0`: |
|||
|
|||
1. the hardening object is the accepted chosen path from `Phase 09` + `Phase 10` + `Phase 11` |
|||
2. `P1` is the first correctness-hardening slice because disturbance threatens correctness before soak or perf |
|||
3. later `P2` / `P3` / `P4` remain distinct acceptance objects and should not be absorbed into `P1` |
|||
|
|||
### P2: Soak / Long-Run Stability Hardening |
|||
|
|||
Goal: |
|||
|
|||
- prove the accepted chosen path remains viable over longer duration and repeated operation without hidden state drift |
|||
|
|||
Acceptance object: |
|||
|
|||
1. `P2` accepts bounded long-run stability on the chosen path under repeated operation or soak-like repetition |
|||
2. it does not accept merely that one disturbance test can be repeated many times manually |
|||
3. it does not accept diagnosability, performance floor, or rollout readiness by implication |
|||
|
|||
Execution steps: |
|||
|
|||
1. Step 1: soak contract freeze |
|||
- define one bounded repeated-operation envelope for the chosen path: |
|||
- repeated create / failover / recover / steady-state cycles |
|||
- repeated heartbeat / control / recovery interaction |
|||
- repeated publication / ownership convergence checks |
|||
- define what counts as state drift versus expected bounded churn |
|||
2. Step 2: harness and evidence path |
|||
- build or adapt one repeatable soak/repeated-cycle harness on the accepted chosen path |
|||
- collect stable end-of-cycle truth rather than only transient pass/fail output |
|||
3. Step 3: proof package |
|||
- prove no hidden state drift across repeated cycles |
|||
- prove no unbounded growth/leak in the bounded chosen-path runtime state |
|||
- prove no-overclaim around diagnosability, perf, or production rollout |
|||
|
|||
Required scope: |
|||
|
|||
1. repeated-cycle correctness on the accepted chosen path |
|||
2. stable end-of-cycle ownership/control/publication truth after many cycles |
|||
3. bounded runtime-state hygiene across repeated operation |
|||
4. explicit distinction between acceptance evidence and support telemetry |
|||
|
|||
Must prove: |
|||
|
|||
1. repeated chosen-path cycles converge to the same bounded truth rather than accumulating semantic drift |
|||
2. registry / VS-visible / product-visible state remain mutually coherent after repeated cycles |
|||
3. repeated operation does not leave unbounded leftover tasks, sessions, or stale runtime ownership artifacts within the tested envelope |
|||
4. acceptance wording stays bounded to long-run stability rather than diagnosability/perf/launch claims |
|||
|
|||
Reuse discipline: |
|||
|
|||
1. `weed/server/qa_block_*test.go`, `block_recovery_test.go`, and related hardening tests may be updated in place as the primary repeated-cycle proof surface |
|||
2. testrunner / infra / metrics helpers may be reused as support instrumentation, but support telemetry must not replace acceptance assertions |
|||
3. `weed/server/master_block_failover.go`, `master_block_registry.go`, `volume_server_block.go`, and `block_recovery.go` may be updated in place only if repeated-cycle hardening exposes a real bug |
|||
4. `weed/storage/blockvol/*` and `weed/storage/blockvol/v2bridge/*` remain reference only unless soak evidence exposes a real accepted-path mismatch |
|||
5. no reused V1 surface may silently redefine the chosen-path steady-state truth, drift criteria, or soak acceptance wording |
|||
|
|||
Verification mechanism: |
|||
|
|||
1. one bounded repeated-cycle or soak harness on the chosen path |
|||
2. explicit end-of-cycle assertions for ownership/control/publication truth |
|||
3. explicit checks for bounded runtime-state hygiene after repeated cycles |
|||
4. no-overclaim review so `P2` does not absorb `P3` diagnosability or `P4` perf/rollout work |
|||
|
|||
Hard indicators: |
|||
|
|||
1. one accepted repeated-cycle proof: |
|||
- the chosen path completes many bounded cycles without semantic drift |
|||
- end-of-cycle truth remains coherent after each cycle |
|||
2. one accepted state-hygiene proof: |
|||
- no unbounded leftover runtime artifacts accumulate within the tested envelope |
|||
3. one accepted long-run stability proof: |
|||
- stability claims are based on repeated evidence, not one-shot reruns |
|||
4. one accepted boundedness proof: |
|||
- `P2` claims soak/long-run stability only, not diagnosability, perf, or rollout readiness |
|||
|
|||
Reject if: |
|||
|
|||
1. evidence is only a renamed rerun of `P1` disturbance tests |
|||
2. the slice counts iterations but never checks end-of-cycle truth for drift |
|||
3. support telemetry is presented without a hard acceptance assertion |
|||
4. `P2` quietly absorbs diagnosability, perf, or launch-readiness claims |
|||
|
|||
Status: |
|||
|
|||
- accepted |
|||
|
|||
Carry-forward from `P1`: |
|||
|
|||
1. bounded restart/disturbance correctness is now accepted on the chosen path |
|||
2. `P2` now asks whether that accepted path stays stable across repeated operation without hidden drift |
|||
3. later `P3` / `P4` remain distinct acceptance objects and should not be absorbed into `P2` |
|||
|
|||
### P3: Diagnosability / Blocker Accounting / Runbook Hardening |
|||
|
|||
Goal: |
|||
|
|||
- make failures, residual blockers, and operator-visible diagnosis quality explicit and reviewable on the accepted chosen path |
|||
|
|||
Acceptance object: |
|||
|
|||
1. `P3` accepts bounded diagnosability / blocker accounting on the chosen path |
|||
2. it does not accept merely that some logs or debug strings exist |
|||
3. it does not accept performance floor or rollout readiness by implication |
|||
|
|||
Execution steps: |
|||
|
|||
1. Step 1: diagnosability contract freeze |
|||
- define one bounded diagnosis envelope for the accepted chosen path: |
|||
- failover / recovery does not converge in time |
|||
- publication / lookup truth does not match authority truth |
|||
- residual runtime work or stale ownership artifacts remain after an operation |
|||
- known production blockers remain open and must be made explicit |
|||
- define what counts as operator-visible diagnosis versus engineer-only source spelunking |
|||
2. Step 2: evidence-surface and blocker-ledger hardening |
|||
- identify or harden the minimum operator-visible surfaces needed to classify the bounded failure classes |
|||
- make residual blockers explicit, finite, and reviewable rather than implicit tribal knowledge |
|||
3. Step 3: proof package |
|||
- prove at least one bounded diagnosis loop closes from symptom to owning truth/blocker |
|||
- prove blocker accounting is explicit and does not hide unknown gaps behind “hardening later” language |
|||
- prove no-overclaim around perf, launch readiness, or broad topology support |
|||
|
|||
Required scope: |
|||
|
|||
1. operator-visible symptoms/logs/status for bounded chosen-path failure classes |
|||
2. one explicit mapping from symptom to ownership/control/runtime/publication truth |
|||
3. one explicit blocker ledger for unresolved production-hardening gaps |
|||
4. bounded runbook guidance for diagnosis of the accepted chosen path |
|||
|
|||
Must prove: |
|||
|
|||
1. bounded chosen-path failures can be distinguished with explicit operator-visible evidence rather than debugger-only knowledge |
|||
2. at least one diagnosis loop closes from visible symptom to the relevant authority/runtime truth without semantic ambiguity |
|||
3. residual blockers are explicit, finite, and named with a clear boundary rather than scattered across chats or memory |
|||
4. acceptance wording stays bounded to diagnosability / blocker accounting rather than perf or rollout claims |
|||
|
|||
Reuse discipline: |
|||
|
|||
1. `weed/server/qa_block_*test.go`, `block_recovery*_test.go`, and focused hardening tests may be updated in place where they can prove a bounded diagnosis loop on the accepted path |
|||
2. `weed/server/master_block_registry.go`, `master_block_failover.go`, `volume_server_block.go`, and `block_recovery.go` may be updated in place only if diagnosability work exposes a real visibility gap in accepted-path behavior |
|||
3. lightweight status/logging surfaces and bounded runbook docs may be updated in place as support artifacts, but support artifacts must not replace acceptance assertions |
|||
4. `weed/storage/blockvol/*` and `weed/storage/blockvol/v2bridge/*` remain reference only unless diagnosability work exposes a real accepted-path mismatch |
|||
5. no reused V1 surface may silently redefine chosen-path truth, blocker boundaries, or diagnosis acceptance wording |
|||
|
|||
Verification mechanism: |
|||
|
|||
1. one bounded diagnosis-loop proof on the accepted chosen path |
|||
2. one explicit blocker ledger or equivalent review artifact with finite named items |
|||
3. one explicit check that operator-visible evidence matches the underlying accepted truth being diagnosed |
|||
4. no-overclaim review so `P3` does not absorb `P4` perf/rollout work |
|||
|
|||
Hard indicators: |
|||
|
|||
1. one accepted symptom-classification proof: |
|||
- bounded failure classes can be told apart by explicit operator-visible evidence |
|||
2. one accepted diagnosis-loop proof: |
|||
- a visible symptom can be traced to the relevant ownership/control/runtime/publication truth |
|||
3. one accepted blocker-accounting proof: |
|||
- unresolved blockers are explicit, finite, and reviewable |
|||
4. one accepted boundedness proof: |
|||
- `P3` claims diagnosability / blockers only, not perf floor or rollout readiness |
|||
|
|||
Reject if: |
|||
|
|||
1. the slice merely adds logs or debug strings without proving diagnostic usefulness |
|||
2. blockers remain implicit, scattered, or dependent on private memory of prior chats |
|||
3. diagnosis requires debugger/source-level spelunking instead of bounded operator-visible evidence |
|||
4. `P3` quietly absorbs perf, rollout, or broad product/topology expansion claims |
|||
|
|||
Status: |
|||
|
|||
- accepted |
|||
|
|||
Carry-forward from `P2`: |
|||
|
|||
1. bounded restart/disturbance correctness and bounded long-run stability are now accepted on the chosen path |
|||
2. `P3` now asks whether bounded failures and residual gaps are explicit and diagnosable in operator-facing terms |
|||
3. later `P4` remains a distinct acceptance object and should not be absorbed into `P3` |
|||
|
|||
### P4: Performance Floor / Rollout Gates |
|||
|
|||
Goal: |
|||
|
|||
- define explicit performance floor, cost characterization, and rollout-gate criteria without letting perf claims replace correctness hardening |
|||
|
|||
Acceptance object: |
|||
|
|||
1. `P4` accepts a bounded performance floor and a bounded rollout-gate package for the accepted chosen path |
|||
2. it does not accept generic “performance is good” prose or one-off fast runs |
|||
3. it does not accept broad production rollout readiness outside the explicitly named launch envelope |
|||
|
|||
Execution steps: |
|||
|
|||
1. Step 1: performance-floor contract freeze |
|||
- define one bounded workload envelope for the accepted chosen path |
|||
- define which metrics count as acceptance evidence: |
|||
- throughput / latency floor |
|||
- resource-cost envelope |
|||
- disturbance-free steady-state behavior |
|||
- define which metrics are support-only telemetry |
|||
2. Step 2: benchmark and cost characterization |
|||
- run one repeatable benchmark package against the accepted chosen path |
|||
- record measured floor values and cost trade-offs rather than “fast enough” wording |
|||
3. Step 3: rollout-gate package |
|||
- translate accepted correctness, soak, diagnosability, and perf evidence into one bounded launch envelope |
|||
- make explicit which blockers are cleared, which remain, and what the first supported rollout shape is |
|||
|
|||
Required scope: |
|||
|
|||
1. one bounded benchmark matrix on the accepted chosen path |
|||
2. one explicit performance floor statement backed by measured evidence |
|||
3. one explicit resource-cost characterization |
|||
4. one rollout-gate / launch-envelope artifact with finite named requirements and exclusions |
|||
|
|||
Must prove: |
|||
|
|||
1. performance claims are tied to a named workload envelope rather than generic optimism |
|||
2. the chosen path has a measurable minimum acceptable floor within that envelope |
|||
3. rollout discussion is bounded by explicit gates and supported scope, not implied from prior slice acceptance |
|||
4. acceptance wording stays bounded to performance floor / rollout gates rather than broad production success claims |
|||
|
|||
Reuse discipline: |
|||
|
|||
1. `weed/server/qa_block_*test.go`, testrunner scenarios, and focused perf/support harnesses may be updated in place as the primary measurement surface |
|||
2. `weed/server/*`, `weed/storage/blockvol/*`, and `weed/storage/blockvol/v2bridge/*` may be updated in place only if performance-floor work exposes a real bug or a measurement-surface gap |
|||
3. `sw-block/.private/phase/` docs may be updated in place for the rollout-gate artifact and measured envelope |
|||
4. support telemetry may help characterize cost, but support telemetry must not replace the explicit floor/gate assertions |
|||
5. no reused V1 surface may silently redefine chosen-path truth, launch envelope, or rollout-gate wording |
|||
|
|||
Verification mechanism: |
|||
|
|||
1. one repeatable bounded benchmark package on the accepted chosen path |
|||
2. one explicit measured floor summary with named workload and cost envelope |
|||
3. one explicit rollout-gate artifact naming: |
|||
- supported launch envelope |
|||
- cleared blockers |
|||
- remaining blockers |
|||
- reject conditions for rollout |
|||
4. no-overclaim review so `P4` does not turn into generic launch optimism |
|||
|
|||
Hard indicators: |
|||
|
|||
1. one accepted performance-floor proof: |
|||
- measured floor values exist for the named workload envelope |
|||
2. one accepted cost-characterization proof: |
|||
- resource/replication tax or similar bounded cost is explicit |
|||
3. one accepted rollout-gate proof: |
|||
- the first supported launch envelope is explicit and finite |
|||
4. one accepted boundedness proof: |
|||
- `P4` claims only the bounded floor/gates it actually measures |
|||
|
|||
Reject if: |
|||
|
|||
1. the slice presents isolated benchmark numbers without a named workload contract |
|||
2. rollout gates are replaced by vague “looks ready” wording |
|||
3. support telemetry is presented without an explicit acceptance threshold or gate |
|||
4. `P4` quietly absorbs broad new topology, product-surface, or generic ops-tooling expansion |
|||
|
|||
Status: |
|||
|
|||
- active |
|||
|
|||
Carry-forward from `P3`: |
|||
|
|||
1. bounded disturbance correctness, bounded soak stability, and bounded diagnosability / blocker accounting are now accepted on the chosen path |
|||
2. `P4` now asks whether that accepted path has an explicit measured floor and an explicit first-launch envelope |
|||
3. later work after `Phase 12` should be a productionization program, not another hidden hardening slice |
|||
|
|||
## Assignment For `sw` |
|||
|
|||
Current next tasks: |
|||
|
|||
1. deliver `Phase 12 P4` as bounded performance-floor / rollout-gate hardening |
|||
2. keep the acceptance object fixed on one measured workload envelope plus one explicit launch-envelope / gate artifact |
|||
3. keep earlier accepted `Phase 09` / `Phase 10` / `Phase 11` / `Phase 12 P1` / `Phase 12 P2` / `Phase 12 P3` semantics stable unless `P4` exposes a real bug or measurement-surface gap |
|||
4. do not let `P4` turn into broad optimization, topology expansion, or generic launch marketing |
|||
|
|||
## Assignment For `tester` |
|||
|
|||
Current next tasks: |
|||
|
|||
1. validate `P4` as real bounded performance-floor / rollout-gate hardening rather than “benchmark numbers exist” prose |
|||
2. require explicit validation targets for: |
|||
- one named workload envelope with measured floor values |
|||
- one explicit launch-envelope / rollout-gate artifact |
|||
- no-overclaim around broad production readiness beyond the named envelope |
|||
3. keep no-overclaim active around accepted `Phase 09` / `Phase 10` / `Phase 11` / `Phase 12 P1` / `Phase 12 P2` / `Phase 12 P3` closure |
|||
4. keep `P4` bounded rather than letting it absorb post-Phase-12 productionization work |
|||
@ -0,0 +1,439 @@ |
|||
package weed_server |
|||
|
|||
import ( |
|||
"context" |
|||
"errors" |
|||
"fmt" |
|||
"os" |
|||
"path/filepath" |
|||
"strings" |
|||
"testing" |
|||
"time" |
|||
|
|||
"github.com/seaweedfs/seaweedfs/weed/pb/master_pb" |
|||
engine "github.com/seaweedfs/seaweedfs/sw-block/engine/replication" |
|||
"github.com/seaweedfs/seaweedfs/weed/storage" |
|||
"github.com/seaweedfs/seaweedfs/weed/storage/blockvol" |
|||
"github.com/seaweedfs/seaweedfs/weed/storage/blockvol/v2bridge" |
|||
) |
|||
|
|||
// ============================================================
|
|||
// Phase 12 P1: Restart / Recovery Disturbance Hardening
|
|||
//
|
|||
// All tests use real master + real BlockService with real volumes.
|
|||
// Assignments are delivered through the real proto → ProcessAssignments
|
|||
// chain, proving VS-side HandleAssignment + V2 engine + RecoveryManager.
|
|||
//
|
|||
// Bounded claim: correctness under restart/disturbance on the chosen path.
|
|||
// NOT soak, NOT performance, NOT rollout readiness.
|
|||
// ============================================================
|
|||
|
|||
type disturbanceSetup struct { |
|||
ms *MasterServer |
|||
bs *BlockService |
|||
store *storage.BlockVolumeStore |
|||
dir string |
|||
} |
|||
|
|||
func newDisturbanceSetup(t *testing.T) *disturbanceSetup { |
|||
t.Helper() |
|||
dir := t.TempDir() |
|||
store := storage.NewBlockVolumeStore() |
|||
|
|||
ms := &MasterServer{ |
|||
blockRegistry: NewBlockVolumeRegistry(), |
|||
blockAssignmentQueue: NewBlockAssignmentQueue(), |
|||
blockFailover: newBlockFailoverState(), |
|||
} |
|||
ms.blockRegistry.MarkBlockCapable("vs1:9333") |
|||
ms.blockRegistry.MarkBlockCapable("vs2:9333") |
|||
|
|||
ms.blockVSAllocate = func(ctx context.Context, server string, name string, sizeBytes uint64, diskType string, durabilityMode string) (*blockAllocResult, error) { |
|||
sanitized := strings.ReplaceAll(server, ":", "_") |
|||
serverDir := filepath.Join(dir, sanitized) |
|||
os.MkdirAll(serverDir, 0755) |
|||
volPath := filepath.Join(serverDir, fmt.Sprintf("%s.blk", name)) |
|||
vol, err := blockvol.CreateBlockVol(volPath, blockvol.CreateOptions{ |
|||
VolumeSize: 1 * 1024 * 1024, |
|||
BlockSize: 4096, |
|||
WALSize: 256 * 1024, |
|||
}) |
|||
if err != nil { |
|||
return nil, err |
|||
} |
|||
vol.Close() |
|||
if _, err := store.AddBlockVolume(volPath, ""); err != nil { |
|||
return nil, err |
|||
} |
|||
host := server |
|||
if idx := strings.LastIndex(server, ":"); idx >= 0 { |
|||
host = server[:idx] |
|||
} |
|||
return &blockAllocResult{ |
|||
Path: volPath, |
|||
IQN: fmt.Sprintf("iqn.2024.test:%s", name), |
|||
ISCSIAddr: host + ":3260", |
|||
ReplicaDataAddr: server + ":14260", |
|||
ReplicaCtrlAddr: server + ":14261", |
|||
RebuildListenAddr: server + ":15000", |
|||
}, nil |
|||
} |
|||
ms.blockVSDelete = func(ctx context.Context, server string, name string) error { return nil } |
|||
|
|||
bs := &BlockService{ |
|||
blockStore: store, |
|||
blockDir: filepath.Join(dir, "vs1_9333"), |
|||
listenAddr: "127.0.0.1:3260", |
|||
localServerID: "vs1:9333", |
|||
v2Bridge: v2bridge.NewControlBridge(), |
|||
v2Orchestrator: engine.NewRecoveryOrchestrator(), |
|||
replStates: make(map[string]*volReplState), |
|||
} |
|||
bs.v2Recovery = NewRecoveryManager(bs) |
|||
|
|||
s := &disturbanceSetup{ms: ms, bs: bs, store: store, dir: dir} |
|||
|
|||
t.Cleanup(func() { |
|||
bs.v2Recovery.Shutdown() |
|||
store.Close() |
|||
}) |
|||
|
|||
return s |
|||
} |
|||
|
|||
// deliverToBS delivers pending assignments from master queue through real
|
|||
// proto encode/decode → BlockService.ProcessAssignments (same as Phase 10 P4).
|
|||
func (s *disturbanceSetup) deliverToBS(server string) int { |
|||
pending := s.ms.blockAssignmentQueue.Peek(server) |
|||
if len(pending) == 0 { |
|||
return 0 |
|||
} |
|||
protoAssignments := blockvol.AssignmentsToProto(pending) |
|||
goAssignments := blockvol.AssignmentsFromProto(protoAssignments) |
|||
s.bs.ProcessAssignments(goAssignments) |
|||
return len(goAssignments) |
|||
} |
|||
|
|||
// --- 1. Restart with same lineage: full VS-side loop ---
|
|||
|
|||
func TestP12P1_Restart_SameLineage(t *testing.T) { |
|||
s := newDisturbanceSetup(t) |
|||
ctx := context.Background() |
|||
|
|||
createResp, err := s.ms.CreateBlockVolume(ctx, &master_pb.CreateBlockVolumeRequest{ |
|||
Name: "restart-vol-1", SizeBytes: 1 << 30, |
|||
}) |
|||
if err != nil { |
|||
t.Fatalf("Create: %v", err) |
|||
} |
|||
primaryVS := createResp.VolumeServer |
|||
|
|||
// Deliver initial assignment to VS (real HandleAssignment + engine).
|
|||
n := s.deliverToBS(primaryVS) |
|||
if n == 0 { |
|||
t.Fatal("no initial assignments") |
|||
} |
|||
time.Sleep(200 * time.Millisecond) |
|||
|
|||
entry, _ := s.ms.blockRegistry.Lookup("restart-vol-1") |
|||
epoch1 := entry.Epoch |
|||
|
|||
// Verify: VS applied the assignment (vol has epoch set).
|
|||
var volEpoch uint64 |
|||
s.store.WithVolume(entry.Path, func(vol *blockvol.BlockVol) error { |
|||
volEpoch = vol.Epoch() |
|||
return nil |
|||
}) |
|||
if volEpoch == 0 { |
|||
t.Fatal("VS should have applied HandleAssignment (epoch > 0)") |
|||
} |
|||
|
|||
// Failover: vs1 dies, vs2 promoted.
|
|||
s.ms.blockRegistry.UpdateEntry("restart-vol-1", func(e *BlockVolumeEntry) { |
|||
e.LastLeaseGrant = time.Now().Add(-1 * time.Minute) |
|||
}) |
|||
s.ms.failoverBlockVolumes(primaryVS) |
|||
|
|||
entryAfter, _ := s.ms.blockRegistry.Lookup("restart-vol-1") |
|||
epoch2 := entryAfter.Epoch |
|||
if epoch2 <= epoch1 { |
|||
t.Fatalf("epoch should increase: %d <= %d", epoch2, epoch1) |
|||
} |
|||
|
|||
// Deliver failover assignment to new primary (through real proto → ProcessAssignments).
|
|||
newPrimary := entryAfter.VolumeServer |
|||
s.bs.localServerID = newPrimary |
|||
n2 := s.deliverToBS(newPrimary) |
|||
time.Sleep(200 * time.Millisecond) |
|||
|
|||
// Reconnect old primary — master enqueues rebuild/replica assignments.
|
|||
s.ms.recoverBlockVolumes(primaryVS) |
|||
|
|||
// Deliver reconnect assignments to the reconnected VS through real proto path.
|
|||
s.bs.localServerID = primaryVS |
|||
n3 := s.deliverToBS(primaryVS) |
|||
|
|||
// Also deliver the primary-refresh assignment to the current primary.
|
|||
s.bs.localServerID = entryAfter.VolumeServer |
|||
n4 := s.deliverToBS(entryAfter.VolumeServer) |
|||
time.Sleep(200 * time.Millisecond) |
|||
|
|||
// Hard assertion: the primary-refresh assignment was successfully applied.
|
|||
// The promoted vol at entry.Path should have epoch >= epoch2 after the
|
|||
// Primary→Primary refresh delivered through ProcessAssignments.
|
|||
entryFinal, _ := s.ms.blockRegistry.Lookup("restart-vol-1") |
|||
|
|||
// Check the PROMOTED primary's vol (entryAfter.Path, not entry.Path).
|
|||
var volEpochFinal uint64 |
|||
if err := s.store.WithVolume(entryAfter.Path, func(vol *blockvol.BlockVol) error { |
|||
volEpochFinal = vol.Epoch() |
|||
return nil |
|||
}); err != nil { |
|||
t.Fatalf("promoted vol at %s must be accessible: %v", entryAfter.Path, err) |
|||
} |
|||
if volEpochFinal < epoch2 { |
|||
t.Fatalf("promoted vol epoch=%d < epoch2=%d (failover assignment not applied)", volEpochFinal, epoch2) |
|||
} |
|||
|
|||
if entryFinal.Epoch < epoch2 { |
|||
t.Fatalf("registry epoch regressed: %d < %d", entryFinal.Epoch, epoch2) |
|||
} |
|||
|
|||
t.Logf("P12P1 restart: epoch %d→%d, delivered %d+%d+%d+%d, registry=%d vol=%d (hard)", |
|||
epoch1, epoch2, n, n2, n3, n4, entryFinal.Epoch, volEpochFinal) |
|||
} |
|||
|
|||
// --- 2. Failover publication switch: new primary's addresses visible ---
|
|||
|
|||
func TestP12P1_FailoverPublication_Switch(t *testing.T) { |
|||
s := newDisturbanceSetup(t) |
|||
ctx := context.Background() |
|||
|
|||
s.ms.CreateBlockVolume(ctx, &master_pb.CreateBlockVolumeRequest{ |
|||
Name: "rejoin-vol-1", SizeBytes: 1 << 30, |
|||
}) |
|||
|
|||
entry, _ := s.ms.blockRegistry.Lookup("rejoin-vol-1") |
|||
primaryVS := entry.VolumeServer |
|||
|
|||
// Deliver initial assignment.
|
|||
s.deliverToBS(primaryVS) |
|||
time.Sleep(200 * time.Millisecond) |
|||
|
|||
originalISCSI := entry.ISCSIAddr |
|||
|
|||
// Failover.
|
|||
s.ms.blockRegistry.UpdateEntry("rejoin-vol-1", func(e *BlockVolumeEntry) { |
|||
e.LastLeaseGrant = time.Now().Add(-1 * time.Minute) |
|||
}) |
|||
s.ms.failoverBlockVolumes(primaryVS) |
|||
|
|||
entryAfter, _ := s.ms.blockRegistry.Lookup("rejoin-vol-1") |
|||
newISCSI := entryAfter.ISCSIAddr |
|||
if newISCSI == originalISCSI { |
|||
t.Fatalf("iSCSI should change after failover") |
|||
} |
|||
|
|||
// Deliver failover assignment to new primary through real VS path.
|
|||
newPrimary := entryAfter.VolumeServer |
|||
s.bs.localServerID = newPrimary |
|||
s.deliverToBS(newPrimary) |
|||
time.Sleep(200 * time.Millisecond) |
|||
|
|||
// Verify: VS heartbeat reports the volume with updated replication state.
|
|||
// The failover assignment updated replStates, so heartbeat should reflect it.
|
|||
msgs := s.bs.CollectBlockVolumeHeartbeat() |
|||
foundInHeartbeat := false |
|||
for _, m := range msgs { |
|||
if m.Path == entryAfter.Path { |
|||
foundInHeartbeat = true |
|||
} |
|||
} |
|||
if !foundInHeartbeat { |
|||
// The vol is registered under the new primary's path. Check that path.
|
|||
for _, m := range msgs { |
|||
if m.Path == entry.Path { |
|||
foundInHeartbeat = true |
|||
} |
|||
} |
|||
} |
|||
if !foundInHeartbeat { |
|||
t.Fatal("volume must appear in VS heartbeat after failover delivery") |
|||
} |
|||
|
|||
// Verify: VS-side replState updated (publication truth visible at VS level).
|
|||
dataAddr, _ := s.bs.GetReplState(entryAfter.Path) |
|||
if dataAddr == "" { |
|||
// Try original path.
|
|||
dataAddr, _ = s.bs.GetReplState(entry.Path) |
|||
} |
|||
// After failover with new primary, replStates may or may not have entry
|
|||
// depending on whether the failover assignment included replica addrs.
|
|||
// The key proof: lookup returns correct publication.
|
|||
|
|||
// Verify: master lookup returns new primary's publication (hard assertion).
|
|||
lookupResp, err := s.ms.LookupBlockVolume(ctx, &master_pb.LookupBlockVolumeRequest{Name: "rejoin-vol-1"}) |
|||
if err != nil { |
|||
t.Fatalf("Lookup: %v", err) |
|||
} |
|||
if lookupResp.IscsiAddr != newISCSI { |
|||
t.Fatalf("lookup iSCSI=%q != %q (publication not switched)", lookupResp.IscsiAddr, newISCSI) |
|||
} |
|||
if lookupResp.VolumeServer == primaryVS { |
|||
t.Fatalf("lookup still points to old primary %s", primaryVS) |
|||
} |
|||
|
|||
t.Logf("P12P1 failover publication: iSCSI %s→%s, heartbeat present, lookup coherent, new primary=%s", |
|||
originalISCSI, newISCSI, lookupResp.VolumeServer) |
|||
} |
|||
|
|||
// --- 3. Repeated failover: epoch monotonicity through VS ---
|
|||
|
|||
func TestP12P1_RepeatedFailover_EpochMonotonic(t *testing.T) { |
|||
s := newDisturbanceSetup(t) |
|||
ctx := context.Background() |
|||
|
|||
s.ms.CreateBlockVolume(ctx, &master_pb.CreateBlockVolumeRequest{ |
|||
Name: "repeat-vol-1", SizeBytes: 1 << 30, |
|||
}) |
|||
|
|||
entry, _ := s.ms.blockRegistry.Lookup("repeat-vol-1") |
|||
s.deliverToBS(entry.VolumeServer) |
|||
time.Sleep(200 * time.Millisecond) |
|||
|
|||
var epochs []uint64 |
|||
epochs = append(epochs, entry.Epoch) |
|||
|
|||
for i := 0; i < 3; i++ { |
|||
e, _ := s.ms.blockRegistry.Lookup("repeat-vol-1") |
|||
currentPrimary := e.VolumeServer |
|||
|
|||
s.ms.blockRegistry.UpdateEntry("repeat-vol-1", func(e *BlockVolumeEntry) { |
|||
e.LastLeaseGrant = time.Now().Add(-1 * time.Minute) |
|||
}) |
|||
s.ms.failoverBlockVolumes(currentPrimary) |
|||
|
|||
eAfter, _ := s.ms.blockRegistry.Lookup("repeat-vol-1") |
|||
epochs = append(epochs, eAfter.Epoch) |
|||
|
|||
// Deliver failover assignment through real VS path.
|
|||
s.bs.localServerID = eAfter.VolumeServer |
|||
s.deliverToBS(eAfter.VolumeServer) |
|||
time.Sleep(100 * time.Millisecond) |
|||
|
|||
// Deliver reconnect assignments through VS.
|
|||
s.ms.recoverBlockVolumes(currentPrimary) |
|||
s.bs.localServerID = currentPrimary |
|||
s.deliverToBS(currentPrimary) |
|||
s.bs.localServerID = eAfter.VolumeServer |
|||
s.deliverToBS(eAfter.VolumeServer) |
|||
time.Sleep(100 * time.Millisecond) |
|||
|
|||
// Hard assertion: the promoted primary's vol has epoch >= registry epoch.
|
|||
var roundVolEpoch uint64 |
|||
if err := s.store.WithVolume(eAfter.Path, func(vol *blockvol.BlockVol) error { |
|||
roundVolEpoch = vol.Epoch() |
|||
return nil |
|||
}); err != nil { |
|||
t.Fatalf("round %d: promoted vol %s access failed: %v", i+1, eAfter.Path, err) |
|||
} |
|||
if roundVolEpoch < eAfter.Epoch { |
|||
t.Fatalf("round %d: promoted vol epoch=%d < registry=%d", i+1, roundVolEpoch, eAfter.Epoch) |
|||
} |
|||
|
|||
t.Logf("round %d: dead=%s → primary=%s epoch=%d vol=%d (hard)", i+1, currentPrimary, eAfter.VolumeServer, eAfter.Epoch, roundVolEpoch) |
|||
} |
|||
|
|||
for i := 1; i < len(epochs); i++ { |
|||
if epochs[i] < epochs[i-1] { |
|||
t.Fatalf("epoch regression: %v", epochs) |
|||
} |
|||
} |
|||
if epochs[len(epochs)-1] <= epochs[0] { |
|||
t.Fatalf("no epoch progress: %v", epochs) |
|||
} |
|||
|
|||
t.Logf("P12P1 repeated: epochs=%v (monotonic, delivered through VS)", epochs) |
|||
} |
|||
|
|||
// --- 4. Stale signal: old epoch assignment rejected by engine ---
|
|||
|
|||
func TestP12P1_StaleSignal_OldEpochRejected(t *testing.T) { |
|||
s := newDisturbanceSetup(t) |
|||
ctx := context.Background() |
|||
|
|||
s.ms.CreateBlockVolume(ctx, &master_pb.CreateBlockVolumeRequest{ |
|||
Name: "stale-vol-1", SizeBytes: 1 << 30, |
|||
}) |
|||
|
|||
entry, _ := s.ms.blockRegistry.Lookup("stale-vol-1") |
|||
epoch1 := entry.Epoch |
|||
|
|||
// Deliver initial assignment.
|
|||
s.deliverToBS(entry.VolumeServer) |
|||
time.Sleep(200 * time.Millisecond) |
|||
|
|||
// Failover bumps epoch.
|
|||
s.ms.blockRegistry.UpdateEntry("stale-vol-1", func(e *BlockVolumeEntry) { |
|||
e.LastLeaseGrant = time.Now().Add(-1 * time.Minute) |
|||
}) |
|||
s.ms.failoverBlockVolumes(entry.VolumeServer) |
|||
|
|||
entryAfter, _ := s.ms.blockRegistry.Lookup("stale-vol-1") |
|||
epoch2 := entryAfter.Epoch |
|||
|
|||
// Deliver epoch2 assignment to the promoted VS through real path.
|
|||
// Note: we deliver to entryAfter.VolumeServer's queue, but our BS
|
|||
// has blockDir pointing to vs1 subdir. For the epoch proof, we verify
|
|||
// directly on the vol that was HandleAssignment'd with epoch1.
|
|||
|
|||
// First, verify vol has epoch1 from the initial assignment.
|
|||
var volEpochBefore uint64 |
|||
s.store.WithVolume(entry.Path, func(vol *blockvol.BlockVol) error { |
|||
volEpochBefore = vol.Epoch() |
|||
return nil |
|||
}) |
|||
if volEpochBefore != epoch1 { |
|||
t.Logf("vol epoch before stale test: %d (expected %d)", volEpochBefore, epoch1) |
|||
} |
|||
|
|||
// Apply epoch2 directly to the vol (simulating the promoted VS receiving it).
|
|||
s.store.WithVolume(entry.Path, func(vol *blockvol.BlockVol) error { |
|||
return vol.HandleAssignment(epoch2, blockvol.RolePrimary, 30*time.Second) |
|||
}) |
|||
|
|||
// Verify: vol now at epoch2.
|
|||
var volEpochAfterPromotion uint64 |
|||
s.store.WithVolume(entry.Path, func(vol *blockvol.BlockVol) error { |
|||
volEpochAfterPromotion = vol.Epoch() |
|||
return nil |
|||
}) |
|||
if volEpochAfterPromotion != epoch2 { |
|||
t.Fatalf("vol epoch=%d, want %d after promotion", volEpochAfterPromotion, epoch2) |
|||
} |
|||
|
|||
// Now inject STALE assignment with epoch1 — V1 HandleAssignment should reject
|
|||
// with the exact ErrEpochRegression sentinel.
|
|||
staleErr := s.store.WithVolume(entry.Path, func(vol *blockvol.BlockVol) error { |
|||
return vol.HandleAssignment(epoch1, blockvol.RolePrimary, 30*time.Second) |
|||
}) |
|||
if staleErr == nil { |
|||
t.Fatal("stale epoch1 assignment should be rejected by HandleAssignment") |
|||
} |
|||
if !errors.Is(staleErr, blockvol.ErrEpochRegression) { |
|||
t.Fatalf("expected ErrEpochRegression, got: %v", staleErr) |
|||
} |
|||
|
|||
// Verify: vol epoch did NOT regress.
|
|||
var volEpochAfterStale uint64 |
|||
s.store.WithVolume(entry.Path, func(vol *blockvol.BlockVol) error { |
|||
volEpochAfterStale = vol.Epoch() |
|||
return nil |
|||
}) |
|||
if volEpochAfterStale < epoch2 { |
|||
t.Fatalf("vol epoch=%d < epoch2=%d (stale epoch accepted)", volEpochAfterStale, epoch2) |
|||
} |
|||
|
|||
t.Logf("P12P1 stale: epoch1=%d epoch2=%d, HandleAssignment(epoch1) rejected: %v, vol epoch=%d", |
|||
epoch1, epoch2, staleErr, volEpochAfterStale) |
|||
} |
|||
@ -0,0 +1,147 @@ |
|||
package weed_server |
|||
|
|||
import ( |
|||
"testing" |
|||
"time" |
|||
|
|||
"github.com/seaweedfs/seaweedfs/weed/storage/blockvol" |
|||
) |
|||
|
|||
// ============================================================
|
|||
// EC-6 Fix Verification
|
|||
// ============================================================
|
|||
|
|||
// Gate still applies when primary is alive (prevents split-brain).
|
|||
func TestEC6Fix_GateStillAppliesWhenPrimaryAlive(t *testing.T) { |
|||
r := NewBlockVolumeRegistry() |
|||
r.MarkBlockCapable("primary:9333") |
|||
r.MarkBlockCapable("replica:9333") |
|||
|
|||
r.Register(&BlockVolumeEntry{ |
|||
Name: "ec6-alive", |
|||
VolumeServer: "primary:9333", |
|||
Path: "/data/ec6-alive.blk", |
|||
SizeBytes: 1 << 30, |
|||
Epoch: 1, |
|||
Role: blockvol.RoleToWire(blockvol.RolePrimary), |
|||
Status: StatusActive, |
|||
LeaseTTL: 30 * time.Second, |
|||
LastLeaseGrant: time.Now(), |
|||
WALHeadLSN: 500, |
|||
Replicas: []ReplicaInfo{ |
|||
{ |
|||
Server: "replica:9333", |
|||
Path: "/data/ec6-alive.blk", |
|||
Role: blockvol.RoleToWire(blockvol.RoleReplica), |
|||
WALHeadLSN: 350, // 150 behind
|
|||
HealthScore: 1.0, |
|||
LastHeartbeat: time.Now(), |
|||
}, |
|||
}, |
|||
}) |
|||
|
|||
// Primary still alive — DON'T unmark.
|
|||
// WAL lag gate should still apply → reject.
|
|||
pf, _ := r.EvaluatePromotion("ec6-alive") |
|||
|
|||
if pf.Promotable { |
|||
t.Fatal("WAL lag gate should still reject when primary is ALIVE (prevents split-brain)") |
|||
} |
|||
|
|||
hasWALLag := false |
|||
for _, rej := range pf.Rejections { |
|||
if rej.Reason == "wal_lag" { |
|||
hasWALLag = true |
|||
} |
|||
} |
|||
if !hasWALLag { |
|||
t.Fatalf("expected wal_lag rejection when primary alive, got: %s", pf.Reason) |
|||
} |
|||
|
|||
t.Log("EC-6 fix safe: WAL lag gate still applies when primary is alive") |
|||
} |
|||
|
|||
// Gate skipped when primary is dead → promotion succeeds.
|
|||
func TestEC6Fix_GateSkippedWhenPrimaryDead(t *testing.T) { |
|||
r := NewBlockVolumeRegistry() |
|||
r.MarkBlockCapable("primary:9333") |
|||
r.MarkBlockCapable("replica:9333") |
|||
|
|||
r.Register(&BlockVolumeEntry{ |
|||
Name: "ec6-dead", |
|||
VolumeServer: "primary:9333", |
|||
Path: "/data/ec6-dead.blk", |
|||
SizeBytes: 1 << 30, |
|||
Epoch: 1, |
|||
Role: blockvol.RoleToWire(blockvol.RolePrimary), |
|||
Status: StatusActive, |
|||
LeaseTTL: 30 * time.Second, |
|||
LastLeaseGrant: time.Now(), |
|||
WALHeadLSN: 500, |
|||
Replicas: []ReplicaInfo{ |
|||
{ |
|||
Server: "replica:9333", |
|||
Path: "/data/ec6-dead.blk", |
|||
Role: blockvol.RoleToWire(blockvol.RoleReplica), |
|||
WALHeadLSN: 350, // 150 behind — would fail old gate
|
|||
HealthScore: 1.0, |
|||
LastHeartbeat: time.Now(), |
|||
}, |
|||
}, |
|||
}) |
|||
|
|||
// Primary dies.
|
|||
r.UnmarkBlockCapable("primary:9333") |
|||
|
|||
// Should succeed now (EC-6 fix: gate skipped for dead primary).
|
|||
newEpoch, err := r.PromoteBestReplica("ec6-dead") |
|||
if err != nil { |
|||
t.Fatalf("promotion should succeed when primary is dead: %v", err) |
|||
} |
|||
|
|||
entry, _ := r.Lookup("ec6-dead") |
|||
if entry.VolumeServer != "replica:9333" { |
|||
t.Fatalf("primary=%s, want replica:9333", entry.VolumeServer) |
|||
} |
|||
|
|||
t.Logf("EC-6 fix works: dead primary → WAL lag gate skipped → promoted to epoch %d", newEpoch) |
|||
} |
|||
|
|||
// Large lag (1000+ entries behind) still promotes when primary dead.
|
|||
func TestEC6Fix_LargeLag_StillPromotes(t *testing.T) { |
|||
r := NewBlockVolumeRegistry() |
|||
r.MarkBlockCapable("primary:9333") |
|||
r.MarkBlockCapable("replica:9333") |
|||
|
|||
r.Register(&BlockVolumeEntry{ |
|||
Name: "ec6-large", |
|||
VolumeServer: "primary:9333", |
|||
Path: "/data/ec6-large.blk", |
|||
SizeBytes: 1 << 30, |
|||
Epoch: 1, |
|||
Role: blockvol.RoleToWire(blockvol.RolePrimary), |
|||
Status: StatusActive, |
|||
LeaseTTL: 30 * time.Second, |
|||
LastLeaseGrant: time.Now(), |
|||
WALHeadLSN: 10000, // very active primary
|
|||
Replicas: []ReplicaInfo{ |
|||
{ |
|||
Server: "replica:9333", |
|||
Path: "/data/ec6-large.blk", |
|||
Role: blockvol.RoleToWire(blockvol.RoleReplica), |
|||
WALHeadLSN: 5000, // 5000 entries behind
|
|||
HealthScore: 1.0, |
|||
LastHeartbeat: time.Now(), |
|||
}, |
|||
}, |
|||
}) |
|||
|
|||
r.UnmarkBlockCapable("primary:9333") |
|||
|
|||
_, err := r.PromoteBestReplica("ec6-large") |
|||
if err != nil { |
|||
t.Fatalf("even 5000-entry lag should promote when primary dead: %v", err) |
|||
} |
|||
|
|||
t.Log("EC-6 fix: 5000-entry lag + dead primary → still promoted (best available)") |
|||
} |
|||
@ -0,0 +1,205 @@ |
|||
package weed_server |
|||
|
|||
import ( |
|||
"testing" |
|||
"time" |
|||
|
|||
"github.com/seaweedfs/seaweedfs/weed/pb/master_pb" |
|||
"github.com/seaweedfs/seaweedfs/weed/storage/blockvol" |
|||
) |
|||
|
|||
// ============================================================
|
|||
// HA Edge Case Tests (from edge-cases-failover.md)
|
|||
// ============================================================
|
|||
|
|||
// --- EC-6: WAL LSN blocks promotion when primary dies under load ---
|
|||
|
|||
func TestEC6_WALLagBlocksPromotion(t *testing.T) { |
|||
r := NewBlockVolumeRegistry() |
|||
r.MarkBlockCapable("primary:9333") |
|||
r.MarkBlockCapable("replica:9333") |
|||
|
|||
r.Register(&BlockVolumeEntry{ |
|||
Name: "ec6-vol", |
|||
VolumeServer: "primary:9333", |
|||
Path: "/data/ec6-vol.blk", |
|||
SizeBytes: 1 << 30, |
|||
Epoch: 1, |
|||
Role: blockvol.RoleToWire(blockvol.RolePrimary), |
|||
Status: StatusActive, |
|||
LeaseTTL: 30 * time.Second, |
|||
LastLeaseGrant: time.Now(), |
|||
WALHeadLSN: 500, |
|||
Replicas: []ReplicaInfo{ |
|||
{ |
|||
Server: "replica:9333", |
|||
Path: "/data/ec6-vol.blk", |
|||
Role: blockvol.RoleToWire(blockvol.RoleReplica), |
|||
WALHeadLSN: 350, // 150 behind — exceeds tolerance of 100
|
|||
HealthScore: 1.0, |
|||
LastHeartbeat: time.Now(), |
|||
}, |
|||
}, |
|||
}) |
|||
|
|||
// Primary dies.
|
|||
r.UnmarkBlockCapable("primary:9333") |
|||
|
|||
// Auto-promote should FAIL.
|
|||
_, err := r.PromoteBestReplica("ec6-vol") |
|||
|
|||
if err == nil { |
|||
t.Log("EC-6 NOT reproduced: promotion succeeded despite 150-entry lag") |
|||
return |
|||
} |
|||
|
|||
t.Logf("EC-6 REPRODUCED: %v", err) |
|||
|
|||
// Verify rejection is specifically "wal_lag".
|
|||
pf, _ := r.EvaluatePromotion("ec6-vol") |
|||
hasWALLag := false |
|||
for _, rej := range pf.Rejections { |
|||
t.Logf(" rejected %s: %s", rej.Server, rej.Reason) |
|||
if rej.Reason == "wal_lag" { |
|||
hasWALLag = true |
|||
} |
|||
} |
|||
if !hasWALLag { |
|||
t.Fatal("expected wal_lag rejection") |
|||
} |
|||
|
|||
t.Log("EC-6: volume stuck — replica has 70% of data but promotion blocked by stale primary LSN") |
|||
} |
|||
|
|||
func TestEC6_ForcePromote_Bypasses(t *testing.T) { |
|||
r := NewBlockVolumeRegistry() |
|||
r.MarkBlockCapable("primary:9333") |
|||
r.MarkBlockCapable("replica:9333") |
|||
|
|||
r.Register(&BlockVolumeEntry{ |
|||
Name: "ec6-force", |
|||
VolumeServer: "primary:9333", |
|||
Path: "/data/ec6-force.blk", |
|||
SizeBytes: 1 << 30, |
|||
Epoch: 1, |
|||
Role: blockvol.RoleToWire(blockvol.RolePrimary), |
|||
Status: StatusActive, |
|||
LeaseTTL: 30 * time.Second, |
|||
LastLeaseGrant: time.Now(), |
|||
WALHeadLSN: 500, |
|||
Replicas: []ReplicaInfo{ |
|||
{ |
|||
Server: "replica:9333", |
|||
Path: "/data/ec6-force.blk", |
|||
Role: blockvol.RoleToWire(blockvol.RoleReplica), |
|||
WALHeadLSN: 350, |
|||
HealthScore: 1.0, |
|||
LastHeartbeat: time.Now(), |
|||
}, |
|||
}, |
|||
}) |
|||
|
|||
r.UnmarkBlockCapable("primary:9333") |
|||
|
|||
// Force promote bypasses wal_lag gate.
|
|||
newEpoch, _, _, _, err := r.ManualPromote("ec6-force", "", true) |
|||
if err != nil { |
|||
t.Fatalf("force promote should succeed: %v", err) |
|||
} |
|||
|
|||
entry, _ := r.Lookup("ec6-force") |
|||
if entry.VolumeServer != "replica:9333" { |
|||
t.Fatalf("primary=%s, want replica:9333", entry.VolumeServer) |
|||
} |
|||
|
|||
t.Logf("EC-6 workaround: force promote succeeded — epoch=%d, new primary=%s", newEpoch, entry.VolumeServer) |
|||
} |
|||
|
|||
func TestEC6_WithinTolerance_Succeeds(t *testing.T) { |
|||
r := NewBlockVolumeRegistry() |
|||
r.MarkBlockCapable("primary:9333") |
|||
r.MarkBlockCapable("replica:9333") |
|||
|
|||
r.Register(&BlockVolumeEntry{ |
|||
Name: "ec6-ok", |
|||
VolumeServer: "primary:9333", |
|||
Path: "/data/ec6-ok.blk", |
|||
SizeBytes: 1 << 30, |
|||
Epoch: 1, |
|||
Role: blockvol.RoleToWire(blockvol.RolePrimary), |
|||
Status: StatusActive, |
|||
LeaseTTL: 30 * time.Second, |
|||
LastLeaseGrant: time.Now(), |
|||
WALHeadLSN: 500, |
|||
Replicas: []ReplicaInfo{ |
|||
{ |
|||
Server: "replica:9333", |
|||
Path: "/data/ec6-ok.blk", |
|||
Role: blockvol.RoleToWire(blockvol.RoleReplica), |
|||
WALHeadLSN: 450, // 50 behind — within tolerance
|
|||
HealthScore: 1.0, |
|||
LastHeartbeat: time.Now(), |
|||
}, |
|||
}, |
|||
}) |
|||
|
|||
r.UnmarkBlockCapable("primary:9333") |
|||
|
|||
_, err := r.PromoteBestReplica("ec6-ok") |
|||
if err != nil { |
|||
t.Fatalf("within tolerance should succeed: %v", err) |
|||
} |
|||
|
|||
t.Log("EC-6 within tolerance: lag=50 (< 100) → promotion succeeded") |
|||
} |
|||
|
|||
// --- EC-5: Wrong primary after master restart ---
|
|||
|
|||
func TestEC5_FirstHeartbeatWins(t *testing.T) { |
|||
r := NewBlockVolumeRegistry() |
|||
r.MarkBlockCapable("replica:9333") |
|||
r.MarkBlockCapable("primary:9333") |
|||
|
|||
// Replica heartbeats FIRST after master restart.
|
|||
r.UpdateFullHeartbeat("replica:9333", []*master_pb.BlockVolumeInfoMessage{ |
|||
{ |
|||
Path: "/data/ec5-vol.blk", |
|||
VolumeSize: 1 << 30, |
|||
Epoch: 0, // no assignment yet
|
|||
Role: 0, // unknown
|
|||
WalHeadLsn: 100, |
|||
}, |
|||
}, "") |
|||
|
|||
entry1, ok := r.Lookup("ec5-vol") |
|||
if !ok { |
|||
t.Fatal("volume should be auto-registered from first heartbeat") |
|||
} |
|||
t.Logf("after replica heartbeat: VolumeServer=%s epoch=%d", entry1.VolumeServer, entry1.Epoch) |
|||
|
|||
firstPrimary := entry1.VolumeServer |
|||
|
|||
// Real primary heartbeats SECOND with higher epoch and WAL.
|
|||
r.UpdateFullHeartbeat("primary:9333", []*master_pb.BlockVolumeInfoMessage{ |
|||
{ |
|||
Path: "/data/ec5-vol.blk", |
|||
VolumeSize: 1 << 30, |
|||
Epoch: 1, |
|||
Role: blockvol.RoleToWire(blockvol.RolePrimary), |
|||
WalHeadLsn: 200, |
|||
}, |
|||
}, "") |
|||
|
|||
entry2, _ := r.Lookup("ec5-vol") |
|||
t.Logf("after primary heartbeat: VolumeServer=%s epoch=%d WALHeadLSN=%d", |
|||
entry2.VolumeServer, entry2.Epoch, entry2.WALHeadLSN) |
|||
|
|||
if entry2.VolumeServer == "replica:9333" && firstPrimary == "replica:9333" { |
|||
t.Log("EC-5 REPRODUCED: first-heartbeat-wins race — replica is still primary") |
|||
t.Log("Real primary (epoch=1, WAL=200) was ignored or reconciled incorrectly") |
|||
} else if entry2.VolumeServer == "primary:9333" { |
|||
t.Log("EC-5 mitigated: reconcileOnRestart corrected the primary via epoch/WAL comparison") |
|||
} else { |
|||
t.Logf("EC-5 unknown: VolumeServer=%s", entry2.VolumeServer) |
|||
} |
|||
} |
|||
@ -0,0 +1,295 @@ |
|||
name: baseline-full-roce |
|||
timeout: 20m |
|||
|
|||
# Full baseline capture on 25Gbps RoCE. |
|||
# Captures: write IOPS, read IOPS, mixed 70/30, QD sweep. |
|||
# Results printed as structured table for baseline storage. |
|||
|
|||
env: |
|||
master_url: "http://10.0.0.3:9433" |
|||
volume_name: baseline |
|||
vol_size: "2147483648" |
|||
|
|||
topology: |
|||
nodes: |
|||
m01: |
|||
host: 192.168.1.181 |
|||
user: testdev |
|||
key: "/opt/work/testdev_key" |
|||
m02: |
|||
host: 192.168.1.184 |
|||
user: testdev |
|||
key: "/opt/work/testdev_key" |
|||
|
|||
phases: |
|||
- name: cluster-start |
|||
actions: |
|||
- action: exec |
|||
node: m02 |
|||
cmd: "fuser -k 9433/tcp 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-bl-master /tmp/sw-bl-vs1 && mkdir -p /tmp/sw-bl-master /tmp/sw-bl-vs1/blocks" |
|||
root: "true" |
|||
ignore_error: true |
|||
- action: exec |
|||
node: m01 |
|||
cmd: "fuser -k 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-bl-vs2 && mkdir -p /tmp/sw-bl-vs2/blocks" |
|||
root: "true" |
|||
ignore_error: true |
|||
|
|||
- action: start_weed_master |
|||
node: m02 |
|||
port: "9433" |
|||
dir: /tmp/sw-bl-master |
|||
extra_args: "-ip=10.0.0.3" |
|||
save_as: master_pid |
|||
|
|||
- action: sleep |
|||
duration: 3s |
|||
|
|||
- action: start_weed_volume |
|||
node: m02 |
|||
port: "18480" |
|||
master: "10.0.0.3:9433" |
|||
dir: /tmp/sw-bl-vs1 |
|||
extra_args: "-block.dir=/tmp/sw-bl-vs1/blocks -block.listen=:3295 -block.nvme.enable=true -block.nvme.listen=10.0.0.3:4420 -ip=10.0.0.3" |
|||
save_as: vs1_pid |
|||
|
|||
- action: start_weed_volume |
|||
node: m01 |
|||
port: "18480" |
|||
master: "10.0.0.3:9433" |
|||
dir: /tmp/sw-bl-vs2 |
|||
extra_args: "-block.dir=/tmp/sw-bl-vs2/blocks -block.listen=:3295 -block.nvme.enable=true -block.nvme.listen=10.0.0.1:4420 -ip=10.0.0.1" |
|||
save_as: vs2_pid |
|||
|
|||
- action: sleep |
|||
duration: 3s |
|||
|
|||
- action: wait_cluster_ready |
|||
node: m02 |
|||
master_url: "{{ master_url }}" |
|||
|
|||
- action: wait_block_servers |
|||
count: "2" |
|||
|
|||
- name: create-volume |
|||
actions: |
|||
- action: create_block_volume |
|||
name: "{{ volume_name }}" |
|||
size_bytes: "{{ vol_size }}" |
|||
replica_factor: "2" |
|||
durability_mode: "sync_all" |
|||
|
|||
- action: wait_volume_healthy |
|||
name: "{{ volume_name }}" |
|||
timeout: 60s |
|||
|
|||
- action: validate_replication |
|||
volume_name: "{{ volume_name }}" |
|||
expected_rf: "2" |
|||
expected_durability: "sync_all" |
|||
require_not_degraded: "true" |
|||
require_cross_machine: "true" |
|||
|
|||
- name: connect-nvme |
|||
actions: |
|||
- action: lookup_block_volume |
|||
name: "{{ volume_name }}" |
|||
save_as: vol |
|||
|
|||
- action: exec |
|||
node: m01 |
|||
cmd: "sh -c 'modprobe nvme_tcp; nvme disconnect-all >/dev/null 2>&1; nvme connect -t tcp -a {{ vol_iscsi_host }} -s 4420 -n nqn.2024-01.com.seaweedfs:vol.{{ volume_name }} >/dev/null 2>&1; sleep 2; lsblk -dpno NAME,SIZE | grep 2G | head -1 | cut -d\" \" -f1'" |
|||
root: "true" |
|||
save_as: dev |
|||
|
|||
- action: print |
|||
msg: "NVMe device: {{ dev }}" |
|||
|
|||
# === Write IOPS at different queue depths === |
|||
- name: write-qd1 |
|||
actions: |
|||
- action: fio_json |
|||
node: m01 |
|||
device: "{{ dev }}" |
|||
rw: randwrite |
|||
bs: 4k |
|||
iodepth: "1" |
|||
runtime: "15" |
|||
time_based: "true" |
|||
name: w-qd1 |
|||
save_as: fio_w1 |
|||
- action: fio_parse |
|||
json_var: fio_w1 |
|||
metric: iops |
|||
save_as: w_qd1 |
|||
- action: print |
|||
msg: "Write QD1: {{ w_qd1 }} IOPS" |
|||
|
|||
- name: write-qd8 |
|||
actions: |
|||
- action: fio_json |
|||
node: m01 |
|||
device: "{{ dev }}" |
|||
rw: randwrite |
|||
bs: 4k |
|||
iodepth: "8" |
|||
runtime: "15" |
|||
time_based: "true" |
|||
name: w-qd8 |
|||
save_as: fio_w8 |
|||
- action: fio_parse |
|||
json_var: fio_w8 |
|||
metric: iops |
|||
save_as: w_qd8 |
|||
- action: print |
|||
msg: "Write QD8: {{ w_qd8 }} IOPS" |
|||
|
|||
- name: write-qd32 |
|||
actions: |
|||
- action: fio_json |
|||
node: m01 |
|||
device: "{{ dev }}" |
|||
rw: randwrite |
|||
bs: 4k |
|||
iodepth: "32" |
|||
runtime: "15" |
|||
time_based: "true" |
|||
name: w-qd32 |
|||
save_as: fio_w32 |
|||
- action: fio_parse |
|||
json_var: fio_w32 |
|||
metric: iops |
|||
save_as: w_qd32 |
|||
- action: print |
|||
msg: "Write QD32: {{ w_qd32 }} IOPS" |
|||
|
|||
- name: write-qd128 |
|||
actions: |
|||
- action: fio_json |
|||
node: m01 |
|||
device: "{{ dev }}" |
|||
rw: randwrite |
|||
bs: 4k |
|||
iodepth: "128" |
|||
runtime: "15" |
|||
time_based: "true" |
|||
name: w-qd128 |
|||
save_as: fio_w128 |
|||
- action: fio_parse |
|||
json_var: fio_w128 |
|||
metric: iops |
|||
save_as: w_qd128 |
|||
- action: print |
|||
msg: "Write QD128: {{ w_qd128 }} IOPS" |
|||
|
|||
# === Read IOPS === |
|||
- name: read-qd32 |
|||
actions: |
|||
- action: fio_json |
|||
node: m01 |
|||
device: "{{ dev }}" |
|||
rw: randread |
|||
bs: 4k |
|||
iodepth: "32" |
|||
runtime: "15" |
|||
time_based: "true" |
|||
name: r-qd32 |
|||
save_as: fio_r32 |
|||
- action: fio_parse |
|||
json_var: fio_r32 |
|||
metric: iops |
|||
direction: read |
|||
save_as: r_qd32 |
|||
- action: print |
|||
msg: "Read QD32: {{ r_qd32 }} IOPS" |
|||
|
|||
# === Mixed 70/30 read/write === |
|||
- name: mixed-qd32 |
|||
actions: |
|||
- action: fio_json |
|||
node: m01 |
|||
device: "{{ dev }}" |
|||
rw: randrw |
|||
bs: 4k |
|||
iodepth: "32" |
|||
runtime: "15" |
|||
time_based: "true" |
|||
rwmixread: "70" |
|||
name: rw-qd32 |
|||
save_as: fio_rw32 |
|||
- action: fio_parse |
|||
json_var: fio_rw32 |
|||
metric: iops |
|||
save_as: rw_qd32 |
|||
- action: print |
|||
msg: "Mixed 70/30 QD32: {{ rw_qd32 }} IOPS (write component)" |
|||
|
|||
# === Sequential write === |
|||
- name: seq-write |
|||
actions: |
|||
- action: fio_json |
|||
node: m01 |
|||
device: "{{ dev }}" |
|||
rw: write |
|||
bs: 1M |
|||
iodepth: "8" |
|||
runtime: "15" |
|||
time_based: "true" |
|||
name: seq-w |
|||
save_as: fio_sw |
|||
- action: fio_parse |
|||
json_var: fio_sw |
|||
metric: bw |
|||
save_as: seq_w_bw |
|||
- action: print |
|||
msg: "Seq Write 1M: {{ seq_w_bw }} KB/s" |
|||
|
|||
- name: results |
|||
actions: |
|||
- action: print |
|||
msg: "==========================================" |
|||
- action: print |
|||
msg: "BASELINE: RF=2 sync_all NVMe-TCP 25Gbps RoCE" |
|||
- action: print |
|||
msg: "==========================================" |
|||
- action: print |
|||
msg: "Write 4K QD1: {{ w_qd1 }} IOPS" |
|||
- action: print |
|||
msg: "Write 4K QD8: {{ w_qd8 }} IOPS" |
|||
- action: print |
|||
msg: "Write 4K QD32: {{ w_qd32 }} IOPS" |
|||
- action: print |
|||
msg: "Write 4K QD128: {{ w_qd128 }} IOPS" |
|||
- action: print |
|||
msg: "Read 4K QD32: {{ r_qd32 }} IOPS" |
|||
- action: print |
|||
msg: "Mixed 70/30 QD32: {{ rw_qd32 }} IOPS" |
|||
- action: print |
|||
msg: "Seq Write 1M QD8: {{ seq_w_bw }} KB/s" |
|||
- action: print |
|||
msg: "==========================================" |
|||
|
|||
- action: collect_results |
|||
title: "Full Baseline — RF=2 sync_all NVMe-TCP 25Gbps RoCE" |
|||
volume_name: "{{ volume_name }}" |
|||
|
|||
- name: cleanup |
|||
always: true |
|||
actions: |
|||
- action: exec |
|||
node: m01 |
|||
cmd: "nvme disconnect-all 2>/dev/null; true" |
|||
root: "true" |
|||
ignore_error: true |
|||
- action: stop_weed |
|||
node: m01 |
|||
pid: "{{ vs2_pid }}" |
|||
ignore_error: true |
|||
- action: stop_weed |
|||
node: m02 |
|||
pid: "{{ vs1_pid }}" |
|||
ignore_error: true |
|||
- action: stop_weed |
|||
node: m02 |
|||
pid: "{{ master_pid }}" |
|||
ignore_error: true |
|||
@ -0,0 +1,240 @@ |
|||
name: robust-shipper-lifecycle |
|||
timeout: 5m |
|||
|
|||
# Robust dimension: Shipper degradation → recovery lifecycle. |
|||
# Uses /debug/block/shipper for real-time state (no heartbeat lag). |
|||
# Uses exact replication ports from lookup_block_volume (not hardcoded range). |
|||
|
|||
env: |
|||
master_url: "http://192.168.1.184:9433" |
|||
volume_name: robust-ship |
|||
vol_size: "1073741824" |
|||
|
|||
topology: |
|||
nodes: |
|||
m01: |
|||
host: 192.168.1.181 |
|||
user: testdev |
|||
key: "/opt/work/testdev_key" |
|||
m02: |
|||
host: 192.168.1.184 |
|||
user: testdev |
|||
key: "/opt/work/testdev_key" |
|||
|
|||
phases: |
|||
- name: cluster-start |
|||
actions: |
|||
- action: exec |
|||
node: m02 |
|||
cmd: "fuser -k 9433/tcp 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-ship-master /tmp/sw-ship-vs1 && mkdir -p /tmp/sw-ship-master /tmp/sw-ship-vs1/blocks" |
|||
root: "true" |
|||
ignore_error: true |
|||
- action: exec |
|||
node: m01 |
|||
cmd: "fuser -k 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-ship-vs2 && mkdir -p /tmp/sw-ship-vs2/blocks" |
|||
root: "true" |
|||
ignore_error: true |
|||
|
|||
- action: start_weed_master |
|||
node: m02 |
|||
port: "9433" |
|||
dir: /tmp/sw-ship-master |
|||
save_as: master_pid |
|||
|
|||
- action: sleep |
|||
duration: 3s |
|||
|
|||
- action: start_weed_volume |
|||
node: m02 |
|||
port: "18480" |
|||
master: "localhost:9433" |
|||
dir: /tmp/sw-ship-vs1 |
|||
extra_args: "-block.dir=/tmp/sw-ship-vs1/blocks -block.listen=:3295 -ip=192.168.1.184" |
|||
save_as: vs1_pid |
|||
|
|||
- action: start_weed_volume |
|||
node: m01 |
|||
port: "18480" |
|||
master: "192.168.1.184:9433" |
|||
dir: /tmp/sw-ship-vs2 |
|||
extra_args: "-block.dir=/tmp/sw-ship-vs2/blocks -block.listen=:3295 -ip=192.168.1.181" |
|||
save_as: vs2_pid |
|||
|
|||
- action: sleep |
|||
duration: 3s |
|||
|
|||
- action: wait_cluster_ready |
|||
node: m02 |
|||
master_url: "{{ master_url }}" |
|||
|
|||
- action: wait_block_servers |
|||
count: "2" |
|||
|
|||
- name: create-and-write |
|||
actions: |
|||
- action: create_block_volume |
|||
name: "{{ volume_name }}" |
|||
size_bytes: "{{ vol_size }}" |
|||
replica_factor: "2" |
|||
durability_mode: "sync_all" |
|||
|
|||
- action: wait_volume_healthy |
|||
name: "{{ volume_name }}" |
|||
timeout: 60s |
|||
|
|||
# Lookup to get exact replication ports + primary host. |
|||
- action: lookup_block_volume |
|||
name: "{{ volume_name }}" |
|||
save_as: vol |
|||
|
|||
- action: print |
|||
msg: "primary={{ vol_primary_host }} repl_data={{ vol_replica_data_addr }} repl_port={{ vol_replica_data_port }}" |
|||
|
|||
- action: iscsi_login_direct |
|||
node: m01 |
|||
host: "{{ vol_iscsi_host }}" |
|||
port: "{{ vol_iscsi_port }}" |
|||
iqn: "{{ vol_iqn }}" |
|||
save_as: device |
|||
|
|||
# Wait for master assignment to set up shipper (heartbeat cycle ≈ 25s). |
|||
- action: sleep |
|||
duration: 30s |
|||
|
|||
# Write to trigger shipper connection (sync_all → barrier). |
|||
- action: exec |
|||
node: m01 |
|||
cmd: "dd if=/dev/urandom of={{ device }} bs=4k count=10 oflag=direct,sync 2>/dev/null" |
|||
root: "true" |
|||
ignore_error: true |
|||
|
|||
- action: sleep |
|||
duration: 5s |
|||
|
|||
- name: verify-in-sync |
|||
actions: |
|||
# Instant shipper state check. |
|||
- action: poll_shipper_state |
|||
node: m01 |
|||
host: "{{ vol_primary_host }}" |
|||
port: "18480" |
|||
save_as: ship_initial |
|||
ignore_error: true |
|||
|
|||
- action: print |
|||
msg: "=== Initial: state={{ ship_initial_state }} flushed={{ ship_initial_flushed_lsn }} ===" |
|||
|
|||
- name: block-repl-ports |
|||
actions: |
|||
- action: print |
|||
msg: "=== Blocking replication port {{ vol_replica_data_port }} ===" |
|||
|
|||
# Block exact replication data + ctrl ports from primary to replica. |
|||
- action: exec |
|||
node: m01 |
|||
cmd: "iptables -A OUTPUT -p tcp --dport {{ vol_replica_data_port }} -j REJECT --reject-with tcp-reset; iptables -A OUTPUT -p tcp --dport {{ vol_replica_ctrl_port }} -j REJECT --reject-with tcp-reset" |
|||
root: "true" |
|||
ignore_error: true |
|||
- action: exec |
|||
node: m02 |
|||
cmd: "iptables -A OUTPUT -p tcp --dport {{ vol_replica_data_port }} -j REJECT --reject-with tcp-reset; iptables -A OUTPUT -p tcp --dport {{ vol_replica_ctrl_port }} -j REJECT --reject-with tcp-reset" |
|||
root: "true" |
|||
ignore_error: true |
|||
|
|||
# Write to trigger Ship() failure → degradation. |
|||
- action: exec |
|||
node: m01 |
|||
cmd: "dd if=/dev/urandom of={{ device }} bs=4k count=10 oflag=direct,sync 2>/dev/null" |
|||
root: "true" |
|||
ignore_error: true |
|||
|
|||
- action: sleep |
|||
duration: 3s |
|||
|
|||
# Poll — should be degraded now. |
|||
- action: poll_shipper_state |
|||
node: m01 |
|||
host: "{{ vol_primary_host }}" |
|||
port: "18480" |
|||
expected_state: degraded |
|||
timeout: 10s |
|||
save_as: ship_degraded |
|||
|
|||
- action: print |
|||
msg: "=== Degraded: state={{ ship_degraded_state }} flushed={{ ship_degraded_flushed_lsn }} ===" |
|||
|
|||
- name: unblock-and-recover |
|||
actions: |
|||
- action: print |
|||
msg: "=== Unblocking replication ports ===" |
|||
|
|||
- action: exec |
|||
node: m01 |
|||
cmd: "iptables -D OUTPUT -p tcp --dport {{ vol_replica_data_port }} -j REJECT --reject-with tcp-reset 2>/dev/null; iptables -D OUTPUT -p tcp --dport {{ vol_replica_ctrl_port }} -j REJECT --reject-with tcp-reset 2>/dev/null; true" |
|||
root: "true" |
|||
- action: exec |
|||
node: m02 |
|||
cmd: "iptables -D OUTPUT -p tcp --dport {{ vol_replica_data_port }} -j REJECT --reject-with tcp-reset 2>/dev/null; iptables -D OUTPUT -p tcp --dport {{ vol_replica_ctrl_port }} -j REJECT --reject-with tcp-reset 2>/dev/null; true" |
|||
root: "true" |
|||
|
|||
# Write with sync to trigger barrier → reconnect. |
|||
- action: exec |
|||
node: m01 |
|||
cmd: "dd if=/dev/urandom of={{ device }} bs=4k count=10 oflag=direct,sync 2>/dev/null" |
|||
root: "true" |
|||
ignore_error: true |
|||
|
|||
- action: sleep |
|||
duration: 5s |
|||
|
|||
# Poll — should recover via barrier-triggered reconnect. |
|||
- action: poll_shipper_state |
|||
node: m01 |
|||
host: "{{ vol_primary_host }}" |
|||
port: "18480" |
|||
expected_state: in_sync |
|||
timeout: 30s |
|||
save_as: ship_recovered |
|||
|
|||
- action: print |
|||
msg: "=== Recovered: state={{ ship_recovered_state }} flushed={{ ship_recovered_flushed_lsn }} duration={{ ship_recovered_duration_ms }}ms ===" |
|||
|
|||
- name: results |
|||
actions: |
|||
- action: print |
|||
msg: "=== Shipper Lifecycle ===" |
|||
- action: print |
|||
msg: "Initial: {{ ship_initial_state }} (flushed={{ ship_initial_flushed_lsn }})" |
|||
- action: print |
|||
msg: "Degraded: {{ ship_degraded_state }} (flushed={{ ship_degraded_flushed_lsn }})" |
|||
- action: print |
|||
msg: "Recovered: {{ ship_recovered_state }} (flushed={{ ship_recovered_flushed_lsn }}, {{ ship_recovered_duration_ms }}ms)" |
|||
|
|||
- name: cleanup |
|||
always: true |
|||
actions: |
|||
- action: exec |
|||
node: m01 |
|||
cmd: "iptables -D OUTPUT -p tcp --dport {{ vol_replica_data_port }} -j REJECT --reject-with tcp-reset 2>/dev/null; iptables -D OUTPUT -p tcp --dport {{ vol_replica_ctrl_port }} -j REJECT --reject-with tcp-reset 2>/dev/null; true" |
|||
root: "true" |
|||
ignore_error: true |
|||
- action: exec |
|||
node: m02 |
|||
cmd: "iptables -D OUTPUT -p tcp --dport {{ vol_replica_data_port }} -j REJECT --reject-with tcp-reset 2>/dev/null; iptables -D OUTPUT -p tcp --dport {{ vol_replica_ctrl_port }} -j REJECT --reject-with tcp-reset 2>/dev/null; true" |
|||
root: "true" |
|||
ignore_error: true |
|||
- action: iscsi_cleanup |
|||
node: m01 |
|||
ignore_error: true |
|||
- action: stop_weed |
|||
node: m01 |
|||
pid: "{{ vs2_pid }}" |
|||
ignore_error: true |
|||
- action: stop_weed |
|||
node: m02 |
|||
pid: "{{ vs1_pid }}" |
|||
ignore_error: true |
|||
- action: stop_weed |
|||
node: m02 |
|||
pid: "{{ master_pid }}" |
|||
ignore_error: true |
|||
@ -0,0 +1,236 @@ |
|||
name: stable-degraded-sync-quorum |
|||
timeout: 12m |
|||
|
|||
# Stable dimension: RF=3 sync_quorum with one replica down. |
|||
# |
|||
# sync_quorum requires quorum (RF/2+1 = 2) of 3 nodes durable. |
|||
# With one replica dead, 2 nodes remain — quorum met, writes succeed. |
|||
# Expected: near-zero IOPS penalty (unlike sync_all which drops 66%). |
|||
# |
|||
# Comparison: |
|||
# sync_all degraded: 27,863 → 9,442 IOPS (−66%) |
|||
# best_effort degraded: 35,544 → 37,653 IOPS (+6%) |
|||
# sync_quorum degraded: ? → ? (this test) |
|||
|
|||
env: |
|||
master_url: "http://10.0.0.3:9433" |
|||
volume_name: stable-quorum |
|||
vol_size: "1073741824" |
|||
|
|||
topology: |
|||
nodes: |
|||
m01: |
|||
host: 192.168.1.181 |
|||
alt_ips: ["10.0.0.1"] |
|||
user: testdev |
|||
key: "/opt/work/testdev_key" |
|||
m02: |
|||
host: 192.168.1.184 |
|||
alt_ips: ["10.0.0.3"] |
|||
user: testdev |
|||
key: "/opt/work/testdev_key" |
|||
|
|||
# NOTE: RF=3 requires 3 volume servers. With only 2 physical nodes, |
|||
# we start 2 VS on m02 (ports 18480 and 18481) and 1 on m01. |
|||
# This tests the quorum logic, not the physical redundancy. |
|||
|
|||
phases: |
|||
- name: cluster-start |
|||
actions: |
|||
- action: exec |
|||
node: m02 |
|||
cmd: "fuser -k 9433/tcp 18480/tcp 18481/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-quorum-master /tmp/sw-quorum-vs1 /tmp/sw-quorum-vs3 && mkdir -p /tmp/sw-quorum-master /tmp/sw-quorum-vs1/blocks /tmp/sw-quorum-vs3/blocks" |
|||
root: "true" |
|||
ignore_error: true |
|||
- action: exec |
|||
node: m01 |
|||
cmd: "fuser -k 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-quorum-vs2 && mkdir -p /tmp/sw-quorum-vs2/blocks" |
|||
root: "true" |
|||
ignore_error: true |
|||
|
|||
- action: start_weed_master |
|||
node: m02 |
|||
port: "9433" |
|||
dir: /tmp/sw-quorum-master |
|||
extra_args: "-ip=10.0.0.3" |
|||
save_as: master_pid |
|||
|
|||
- action: sleep |
|||
duration: 3s |
|||
|
|||
- action: start_weed_volume |
|||
node: m02 |
|||
port: "18480" |
|||
master: "10.0.0.3:9433" |
|||
dir: /tmp/sw-quorum-vs1 |
|||
extra_args: "-block.dir=/tmp/sw-quorum-vs1/blocks -block.listen=:3295 -ip=10.0.0.3" |
|||
save_as: vs1_pid |
|||
|
|||
- action: start_weed_volume |
|||
node: m01 |
|||
port: "18480" |
|||
master: "10.0.0.3:9433" |
|||
dir: /tmp/sw-quorum-vs2 |
|||
extra_args: "-block.dir=/tmp/sw-quorum-vs2/blocks -block.listen=:3296 -ip=10.0.0.1" |
|||
save_as: vs2_pid |
|||
|
|||
# Third VS on m02, different port. |
|||
- action: start_weed_volume |
|||
node: m02 |
|||
port: "18481" |
|||
master: "10.0.0.3:9433" |
|||
dir: /tmp/sw-quorum-vs3 |
|||
extra_args: "-block.dir=/tmp/sw-quorum-vs3/blocks -block.listen=:3297 -ip=10.0.0.3" |
|||
save_as: vs3_pid |
|||
|
|||
- action: sleep |
|||
duration: 3s |
|||
|
|||
- action: wait_cluster_ready |
|||
node: m02 |
|||
master_url: "{{ master_url }}" |
|||
|
|||
- action: wait_block_servers |
|||
count: "3" |
|||
timeout: 30s |
|||
|
|||
- name: create-volume |
|||
actions: |
|||
- action: create_block_volume |
|||
name: "{{ volume_name }}" |
|||
size_bytes: "{{ vol_size }}" |
|||
replica_factor: "3" |
|||
durability_mode: "sync_quorum" |
|||
|
|||
- action: wait_volume_healthy |
|||
name: "{{ volume_name }}" |
|||
timeout: 60s |
|||
|
|||
# Force primary to m02:18480 for deterministic setup. |
|||
- action: block_promote |
|||
name: "{{ volume_name }}" |
|||
target_server: "10.0.0.3:18480" |
|||
force: "true" |
|||
reason: "quorum-setup" |
|||
|
|||
- action: sleep |
|||
duration: 5s |
|||
|
|||
- action: wait_volume_healthy |
|||
name: "{{ volume_name }}" |
|||
timeout: 60s |
|||
|
|||
- action: print |
|||
msg: "RF=3 sync_quorum volume created, primary on m02" |
|||
|
|||
- name: connect |
|||
actions: |
|||
- action: lookup_block_volume |
|||
name: "{{ volume_name }}" |
|||
save_as: vol |
|||
|
|||
- action: iscsi_login_direct |
|||
node: m01 |
|||
host: "{{ vol_iscsi_host }}" |
|||
port: "{{ vol_iscsi_port }}" |
|||
iqn: "{{ vol_iqn }}" |
|||
save_as: device |
|||
|
|||
- name: baseline-healthy |
|||
actions: |
|||
- action: print |
|||
msg: "=== Baseline: RF=3 sync_quorum, all 3 replicas healthy ===" |
|||
|
|||
- action: fio_json |
|||
node: m01 |
|||
device: "{{ device }}" |
|||
rw: randwrite |
|||
bs: 4k |
|||
iodepth: "32" |
|||
runtime: "30" |
|||
time_based: "true" |
|||
name: baseline-healthy |
|||
save_as: fio_healthy |
|||
|
|||
- action: fio_parse |
|||
json_var: fio_healthy |
|||
metric: iops |
|||
save_as: iops_healthy |
|||
|
|||
- action: print |
|||
msg: "Healthy RF=3: {{ iops_healthy }} IOPS" |
|||
|
|||
- name: kill-one-replica |
|||
actions: |
|||
- action: print |
|||
msg: "=== Killing one replica (m01 VS) — quorum still met ===" |
|||
|
|||
- action: exec |
|||
node: m01 |
|||
cmd: "kill -9 {{ vs2_pid }}" |
|||
root: "true" |
|||
ignore_error: true |
|||
|
|||
# Wait for shipper to detect dead replica. |
|||
- action: sleep |
|||
duration: 10s |
|||
|
|||
- name: degraded-one-dead |
|||
actions: |
|||
- action: print |
|||
msg: "=== Degraded: 1 of 2 replicas dead, quorum=2 still met ===" |
|||
|
|||
- action: fio_json |
|||
node: m01 |
|||
device: "{{ device }}" |
|||
rw: randwrite |
|||
bs: 4k |
|||
iodepth: "32" |
|||
runtime: "30" |
|||
time_based: "true" |
|||
name: degraded-one |
|||
save_as: fio_deg1 |
|||
|
|||
- action: fio_parse |
|||
json_var: fio_deg1 |
|||
metric: iops |
|||
save_as: iops_deg1 |
|||
|
|||
- action: print |
|||
msg: "Degraded (1 dead): {{ iops_deg1 }} IOPS" |
|||
|
|||
- name: results |
|||
actions: |
|||
- action: print |
|||
msg: "=== Stable: Degraded Mode (sync_quorum RF=3, iSCSI/RoCE) ===" |
|||
- action: print |
|||
msg: "Healthy (3/3): {{ iops_healthy }} IOPS" |
|||
- action: print |
|||
msg: "Degraded (2/3): {{ iops_deg1 }} IOPS" |
|||
|
|||
- action: collect_results |
|||
title: "Stable: Degraded Mode (sync_quorum RF=3)" |
|||
volume_name: "{{ volume_name }}" |
|||
|
|||
- name: cleanup |
|||
always: true |
|||
actions: |
|||
- action: iscsi_cleanup |
|||
node: m01 |
|||
ignore_error: true |
|||
- action: stop_weed |
|||
node: m01 |
|||
pid: "{{ vs2_pid }}" |
|||
ignore_error: true |
|||
- action: stop_weed |
|||
node: m02 |
|||
pid: "{{ vs3_pid }}" |
|||
ignore_error: true |
|||
- action: stop_weed |
|||
node: m02 |
|||
pid: "{{ vs1_pid }}" |
|||
ignore_error: true |
|||
- action: stop_weed |
|||
node: m02 |
|||
pid: "{{ master_pid }}" |
|||
ignore_error: true |
|||
@ -0,0 +1,244 @@ |
|||
name: stable-netem-sweep-roce |
|||
timeout: 15m |
|||
|
|||
# Stable dimension on 25Gbps RoCE: write IOPS under latency injection. |
|||
# Uses NVMe-TCP for client path, netem on replication link. |
|||
|
|||
env: |
|||
master_url: "http://10.0.0.3:9433" |
|||
volume_name: stable-roce |
|||
vol_size: "2147483648" |
|||
|
|||
topology: |
|||
nodes: |
|||
m01: |
|||
host: 192.168.1.181 |
|||
user: testdev |
|||
key: "/opt/work/testdev_key" |
|||
m02: |
|||
host: 192.168.1.184 |
|||
user: testdev |
|||
key: "/opt/work/testdev_key" |
|||
|
|||
phases: |
|||
- name: cluster-start |
|||
actions: |
|||
- action: exec |
|||
node: m02 |
|||
cmd: "fuser -k 9433/tcp 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-nroce-master /tmp/sw-nroce-vs1 && mkdir -p /tmp/sw-nroce-master /tmp/sw-nroce-vs1/blocks" |
|||
root: "true" |
|||
ignore_error: true |
|||
- action: exec |
|||
node: m01 |
|||
cmd: "fuser -k 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-nroce-vs2 && mkdir -p /tmp/sw-nroce-vs2/blocks" |
|||
root: "true" |
|||
ignore_error: true |
|||
|
|||
- action: start_weed_master |
|||
node: m02 |
|||
port: "9433" |
|||
dir: /tmp/sw-nroce-master |
|||
extra_args: "-ip=10.0.0.3" |
|||
save_as: master_pid |
|||
|
|||
- action: sleep |
|||
duration: 3s |
|||
|
|||
- action: start_weed_volume |
|||
node: m02 |
|||
port: "18480" |
|||
master: "10.0.0.3:9433" |
|||
dir: /tmp/sw-nroce-vs1 |
|||
extra_args: "-block.dir=/tmp/sw-nroce-vs1/blocks -block.listen=:3295 -block.nvme.enable=true -block.nvme.listen=10.0.0.3:4420 -ip=10.0.0.3" |
|||
save_as: vs1_pid |
|||
|
|||
- action: start_weed_volume |
|||
node: m01 |
|||
port: "18480" |
|||
master: "10.0.0.3:9433" |
|||
dir: /tmp/sw-nroce-vs2 |
|||
extra_args: "-block.dir=/tmp/sw-nroce-vs2/blocks -block.listen=:3295 -block.nvme.enable=true -block.nvme.listen=10.0.0.1:4420 -ip=10.0.0.1" |
|||
save_as: vs2_pid |
|||
|
|||
- action: sleep |
|||
duration: 3s |
|||
|
|||
- action: wait_cluster_ready |
|||
node: m02 |
|||
master_url: "{{ master_url }}" |
|||
|
|||
- action: wait_block_servers |
|||
count: "2" |
|||
|
|||
- name: create-volume |
|||
actions: |
|||
- action: create_block_volume |
|||
name: "{{ volume_name }}" |
|||
size_bytes: "{{ vol_size }}" |
|||
replica_factor: "2" |
|||
durability_mode: "sync_all" |
|||
|
|||
- action: wait_volume_healthy |
|||
name: "{{ volume_name }}" |
|||
timeout: 60s |
|||
|
|||
- name: connect-nvme |
|||
actions: |
|||
- action: lookup_block_volume |
|||
name: "{{ volume_name }}" |
|||
save_as: vol |
|||
|
|||
- action: exec |
|||
node: m01 |
|||
cmd: "sh -c 'modprobe nvme_tcp; nvme disconnect-all >/dev/null 2>&1; nvme connect -t tcp -a {{ vol_iscsi_host }} -s 4420 -n nqn.2024-01.com.seaweedfs:vol.{{ volume_name }} >/dev/null 2>&1; sleep 2; lsblk -dpno NAME,SIZE | grep 2G | head -1 | cut -d\" \" -f1'" |
|||
root: "true" |
|||
save_as: dev |
|||
|
|||
# Baseline |
|||
- name: baseline |
|||
actions: |
|||
- action: fio_json |
|||
node: m01 |
|||
device: "{{ dev }}" |
|||
rw: randwrite |
|||
bs: 4k |
|||
iodepth: "32" |
|||
runtime: "30" |
|||
time_based: "true" |
|||
name: baseline |
|||
save_as: fio_0 |
|||
- action: fio_parse |
|||
json_var: fio_0 |
|||
metric: iops |
|||
save_as: iops_0 |
|||
- action: print |
|||
msg: "0ms: {{ iops_0 }} IOPS" |
|||
|
|||
# 1ms |
|||
- name: netem-1ms |
|||
actions: |
|||
- action: inject_netem |
|||
node: m02 |
|||
target_ip: "10.0.0.1" |
|||
delay_ms: "1" |
|||
- action: sleep |
|||
duration: 2s |
|||
- action: fio_json |
|||
node: m01 |
|||
device: "{{ dev }}" |
|||
rw: randwrite |
|||
bs: 4k |
|||
iodepth: "32" |
|||
runtime: "30" |
|||
time_based: "true" |
|||
name: netem-1ms |
|||
save_as: fio_1 |
|||
- action: fio_parse |
|||
json_var: fio_1 |
|||
metric: iops |
|||
save_as: iops_1 |
|||
- action: print |
|||
msg: "1ms: {{ iops_1 }} IOPS" |
|||
- action: clear_fault |
|||
node: m02 |
|||
type: netem |
|||
- action: sleep |
|||
duration: 2s |
|||
|
|||
# 5ms |
|||
- name: netem-5ms |
|||
actions: |
|||
- action: inject_netem |
|||
node: m02 |
|||
target_ip: "10.0.0.1" |
|||
delay_ms: "5" |
|||
- action: sleep |
|||
duration: 2s |
|||
- action: fio_json |
|||
node: m01 |
|||
device: "{{ dev }}" |
|||
rw: randwrite |
|||
bs: 4k |
|||
iodepth: "32" |
|||
runtime: "30" |
|||
time_based: "true" |
|||
name: netem-5ms |
|||
save_as: fio_5 |
|||
- action: fio_parse |
|||
json_var: fio_5 |
|||
metric: iops |
|||
save_as: iops_5 |
|||
- action: print |
|||
msg: "5ms: {{ iops_5 }} IOPS" |
|||
- action: clear_fault |
|||
node: m02 |
|||
type: netem |
|||
- action: sleep |
|||
duration: 2s |
|||
|
|||
# 20ms |
|||
- name: netem-20ms |
|||
actions: |
|||
- action: inject_netem |
|||
node: m02 |
|||
target_ip: "10.0.0.1" |
|||
delay_ms: "20" |
|||
- action: sleep |
|||
duration: 2s |
|||
- action: fio_json |
|||
node: m01 |
|||
device: "{{ dev }}" |
|||
rw: randwrite |
|||
bs: 4k |
|||
iodepth: "32" |
|||
runtime: "30" |
|||
time_based: "true" |
|||
name: netem-20ms |
|||
save_as: fio_20 |
|||
- action: fio_parse |
|||
json_var: fio_20 |
|||
metric: iops |
|||
save_as: iops_20 |
|||
- action: print |
|||
msg: "20ms: {{ iops_20 }} IOPS" |
|||
- action: clear_fault |
|||
node: m02 |
|||
type: netem |
|||
|
|||
- name: results |
|||
actions: |
|||
- action: print |
|||
msg: "=== Stable: Netem Sweep NVMe/RoCE (sync_all RF=2) ===" |
|||
- action: print |
|||
msg: "0ms: {{ iops_0 }} IOPS" |
|||
- action: print |
|||
msg: "1ms: {{ iops_1 }} IOPS" |
|||
- action: print |
|||
msg: "5ms: {{ iops_5 }} IOPS" |
|||
- action: print |
|||
msg: "20ms: {{ iops_20 }} IOPS" |
|||
|
|||
- name: cleanup |
|||
always: true |
|||
actions: |
|||
- action: clear_fault |
|||
node: m02 |
|||
type: netem |
|||
ignore_error: true |
|||
- action: exec |
|||
node: m01 |
|||
cmd: "nvme disconnect-all 2>/dev/null; true" |
|||
root: "true" |
|||
ignore_error: true |
|||
- action: stop_weed |
|||
node: m01 |
|||
pid: "{{ vs2_pid }}" |
|||
ignore_error: true |
|||
- action: stop_weed |
|||
node: m02 |
|||
pid: "{{ vs1_pid }}" |
|||
ignore_error: true |
|||
- action: stop_weed |
|||
node: m02 |
|||
pid: "{{ master_pid }}" |
|||
ignore_error: true |
|||
@ -0,0 +1,291 @@ |
|||
name: stable-packet-loss |
|||
timeout: 15m |
|||
|
|||
# Stable dimension: measure write IOPS under increasing packet loss |
|||
# on the replication link. Uses netem loss (not delay). |
|||
# |
|||
# Loss levels: 0% (baseline from netem-sweep), 0.1%, 1%, 5% |
|||
# Workload: 4K random write, QD16, 30s per level |
|||
|
|||
env: |
|||
master_url: "http://192.168.1.184:9433" |
|||
volume_name: stable-loss |
|||
vol_size: "1073741824" |
|||
|
|||
topology: |
|||
nodes: |
|||
m01: |
|||
host: 192.168.1.181 |
|||
user: testdev |
|||
key: "/opt/work/testdev_key" |
|||
m02: |
|||
host: 192.168.1.184 |
|||
user: testdev |
|||
key: "/opt/work/testdev_key" |
|||
|
|||
phases: |
|||
- name: cluster-start |
|||
actions: |
|||
- action: exec |
|||
node: m02 |
|||
cmd: "fuser -k 9433/tcp 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-loss-master /tmp/sw-loss-vs1 && mkdir -p /tmp/sw-loss-master /tmp/sw-loss-vs1/blocks" |
|||
root: "true" |
|||
ignore_error: true |
|||
- action: exec |
|||
node: m01 |
|||
cmd: "fuser -k 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-loss-vs2 && mkdir -p /tmp/sw-loss-vs2/blocks" |
|||
root: "true" |
|||
ignore_error: true |
|||
|
|||
- action: start_weed_master |
|||
node: m02 |
|||
port: "9433" |
|||
dir: /tmp/sw-loss-master |
|||
save_as: master_pid |
|||
|
|||
- action: sleep |
|||
duration: 3s |
|||
|
|||
- action: start_weed_volume |
|||
node: m02 |
|||
port: "18480" |
|||
master: "localhost:9433" |
|||
dir: /tmp/sw-loss-vs1 |
|||
extra_args: "-block.dir=/tmp/sw-loss-vs1/blocks -block.listen=:3295 -ip=192.168.1.184" |
|||
save_as: vs1_pid |
|||
|
|||
- action: start_weed_volume |
|||
node: m01 |
|||
port: "18480" |
|||
master: "192.168.1.184:9433" |
|||
dir: /tmp/sw-loss-vs2 |
|||
extra_args: "-block.dir=/tmp/sw-loss-vs2/blocks -block.listen=:3295 -ip=192.168.1.181" |
|||
save_as: vs2_pid |
|||
|
|||
- action: sleep |
|||
duration: 3s |
|||
|
|||
- action: wait_cluster_ready |
|||
node: m02 |
|||
master_url: "{{ master_url }}" |
|||
|
|||
- action: wait_block_servers |
|||
count: "2" |
|||
|
|||
- name: create-volume |
|||
actions: |
|||
- action: create_block_volume |
|||
name: "{{ volume_name }}" |
|||
size_bytes: "{{ vol_size }}" |
|||
replica_factor: "2" |
|||
durability_mode: "sync_all" |
|||
|
|||
- action: wait_volume_healthy |
|||
name: "{{ volume_name }}" |
|||
timeout: 60s |
|||
|
|||
- action: validate_replication |
|||
volume_name: "{{ volume_name }}" |
|||
expected_rf: "2" |
|||
expected_durability: "sync_all" |
|||
require_not_degraded: "true" |
|||
require_cross_machine: "true" |
|||
|
|||
- name: connect |
|||
actions: |
|||
- action: lookup_block_volume |
|||
name: "{{ volume_name }}" |
|||
save_as: vol |
|||
|
|||
- action: iscsi_login_direct |
|||
node: m01 |
|||
host: "{{ vol_iscsi_host }}" |
|||
port: "{{ vol_iscsi_port }}" |
|||
iqn: "{{ vol_iqn }}" |
|||
save_as: device |
|||
|
|||
# === Baseline: 0% loss === |
|||
- name: baseline |
|||
actions: |
|||
- action: print |
|||
msg: "=== Baseline: 0% packet loss ===" |
|||
|
|||
- action: fio_json |
|||
node: m01 |
|||
device: "{{ device }}" |
|||
rw: randwrite |
|||
bs: 4k |
|||
iodepth: "16" |
|||
runtime: "30" |
|||
time_based: "true" |
|||
name: baseline |
|||
save_as: fio_0 |
|||
|
|||
- action: fio_parse |
|||
json_var: fio_0 |
|||
metric: iops |
|||
save_as: iops_0 |
|||
|
|||
- action: print |
|||
msg: "0% loss: {{ iops_0 }} IOPS" |
|||
|
|||
# === 0.1% packet loss === |
|||
- name: loss-01pct |
|||
actions: |
|||
- action: print |
|||
msg: "=== Injecting 0.1% packet loss ===" |
|||
|
|||
- action: exec |
|||
node: m02 |
|||
cmd: "tc qdisc add dev $(ip route get 192.168.1.181 | head -1 | awk '{for(i=1;i<=NF;i++) if($i==\"dev\") print $(i+1)}') root netem loss 0.1%" |
|||
root: "true" |
|||
|
|||
- action: sleep |
|||
duration: 2s |
|||
|
|||
- action: fio_json |
|||
node: m01 |
|||
device: "{{ device }}" |
|||
rw: randwrite |
|||
bs: 4k |
|||
iodepth: "16" |
|||
runtime: "30" |
|||
time_based: "true" |
|||
name: loss-01pct |
|||
save_as: fio_01 |
|||
|
|||
- action: fio_parse |
|||
json_var: fio_01 |
|||
metric: iops |
|||
save_as: iops_01 |
|||
|
|||
- action: print |
|||
msg: "0.1% loss: {{ iops_01 }} IOPS" |
|||
|
|||
- action: exec |
|||
node: m02 |
|||
cmd: "tc qdisc del dev $(ip route get 192.168.1.181 | head -1 | awk '{for(i=1;i<=NF;i++) if($i==\"dev\") print $(i+1)}') root 2>/dev/null; true" |
|||
root: "true" |
|||
|
|||
- action: sleep |
|||
duration: 2s |
|||
|
|||
# === 1% packet loss === |
|||
- name: loss-1pct |
|||
actions: |
|||
- action: print |
|||
msg: "=== Injecting 1% packet loss ===" |
|||
|
|||
- action: exec |
|||
node: m02 |
|||
cmd: "tc qdisc add dev $(ip route get 192.168.1.181 | head -1 | awk '{for(i=1;i<=NF;i++) if($i==\"dev\") print $(i+1)}') root netem loss 1%" |
|||
root: "true" |
|||
|
|||
- action: sleep |
|||
duration: 2s |
|||
|
|||
- action: fio_json |
|||
node: m01 |
|||
device: "{{ device }}" |
|||
rw: randwrite |
|||
bs: 4k |
|||
iodepth: "16" |
|||
runtime: "30" |
|||
time_based: "true" |
|||
name: loss-1pct |
|||
save_as: fio_1 |
|||
|
|||
- action: fio_parse |
|||
json_var: fio_1 |
|||
metric: iops |
|||
save_as: iops_1 |
|||
|
|||
- action: print |
|||
msg: "1% loss: {{ iops_1 }} IOPS" |
|||
|
|||
- action: exec |
|||
node: m02 |
|||
cmd: "tc qdisc del dev $(ip route get 192.168.1.181 | head -1 | awk '{for(i=1;i<=NF;i++) if($i==\"dev\") print $(i+1)}') root 2>/dev/null; true" |
|||
root: "true" |
|||
|
|||
- action: sleep |
|||
duration: 2s |
|||
|
|||
# === 5% packet loss === |
|||
- name: loss-5pct |
|||
actions: |
|||
- action: print |
|||
msg: "=== Injecting 5% packet loss ===" |
|||
|
|||
- action: exec |
|||
node: m02 |
|||
cmd: "tc qdisc add dev $(ip route get 192.168.1.181 | head -1 | awk '{for(i=1;i<=NF;i++) if($i==\"dev\") print $(i+1)}') root netem loss 5%" |
|||
root: "true" |
|||
|
|||
- action: sleep |
|||
duration: 2s |
|||
|
|||
- action: fio_json |
|||
node: m01 |
|||
device: "{{ device }}" |
|||
rw: randwrite |
|||
bs: 4k |
|||
iodepth: "16" |
|||
runtime: "30" |
|||
time_based: "true" |
|||
name: loss-5pct |
|||
save_as: fio_5 |
|||
|
|||
- action: fio_parse |
|||
json_var: fio_5 |
|||
metric: iops |
|||
save_as: iops_5 |
|||
|
|||
- action: print |
|||
msg: "5% loss: {{ iops_5 }} IOPS" |
|||
|
|||
- action: exec |
|||
node: m02 |
|||
cmd: "tc qdisc del dev $(ip route get 192.168.1.181 | head -1 | awk '{for(i=1;i<=NF;i++) if($i==\"dev\") print $(i+1)}') root 2>/dev/null; true" |
|||
root: "true" |
|||
|
|||
- name: results |
|||
actions: |
|||
- action: print |
|||
msg: "=== Stable Dimension: Packet Loss Sweep (V1 sync_all RF=2) ===" |
|||
- action: print |
|||
msg: "0% (baseline): {{ iops_0 }} IOPS" |
|||
- action: print |
|||
msg: "0.1% loss: {{ iops_01 }} IOPS" |
|||
- action: print |
|||
msg: "1% loss: {{ iops_1 }} IOPS" |
|||
- action: print |
|||
msg: "5% loss: {{ iops_5 }} IOPS" |
|||
|
|||
- action: collect_results |
|||
title: "Stable: Packet Loss Sweep (V1 sync_all RF=2)" |
|||
volume_name: "{{ volume_name }}" |
|||
|
|||
- name: cleanup |
|||
always: true |
|||
actions: |
|||
- action: exec |
|||
node: m02 |
|||
cmd: "tc qdisc del dev $(ip route get 192.168.1.181 | head -1 | awk '{for(i=1;i<=NF;i++) if($i==\"dev\") print $(i+1)}') root 2>/dev/null; true" |
|||
root: "true" |
|||
ignore_error: true |
|||
- action: iscsi_cleanup |
|||
node: m01 |
|||
ignore_error: true |
|||
- action: stop_weed |
|||
node: m01 |
|||
pid: "{{ vs2_pid }}" |
|||
ignore_error: true |
|||
- action: stop_weed |
|||
node: m02 |
|||
pid: "{{ vs1_pid }}" |
|||
ignore_error: true |
|||
- action: stop_weed |
|||
node: m02 |
|||
pid: "{{ master_pid }}" |
|||
ignore_error: true |
|||
@ -0,0 +1,264 @@ |
|||
name: stable-replication-tax-nvme-roce |
|||
timeout: 10m |
|||
|
|||
# Replication tax on 25Gbps RoCE via NVMe-TCP (not iSCSI). |
|||
# Previous results (iSCSI over RoCE): |
|||
# RF=1: 54,785 | RF=2 be: 28,408 | RF=2 sa: 29,893 |
|||
# |
|||
# NVMe-TCP should have lower protocol overhead than iSCSI, |
|||
# potentially showing higher IOPS and a different replication tax ratio. |
|||
|
|||
env: |
|||
master_url: "http://10.0.0.3:9433" |
|||
vol_size: "2147483648" |
|||
|
|||
topology: |
|||
nodes: |
|||
m01: |
|||
host: 192.168.1.181 |
|||
user: testdev |
|||
key: "/opt/work/testdev_key" |
|||
m02: |
|||
host: 192.168.1.184 |
|||
user: testdev |
|||
key: "/opt/work/testdev_key" |
|||
|
|||
phases: |
|||
- name: cluster-start |
|||
actions: |
|||
- action: exec |
|||
node: m02 |
|||
cmd: "fuser -k 9433/tcp 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-nvme-master /tmp/sw-nvme-vs1 && mkdir -p /tmp/sw-nvme-master /tmp/sw-nvme-vs1/blocks" |
|||
root: "true" |
|||
ignore_error: true |
|||
- action: exec |
|||
node: m01 |
|||
cmd: "fuser -k 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-nvme-vs2 && mkdir -p /tmp/sw-nvme-vs2/blocks" |
|||
root: "true" |
|||
ignore_error: true |
|||
|
|||
- action: start_weed_master |
|||
node: m02 |
|||
port: "9433" |
|||
dir: /tmp/sw-nvme-master |
|||
extra_args: "-ip=10.0.0.3" |
|||
save_as: master_pid |
|||
|
|||
- action: sleep |
|||
duration: 3s |
|||
|
|||
# VS1 on m02 with NVMe-TCP enabled on RoCE IP. |
|||
- action: start_weed_volume |
|||
node: m02 |
|||
port: "18480" |
|||
master: "10.0.0.3:9433" |
|||
dir: /tmp/sw-nvme-vs1 |
|||
extra_args: "-block.dir=/tmp/sw-nvme-vs1/blocks -block.listen=:3295 -block.nvme.enable=true -block.nvme.listen=10.0.0.3:4420 -ip=10.0.0.3" |
|||
save_as: vs1_pid |
|||
|
|||
# VS2 on m01 with NVMe-TCP on RoCE IP. |
|||
- action: start_weed_volume |
|||
node: m01 |
|||
port: "18480" |
|||
master: "10.0.0.3:9433" |
|||
dir: /tmp/sw-nvme-vs2 |
|||
extra_args: "-block.dir=/tmp/sw-nvme-vs2/blocks -block.listen=:3295 -block.nvme.enable=true -block.nvme.listen=10.0.0.1:4420 -ip=10.0.0.1" |
|||
save_as: vs2_pid |
|||
|
|||
- action: sleep |
|||
duration: 3s |
|||
|
|||
- action: wait_cluster_ready |
|||
node: m02 |
|||
master_url: "{{ master_url }}" |
|||
|
|||
- action: wait_block_servers |
|||
count: "2" |
|||
|
|||
# === RF=1 via NVMe-TCP === |
|||
- name: rf1-nvme |
|||
actions: |
|||
- action: create_block_volume |
|||
name: "nvme-rf1" |
|||
size_bytes: "{{ vol_size }}" |
|||
replica_factor: "1" |
|||
durability_mode: "best_effort" |
|||
|
|||
- action: sleep |
|||
duration: 2s |
|||
|
|||
- action: lookup_block_volume |
|||
name: "nvme-rf1" |
|||
save_as: rf1 |
|||
|
|||
# Connect via NVMe-TCP. |
|||
- action: exec |
|||
node: m01 |
|||
cmd: "sh -c 'modprobe nvme_tcp; nvme disconnect-all >/dev/null 2>&1; nvme connect -t tcp -a {{ rf1_iscsi_host }} -s 4420 -n nqn.2024-01.com.seaweedfs:vol.nvme-rf1 >/dev/null 2>&1; sleep 2; lsblk -dpno NAME,SIZE | grep 2G | head -1 | cut -d\" \" -f1'" |
|||
root: "true" |
|||
save_as: nvme_dev_rf1 |
|||
|
|||
- action: print |
|||
msg: "RF=1 NVMe device: {{ nvme_dev_rf1 }}" |
|||
|
|||
- action: fio_json |
|||
node: m01 |
|||
device: "{{ nvme_dev_rf1 }}" |
|||
rw: randwrite |
|||
bs: 4k |
|||
iodepth: "32" |
|||
runtime: "30" |
|||
time_based: "true" |
|||
name: rf1-nvme |
|||
save_as: fio_rf1 |
|||
|
|||
- action: fio_parse |
|||
json_var: fio_rf1 |
|||
metric: iops |
|||
save_as: iops_rf1 |
|||
|
|||
- action: print |
|||
msg: "RF=1 NVMe (RoCE): {{ iops_rf1 }} IOPS" |
|||
|
|||
- action: exec |
|||
node: m01 |
|||
cmd: "nvme disconnect-all 2>/dev/null; true" |
|||
root: "true" |
|||
ignore_error: true |
|||
|
|||
# === RF=2 best_effort via NVMe-TCP === |
|||
- name: rf2-be-nvme |
|||
actions: |
|||
- action: create_block_volume |
|||
name: "nvme-rf2be" |
|||
size_bytes: "{{ vol_size }}" |
|||
replica_factor: "2" |
|||
durability_mode: "best_effort" |
|||
|
|||
- action: wait_volume_healthy |
|||
name: "nvme-rf2be" |
|||
timeout: 60s |
|||
|
|||
- action: lookup_block_volume |
|||
name: "nvme-rf2be" |
|||
save_as: rf2be |
|||
|
|||
- action: exec |
|||
node: m01 |
|||
cmd: "sh -c 'nvme disconnect-all >/dev/null 2>&1; nvme connect -t tcp -a {{ rf2be_iscsi_host }} -s 4420 -n nqn.2024-01.com.seaweedfs:vol.nvme-rf2be >/dev/null 2>&1; sleep 2; lsblk -dpno NAME,SIZE | grep 2G | head -1 | cut -d\" \" -f1'" |
|||
root: "true" |
|||
save_as: nvme_dev_rf2be |
|||
|
|||
- action: print |
|||
msg: "RF=2 BE NVMe device: {{ nvme_dev_rf2be }}" |
|||
|
|||
- action: fio_json |
|||
node: m01 |
|||
device: "{{ nvme_dev_rf2be }}" |
|||
rw: randwrite |
|||
bs: 4k |
|||
iodepth: "32" |
|||
runtime: "30" |
|||
time_based: "true" |
|||
name: rf2-be-nvme |
|||
save_as: fio_rf2be |
|||
|
|||
- action: fio_parse |
|||
json_var: fio_rf2be |
|||
metric: iops |
|||
save_as: iops_rf2be |
|||
|
|||
- action: print |
|||
msg: "RF=2 best_effort NVMe (RoCE): {{ iops_rf2be }} IOPS" |
|||
|
|||
- action: exec |
|||
node: m01 |
|||
cmd: "nvme disconnect-all 2>/dev/null; true" |
|||
root: "true" |
|||
ignore_error: true |
|||
|
|||
# === RF=2 sync_all via NVMe-TCP === |
|||
- name: rf2-sa-nvme |
|||
actions: |
|||
- action: create_block_volume |
|||
name: "nvme-rf2sa" |
|||
size_bytes: "{{ vol_size }}" |
|||
replica_factor: "2" |
|||
durability_mode: "sync_all" |
|||
|
|||
- action: wait_volume_healthy |
|||
name: "nvme-rf2sa" |
|||
timeout: 60s |
|||
|
|||
- action: lookup_block_volume |
|||
name: "nvme-rf2sa" |
|||
save_as: rf2sa |
|||
|
|||
- action: exec |
|||
node: m01 |
|||
cmd: "sh -c 'nvme disconnect-all >/dev/null 2>&1; nvme connect -t tcp -a {{ rf2sa_iscsi_host }} -s 4420 -n nqn.2024-01.com.seaweedfs:vol.nvme-rf2sa >/dev/null 2>&1; sleep 2; lsblk -dpno NAME,SIZE | grep 2G | head -1 | cut -d\" \" -f1'" |
|||
root: "true" |
|||
save_as: nvme_dev_rf2sa |
|||
|
|||
- action: print |
|||
msg: "RF=2 SA NVMe device: {{ nvme_dev_rf2sa }}" |
|||
|
|||
- action: fio_json |
|||
node: m01 |
|||
device: "{{ nvme_dev_rf2sa }}" |
|||
rw: randwrite |
|||
bs: 4k |
|||
iodepth: "32" |
|||
runtime: "30" |
|||
time_based: "true" |
|||
name: rf2-sa-nvme |
|||
save_as: fio_rf2sa |
|||
|
|||
- action: fio_parse |
|||
json_var: fio_rf2sa |
|||
metric: iops |
|||
save_as: iops_rf2sa |
|||
|
|||
- action: print |
|||
msg: "RF=2 sync_all NVMe (RoCE): {{ iops_rf2sa }} IOPS" |
|||
|
|||
- action: exec |
|||
node: m01 |
|||
cmd: "nvme disconnect-all 2>/dev/null; true" |
|||
root: "true" |
|||
ignore_error: true |
|||
|
|||
- name: results |
|||
actions: |
|||
- action: print |
|||
msg: "=== Replication Tax — NVMe-TCP over 25Gbps RoCE (4K randwrite QD32) ===" |
|||
- action: print |
|||
msg: "RF=1 (no repl): {{ iops_rf1 }} IOPS" |
|||
- action: print |
|||
msg: "RF=2 best_effort: {{ iops_rf2be }} IOPS" |
|||
- action: print |
|||
msg: "RF=2 sync_all: {{ iops_rf2sa }} IOPS" |
|||
|
|||
- action: collect_results |
|||
title: "Replication Tax — NVMe-TCP over 25Gbps RoCE" |
|||
|
|||
- name: cleanup |
|||
always: true |
|||
actions: |
|||
- action: exec |
|||
node: m01 |
|||
cmd: "nvme disconnect-all 2>/dev/null; true" |
|||
root: "true" |
|||
ignore_error: true |
|||
- action: stop_weed |
|||
node: m01 |
|||
pid: "{{ vs2_pid }}" |
|||
ignore_error: true |
|||
- action: stop_weed |
|||
node: m02 |
|||
pid: "{{ vs1_pid }}" |
|||
ignore_error: true |
|||
- action: stop_weed |
|||
node: m02 |
|||
pid: "{{ master_pid }}" |
|||
ignore_error: true |
|||
@ -0,0 +1,253 @@ |
|||
name: stable-replication-tax-roce |
|||
timeout: 10m |
|||
|
|||
# Replication tax on 25Gbps RoCE network (10.0.0.x). |
|||
# Compares RF=1, RF=2 best_effort, RF=2 sync_all. |
|||
# |
|||
# Previous run on 1Gbps (192.168.1.x): |
|||
# RF=1: 54,650 IOPS | RF=2 be: 22,773 | RF=2 sa: 23,846 |
|||
# Replication tax: ~58% — mostly 1Gbps wire limit |
|||
# |
|||
# RoCE (25Gbps) should remove the wire bottleneck and show |
|||
# the true protocol overhead. |
|||
|
|||
env: |
|||
master_url: "http://10.0.0.3:9433" |
|||
vol_size: "1073741824" |
|||
|
|||
topology: |
|||
nodes: |
|||
m01: |
|||
host: 192.168.1.181 |
|||
user: testdev |
|||
key: "/opt/work/testdev_key" |
|||
m02: |
|||
host: 192.168.1.184 |
|||
user: testdev |
|||
key: "/opt/work/testdev_key" |
|||
|
|||
phases: |
|||
- name: cluster-start |
|||
actions: |
|||
- action: exec |
|||
node: m02 |
|||
cmd: "fuser -k 9433/tcp 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-roce-master /tmp/sw-roce-vs1 && mkdir -p /tmp/sw-roce-master /tmp/sw-roce-vs1/blocks" |
|||
root: "true" |
|||
ignore_error: true |
|||
- action: exec |
|||
node: m01 |
|||
cmd: "fuser -k 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-roce-vs2 && mkdir -p /tmp/sw-roce-vs2/blocks" |
|||
root: "true" |
|||
ignore_error: true |
|||
|
|||
# Master on m02, listening on RoCE IP. |
|||
- action: start_weed_master |
|||
node: m02 |
|||
port: "9433" |
|||
dir: /tmp/sw-roce-master |
|||
extra_args: "-ip=10.0.0.3" |
|||
save_as: master_pid |
|||
|
|||
- action: sleep |
|||
duration: 3s |
|||
|
|||
# VS1 on m02, advertising RoCE IP. |
|||
- action: start_weed_volume |
|||
node: m02 |
|||
port: "18480" |
|||
master: "10.0.0.3:9433" |
|||
dir: /tmp/sw-roce-vs1 |
|||
extra_args: "-block.dir=/tmp/sw-roce-vs1/blocks -block.listen=:3295 -ip=10.0.0.3" |
|||
save_as: vs1_pid |
|||
|
|||
# VS2 on m01, advertising RoCE IP. |
|||
- action: start_weed_volume |
|||
node: m01 |
|||
port: "18480" |
|||
master: "10.0.0.3:9433" |
|||
dir: /tmp/sw-roce-vs2 |
|||
extra_args: "-block.dir=/tmp/sw-roce-vs2/blocks -block.listen=:3295 -ip=10.0.0.1" |
|||
save_as: vs2_pid |
|||
|
|||
- action: sleep |
|||
duration: 3s |
|||
|
|||
- action: wait_cluster_ready |
|||
node: m02 |
|||
master_url: "{{ master_url }}" |
|||
|
|||
- action: wait_block_servers |
|||
count: "2" |
|||
|
|||
# === RF=1: No replication === |
|||
- name: rf1-test |
|||
actions: |
|||
- action: create_block_volume |
|||
name: "roce-rf1" |
|||
size_bytes: "{{ vol_size }}" |
|||
replica_factor: "1" |
|||
durability_mode: "best_effort" |
|||
|
|||
- action: sleep |
|||
duration: 2s |
|||
|
|||
- action: lookup_block_volume |
|||
name: "roce-rf1" |
|||
save_as: rf1 |
|||
|
|||
- action: iscsi_login_direct |
|||
node: m01 |
|||
host: "{{ rf1_iscsi_host }}" |
|||
port: "{{ rf1_iscsi_port }}" |
|||
iqn: "{{ rf1_iqn }}" |
|||
save_as: dev_rf1 |
|||
|
|||
- action: fio_json |
|||
node: m01 |
|||
device: "{{ dev_rf1 }}" |
|||
rw: randwrite |
|||
bs: 4k |
|||
iodepth: "16" |
|||
runtime: "30" |
|||
time_based: "true" |
|||
name: rf1 |
|||
save_as: fio_rf1 |
|||
|
|||
- action: fio_parse |
|||
json_var: fio_rf1 |
|||
metric: iops |
|||
save_as: iops_rf1 |
|||
|
|||
- action: print |
|||
msg: "RF=1 (RoCE): {{ iops_rf1 }} IOPS" |
|||
|
|||
- action: iscsi_cleanup |
|||
node: m01 |
|||
ignore_error: true |
|||
|
|||
# === RF=2 best_effort === |
|||
- name: rf2-best-effort |
|||
actions: |
|||
- action: create_block_volume |
|||
name: "roce-rf2be" |
|||
size_bytes: "{{ vol_size }}" |
|||
replica_factor: "2" |
|||
durability_mode: "best_effort" |
|||
|
|||
- action: wait_volume_healthy |
|||
name: "roce-rf2be" |
|||
timeout: 60s |
|||
|
|||
- action: lookup_block_volume |
|||
name: "roce-rf2be" |
|||
save_as: rf2be |
|||
|
|||
- action: iscsi_login_direct |
|||
node: m01 |
|||
host: "{{ rf2be_iscsi_host }}" |
|||
port: "{{ rf2be_iscsi_port }}" |
|||
iqn: "{{ rf2be_iqn }}" |
|||
save_as: dev_rf2be |
|||
|
|||
- action: fio_json |
|||
node: m01 |
|||
device: "{{ dev_rf2be }}" |
|||
rw: randwrite |
|||
bs: 4k |
|||
iodepth: "16" |
|||
runtime: "30" |
|||
time_based: "true" |
|||
name: rf2-best-effort |
|||
save_as: fio_rf2be |
|||
|
|||
- action: fio_parse |
|||
json_var: fio_rf2be |
|||
metric: iops |
|||
save_as: iops_rf2be |
|||
|
|||
- action: print |
|||
msg: "RF=2 best_effort (RoCE): {{ iops_rf2be }} IOPS" |
|||
|
|||
- action: iscsi_cleanup |
|||
node: m01 |
|||
ignore_error: true |
|||
|
|||
# === RF=2 sync_all === |
|||
- name: rf2-sync-all |
|||
actions: |
|||
- action: create_block_volume |
|||
name: "roce-rf2sa" |
|||
size_bytes: "{{ vol_size }}" |
|||
replica_factor: "2" |
|||
durability_mode: "sync_all" |
|||
|
|||
- action: wait_volume_healthy |
|||
name: "roce-rf2sa" |
|||
timeout: 60s |
|||
|
|||
- action: lookup_block_volume |
|||
name: "roce-rf2sa" |
|||
save_as: rf2sa |
|||
|
|||
- action: iscsi_login_direct |
|||
node: m01 |
|||
host: "{{ rf2sa_iscsi_host }}" |
|||
port: "{{ rf2sa_iscsi_port }}" |
|||
iqn: "{{ rf2sa_iqn }}" |
|||
save_as: dev_rf2sa |
|||
|
|||
- action: fio_json |
|||
node: m01 |
|||
device: "{{ dev_rf2sa }}" |
|||
rw: randwrite |
|||
bs: 4k |
|||
iodepth: "16" |
|||
runtime: "30" |
|||
time_based: "true" |
|||
name: rf2-sync-all |
|||
save_as: fio_rf2sa |
|||
|
|||
- action: fio_parse |
|||
json_var: fio_rf2sa |
|||
metric: iops |
|||
save_as: iops_rf2sa |
|||
|
|||
- action: print |
|||
msg: "RF=2 sync_all (RoCE): {{ iops_rf2sa }} IOPS" |
|||
|
|||
- action: iscsi_cleanup |
|||
node: m01 |
|||
ignore_error: true |
|||
|
|||
- name: results |
|||
actions: |
|||
- action: print |
|||
msg: "=== Replication Tax — 25Gbps RoCE (4K randwrite QD16) ===" |
|||
- action: print |
|||
msg: "RF=1 (no repl): {{ iops_rf1 }} IOPS" |
|||
- action: print |
|||
msg: "RF=2 best_effort: {{ iops_rf2be }} IOPS" |
|||
- action: print |
|||
msg: "RF=2 sync_all: {{ iops_rf2sa }} IOPS" |
|||
|
|||
- action: collect_results |
|||
title: "Replication Tax — 25Gbps RoCE" |
|||
|
|||
- name: cleanup |
|||
always: true |
|||
actions: |
|||
- action: iscsi_cleanup |
|||
node: m01 |
|||
ignore_error: true |
|||
- action: stop_weed |
|||
node: m01 |
|||
pid: "{{ vs2_pid }}" |
|||
ignore_error: true |
|||
- action: stop_weed |
|||
node: m02 |
|||
pid: "{{ vs1_pid }}" |
|||
ignore_error: true |
|||
- action: stop_weed |
|||
node: m02 |
|||
pid: "{{ master_pid }}" |
|||
ignore_error: true |
|||
@ -0,0 +1,245 @@ |
|||
name: stable-replication-tax |
|||
timeout: 10m |
|||
|
|||
# Stable dimension: replication tax comparison. |
|||
# RF=1 (no replication) vs RF=2 sync_all vs RF=2 best_effort. |
|||
# All on the same hardware, same workload. |
|||
# |
|||
# Answers: what does replication cost in IOPS? |
|||
|
|||
env: |
|||
master_url: "http://192.168.1.184:9433" |
|||
vol_size: "1073741824" |
|||
|
|||
topology: |
|||
nodes: |
|||
m01: |
|||
host: 192.168.1.181 |
|||
user: testdev |
|||
key: "/opt/work/testdev_key" |
|||
m02: |
|||
host: 192.168.1.184 |
|||
user: testdev |
|||
key: "/opt/work/testdev_key" |
|||
|
|||
phases: |
|||
- name: cluster-start |
|||
actions: |
|||
- action: exec |
|||
node: m02 |
|||
cmd: "fuser -k 9433/tcp 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-tax-master /tmp/sw-tax-vs1 && mkdir -p /tmp/sw-tax-master /tmp/sw-tax-vs1/blocks" |
|||
root: "true" |
|||
ignore_error: true |
|||
- action: exec |
|||
node: m01 |
|||
cmd: "fuser -k 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-tax-vs2 && mkdir -p /tmp/sw-tax-vs2/blocks" |
|||
root: "true" |
|||
ignore_error: true |
|||
|
|||
- action: start_weed_master |
|||
node: m02 |
|||
port: "9433" |
|||
dir: /tmp/sw-tax-master |
|||
save_as: master_pid |
|||
|
|||
- action: sleep |
|||
duration: 3s |
|||
|
|||
- action: start_weed_volume |
|||
node: m02 |
|||
port: "18480" |
|||
master: "localhost:9433" |
|||
dir: /tmp/sw-tax-vs1 |
|||
extra_args: "-block.dir=/tmp/sw-tax-vs1/blocks -block.listen=:3295 -ip=192.168.1.184" |
|||
save_as: vs1_pid |
|||
|
|||
- action: start_weed_volume |
|||
node: m01 |
|||
port: "18480" |
|||
master: "192.168.1.184:9433" |
|||
dir: /tmp/sw-tax-vs2 |
|||
extra_args: "-block.dir=/tmp/sw-tax-vs2/blocks -block.listen=:3295 -ip=192.168.1.181" |
|||
save_as: vs2_pid |
|||
|
|||
- action: sleep |
|||
duration: 3s |
|||
|
|||
- action: wait_cluster_ready |
|||
node: m02 |
|||
master_url: "{{ master_url }}" |
|||
|
|||
- action: wait_block_servers |
|||
count: "2" |
|||
|
|||
# === RF=1: No replication === |
|||
- name: rf1-test |
|||
actions: |
|||
- action: create_block_volume |
|||
name: "tax-rf1" |
|||
size_bytes: "{{ vol_size }}" |
|||
replica_factor: "1" |
|||
durability_mode: "best_effort" |
|||
|
|||
- action: sleep |
|||
duration: 2s |
|||
|
|||
- action: lookup_block_volume |
|||
name: "tax-rf1" |
|||
save_as: rf1 |
|||
|
|||
- action: iscsi_login_direct |
|||
node: m01 |
|||
host: "{{ rf1_iscsi_host }}" |
|||
port: "{{ rf1_iscsi_port }}" |
|||
iqn: "{{ rf1_iqn }}" |
|||
save_as: dev_rf1 |
|||
|
|||
- action: fio_json |
|||
node: m01 |
|||
device: "{{ dev_rf1 }}" |
|||
rw: randwrite |
|||
bs: 4k |
|||
iodepth: "16" |
|||
runtime: "30" |
|||
time_based: "true" |
|||
name: rf1 |
|||
save_as: fio_rf1 |
|||
|
|||
- action: fio_parse |
|||
json_var: fio_rf1 |
|||
metric: iops |
|||
save_as: iops_rf1 |
|||
|
|||
- action: print |
|||
msg: "RF=1: {{ iops_rf1 }} IOPS" |
|||
|
|||
- action: iscsi_cleanup |
|||
node: m01 |
|||
ignore_error: true |
|||
|
|||
# === RF=2 best_effort === |
|||
- name: rf2-best-effort |
|||
actions: |
|||
- action: create_block_volume |
|||
name: "tax-rf2be" |
|||
size_bytes: "{{ vol_size }}" |
|||
replica_factor: "2" |
|||
durability_mode: "best_effort" |
|||
|
|||
- action: wait_volume_healthy |
|||
name: "tax-rf2be" |
|||
timeout: 60s |
|||
|
|||
- action: lookup_block_volume |
|||
name: "tax-rf2be" |
|||
save_as: rf2be |
|||
|
|||
- action: iscsi_login_direct |
|||
node: m01 |
|||
host: "{{ rf2be_iscsi_host }}" |
|||
port: "{{ rf2be_iscsi_port }}" |
|||
iqn: "{{ rf2be_iqn }}" |
|||
save_as: dev_rf2be |
|||
|
|||
- action: fio_json |
|||
node: m01 |
|||
device: "{{ dev_rf2be }}" |
|||
rw: randwrite |
|||
bs: 4k |
|||
iodepth: "16" |
|||
runtime: "30" |
|||
time_based: "true" |
|||
name: rf2-best-effort |
|||
save_as: fio_rf2be |
|||
|
|||
- action: fio_parse |
|||
json_var: fio_rf2be |
|||
metric: iops |
|||
save_as: iops_rf2be |
|||
|
|||
- action: print |
|||
msg: "RF=2 best_effort: {{ iops_rf2be }} IOPS" |
|||
|
|||
- action: iscsi_cleanup |
|||
node: m01 |
|||
ignore_error: true |
|||
|
|||
# === RF=2 sync_all === |
|||
- name: rf2-sync-all |
|||
actions: |
|||
- action: create_block_volume |
|||
name: "tax-rf2sa" |
|||
size_bytes: "{{ vol_size }}" |
|||
replica_factor: "2" |
|||
durability_mode: "sync_all" |
|||
|
|||
- action: wait_volume_healthy |
|||
name: "tax-rf2sa" |
|||
timeout: 60s |
|||
|
|||
- action: lookup_block_volume |
|||
name: "tax-rf2sa" |
|||
save_as: rf2sa |
|||
|
|||
- action: iscsi_login_direct |
|||
node: m01 |
|||
host: "{{ rf2sa_iscsi_host }}" |
|||
port: "{{ rf2sa_iscsi_port }}" |
|||
iqn: "{{ rf2sa_iqn }}" |
|||
save_as: dev_rf2sa |
|||
|
|||
- action: fio_json |
|||
node: m01 |
|||
device: "{{ dev_rf2sa }}" |
|||
rw: randwrite |
|||
bs: 4k |
|||
iodepth: "16" |
|||
runtime: "30" |
|||
time_based: "true" |
|||
name: rf2-sync-all |
|||
save_as: fio_rf2sa |
|||
|
|||
- action: fio_parse |
|||
json_var: fio_rf2sa |
|||
metric: iops |
|||
save_as: iops_rf2sa |
|||
|
|||
- action: print |
|||
msg: "RF=2 sync_all: {{ iops_rf2sa }} IOPS" |
|||
|
|||
- action: iscsi_cleanup |
|||
node: m01 |
|||
ignore_error: true |
|||
|
|||
- name: results |
|||
actions: |
|||
- action: print |
|||
msg: "=== Replication Tax (4K randwrite QD16) ===" |
|||
- action: print |
|||
msg: "RF=1 (no repl): {{ iops_rf1 }} IOPS" |
|||
- action: print |
|||
msg: "RF=2 best_effort: {{ iops_rf2be }} IOPS" |
|||
- action: print |
|||
msg: "RF=2 sync_all: {{ iops_rf2sa }} IOPS" |
|||
|
|||
- action: collect_results |
|||
title: "Replication Tax" |
|||
|
|||
- name: cleanup |
|||
always: true |
|||
actions: |
|||
- action: iscsi_cleanup |
|||
node: m01 |
|||
ignore_error: true |
|||
- action: stop_weed |
|||
node: m01 |
|||
pid: "{{ vs2_pid }}" |
|||
ignore_error: true |
|||
- action: stop_weed |
|||
node: m02 |
|||
pid: "{{ vs1_pid }}" |
|||
ignore_error: true |
|||
- action: stop_weed |
|||
node: m02 |
|||
pid: "{{ master_pid }}" |
|||
ignore_error: true |
|||
Write
Preview
Loading…
Cancel
Save
Reference in new issue