34 KiB
Phase 6 Progress
Status
- CP6-1 complete. 54 CSI tests (25 dev + 30 QA - 1 removed).
- CP6-2 complete. 172 CP6-2 tests (118 dev/review + 54 QA). 1 QA bug found and fixed.
- Phase 6 cumulative: 226 tests, all PASS.
Completed
-
CP6-1 Task 0: Extracted BlockVolAdapter to shared
blockvol/adapter.go, added DisconnectVolume to TargetServer, added Session.TargetIQN(). -
CP6-1 Task 1: VolumeManager (multi-volume BlockVol + shared TargetServer lifecycle). 10 tests.
-
CP6-1 Task 2: CSI Identity service (GetPluginInfo, GetPluginCapabilities, Probe). 3 tests.
-
CP6-1 Task 3: CSI Controller service (CreateVolume, DeleteVolume, ValidateVolumeCapabilities). 4 tests.
-
CP6-1 Task 4: CSI Node service (NodeStageVolume, NodeUnstageVolume, NodePublishVolume, NodeUnpublishVolume). 7 tests.
-
CP6-1 Task 5: gRPC server + binary entry point (
csi/cmd/block-csi/main.go). -
CP6-1 Task 6: K8s manifests (DaemonSet, StorageClass, RBAC, example PVC) + smoke-test.sh.
-
CP6-1 Review fixes: 5 findings + 2 open questions resolved, 3 new tests added.
- Finding 1: CreateVolume idempotency after restart (adopts existing .blk files on disk).
- Finding 2: NodePublishVolume validates empty StagingTargetPath.
- Finding 3: Resource leak cleanup on error paths (success flag + deferred CloseVolume).
- Finding 4: Synchronous listener creation (bind errors surface immediately).
- Finding 5: IQN collision avoidance (SHA256 hash suffix on truncation).
-
CP6-1 QA adversarial: 30 tests in qa_csi_test.go. 5 bugs found and fixed:
- BUG-QA-1 (Medium): DeleteVolume leaked .snap.* delta files. Fixed: glob+remove snapshot files.
- BUG-QA-2 (High): Start not retryable after failure (sync.Once). Fixed: state machine.
- BUG-QA-3 (High): Stop then Start broken (sync.Once already fired). Fixed: same state machine.
- BUG-QA-4 (Low): CreateVolume ignored LimitBytes. Fixed: validate and cap size.
- BUG-QA-5 (Medium): sanitizeFilename case divergence with SanitizeIQN. Fixed: lowercase both.
- Additional: goroutine captured m.target by reference (nil after Stop). Fixed: local capture.
-
CP6-2 complete. All 7 tasks done. 63 CSI tests + 48 server block tests = 111 CP6-2 tests, all PASS.
CP6-2: Control-Plane Integration
Completed Tasks
- Task 0: Proto Extension + Code Generation — block volume messages in master.proto/volume_server.proto, Go stubs regenerated, conversion helpers + 5 tests.
- Task 1: Master Block Volume Registry — in-memory registry with Pending→Active status tracking, full/delta heartbeat reconciliation, per-name inflight lock (TOCTOU prevention), placement (fewest volumes), block-capable server tracking. 11 tests.
- Task 2: Volume Server Block Volume gRPC — AllocateBlockVolume/DeleteBlockVolume gRPC handlers on VolumeServer, CreateBlockVol/DeleteBlockVol on BlockService, shared naming (blockvol/naming.go). 5 tests.
- Task 3: Master Block Volume RPC Handlers — CreateBlockVolume (idempotent, inflight lock, retry up to 3 servers), DeleteBlockVolume (idempotent), LookupBlockVolume. Mock VS call injection for testability. 9 tests.
- Task 4: Heartbeat Wiring — block volume fields in heartbeat stream, volume server sends initial full heartbeat + deltas, master processes via UpdateFullHeartbeat/UpdateDeltaHeartbeat.
- Task 5: CSI Controller Refactor — VolumeBackend interface (LocalVolumeBackend + MasterVolumeClient), controller uses backend instead of VolumeManager, returns volume_context with iscsiAddr+iqn, mode flag (controller/node/all). 5 backend tests.
- Task 6: CSI Node Refactor + K8s Manifests — Node reads volume_context for remote targets, staged volume tracking with IQN derivation fallback on restart, split K8s manifests (csi-driver.yaml, csi-controller.yaml Deployment, csi-node.yaml DaemonSet). 4 new node tests (11 total).
New Files (CP6-2)
| File | Description |
|---|---|
blockvol/naming.go |
Shared SanitizeIQN + SanitizeFilename |
blockvol/naming_test.go |
4 naming tests |
blockvol/block_heartbeat_proto.go |
Go wire type ↔ proto conversion |
blockvol/block_heartbeat_proto_test.go |
5 conversion tests |
server/master_block_registry.go |
Block volume registry + placement |
server/master_block_registry_test.go |
11 registry tests |
server/volume_grpc_block.go |
VS block volume gRPC handlers |
server/volume_grpc_block_test.go |
5 VS tests |
server/master_grpc_server_block.go |
Master block volume RPC handlers |
server/master_grpc_server_block_test.go |
9 master handler tests |
csi/volume_backend.go |
VolumeBackend interface + clients |
csi/volume_backend_test.go |
5 backend tests |
csi/deploy/csi-controller.yaml |
Controller Deployment manifest |
csi/deploy/csi-node.yaml |
Node DaemonSet manifest |
Modified Files (CP6-2)
| File | Changes |
|---|---|
pb/master.proto |
Block volume messages, Heartbeat fields 24-27, RPCs |
pb/volume_server.proto |
AllocateBlockVolume, VolumeServerDeleteBlockVolume |
server/master_server.go |
BlockVolumeRegistry + VS call fields |
server/master_grpc_server.go |
Block volume heartbeat processing |
server/volume_grpc_client_to_master.go |
Block volume in heartbeat stream |
server/volume_server_block.go |
CreateBlockVol/DeleteBlockVol on BlockService |
csi/controller.go |
VolumeBackend instead of VolumeManager |
csi/controller_test.go |
Updated for VolumeBackend |
csi/node.go |
Remote target support + staged volume tracking |
csi/node_test.go |
4 new remote target tests |
csi/server.go |
Mode flag, MasterAddr, VolumeBackend config |
csi/cmd/block-csi/main.go |
--master, --mode flags |
csi/deploy/csi-driver.yaml |
CSIDriver object only (split out workloads) |
csi/qa_csi_test.go |
Updated for VolumeBackend |
CP6-2 Review Fixes
All findings from both reviewers addressed. 4 new tests added (118 total CP6-2 tests).
| # | Finding | Severity | Fix |
|---|---|---|---|
| R1-F1 | DeleteBlockVol doesn't terminate active sessions | High | Use DisconnectVolume instead of RemoveVolume |
| R1-F2 | Block registry server list never pruned | Medium | UnmarkBlockCapable on VS disconnect in SendHeartbeat defer |
| R1-F3 | Block volume status never updates after create | Medium | Mark StatusActive immediately after successful VS allocate |
| R1-F4 | IQN generation on startup scan doesn't sanitize | Low | Apply blockvol.SanitizeIQN(name) in scan path |
| R1-F5/R2-F3 | CreateBlockVol idempotent path skips TargetServer | Medium | Re-add adapter to TargetServer on idempotent path |
| R2-F1 | UpdateFullHeartbeat doesn't update SizeBytes | Low | Copy info.VolumeSize to existing.SizeBytes |
| R2-F2 | inflightEntry.done channel is dead code | Low | Removed done channel, simplified to empty struct |
| R2-F4 | CreateBlockVolume idempotent check doesn't validate size | Medium | Return error if existing size < requested size |
| R2-F5 | Full + delta heartbeat can fire on same message | Low | Changed second if to else if + comment |
| R2-F6 | NodeUnstageVolume deletes staged entry before cleanup | Medium | Delete from staged map only after successful cleanup |
New tests: TestMaster_CreateIdempotentSizeMismatch, TestRegistry_UnmarkDeadServer, TestRegistry_FullHeartbeatUpdatesSizeBytes, TestNode_UnstageRetryKeepsStagedEntry.
CP6-2 QA Adversarial Tests
54 tests across 2 files. 1 bug found and fixed.
| File | Tests | Areas |
|---|---|---|
server/qa_block_cp62_test.go |
22 | Registry (8), Master RPCs (8), VS BlockService (6) |
csi/qa_cp62_test.go |
32 | Node remote (6), Controller backend (5), Backend (2), Naming (2), Lifecycle (4), Server/Driver (2), VolumeManager (4), Edge cases (7) |
BUG-QA-CP62-1 (Medium): NewCSIDriver accepts invalid mode strings.
NewCSIDriver(DriverConfig{Mode: "invalid"})returns nil error. Driver runs with only identity server — no controller, no node. K8s reports capabilities but all operations failUnimplemented.- Fix: Added
switchvalidation after mode defaulting. Returns"csi: invalid mode %q, must be controller/node/all". - Test:
TestQA_ModeInvalid.
Final CP6-2 test count: 118 dev/review + 54 QA = 172 CP6-2 tests, all PASS.
Cumulative Phase 6 test count: 54 CP6-1 + 172 CP6-2 = 226 tests.
CSI Testing Ladder
| Level | What | Tools | Status |
|---|---|---|---|
| 1. Unit tests | Mock iscsiadm/mount. Confirm idempotency, error handling, edge cases. | go test |
DONE (226 tests) |
| 2. gRPC conformance | csi-sanity tool validates all CSI RPCs against spec. No K8s needed. |
csi-sanity | DONE (33 pass, 58 skip) |
| 3. Integration smoke | Full iSCSI lifecycle with real filesystem (via csi-sanity "should work" tests). | csi-sanity + iscsiadm | DONE (489 SCSI cmds) |
| 4. Single-node K8s (k3s) | Deploy CSI DaemonSet on k3s. PVC → Pod → write data → delete/recreate → verify persistence. | k3s v1.34.4 | DONE |
| 5. Failure/chaos | Kill CSI controller pod; ensure no IO outage for existing volumes. Node restart with staged volumes. | chaos-mesh or manual | TODO |
| 6. K8s E2E suite | SIG-Storage tests validate provisioning, attach/detach, resize, snapshots. | e2e.test binary |
TODO |
Level 2: csi-sanity Conformance (M02)
Result: 33 Passed, 0 Failed, 58 Skipped, 1 Pending.
Run on M02 (192.168.1.184) with block-csi in local mode. Used helper scripts for staging/target path management.
Bugs found and fixed during csi-sanity:
| # | Bug | Severity | Fix |
|---|---|---|---|
| BUG-SANITY-1 | CreateVolume accepted empty VolumeCapabilities | Medium | Added len(req.VolumeCapabilities) == 0 check |
| BUG-SANITY-2 | ValidateVolumeCapabilities accepted empty VolumeCapabilities | Medium | Same check added |
| BUG-SANITY-3 | NodeStageVolume accepted nil VolumeCapability | Medium | Added nil check |
| BUG-SANITY-4 | NodePublishVolume used mount -t ext4 instead of bind mount |
High | Added BindMount method to MountUtil interface |
| BUG-SANITY-5 | NodeUnpublishVolume didn't remove target path | Medium | Added os.RemoveAll per CSI spec |
| BUG-SANITY-6 | NodeUnpublishVolume failed on unmounted path | Medium | Added IsMounted check before unmount |
All existing unit tests updated with VolumeCapabilities/VolumeCapability in test requests.
Level 3: Integration Smoke (M02)
Verified through csi-sanity's full lifecycle tests which exercised real iSCSI:
- 489 real SCSI commands processed (READ_10, WRITE_10, SYNC_CACHE, INQUIRY, etc.)
- Full cycle: CreateVolume → NodeStageVolume (iSCSI login + mkfs.ext4 + mount) → NodePublishVolume → NodeUnpublishVolume → NodeUnstageVolume (unmount + iSCSI logout) → DeleteVolume
- Clean state verified: no leftover iSCSI sessions, mounts, or volume files
Level 4: k3s PVC→Pod (M02)
Result: PASS — data persists across pod deletion/recreation.
k3s v1.34.4 single-node on M02. CSI deployed as DaemonSet with 3 containers:
- block-csi (privileged, nsenter wrappers for host iscsiadm/mount/umount/mkfs/blkid/mountpoint)
- csi-provisioner (v5.1.0, --node-deployment for single-node)
- csi-node-driver-registrar (v2.12.0)
Test sequence:
- Created PVC (100Mi, sw-block StorageClass) → Bound
- Created pod → wrote "hello sw-block" to /data/test.txt → md5:
7be761488cf480c966077c7aca4ea3ed - Deleted pod (PVC retained) → iSCSI session cleanly closed
- Recreated pod with same PVC → read "hello sw-block" → same md5 verified
- Appended "persistence works!" → confirmed read-write
Additional bug fixed during k3s testing:
| # | Bug | Severity | Fix |
|---|---|---|---|
| BUG-K3S-1 | IsLoggedIn didn't handle iscsiadm exit code 21 (nsenter suppresses output) | Medium | Added exitErr.ExitCode() == 21 check |
DaemonSet manifest: learn/projects/sw-block/test/csi-k3s-node.yaml
- CP6-3 complete. 67 CP6-3 tests. All PASS.
CP6-3: Failover + Rebuild in Kubernetes
Completed Tasks
- Task 0: Proto Extension + Wire Type Updates — Added replica_data_addr, replica_ctrl_addr to BlockVolumeInfoMessage/BlockVolumeAssignment; rebuild_addr to BlockVolumeAssignment; replica_server to Create/LookupBlockVolumeResponse; replica fields to AllocateBlockVolumeResponse. Updated wire types and converters. 8 tests.
- Task 1: Master Assignment Queue + Delivery — BlockAssignmentQueue with Enqueue/Peek/Confirm/ConfirmFromHeartbeat. Retain-until-confirmed pattern (F1): assignments resent on every heartbeat until VS confirms via matching (path, epoch, role). Stale epoch pruning during Peek. Wired into HeartbeatResponse delivery. 11 tests.
- Task 2: VS Assignment Receiver Wiring — VS extracts block_volume_assignments from HeartbeatResponse and calls BlockService.ProcessAssignments.
- Task 3: BlockService Replication Support — ProcessAssignments dispatches to HandleAssignment + setupPrimaryReplication/setupReplicaReceiver/startRebuild per role. ReplicationPorts deterministic hash (F3). Heartbeat reports replica addresses (F5). 9 tests.
- Task 4: Registry Replica Tracking + CreateVolume — Added SetReplica/ClearReplica/SwapPrimaryReplica to registry. CreateBlockVolume creates on 2 servers (primary + replica), enqueues assignments. Single-copy mode if only 1 server or replica fails (F4). LookupBlockVolume returns ReplicaServer. 10 tests.
- Task 5: Master Failover Detection — failoverBlockVolumes on VS disconnect. Lease-aware promotion (F2): promote only after LastLeaseGrant + LeaseTTL expires. Deferred promotion via time.AfterFunc for unexpired leases. promoteReplica swaps primary/replica, bumps epoch, enqueues new primary assignment. 11 tests.
- Task 6: ControllerPublishVolume/UnpublishVolume — ControllerPublishVolume calls backend.LookupVolume, returns publish_context{iscsiAddr, iqn}. ControllerUnpublishVolume is no-op. Added PUBLISH_UNPUBLISH_VOLUME capability. NodeStageVolume prefers publish_context over volume_context (reflects current primary after failover). 8 tests.
- Task 7: Rebuild on Recovery — recoverBlockVolumes on VS reconnect drains pendingRebuilds, sets reconnected server as replica, enqueues Rebuilding assignments. 10 tests (shared with Task 5 test file).
Design Review Findings Addressed
| # | Finding | Severity | Resolution |
|---|---|---|---|
| F1 | Assignment delivery can be dropped | Critical | Retain-until-confirmed: Peek+Confirm pattern, assignments resent every heartbeat |
| F2 | Failover without lease check → split-brain | Critical | Gate promotion on now > lastLeaseGrant + leaseTTL; deferred promotion for unexpired leases |
| F3 | Replication ports change on VS restart | Critical | Deterministic port = FNV hash of path, offset from base iSCSI port |
| F4 | Partial create (replica fails) | Medium | Single-copy mode with ReplicaServer="", skip replica assignments |
| F5 | UpdateFullHeartbeat ignores replica addresses | Medium | VS includes replica_data/ctrl in InfoMessage; registry updates on heartbeat |
Code Review 1 Findings Addressed
| # | Finding | Severity | Resolution |
|---|---|---|---|
| R1-1 | AllocateBlockVolume missing repl addrs | High | AllocateBlockVolume now returns ReplicaDataAddr/CtrlAddr/RebuildListenAddr from ReplicationPorts() |
| R1-2 | Primary never starts rebuild server | High | setupPrimaryReplication now calls vol.StartRebuildServer(rebuildAddr) |
| R1-3 | Assignment queue never confirms after startup | High | VS sends periodic full block heartbeat (5×sleepInterval tick) enabling master confirmation |
| R1-4 | Replica addresses not reported in heartbeat | Medium | BlockService.CollectBlockVolumeHeartbeat wraps store's collector, fills ReplicaDataAddr/CtrlAddr from replStates |
| R1-5 | Lease never refreshed after create | Medium | UpdateFullHeartbeat refreshes LastLeaseGrant on every heartbeat; periodic block heartbeats keep it current |
Code Review 2 Findings Addressed
| # | Finding | Severity | Resolution |
|---|---|---|---|
| R2-F1 | LastLeaseGrant set AFTER Register → stale-lease race | High | Moved to entry initializer BEFORE Register |
| R2-F2 | Deferred promotion timer has no cancellation | Medium | Timers stored in blockFailoverState.deferredTimers; cancelled in recoverBlockVolumes on reconnect |
| R2-F3 | SwapPrimaryReplica hardcodes uint32(1) | Medium | Changed to blockvol.RoleToWire(blockvol.RolePrimary) |
| R2-F4 | DeleteBlockVolume doesn't delete replica | Medium | Added best-effort replica delete (non-fatal if replica VS is down) |
| R2-F5 | promoteReplica reads epoch without lock | Medium | SwapPrimaryReplica now computes epoch+1 atomically inside lock, returns newEpoch |
| R2-F6 | Redundant string(server) casts | Low | Removed — servers already typed as string |
| R2-F7 | startRebuild goroutine has no feedback path | Low | Documented as future work (VS could report via heartbeat) |
New Files (CP6-3)
| File | Description |
|---|---|
server/master_block_assignment_queue.go |
Assignment queue with retain-until-confirmed |
server/master_block_assignment_queue_test.go |
11 queue tests |
server/master_block_failover.go |
Failover detection + rebuild on recovery |
server/master_block_failover_test.go |
21 failover + rebuild tests |
Modified Files (CP6-3)
| File | Changes |
|---|---|
pb/master.proto |
Replica/rebuild fields on assignment/info/response messages |
pb/volume_server.proto |
Replica/rebuild fields on AllocateBlockVolumeResponse |
pb/master_pb/master.pb.go |
New fields + getters |
pb/volume_server_pb/volume_server.pb.go |
New fields + getters |
storage/blockvol/block_heartbeat.go |
ReplicaDataAddr/CtrlAddr on InfoMessage, RebuildAddr on Assignment |
storage/blockvol/block_heartbeat_proto.go |
Updated converters + AssignmentsToProto |
server/master_server.go |
blockAssignmentQueue, blockFailover, blockAllocResult struct |
server/master_grpc_server.go |
Assignment delivery in heartbeat, failover on disconnect, recovery on reconnect |
server/master_grpc_server_block.go |
Replica creation, assignment enqueueing, tryCreateReplica; R2-F1 LastLeaseGrant fix; R2-F4 replica delete; R2-F6 cast cleanup |
server/master_block_registry.go |
Replica fields, lease fields, SetReplica/ClearReplica/SwapPrimaryReplica; R2-F3 RoleToWire; R2-F5 atomic epoch; R1-5 lease refresh |
server/volume_grpc_client_to_master.go |
Assignment processing from HeartbeatResponse; R1-3 periodic block heartbeat tick |
server/volume_grpc_block.go |
R1-1 replication ports in AllocateBlockVolumeResponse |
server/volume_server_block.go |
ProcessAssignments, replication setup, ReplicationPorts; R1-2 StartRebuildServer; R1-4 CollectBlockVolumeHeartbeat with repl addrs |
server/master_block_failover.go |
R2-F2 deferred timer cancellation; R2-F5 new SwapPrimaryReplica API; R2-F7 rebuild feedback comment |
storage/store_blockvol.go |
WithVolume (exported) |
csi/controller.go |
ControllerPublishVolume/UnpublishVolume, PUBLISH_UNPUBLISH capability |
csi/node.go |
Prefer publish_context over volume_context |
CP6-3 Test Count
| File | New Tests |
|---|---|
blockvol/block_heartbeat_proto_test.go |
7 |
server/master_block_assignment_queue_test.go |
11 |
server/volume_server_block_test.go |
9 |
server/master_block_registry_test.go |
5 |
server/master_grpc_server_block_test.go |
6 |
server/master_block_failover_test.go |
21 |
csi/controller_test.go |
6 |
csi/node_test.go |
2 |
| Total CP6-3 | 67 |
Cumulative Phase 6 test count: 54 CP6-1 + 172 CP6-2 + 67 CP6-3 = 293 tests.
CP6-3 QA Adversarial Tests
48 tests in server/qa_block_cp63_test.go. 1 bug found and fixed.
| Group | Tests | Areas |
|---|---|---|
| Assignment Queue | 8 | Wrong epoch confirm, partial heartbeat confirm, same-path different roles, concurrent ops |
| Registry | 7 | Double swap, swap no-replica, concurrent swap+lookup, SetReplica replace, heartbeat clobber |
| Failover | 7 | Deferred cancel on reconnect, double disconnect, mixed lease states, volume deleted during timer |
| Create+Delete | 5 | Lease non-zero after create, replica delete on vol delete, replica delete failure |
| Rebuild | 3 | Double reconnect, nil failover state, full cycle |
| Integration | 2 | Failover enqueues assignment, heartbeat confirms failover assignment |
| Edge Cases | 5 | Epoch monotonic, cancel timers no rebuilds, replica server dies, empty batch |
| Master-level | 5 | Delete VS unreachable, sanitized name, concurrent create/delete, all VS fail, slow allocate |
| VS-level | 6 | Concurrent create, concurrent create/delete, delete cleans snapshots, sanitization collision, idempotent re-add, nil block service |
BUG-QA-CP63-1 (Medium): SetReplica leaks old replica server in byServer index.
SetReplicadidn't remove old replica server frombyServerwhen replacing with a new one.- Fix: Added
removeFromServer(oldReplicaServer, name)before setting new replica (3 lines). - Test:
TestQA_Reg_SetReplicaTwice_ReplacesOld.
Final CP6-3 test count: 67 dev/review + 48 QA = 115 CP6-3 tests, all PASS.
CP6-3 Integration Tests
8 tests in server/integration_block_test.go. Full cross-component flows.
| # | Test | What it proves |
|---|---|---|
| 1 | FailoverCSIPublish | LookupBlockVolume returns new iSCSI addr after failover |
| 2 | RebuildOnRecovery | Rebuilding assignment enqueued + heartbeat confirms it |
| 3 | AssignmentDeliveryConfirmation | Queue retains until heartbeat confirms matching (path, epoch) |
| 4 | LeaseAwarePromotion | Promotion deferred until lease TTL expires |
| 5 | ReplicaFailureSingleCopy | Single-copy mode: no replica assignments, failover is no-op |
| 6 | TransientDisconnectNoSplitBrain | Deferred timer cancelled on reconnect, no split-brain |
| 7 | FullLifecycle | 11-phase lifecycle: create→publish→confirm→failover→re-publish→recover→rebuild→delete |
| 8 | DoubleFailover | Two successive failovers: epoch 1→2→3 |
| 9 | MultiVolumeFailoverRebuild | 3 volumes, kill 1 server, rebuild all affected |
Final CP6-3 test count: 67 dev/review + 48 QA + 8 mock integration + 3 real integration = 126 CP6-3 tests, all PASS.
Cumulative Phase 6 with QA: 54 CP6-1 + 172 CP6-2 + 126 CP6-3 = 352 tests.
CP6-3 Real Integration Tests (M02)
3 tests in blockvol/test/cp63_test.go, run on M02 (192.168.1.184) with real iSCSI.
Bug found: RoleNone → RoleRebuilding transition not allowed.
After VS restart, volume is RoleNone. Master sends Rebuilding assignment, but both
validTransitions (role.go) and HandleAssignment (promotion.go) rejected this path.
- Fix: Added
RoleRebuilding: truetovalidTransitions[RoleNone]in role.go. AddedRoleNone → RoleRebuildingcase in HandleAssignment with SetEpoch + SetRole. - Admin API: Added
action:"connect"to/rebuildendpoint (starts rebuild client).
| # | Test | Time | What it proves |
|---|---|---|---|
| 1 | FailoverCSIAddressSwitch | 3.2s | Write A → kill primary → promote replica → re-discover at new iSCSI address → verify A → write B → verify A+B. Simulates CSI ControllerPublishVolume address-switch. |
| 2 | RebuildDataConsistency | 5.3s | Write A (replicated) → kill replica → write B (missed) → restart replica as Rebuilding → rebuild server + client → wait role→Replica → kill primary → promote rebuilt → verify A+B. Full end-to-end rebuild with data verification. |
| 3 | FullLifecycleFailoverRebuild | 6.4s | Write A → kill primary → promote → write B → rebuild old primary → write C → kill new primary → promote old → verify A+B. 11-phase lifecycle: failover→recoverBlockVolumes→rebuild. |
All 7 existing HA tests: PASS (no regression). Total real integration: 10 tests on M02.
In Progress
- None.
Blockers
- None.
Next Steps
- CP6-4: Soak testing, lease renewal timers, monitoring dashboards.
Notes
- CSI spec dependency:
github.com/container-storage-interface/spec v1.10.0. - Architecture: CSI binary embeds TargetServer + BlockVol in-process (loopback iSCSI).
- Interface-based ISCSIUtil/MountUtil for unit testing without real iscsiadm/mount.
- k3s deployment requires: hostNetwork, hostPID, privileged, /dev mount, nsenter wrappers for host commands.
- Known pre-existing flaky:
TestQAPhase4ACP1/role_concurrent_transitions(unrelated to CSI).
CP6-1 Test Catalog
VolumeManager (csi/volume_manager_test.go) — 10 tests
| # | Test | What it proves |
|---|---|---|
| 1 | CreateOpenClose | Create, verify IQN, close, reopen lifecycle |
| 2 | DeleteRemovesFile | .blk file removed on delete |
| 3 | DuplicateCreate | Same size idempotent; different size returns ErrVolumeSizeMismatch |
| 4 | ListenAddr | Non-empty listen address after start |
| 5 | OpenNonExistent | Error on opening non-existent volume |
| 6 | CloseAlreadyClosed | Idempotent close of non-tracked volume |
| 7 | ConcurrentCreateDelete | 10 parallel create+delete, no races |
| 8 | SanitizeIQN | Special char replacement, truncation to 64 chars |
| 9 | CreateIdempotentAfterRestart | Existing .blk file adopted on restart |
| 10 | IQNCollision | Long names with same prefix get distinct IQNs via hash suffix |
Identity (csi/identity_test.go) — 3 tests
| # | Test | What it proves |
|---|---|---|
| 1 | GetPluginInfo | Returns correct driver name + version |
| 2 | GetPluginCapabilities | Returns CONTROLLER_SERVICE capability |
| 3 | Probe | Returns ready=true |
Controller (csi/controller_test.go) — 4 tests
| # | Test | What it proves |
|---|---|---|
| 1 | CreateVolume | Volume created and tracked |
| 2 | CreateIdempotent | Same name+size succeeds, different size returns AlreadyExists |
| 3 | DeleteVolume | Volume removed after delete |
| 4 | DeleteNotFound | Delete non-existent returns success (CSI spec) |
Node (csi/node_test.go) — 7 tests
| # | Test | What it proves |
|---|---|---|
| 1 | StageUnstage | Full stage flow (discovery+login+mount) and unstage (unmount+logout+close) |
| 2 | PublishUnpublish | Bind mount from staging to target path |
| 3 | StageIdempotent | Already-mounted staging path returns OK without side effects |
| 4 | StageLoginFailure | iSCSI login error propagated as Internal |
| 5 | StageMkfsFailure | mkfs error propagated as Internal |
| 6 | StageLoginFailureCleanup | Volume closed after login failure (no resource leak) |
| 7 | PublishMissingStagingPath | Empty StagingTargetPath returns InvalidArgument |
Adapter (blockvol/adapter_test.go) — 3 tests
| # | Test | What it proves |
|---|---|---|
| 1 | AdapterALUAProvider | ALUAState/TPGroupID/DeviceNAA correct values |
| 2 | RoleToALUA | All role→ALUA state mappings |
| 3 | UUIDToNAA | NAA-6 byte layout from UUID |
CP6-2 Test Catalog
Registry (server/master_block_registry_test.go) — 11 tests
| # | Test | What it proves |
|---|---|---|
| 1 | RegisterLookup | Register + Lookup returns entry |
| 2 | DuplicateRegister | Second register same name errors |
| 3 | Unregister | Unregister removes entry |
| 4 | ListByServer | Returns only entries for given server |
| 5 | FullHeartbeat | Marks active, removes stale, adds new |
| 6 | DeltaHeartbeat | Add/remove deltas applied correctly |
| 7 | PickServer | Fewest-volumes placement |
| 8 | Inflight | AcquireInflight blocks duplicate, ReleaseInflight unblocks |
| 9 | BlockCapable | MarkBlockCapable / UnmarkBlockCapable tracking |
| 10 | UnmarkDeadServer | R1-F2 regression test |
| 11 | FullHeartbeatUpdatesSizeBytes | R2-F1 regression test |
Master RPCs (server/master_grpc_server_block_test.go) — 9 tests
| # | Test | What it proves |
|---|---|---|
| 1 | CreateHappyPath | Create → register → lookup works |
| 2 | CreateIdempotent | Same name+size returns same entry |
| 3 | CreateIdempotentSizeMismatch | Same name, smaller size → error |
| 4 | CreateInflightBlock | Concurrent create same name → one fails |
| 5 | Delete | Delete → VS called → unregistered |
| 6 | DeleteNotFound | Delete non-existent → success |
| 7 | Lookup | Lookup returns entry |
| 8 | LookupNotFound | Lookup non-existent → NotFound |
| 9 | CreateRetryNextServer | First VS fails → retries on next |
VS Block gRPC (server/volume_grpc_block_test.go) — 5 tests
| # | Test | What it proves |
|---|---|---|
| 1 | Allocate | Create via gRPC returns path+iqn+addr |
| 2 | AllocateEmptyName | Empty name → error |
| 3 | AllocateZeroSize | Zero size → error |
| 4 | Delete | Delete via gRPC succeeds |
| 5 | DeleteNilService | Nil blockService → error |
Naming (blockvol/naming_test.go) — 4 tests
| # | Test | What it proves |
|---|---|---|
| 1 | SanitizeFilename | Lowercases, replaces invalid chars |
| 2 | SanitizeIQN | Lowercases, replaces, truncates with hash |
| 3 | IQNMaxLength | 64-char names pass through unchanged |
| 4 | IQNHashDeterministic | Same input → same hash suffix |
Proto conversion (blockvol/block_heartbeat_proto_test.go) — 5 tests
| # | Test | What it proves |
|---|---|---|
| 1 | RoundTrip | Go→proto→Go preserves all fields |
| 2 | NilSafe | Nil input → nil output |
| 3 | ShortRoundTrip | Short info round-trip |
| 4 | AssignmentRoundTrip | Assignment round-trip |
| 5 | SliceHelpers | Slice conversion helpers |
Backend (csi/volume_backend_test.go) — 5 tests
| # | Test | What it proves |
|---|---|---|
| 1 | LocalCreate | LocalVolumeBackend.CreateVolume creates + returns info |
| 2 | LocalDelete | LocalVolumeBackend.DeleteVolume removes volume |
| 3 | LocalLookup | LocalVolumeBackend.LookupVolume returns info |
| 4 | LocalLookupNotFound | Lookup non-existent returns not-found |
| 5 | LocalDeleteNotFound | Delete non-existent returns success |
Node remote (csi/node_test.go additions) — 4 tests
| # | Test | What it proves |
|---|---|---|
| 1 | StageRemoteTarget | volume_context drives iSCSI instead of local mgr |
| 2 | UnstageRemoteTarget | Staged map IQN used for logout |
| 3 | UnstageAfterRestart | IQN derived from iqnPrefix when staged map empty |
| 4 | UnstageRetryKeepsStagedEntry | R2-F6 regression: staged entry preserved on failure |
QA Server (server/qa_block_cp62_test.go) — 22 tests
| # | Test | What it proves |
|---|---|---|
| 1 | Reg_FullHeartbeatCrossTalk | Heartbeat from s2 doesn't remove s1 volumes |
| 2 | Reg_FullHeartbeatEmptyServer | Empty heartbeat marks server block-capable |
| 3 | Reg_ConcurrentHeartbeatAndRegister | 10 goroutines heartbeat+register, no races |
| 4 | Reg_DeltaHeartbeatUnknownPath | Delta for unknown path is no-op |
| 5 | Reg_PickServerTiebreaker | PickServer returns first server on tie |
| 6 | Reg_ReregisterDifferentServer | Re-register same name on different server fails |
| 7 | Reg_InflightIndependence | Inflight lock for vol-a doesn't block vol-b |
| 8 | Reg_BlockCapableServersAfterUnmark | Unmark removes from block-capable list |
| 9 | Master_DeleteVSUnreachable | Delete fails if VS delete fails (no orphan) |
| 10 | Master_CreateSanitizedName | Names with special chars go through |
| 11 | Master_ConcurrentCreateDelete | Concurrent create+delete on same name, no panic |
| 12 | Master_AllVSFailNoOrphan | All 3 servers fail → error, no registry entry |
| 13 | Master_SlowAllocateBlocksSecond | Inflight lock blocks concurrent same-name create |
| 14 | Master_CreateZeroSize | Zero size → InvalidArgument |
| 15 | Master_CreateEmptyName | Empty name → InvalidArgument |
| 16 | Master_EmptyNameValidation | Whitespace-only name → InvalidArgument |
| 17 | VS_ConcurrentCreate | 20 goroutines create same vol, no crash |
| 18 | VS_ConcurrentCreateDelete | 20 goroutines create+delete interleaved |
| 19 | VS_DeleteCleansSnapshots | Delete removes .snap.* files |
| 20 | VS_SanitizationCollision | Idempotent create after sanitization matches |
| 21 | VS_CreateIdempotentReaddTarget | Idempotent create re-adds adapter to TargetServer |
| 22 | VS_GrpcNilBlockService | Nil blockService returns error (not panic) |
QA CSI (csi/qa_cp62_test.go) — 32 tests
| # | Test | What it proves |
|---|---|---|
| 1 | Node_RemoteUnstageNoCloseVolume | Remote unstage doesn't call CloseVolume |
| 2 | Node_RemoteUnstageFailPreservesStaged | Failed unstage preserves staged entry |
| 3 | Node_ConcurrentStageUnstage | 20 concurrent stage+unstage, no races |
| 4 | Node_RemotePortalUsedCorrectly | Remote portal used for discovery (not local) |
| 5 | Node_PartialVolumeContext | Missing iqn falls back to local mgr |
| 6 | Node_UnstageNoMgrNoPrefix | No mgr + no prefix → empty IQN (graceful) |
| 7 | Ctrl_VolumeContextPresent | CreateVolume returns iscsiAddr+iqn in context |
| 8 | Ctrl_ValidateUsesBackend | ValidateVolumeCapabilities uses backend lookup |
| 9 | Ctrl_CreateLargerSizeRejected | Existing vol + larger size → AlreadyExists |
| 10 | Ctrl_ExactBlockSizeBoundary | Exact 4MB boundary succeeds |
| 11 | Ctrl_ConcurrentCreate | 10 concurrent creates, one succeeds |
| 12 | Backend_LookupAfterRestart | Volume found after VolumeManager restart |
| 13 | Backend_DeleteThenLookup | Lookup after delete → not found |
| 14 | Naming_CrossLayerConsistency | CSI and blockvol SanitizeIQN produce same result |
| 15 | Naming_LongNameHashCollision | Two 70-char names → distinct IQNs |
| 16 | RemoteLifecycleFull | Full remote stage→publish→unpublish→unstage→delete |
| 17 | ModeControllerNoMgr | Controller mode with masterAddr, no local mgr |
| 18 | ModeNodeOnly | Node mode creates mgr but no controller |
| 19 | ModeInvalid | Invalid mode → error (BUG-QA-CP62-1) |
| 20 | Srv_AllModeLocalBackend | All mode without master uses local backend |
| 21 | Srv_DoubleStop | Double Stop doesn't panic |
| 22 | VM_CreateAfterStop | Create after stop returns error |
| 23 | VM_OpenNonExistent | Open non-existent returns error |
| 24 | VM_ListenAddrAfterStop | ListenAddr after stop returns empty |
| 25 | VM_VolumeIQNSanitized | VolumeIQN applies sanitization |
| 26 | Edge_MinSize | Minimum 4MB volume succeeds |
| 27 | Edge_BelowMinSize | Below minimum → error |
| 28 | Edge_RequiredEqualsLimit | Required == limit succeeds |
| 29 | Edge_RoundingExceedsLimit | Rounding up exceeds limit → error |
| 30 | Edge_EmptyVolumeIDNode | Empty volumeID → InvalidArgument |
| 31 | Node_PublishWithoutStaging | Publish unstaged vol → still works (mock) |
| 32 | Node_DoubleUnstage | Double unstage → idempotent success |