You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 

4.6 KiB

Phase 5 Progress

Status

  • CP5-1 through CP5-4 complete. Phase 5 DONE.

Completed

  • CP5-1: ALUA implicit support, REPORT TARGET PORT GROUPS, VPD 0x83 descriptors, write fencing on standby.
  • CP5-1: Multipath config + setup script, 4 multipath integration tests.
  • CP5-1: Reviewer fixes (RoleNone write regression, T_SUP flag, TPG ID validation, ASCII log).
  • CP5-1: 10 ALUA unit tests + 16 adversarial tests (all PASS).
  • CP5-2: CoW snapshots implemented with flusher-based CoW, delta files, and recovery.
  • CP5-2: Review fixes applied (PauseAndFlush safety, snapMu race fix, beginOp/endOp, lock order doc, error propagation).
  • CP5-2: 10 unit tests + 22 adversarial tests (all PASS).
  • CP5-3: CHAP auth, online resize, Prometheus metrics, admin endpoints.
  • CP5-3: Review fixes applied (empty secret validation, AuthMethod echo, docs).
  • CP5-3: 12 dev tests + 28 QA adversarial tests (all PASS).
  • CP5-4: Failure injection (7 tests) + distributed consistency (17 tests) + Postgres crash loop (50 iters).
  • CP5-4: 6 bugs found and fixed (lease expiry, scp auth, permissions, fdatasync, pg reinit, pgbench tables).
  • CP5-4: 26/26 tests ALL PASS on m01/M02 remote environment (1067.7s combined).
  • CP5-4: Added CleanFailoverNoDataLoss (500 PG rows survive failover via volume copy).

In Progress

  • None.

Blockers

  • None.

Next Steps

  • Phase 5 complete. Ready for Phase 6 (NVMe-oF) or other priorities.

Notes

  • SCSI test count: 53 (12 ALUA). Integration multipath tests require multipath-tools + sg3_utils.
  • Known flaky: rebuild_full_extent_midcopy_writes under full-suite CPU contention (pre-existing).
  • Known flaky: rebuild_catchup_concurrent_writes (WAL_RECYCLED timing, pre-existing).
  • Known limitation: WAL shipper barrier timeout (5s) causes degradation under heavy fdatasync workloads. PgCrashLoop shows ~50% data divergence per failover without full rebuild. Expected behavior — production would use master-driven rebuild after each failover.
  • Failover latency probe (10 iters): promote+first I/O ~30ms; total pause dominated by iSCSI login (avg 552ms, bimodal 130-180ms vs ~1170ms). Multipath should keep pause near 100-200ms; otherwise tune open-iscsi login timeout and avoid stale portals.

CP5-4 Test Catalog

Failure Injection (test/fault_test.go)

ID Test What it proves
F1 PowerLossDuringFio fdatasync'd data survives kill-9 + failover
F2 DiskFullENOSPC reads survive ENOSPC, writes recover after space freed
F3 WALCorruption WAL recovery discards corrupted tail, early data intact
F4 ReplicaDownDuringWrites primary keeps serving after replica crash mid-write
F5 SlowNetworkBarrierTimeout writes continue under 200ms netem delay (remote only)
F6 NetworkPartitionSelfFence primary self-fences on iptables partition (remote only)
F7 SnapshotDuringFailover snapshot + replication interaction, both patterns survive

Distributed Consistency (test/consistency_test.go)

ID Test What it proves
C1 EpochPersistedOnPromotion epoch survives kill-9 + restart (superblock persistence)
C2 EpochMonotonicThreePromotions 3 failovers, epoch 1→2→3, data from all phases intact
C3 StaleEpochWALRejected replica at epoch=2 rejects WAL entries from epoch=1
C4 LeaseExpiredWriteRejected writes fail after lease expiry
C5 LeaseRenewalUnderJitter lease survives 100ms netem jitter with 30s TTL (remote)
C6 PromotionDataIntegrityChecksum 10MB byte-for-byte match after failover
C7 PromotionPostgresRecovery postgres recovers from crash (single-node, no repl)
C8 DeadZoneNoWrites fencing gap verified between old/new primary
C9 RebuildWALCatchup WAL catch-up rebuild after brief replica outage
C10 RebuildFullExtent full extent rebuild after heavy writes
C11 RebuildDuringActiveWrites fio uninterrupted during rebuild
C12 GracefulDemoteNoDataLoss data intact after demote + re-promote
C13 RapidRoleFlip10x 10 rapid epoch bumps, no crash or panic
C14 LeaseTimerRealExpiry lease transitions true→false at ~5s mark
C15 DistGroupCommitEndToEnd replica WAL advances during fdatasync fio
C16 DistGroupCommitReplicaCrash primary continues in degraded mode
C17 DistGroupCommitBarrierVerify replica LSN >= primary after fdatasync

Postgres Crash Loop (test/pgcrash_test.go)

ID Test What it proves
PG1 CleanFailoverNoDataLoss 500 PG rows survive volume-copy failover, content verified
PG2 ReplicatedFailover50 49 kill→promote→recover→pgbench cycles, PG recovers