Phase 5 Progress

Status

CP5-1: ALUA implicit support, REPORT TARGET PORT GROUPS, VPD 0x83 descriptors, write fencing on standby.
CP5-1: Multipath config + setup script, 4 multipath integration tests.
CP5-1: Reviewer fixes (RoleNone write regression, T_SUP flag, TPG ID validation, ASCII log).
CP5-1: 10 ALUA unit tests + 16 adversarial tests (all PASS).
CP5-2: CoW snapshots implemented with flusher-based CoW, delta files, and recovery.
CP5-2: Review fixes applied (PauseAndFlush safety, snapMu race fix, beginOp/endOp, lock order doc, error propagation).
CP5-2: 10 unit tests + 22 adversarial tests (all PASS).
CP5-3: CHAP auth, online resize, Prometheus metrics, admin endpoints.
CP5-3: Review fixes applied (empty secret validation, AuthMethod echo, docs).
CP5-3: 12 dev tests + 28 QA adversarial tests (all PASS).
CP5-4: Failure injection (7 tests) + distributed consistency (17 tests) + Postgres crash loop (50 iters).
CP5-4: 6 bugs found and fixed (lease expiry, scp auth, permissions, fdatasync, pg reinit, pgbench tables).
CP5-4: 26/26 tests ALL PASS on m01/M02 remote environment (1067.7s combined).
CP5-4: Added CleanFailoverNoDataLoss (500 PG rows survive failover via volume copy).

SCSI test count: 53 (12 ALUA). Integration multipath tests require multipath-tools + sg3_utils.
Known flaky: rebuild_full_extent_midcopy_writes under full-suite CPU contention (pre-existing).
Known flaky: rebuild_catchup_concurrent_writes (WAL_RECYCLED timing, pre-existing).
Known limitation: WAL shipper barrier timeout (5s) causes degradation under heavy fdatasync workloads. PgCrashLoop shows ~50% data divergence per failover without full rebuild. Expected behavior — production would use master-driven rebuild after each failover.
Failover latency probe (10 iters): promote+first I/O ~30ms; total pause dominated by iSCSI login (avg 552ms, bimodal 130-180ms vs ~1170ms). Multipath should keep pause near 100-200ms; otherwise tune open-iscsi login timeout and avoid stale portals.

ID	Test	What it proves
F1	PowerLossDuringFio	fdatasync'd data survives kill-9 + failover
F2	DiskFullENOSPC	reads survive ENOSPC, writes recover after space freed
F3	WALCorruption	WAL recovery discards corrupted tail, early data intact
F4	ReplicaDownDuringWrites	primary keeps serving after replica crash mid-write
F5	SlowNetworkBarrierTimeout	writes continue under 200ms netem delay (remote only)
F6	NetworkPartitionSelfFence	primary self-fences on iptables partition (remote only)
F7	SnapshotDuringFailover	snapshot + replication interaction, both patterns survive

ID	Test	What it proves
C1	EpochPersistedOnPromotion	epoch survives kill-9 + restart (superblock persistence)
C2	EpochMonotonicThreePromotions	3 failovers, epoch 1→2→3, data from all phases intact
C3	StaleEpochWALRejected	replica at epoch=2 rejects WAL entries from epoch=1
C4	LeaseExpiredWriteRejected	writes fail after lease expiry
C5	LeaseRenewalUnderJitter	lease survives 100ms netem jitter with 30s TTL (remote)
C6	PromotionDataIntegrityChecksum	10MB byte-for-byte match after failover
C7	PromotionPostgresRecovery	postgres recovers from crash (single-node, no repl)
C8	DeadZoneNoWrites	fencing gap verified between old/new primary
C9	RebuildWALCatchup	WAL catch-up rebuild after brief replica outage
C10	RebuildFullExtent	full extent rebuild after heavy writes
C11	RebuildDuringActiveWrites	fio uninterrupted during rebuild
C12	GracefulDemoteNoDataLoss	data intact after demote + re-promote
C13	RapidRoleFlip10x	10 rapid epoch bumps, no crash or panic
C14	LeaseTimerRealExpiry	lease transitions true→false at ~5s mark
C15	DistGroupCommitEndToEnd	replica WAL advances during fdatasync fio
C16	DistGroupCommitReplicaCrash	primary continues in degraded mode
C17	DistGroupCommitBarrierVerify	replica LSN >= primary after fdatasync

ID	Test	What it proves
PG1	CleanFailoverNoDataLoss	500 PG rows survive volume-copy failover, content verified
PG2	ReplicatedFailover50	49 kill→promote→recover→pgbench cycles, PG recovers