# Phase 5 Progress

## Status
- CP5-1 through CP5-4 complete. Phase 5 DONE.

## Completed
- CP5-1: ALUA implicit support, REPORT TARGET PORT GROUPS, VPD 0x83 descriptors, write fencing on standby.
- CP5-1: Multipath config + setup script, 4 multipath integration tests.
- CP5-1: Reviewer fixes (RoleNone write regression, T_SUP flag, TPG ID validation, ASCII log).
- CP5-1: 10 ALUA unit tests + 16 adversarial tests (all PASS).
- CP5-2: CoW snapshots implemented with flusher-based CoW, delta files, and recovery.
- CP5-2: Review fixes applied (PauseAndFlush safety, snapMu race fix, beginOp/endOp, lock order doc, error propagation).
- CP5-2: 10 unit tests + 22 adversarial tests (all PASS).
- CP5-3: CHAP auth, online resize, Prometheus metrics, admin endpoints.
- CP5-3: Review fixes applied (empty secret validation, AuthMethod echo, docs).
- CP5-3: 12 dev tests + 28 QA adversarial tests (all PASS).
- CP5-4: Failure injection (7 tests) + distributed consistency (17 tests) + Postgres crash loop (50 iters).
- CP5-4: 6 bugs found and fixed (lease expiry, scp auth, permissions, fdatasync, pg reinit, pgbench tables).
- CP5-4: 26/26 tests ALL PASS on m01/M02 remote environment (1067.7s combined).
- CP5-4: Added CleanFailoverNoDataLoss (500 PG rows survive failover via volume copy).

## In Progress
- None.

## Blockers
- None.

## Next Steps
- Phase 5 complete. Ready for Phase 6 (NVMe-oF) or other priorities.

## Notes
- SCSI test count: 53 (12 ALUA). Integration multipath tests require multipath-tools + sg3_utils.
- Known flaky: rebuild_full_extent_midcopy_writes under full-suite CPU contention (pre-existing).
- Known flaky: rebuild_catchup_concurrent_writes (WAL_RECYCLED timing, pre-existing).
- Known limitation: WAL shipper barrier timeout (5s) causes degradation under heavy fdatasync
  workloads. PgCrashLoop shows ~50% data divergence per failover without full rebuild. Expected
  behavior — production would use master-driven rebuild after each failover.
- Failover latency probe (10 iters): promote+first I/O ~30ms; total pause dominated by iSCSI
  login (avg 552ms, bimodal 130-180ms vs ~1170ms). Multipath should keep pause near 100-200ms;
  otherwise tune open-iscsi login timeout and avoid stale portals.

## CP5-4 Test Catalog

### Failure Injection (`test/fault_test.go`)
| ID | Test | What it proves |
|----|------|----------------|
| F1 | PowerLossDuringFio | fdatasync'd data survives kill-9 + failover |
| F2 | DiskFullENOSPC | reads survive ENOSPC, writes recover after space freed |
| F3 | WALCorruption | WAL recovery discards corrupted tail, early data intact |
| F4 | ReplicaDownDuringWrites | primary keeps serving after replica crash mid-write |
| F5 | SlowNetworkBarrierTimeout | writes continue under 200ms netem delay (remote only) |
| F6 | NetworkPartitionSelfFence | primary self-fences on iptables partition (remote only) |
| F7 | SnapshotDuringFailover | snapshot + replication interaction, both patterns survive |

### Distributed Consistency (`test/consistency_test.go`)
| ID | Test | What it proves |
|----|------|----------------|
| C1 | EpochPersistedOnPromotion | epoch survives kill-9 + restart (superblock persistence) |
| C2 | EpochMonotonicThreePromotions | 3 failovers, epoch 1→2→3, data from all phases intact |
| C3 | StaleEpochWALRejected | replica at epoch=2 rejects WAL entries from epoch=1 |
| C4 | LeaseExpiredWriteRejected | writes fail after lease expiry |
| C5 | LeaseRenewalUnderJitter | lease survives 100ms netem jitter with 30s TTL (remote) |
| C6 | PromotionDataIntegrityChecksum | 10MB byte-for-byte match after failover |
| C7 | PromotionPostgresRecovery | postgres recovers from crash (single-node, no repl) |
| C8 | DeadZoneNoWrites | fencing gap verified between old/new primary |
| C9 | RebuildWALCatchup | WAL catch-up rebuild after brief replica outage |
| C10 | RebuildFullExtent | full extent rebuild after heavy writes |
| C11 | RebuildDuringActiveWrites | fio uninterrupted during rebuild |
| C12 | GracefulDemoteNoDataLoss | data intact after demote + re-promote |
| C13 | RapidRoleFlip10x | 10 rapid epoch bumps, no crash or panic |
| C14 | LeaseTimerRealExpiry | lease transitions true→false at ~5s mark |
| C15 | DistGroupCommitEndToEnd | replica WAL advances during fdatasync fio |
| C16 | DistGroupCommitReplicaCrash | primary continues in degraded mode |
| C17 | DistGroupCommitBarrierVerify | replica LSN >= primary after fdatasync |

### Postgres Crash Loop (`test/pgcrash_test.go`)
| ID | Test | What it proves |
|----|------|----------------|
| PG1 | CleanFailoverNoDataLoss | 500 PG rows survive volume-copy failover, content verified |
| PG2 | ReplicatedFailover50 | 49 kill→promote→recover→pgbench cycles, PG recovers |