You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Phase 5 Progress
Status
- CP5-1 through CP5-4 complete. Phase 5 DONE.
Completed
- CP5-1: ALUA implicit support, REPORT TARGET PORT GROUPS, VPD 0x83 descriptors, write fencing on standby.
- CP5-1: Multipath config + setup script, 4 multipath integration tests.
- CP5-1: Reviewer fixes (RoleNone write regression, T_SUP flag, TPG ID validation, ASCII log).
- CP5-1: 10 ALUA unit tests + 16 adversarial tests (all PASS).
- CP5-2: CoW snapshots implemented with flusher-based CoW, delta files, and recovery.
- CP5-2: Review fixes applied (PauseAndFlush safety, snapMu race fix, beginOp/endOp, lock order doc, error propagation).
- CP5-2: 10 unit tests + 22 adversarial tests (all PASS).
- CP5-3: CHAP auth, online resize, Prometheus metrics, admin endpoints.
- CP5-3: Review fixes applied (empty secret validation, AuthMethod echo, docs).
- CP5-3: 12 dev tests + 28 QA adversarial tests (all PASS).
- CP5-4: Failure injection (7 tests) + distributed consistency (17 tests) + Postgres crash loop (50 iters).
- CP5-4: 6 bugs found and fixed (lease expiry, scp auth, permissions, fdatasync, pg reinit, pgbench tables).
- CP5-4: 26/26 tests ALL PASS on m01/M02 remote environment (1067.7s combined).
- CP5-4: Added CleanFailoverNoDataLoss (500 PG rows survive failover via volume copy).
In Progress
Blockers
Next Steps
- Phase 5 complete. Ready for Phase 6 (NVMe-oF) or other priorities.
Notes
- SCSI test count: 53 (12 ALUA). Integration multipath tests require multipath-tools + sg3_utils.
- Known flaky: rebuild_full_extent_midcopy_writes under full-suite CPU contention (pre-existing).
- Known flaky: rebuild_catchup_concurrent_writes (WAL_RECYCLED timing, pre-existing).
- Known limitation: WAL shipper barrier timeout (5s) causes degradation under heavy fdatasync
workloads. PgCrashLoop shows ~50% data divergence per failover without full rebuild. Expected
behavior — production would use master-driven rebuild after each failover.
- Failover latency probe (10 iters): promote+first I/O ~30ms; total pause dominated by iSCSI
login (avg 552ms, bimodal 130-180ms vs ~1170ms). Multipath should keep pause near 100-200ms;
otherwise tune open-iscsi login timeout and avoid stale portals.
CP5-4 Test Catalog
Failure Injection (test/fault_test.go)
| ID |
Test |
What it proves |
| F1 |
PowerLossDuringFio |
fdatasync'd data survives kill-9 + failover |
| F2 |
DiskFullENOSPC |
reads survive ENOSPC, writes recover after space freed |
| F3 |
WALCorruption |
WAL recovery discards corrupted tail, early data intact |
| F4 |
ReplicaDownDuringWrites |
primary keeps serving after replica crash mid-write |
| F5 |
SlowNetworkBarrierTimeout |
writes continue under 200ms netem delay (remote only) |
| F6 |
NetworkPartitionSelfFence |
primary self-fences on iptables partition (remote only) |
| F7 |
SnapshotDuringFailover |
snapshot + replication interaction, both patterns survive |
Distributed Consistency (test/consistency_test.go)
| ID |
Test |
What it proves |
| C1 |
EpochPersistedOnPromotion |
epoch survives kill-9 + restart (superblock persistence) |
| C2 |
EpochMonotonicThreePromotions |
3 failovers, epoch 1→2→3, data from all phases intact |
| C3 |
StaleEpochWALRejected |
replica at epoch=2 rejects WAL entries from epoch=1 |
| C4 |
LeaseExpiredWriteRejected |
writes fail after lease expiry |
| C5 |
LeaseRenewalUnderJitter |
lease survives 100ms netem jitter with 30s TTL (remote) |
| C6 |
PromotionDataIntegrityChecksum |
10MB byte-for-byte match after failover |
| C7 |
PromotionPostgresRecovery |
postgres recovers from crash (single-node, no repl) |
| C8 |
DeadZoneNoWrites |
fencing gap verified between old/new primary |
| C9 |
RebuildWALCatchup |
WAL catch-up rebuild after brief replica outage |
| C10 |
RebuildFullExtent |
full extent rebuild after heavy writes |
| C11 |
RebuildDuringActiveWrites |
fio uninterrupted during rebuild |
| C12 |
GracefulDemoteNoDataLoss |
data intact after demote + re-promote |
| C13 |
RapidRoleFlip10x |
10 rapid epoch bumps, no crash or panic |
| C14 |
LeaseTimerRealExpiry |
lease transitions true→false at ~5s mark |
| C15 |
DistGroupCommitEndToEnd |
replica WAL advances during fdatasync fio |
| C16 |
DistGroupCommitReplicaCrash |
primary continues in degraded mode |
| C17 |
DistGroupCommitBarrierVerify |
replica LSN >= primary after fdatasync |
Postgres Crash Loop (test/pgcrash_test.go)
| ID |
Test |
What it proves |
| PG1 |
CleanFailoverNoDataLoss |
500 PG rows survive volume-copy failover, content verified |
| PG2 |
ReplicatedFailover50 |
49 kill→promote→recover→pgbench cycles, PG recovers |