Browse Source
CRITICAL BUG: Offset consistency race condition during rebalancing PROBLEM: In handleOffsetCommit, offsets were committed in this order: 1. Commit to in-memory cache (always succeeds) 2. Commit to persistent storage (SMQ filer) - errors silently ignored This created a divergence: - Consumer crashes before persistent commit completes - New consumer starts and fetches offset from memory (has stale value) - Or fetches from persistent storage (has old value) - Result: Messages re-read (duplicates) or skipped (missing) ROOT CAUSE: Two separate, non-atomic commit operations with no ordering constraints. In-memory cache could have offset N while persistent storage has N-50. On rebalance, consumer gets wrong starting position. SOLUTION: Atomic offset commits 1. Commit to persistent storage FIRST 2. Only if persistent commit succeeds, update in-memory cache 3. If persistent commit fails, report error to client and don't update in-memory 4. This ensures in-memory and persistent states never diverge IMPACT: - Eliminates offset divergence during crashes/rebalances - Prevents message loss from incorrect resumption offsets - Reduces duplicates from offset confusion - Ensures consumed persisted messages have: * No message loss (all produced messages read) * No duplicates (each message read once) TEST CASE: Consuming persisted messages with consumer group rebalancing should now: - Recover all produced messages (0% missing) - Not re-read any messages (0% duplicates) - Handle restarts/rebalances correctlypull/7329/head
1 changed files with 18 additions and 15 deletions
Loading…
Reference in new issue