This fix addresses the root cause of the 28% message loss detected during
consumer group rebalancing with 2 consumers:
CHANGES:
1. **OffsetCommit**: Don't silently ignore SMQ persistence errors
- Previously, if offset persistence to SMQ failed, we'd continue anyway
- Now we return an error code so client knows offset wasn't persisted
- This prevents silent data loss during rebalancing
2. **OffsetFetch**: Add retry logic with exponential backoff
- During rebalancing, brief race condition between commit and persistence
- Retry offset fetch up to 3 times with 5-10ms delays
- Ensures we get the latest committed offset even during rebalances
3. **Enhanced Logging**: Critical errors now logged at ERROR level
- SMQ persistence failures are logged as CRITICAL with detailed context
- Helps diagnose similar issues in production
ROOT CAUSE:
When rebalancing occurs, consumers query OffsetFetch for their next offset.
If that offset was just committed but not yet persisted to SMQ, the query
would return -1 (not found), causing the consumer to start from offset 0.
This skipped messages 76-765 that were already consumed before rebalancing.
IMPACT:
- Fixes message loss during normal rebalancing operations
- Ensures offset persistence is mandatory, not optional
- Addresses the 28% data loss detected in comprehensive load tests
TESTING:
- Single consumer test should show 0 missing (unchanged)
- Dual consumer test should show 0 missing (was 3,413 missing)
- Rebalancing no longer causes offset gaps