seaweedfs

History

chrislu f18ff58476 fix: Critical offset persistence race condition causing message loss This fix addresses the root cause of the 28% message loss detected during consumer group rebalancing with 2 consumers: CHANGES: 1. OffsetCommit: Don't silently ignore SMQ persistence errors - Previously, if offset persistence to SMQ failed, we'd continue anyway - Now we return an error code so client knows offset wasn't persisted - This prevents silent data loss during rebalancing 2. OffsetFetch: Add retry logic with exponential backoff - During rebalancing, brief race condition between commit and persistence - Retry offset fetch up to 3 times with 5-10ms delays - Ensures we get the latest committed offset even during rebalances 3. Enhanced Logging: Critical errors now logged at ERROR level - SMQ persistence failures are logged as CRITICAL with detailed context - Helps diagnose similar issues in production ROOT CAUSE: When rebalancing occurs, consumers query OffsetFetch for their next offset. If that offset was just committed but not yet persisted to SMQ, the query would return -1 (not found), causing the consumer to start from offset 0. This skipped messages 76-765 that were already consumed before rebalancing. IMPACT: - Fixes message loss during normal rebalancing operations - Ensures offset persistence is mandatory, not optional - Addresses the 28% data loss detected in comprehensive load tests TESTING: - Single consumer test should show 0 missing (unchanged) - Dual consumer test should show 0 missing (was 3,413 missing) - Rebalancing no longer causes offset gaps		7 days ago
..
batch_crc_compat_test.go	Add Kafka Gateway (#7231)	1 week ago
consumer_coordination.go	fix Node ID Mismatch, and clean up log messages	1 week ago
consumer_group_metadata.go	fix Node ID Mismatch, and clean up log messages	1 week ago
describe_cluster.go	Add Kafka Gateway (#7231)	1 week ago
errors.go	purge unused	1 week ago
fetch.go	fix: Correct throttle time semantics in Fetch responses	7 days ago
fetch_multibatch.go	less logs, remove unused code	1 week ago
fetch_partition_reader.go	fix Node ID Mismatch, and clean up log messages	1 week ago
find_coordinator.go	clean up	1 week ago
flexible_versions.go	Add Kafka Gateway (#7231)	1 week ago
group_introspection.go	Add Kafka Gateway (#7231)	1 week ago
handler.go	cleanup: Remove all temporary debug logs	7 days ago
joingroup.go	fix Node ID Mismatch, and clean up log messages	1 week ago
metadata_blocking_test.go	feat: add context timeout propagation to produce path	1 week ago
metrics.go	Add Kafka Gateway (#7231)	1 week ago
offset_management.go	fix: Critical offset persistence race condition causing message loss	7 days ago
offset_storage_adapter.go	Add Kafka Gateway (#7231)	1 week ago
produce.go	purge logs	7 days ago
record_batch_parser.go	Add Kafka Gateway (#7231)	1 week ago
record_batch_parser_test.go	Add Kafka Gateway (#7231)	1 week ago
record_extraction_test.go	Add Kafka Gateway (#7231)	1 week ago
response_cache.go	Add Kafka Gateway (#7231)	1 week ago
response_format_test.go	Add Kafka Gateway (#7231)	1 week ago
response_validation_example_test.go	Add Kafka Gateway (#7231)	1 week ago