seaweedfs

History

chrislu 98ec5e03eb perf: add partition assignment cache in gateway to eliminate 13.5% CPU overhead CRITICAL: Gateway calling LookupTopicBrokers on EVERY fetch to translate Kafka partition IDs to SeaweedFS partition ranges! Problem (from CPU profile): - getActualPartitionAssignment: 13.52% CPU (1.71s out of 12.65s) - Called bc.client.LookupTopicBrokers on line 228 for EVERY fetch - With 250 fetches/sec, this means 250 LookupTopicBrokers calls/sec! - No caching at all - same overhead as broker had before optimization Root Cause: Gateway needs to translate Kafka partition IDs (0, 1, 2...) to SeaweedFS partition ranges (0-341, 342-682, etc.) for every fetch request. This translation requires calling LookupTopicBrokers to get partition assignments. Without caching, every fetch request triggered: 1. gRPC call to broker (LookupTopicBrokers) 2. Broker reads from its cache (fast now after broker optimization) 3. gRPC response back to gateway 4. Gateway computes partition range mapping The gRPC round-trip overhead was consuming 13.5% CPU even though broker cache was fast! Solution: Added partitionAssignmentCache to BrokerClient: Changes to types.go: - Added partitionAssignmentCacheEntry struct (assignments + expiresAt) - Added cache fields to BrokerClient: * partitionAssignmentCache map[string]partitionAssignmentCacheEntry partitionAssignmentCacheMu sync.RWMutex * partitionAssignmentCacheTTL time.Duration Changes to broker_client.go: - Initialize partitionAssignmentCache in NewBrokerClientWithFilerAccessor - Set partitionAssignmentCacheTTL to 30 seconds (same as broker) Changes to broker_client_publish.go: - Added "time" import - Modified getActualPartitionAssignment() to check cache first: * Cache HIT: Use cached assignments (fast ✅) * Cache MISS: Call LookupTopicBrokers, cache result for 30s - Extracted findPartitionInAssignments() helper function * Contains range calculation and partition matching logic * Reused for both cached and fresh lookups Cache Behavior: - First fetch: Cache MISS -> LookupTopicBrokers (~2ms) -> cache for 30s - Next 7500 fetches in 30s: Cache HIT -> immediate return (~0.01ms) - Cache automatically expires after 30s, re-validates on next fetch Performance Impact: With 250 fetches/sec and 5 topics: - Before: 250 LookupTopicBrokers/sec = 500ms CPU overhead - After: 0.17 LookupTopicBrokers/sec (5 topics / 30s TTL) - Reduction: 99.93% fewer gRPC calls Expected CPU Reduction: - Before: 12.65s total, 1.71s in getActualPartitionAssignment (13.5%) - After: ~11s total (-13.5% = 1.65s saved) - Benefit: 13% lower CPU, more capacity for actual message processing Cache Consistency: - Same 30-second TTL as broker's topic config cache - Partition assignments rarely change (only on topic reconfiguration) - 30-second staleness is acceptable for partition mapping - Gateway will eventually converge with broker's view Testing: - ✅ Compiles successfully - Ready to deploy and measure CPU improvement Priority: CRITICAL - Eliminates major performance bottleneck in gateway fetch path		3 months ago
..
broker_client.go	perf: add partition assignment cache in gateway to eliminate 13.5% CPU overhead	3 months ago
broker_client_fetch.go	refactor: change remaining glog.Infof debug messages to V(3)	3 months ago
broker_client_publish.go	perf: add partition assignment cache in gateway to eliminate 13.5% CPU overhead	3 months ago
broker_client_restart_test.go	Add Kafka Gateway (#7231)	3 months ago
broker_client_subscribe.go	refactor: reduce verbosity of debug log messages	3 months ago
broker_error_mapping.go	Add Kafka Gateway (#7231)	3 months ago
broker_error_mapping_test.go	Add Kafka Gateway (#7231)	3 months ago
fetch_performance_test.go	Add Kafka Gateway (#7231)	3 months ago
record_retrieval_test.go	Add Kafka Gateway (#7231)	3 months ago
seaweedmq_handler.go	refactor: reduce verbosity of debug log messages	3 months ago
seaweedmq_handler_test.go	feat: add context timeout propagation to produce path	3 months ago
seaweedmq_handler_topics.go	Add Kafka Gateway (#7231)	3 months ago
seaweedmq_handler_utils.go	Add Kafka Gateway (#7231)	3 months ago
test_helper.go	Add Kafka Gateway (#7231)	3 months ago
types.go	perf: add partition assignment cache in gateway to eliminate 13.5% CPU overhead	3 months ago