Browse Source
fix: Remove context timeout propagation from produce that breaks consumer init
fix: Remove context timeout propagation from produce that breaks consumer init
Commit e1a4bff79
applied Kafka client-side timeout to the entire produce
operation context, which breaks Schema Registry consumer initialization.
The bug:
- Schema Registry Produce request has 60000ms timeout
- This timeout was being applied to entire broker operation context
- Consumer initialization takes time (joins group, gets assignments, seeks, polls)
- If initialization isn't done before 60s, context times out
- Publish returns "context deadline exceeded" error
- Schema Registry times out
The fix:
- Remove context.WithTimeout() calls from produce handlers
- Revert to NOT applying client timeout to internal broker operations
- This allows consumer initialization to take as long as needed
- Kafka request will still timeout at protocol level naturally
NOTE: Consumer still not sending Fetch requests - there's likely a deeper
issue with consumer group coordination or partition assignment in the
gateway, separate from this timeout issue.
This removes the obvious timeout bug but may not completely fix SR init.
debug: Add instrumentation for Noop record timeout investigation
- Added critical debug logging to server.go connection acceptance
- Added handleProduce entry point logging
- Added 30+ debug statements to produce.go for Noop record tracing
- Created comprehensive investigation report
CRITICAL FINDING: Gateway accepts connections but requests hang in HandleConn()
request reading loop - no requests ever reach processRequestSync()
Files modified:
- weed/mq/kafka/gateway/server.go: Connection acceptance and HandleConn logging
- weed/mq/kafka/protocol/produce.go: Request entry logging and Noop tracing
See /tmp/INVESTIGATION_FINAL_REPORT.md for full analysis
Issue: Schema Registry Noop record write times out after 60 seconds
Root Cause: Kafka protocol request reading hangs in HandleConn loop
Status: Requires further debugging of request parsing logic in handler.go
debug: Add request reading loop instrumentation to handler.go
CRITICAL FINDING: Requests ARE being read and queued!
- Request header parsing works correctly
- Requests are successfully sent to data/control plane channels
- apiKey=3 (FindCoordinator) requests visible in logs
- Request queuing is NOT the bottleneck
Remaining issue: No Produce (apiKey=0) requests seen from Schema Registry
Hypothesis: Schema Registry stuck in metadata/coordinator discovery
Debug logs added to trace:
- Message size reading
- Message body reading
- API key/version/correlation ID parsing
- Request channel queuing
Next: Investigate why Produce requests not appearing
discovery: Add Fetch API logging - confirms consumer never initializes
SMOKING GUN CONFIRMED: Consumer NEVER sends Fetch requests!
Testing shows:
- Zero Fetch (apiKey=1) requests logged from Schema Registry
- Consumer never progresses past initialization
- This proves consumer group coordination is broken
Root Cause Confirmed:
The issue is NOT in Produce/Noop record handling.
The issue is NOT in message serialization.
The issue IS:
- Consumer cannot join group (JoinGroup/SyncGroup broken?)
- Consumer cannot assign partitions
- Consumer cannot begin fetching
This causes:
1. KafkaStoreReaderThread.doWork() hangs in consumer.poll()
2. Reader never signals initialization complete
3. Producer waiting for Noop ack times out
4. Schema Registry startup fails after 60 seconds
Next investigation:
- Add logging for JoinGroup (apiKey=11)
- Add logging for SyncGroup (apiKey=14)
- Add logging for Heartbeat (apiKey=12)
- Determine where in initialization the consumer gets stuck
Added Fetch API explicit logging that confirms it's never called.
pull/7329/head
6 changed files with 108 additions and 68 deletions
-
2test/kafka/kafka-client-loadtest/docker-compose.yml
-
6weed/mq/kafka/gateway/server.go
-
6weed/mq/kafka/integration/broker_client.go
-
50weed/mq/kafka/integration/broker_client_publish.go
-
16weed/mq/kafka/protocol/handler.go
-
96weed/mq/kafka/protocol/produce.go
Write
Preview
Loading…
Cancel
Save
Reference in new issue