ROOT CAUSE IDENTIFIED: The issue with objects like '//bar', '//testobjfoo',
'//testobjbar', and '/key' was due to inconsistent path normalization between
object upload and versioned metadata operations.
PROBLEM:
- toFilerUrl() calls removeDuplicateSlashes() normalizing '//bar' → '/bar'
- But versioned operations used raw object paths: '//bar.versions'
- This created a mismatch where version files were stored under '/bar.versions/'
but .versions directory metadata was stored under '//bar.versions'
- Filer lookups failed because paths didn't match
SOLUTION:
- Apply removeDuplicateSlashes() consistently in all versioned operations:
- putVersionedObject: normalize before creating .versions directory
- getLatestObjectVersion: normalize before looking up .versions directory
- getSpecificObjectVersion: normalize for all version operations
- deleteSpecificObjectVersion: normalize for version deletion
- Ensures all version-related paths use the same normalization as toFilerUrl()
This should resolve the persistent CI failures for objects with double slashes
in their paths, eliminating the 'filer: no entry is found in filer store' errors
that even 8 retries with exponential backoff couldn't resolve.
- Increase retry attempts from 5 to 8 for both updateLatestVersionInDirectory
and getLatestObjectVersion functions
- Increase base delay from 50ms to 100ms with exponential backoff up to 6.4s
- Add specific retry logic for 'no Extended metadata' race condition where
.versions directory exists but metadata is not yet written
- Add detailed timing logs to track retry delays and total wait times
- Addresses persistent CI failures where even 5 retries with 400ms max delay
were insufficient for filer store consistency in GitHub Actions environment
- Increase retry attempts from 3 to 5 for both updateLatestVersionInDirectory
and getLatestObjectVersion functions
- Implement exponential backoff: 50ms, 100ms, 200ms, 400ms delays
- Addresses persistent CI failures where .versions directories are not
immediately visible after creation in filer store
- Based on CI log analysis showing 50ms fixed delays were insufficient
- Maintains CI debug logging to track improved retry behavior
- Add retry logic to updateLatestVersionInDirectory to handle cases where
.versions directory creation succeeds but is not immediately visible
- Add retry logic to getLatestObjectVersion for the same consistency issue
- Use 3 retries with 50ms delays to handle filer store consistency timing
- Addresses CI failures where 'filer: no entry is found in filer store'
occurs after successful directory creation
- Maintains CI debug logging to track retry attempts and outcomes
- Add detailed logging for .versions metadata updates in putVersionedObject
- Add logging for latest version resolution in getLatestObjectVersion
- Add logging for HeadObject latest version requests
- All logs use glog.V(0) with CI-DEBUG prefix for easy filtering
- Will help diagnose timing issues between object creation and retrieval in CI
Debug logs will show:
- When .versions metadata updates start and complete
- When HeadObject tries to read latest version metadata
- Race conditions if HeadObject runs before metadata update completes
- Missing metadata if .versions directory exists but metadata keys are missing
- File access issues if version files exist but can't be accessed
Different Owner: Always fails with 409 BucketAlreadyExists
Checks the s3-identity-id header against the stored bucket owner
Returns error immediately if owners don't match
Same Owner + Conflicting Settings: Fails with 409 BucketAlreadyExists
Compares requested Object Lock settings with existing bucket configuration
Returns error if settings are incompatible (e.g., trying to enable Object Lock on a bucket that doesn't have it)
Same Owner + Compatible Settings: Returns 200 OK (idempotent)
If the bucket already exists with the same owner and compatible settings
Returns success response without recreating the bucket
Skip long-polling if any requested topic does not exist.
Only long-poll when MinBytes > 0, data isn’t available yet, and all topics exist.
Cap the long-polling wait to 1s in tests to prevent hanging on shutdown.
Busy fetch loop: Implemented basic long-polling in Fetch. If no data and min_bytes>0 with max_wait_ms>0, we wait up to max_wait_ms, and populate throttle_time_ms accordingly. This stops the rapid loop for kafka-go on empty partitions.
- Added centralized errors.go with complete Kafka error code definitions
- Implemented timeout detection and network error classification
- Enhanced connection handling with configurable timeouts and better error reporting
- Added comprehensive error handling test suite with 21 test cases
- Unified error code usage across all protocol handlers
- Improved request/response timeout handling with graceful fallbacks
- All protocol and E2E tests passing with robust error handling
- Added flexible_versions.go with utilities for Kafka flexible versions (v3+)
- Implemented ParseRequestHeader for compact string parsing and tagged fields
- Added fallback mechanism in handler.go for backward compatibility
- Updated handleApiVersions to support flexible version responses
- Added comprehensive tests for flexible version utilities
- All protocol tests passing with robust error handling