* Filer: Add retry mechanism for failed file deletions
Implement a retry queue with exponential backoff for handling transient
deletion failures, particularly when volumes are temporarily read-only.
Key features:
- Automatic retry for retryable errors (read-only volumes, network issues)
- Exponential backoff: 5min → 10min → 20min → ... (max 6 hours)
- Maximum 10 retry attempts per file before giving up
- Separate goroutine processing retry queue every minute
- Enhanced logging with retry/permanent error classification
This addresses the issue where file deletions fail when volumes are
temporarily read-only (tiered volumes, maintenance, etc.) and these
deletions were previously lost.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Update weed/filer/filer_deletion.go
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* Filer: Add retry mechanism for failed file deletions
Implement a retry queue with exponential backoff for handling transient
deletion failures, particularly when volumes are temporarily read-only.
Key features:
- Automatic retry for retryable errors (read-only volumes, network issues)
- Exponential backoff: 5min → 10min → 20min → ... (max 6 hours)
- Maximum 10 retry attempts per file before giving up
- Separate goroutine processing retry queue every minute
- Map-based retry queue for O(1) lookups and deletions
- Enhanced logging with retry/permanent error classification
- Consistent error detail limiting (max 10 total errors logged)
- Graceful shutdown support with quit channel for both processors
This addresses the issue where file deletions fail when volumes are
temporarily read-only (tiered volumes, maintenance, etc.) and these
deletions were previously lost.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Filer: Replace magic numbers with named constants in retry processor
Replace hardcoded values with package-level constants for better
maintainability:
- DeletionRetryPollInterval (1 minute): interval for checking retry queue
- DeletionRetryBatchSize (1000): max items to process per iteration
This improves code readability and makes configuration changes easier.
* Filer: Optimize retry queue with min-heap data structure
Replace map-based retry queue with a min-heap for better scalability
and deterministic ordering.
Performance improvements:
- GetReadyItems: O(N) → O(K log N) where K is items retrieved
- AddOrUpdate: O(1) → O(log N) (acceptable trade-off)
- Early exit when checking ready items (heap top is earliest)
- No full iteration over all items while holding lock
Benefits:
- Deterministic processing order (earliest NextRetryAt first)
- Better scalability for large retry queues (thousands of items)
- Reduced lock contention duration
- Memory efficient (no separate slice reconstruction)
Implementation:
- Min-heap ordered by NextRetryAt using container/heap
- Dual index: heap for ordering + map for O(1) FileId lookups
- heap.Fix() used when updating existing items
- Comprehensive complexity documentation in comments
This addresses the performance bottleneck identified in GetReadyItems
where iterating over the entire map with a write lock could block
other goroutines in high-failure scenarios.
* Filer: Modernize heap interface and improve error handling docs
1. Replace interface{} with any in heap methods
- Addresses modern Go style (Go 1.18+)
- Improves code readability
2. Enhance isRetryableError documentation
- Acknowledge string matching brittleness
- Add comprehensive TODO for future improvements:
* Use HTTP status codes (503, 429, etc.)
* Implement structured error types with errors.Is/As
* Extract gRPC status codes
* Add error wrapping for better context
- Document each error pattern with context
- Add defensive check for empty error strings
Current implementation remains pragmatic for initial release while
documenting a clear path for future robustness improvements. String
matching is acceptable for now but should be replaced with structured
error checking when refactoring the deletion pipeline.
* Filer: Refactor deletion processors for better readability
Extract large callback functions into dedicated private methods to
improve code organization and maintainability.
Changes:
1. Extract processDeletionBatch method
- Handles deletion of a batch of file IDs
- Classifies errors (success, not found, retryable, permanent)
- Manages retry queue additions
- Consolidates logging logic
2. Extract processRetryBatch method
- Handles retry attempts for previously failed deletions
- Processes retry results and updates queue
- Symmetric to processDeletionBatch for consistency
Benefits:
- Main loop functions (loopProcessingDeletion, loopProcessingDeletionRetry)
are now concise and focused on orchestration
- Business logic is separated into testable methods
- Reduced nesting depth improves readability
- Easier to understand control flow at a glance
- Better separation of concerns
The refactored methods follow the single responsibility principle,
making the codebase more maintainable and easier to extend.
* Update weed/filer/filer_deletion.go
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* Filer: Fix critical retry count bug and add comprehensive error patterns
Critical bug fixes from PR review:
1. Fix RetryCount reset bug (CRITICAL)
- Problem: When items are re-queued via AddOrUpdate, RetryCount
resets to 1, breaking exponential backoff
- Solution: Add RequeueForRetry() method that preserves retry state
- Impact: Ensures proper exponential backoff progression
2. Add overflow protection in backoff calculation
- Check shift amount > 63 to prevent bit-shift overflow
- Additional safety: check if delay <= 0 or > MaxRetryDelay
- Protects against arithmetic overflow in extreme cases
3. Expand retryable error patterns
- Added: timeout, deadline exceeded, context canceled
- Added: lookup error/failed (volume discovery issues)
- Added: connection refused, broken pipe (network errors)
- Added: too many requests, service unavailable (backpressure)
- Added: temporarily unavailable, try again (transient errors)
- Added: i/o timeout (network timeouts)
Benefits:
- Retry mechanism now works correctly across restarts
- More robust against edge cases and overflow
- Better coverage of transient failure scenarios
- Improved resilience in high-failure environments
Addresses feedback from CodeRabbit and Gemini Code Assist in PR #7402.
* Filer: Add persistence docs and comprehensive unit tests
Documentation improvements:
1. Document in-memory queue limitation
- Acknowledge that retry queue is volatile (lost on restart)
- Document trade-offs and future persistence options
- Provide clear path for production hardening
- Note eventual consistency through main deletion queue
Unit test coverage:
1. TestDeletionRetryQueue_AddAndRetrieve
- Basic add/retrieve operations
- Verify items not ready before delay elapsed
2. TestDeletionRetryQueue_ExponentialBackoff
- Verify exponential backoff progression (5m→10m→20m→40m→80m)
- Validate delay calculations with timing tolerance
3. TestDeletionRetryQueue_OverflowProtection
- Test high retry counts (60+) that could cause overflow
- Verify capping at MaxRetryDelay
4. TestDeletionRetryQueue_MaxAttemptsReached
- Verify items discarded after MaxRetryAttempts
- Confirm proper queue cleanup
5. TestIsRetryableError
- Comprehensive error pattern coverage
- Test all retryable error types (timeout, connection, lookup, etc.)
- Verify non-retryable errors correctly identified
6. TestDeletionRetryQueue_HeapOrdering
- Verify min-heap property maintained
- Test items processed in NextRetryAt order
- Validate heap.Init() integration
All tests passing. Addresses PR feedback on testing requirements.
* Filer: Add code quality improvements for deletion retry
Address PR feedback with minor optimizations:
- Add MaxLoggedErrorDetails constant (replaces magic number 10)
- Pre-allocate slices and maps in processRetryBatch for efficiency
- Improve log message formatting to use constant
These changes improve code maintainability and runtime performance
without altering functionality.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* refactoring retrying
* use constant
* assert
* address comment
* refactor
* address comments
* dedup
* process retried deletions
* address comment
* check in-flight items also; dedup code
* refactoring
* refactoring
* simplify
* reset heap
* more efficient
* add DeletionBatchSize as a constant;Permanent > Retryable > Success > Not Found
---------
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: chrislu <chris.lu@gmail.com>
Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com>