* Filer: Add retry mechanism for failed file deletions
Implement a retry queue with exponential backoff for handling transient
deletion failures, particularly when volumes are temporarily read-only.
Key features:
- Automatic retry for retryable errors (read-only volumes, network issues)
- Exponential backoff: 5min → 10min → 20min → ... (max 6 hours)
- Maximum 10 retry attempts per file before giving up
- Separate goroutine processing retry queue every minute
- Enhanced logging with retry/permanent error classification
This addresses the issue where file deletions fail when volumes are
temporarily read-only (tiered volumes, maintenance, etc.) and these
deletions were previously lost.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Update weed/filer/filer_deletion.go
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* Filer: Add retry mechanism for failed file deletions
Implement a retry queue with exponential backoff for handling transient
deletion failures, particularly when volumes are temporarily read-only.
Key features:
- Automatic retry for retryable errors (read-only volumes, network issues)
- Exponential backoff: 5min → 10min → 20min → ... (max 6 hours)
- Maximum 10 retry attempts per file before giving up
- Separate goroutine processing retry queue every minute
- Map-based retry queue for O(1) lookups and deletions
- Enhanced logging with retry/permanent error classification
- Consistent error detail limiting (max 10 total errors logged)
- Graceful shutdown support with quit channel for both processors
This addresses the issue where file deletions fail when volumes are
temporarily read-only (tiered volumes, maintenance, etc.) and these
deletions were previously lost.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Filer: Replace magic numbers with named constants in retry processor
Replace hardcoded values with package-level constants for better
maintainability:
- DeletionRetryPollInterval (1 minute): interval for checking retry queue
- DeletionRetryBatchSize (1000): max items to process per iteration
This improves code readability and makes configuration changes easier.
* Filer: Optimize retry queue with min-heap data structure
Replace map-based retry queue with a min-heap for better scalability
and deterministic ordering.
Performance improvements:
- GetReadyItems: O(N) → O(K log N) where K is items retrieved
- AddOrUpdate: O(1) → O(log N) (acceptable trade-off)
- Early exit when checking ready items (heap top is earliest)
- No full iteration over all items while holding lock
Benefits:
- Deterministic processing order (earliest NextRetryAt first)
- Better scalability for large retry queues (thousands of items)
- Reduced lock contention duration
- Memory efficient (no separate slice reconstruction)
Implementation:
- Min-heap ordered by NextRetryAt using container/heap
- Dual index: heap for ordering + map for O(1) FileId lookups
- heap.Fix() used when updating existing items
- Comprehensive complexity documentation in comments
This addresses the performance bottleneck identified in GetReadyItems
where iterating over the entire map with a write lock could block
other goroutines in high-failure scenarios.
* Filer: Modernize heap interface and improve error handling docs
1. Replace interface{} with any in heap methods
- Addresses modern Go style (Go 1.18+)
- Improves code readability
2. Enhance isRetryableError documentation
- Acknowledge string matching brittleness
- Add comprehensive TODO for future improvements:
* Use HTTP status codes (503, 429, etc.)
* Implement structured error types with errors.Is/As
* Extract gRPC status codes
* Add error wrapping for better context
- Document each error pattern with context
- Add defensive check for empty error strings
Current implementation remains pragmatic for initial release while
documenting a clear path for future robustness improvements. String
matching is acceptable for now but should be replaced with structured
error checking when refactoring the deletion pipeline.
* Filer: Refactor deletion processors for better readability
Extract large callback functions into dedicated private methods to
improve code organization and maintainability.
Changes:
1. Extract processDeletionBatch method
- Handles deletion of a batch of file IDs
- Classifies errors (success, not found, retryable, permanent)
- Manages retry queue additions
- Consolidates logging logic
2. Extract processRetryBatch method
- Handles retry attempts for previously failed deletions
- Processes retry results and updates queue
- Symmetric to processDeletionBatch for consistency
Benefits:
- Main loop functions (loopProcessingDeletion, loopProcessingDeletionRetry)
are now concise and focused on orchestration
- Business logic is separated into testable methods
- Reduced nesting depth improves readability
- Easier to understand control flow at a glance
- Better separation of concerns
The refactored methods follow the single responsibility principle,
making the codebase more maintainable and easier to extend.
* Update weed/filer/filer_deletion.go
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* Filer: Fix critical retry count bug and add comprehensive error patterns
Critical bug fixes from PR review:
1. Fix RetryCount reset bug (CRITICAL)
- Problem: When items are re-queued via AddOrUpdate, RetryCount
resets to 1, breaking exponential backoff
- Solution: Add RequeueForRetry() method that preserves retry state
- Impact: Ensures proper exponential backoff progression
2. Add overflow protection in backoff calculation
- Check shift amount > 63 to prevent bit-shift overflow
- Additional safety: check if delay <= 0 or > MaxRetryDelay
- Protects against arithmetic overflow in extreme cases
3. Expand retryable error patterns
- Added: timeout, deadline exceeded, context canceled
- Added: lookup error/failed (volume discovery issues)
- Added: connection refused, broken pipe (network errors)
- Added: too many requests, service unavailable (backpressure)
- Added: temporarily unavailable, try again (transient errors)
- Added: i/o timeout (network timeouts)
Benefits:
- Retry mechanism now works correctly across restarts
- More robust against edge cases and overflow
- Better coverage of transient failure scenarios
- Improved resilience in high-failure environments
Addresses feedback from CodeRabbit and Gemini Code Assist in PR #7402.
* Filer: Add persistence docs and comprehensive unit tests
Documentation improvements:
1. Document in-memory queue limitation
- Acknowledge that retry queue is volatile (lost on restart)
- Document trade-offs and future persistence options
- Provide clear path for production hardening
- Note eventual consistency through main deletion queue
Unit test coverage:
1. TestDeletionRetryQueue_AddAndRetrieve
- Basic add/retrieve operations
- Verify items not ready before delay elapsed
2. TestDeletionRetryQueue_ExponentialBackoff
- Verify exponential backoff progression (5m→10m→20m→40m→80m)
- Validate delay calculations with timing tolerance
3. TestDeletionRetryQueue_OverflowProtection
- Test high retry counts (60+) that could cause overflow
- Verify capping at MaxRetryDelay
4. TestDeletionRetryQueue_MaxAttemptsReached
- Verify items discarded after MaxRetryAttempts
- Confirm proper queue cleanup
5. TestIsRetryableError
- Comprehensive error pattern coverage
- Test all retryable error types (timeout, connection, lookup, etc.)
- Verify non-retryable errors correctly identified
6. TestDeletionRetryQueue_HeapOrdering
- Verify min-heap property maintained
- Test items processed in NextRetryAt order
- Validate heap.Init() integration
All tests passing. Addresses PR feedback on testing requirements.
* Filer: Add code quality improvements for deletion retry
Address PR feedback with minor optimizations:
- Add MaxLoggedErrorDetails constant (replaces magic number 10)
- Pre-allocate slices and maps in processRetryBatch for efficiency
- Improve log message formatting to use constant
These changes improve code maintainability and runtime performance
without altering functionality.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* refactoring retrying
* use constant
* assert
* address comment
* refactor
* address comments
* dedup
* process retried deletions
* address comment
* check in-flight items also; dedup code
* refactoring
* refactoring
* simplify
* reset heap
* more efficient
* add DeletionBatchSize as a constant;Permanent > Retryable > Success > Not Found
---------
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: chrislu <chris.lu@gmail.com>
Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com>
* batch deletion operations to return individual error results
Modify batch deletion operations to return individual error results instead of one aggregated error, enabling better tracking of which specific files failed to delete (helping reduce orphan file issues).
* Simplified logging logic
* Optimized nested loop
* handles the edge case where the RPC succeeds but connection cleanup fails
* simplify
* simplify
* ignore 'not found' errors here