Browse Source
			
			
			Filer: Add retry mechanism for failed file deletions (#7402)
			
				
		Filer: Add retry mechanism for failed file deletions (#7402)
	
		
	
			
				* Filer: Add retry mechanism for failed file deletions Implement a retry queue with exponential backoff for handling transient deletion failures, particularly when volumes are temporarily read-only. Key features: - Automatic retry for retryable errors (read-only volumes, network issues) - Exponential backoff: 5min → 10min → 20min → ... (max 6 hours) - Maximum 10 retry attempts per file before giving up - Separate goroutine processing retry queue every minute - Enhanced logging with retry/permanent error classification This addresses the issue where file deletions fail when volumes are temporarily read-only (tiered volumes, maintenance, etc.) and these deletions were previously lost. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Update weed/filer/filer_deletion.go Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Filer: Add retry mechanism for failed file deletions Implement a retry queue with exponential backoff for handling transient deletion failures, particularly when volumes are temporarily read-only. Key features: - Automatic retry for retryable errors (read-only volumes, network issues) - Exponential backoff: 5min → 10min → 20min → ... (max 6 hours) - Maximum 10 retry attempts per file before giving up - Separate goroutine processing retry queue every minute - Map-based retry queue for O(1) lookups and deletions - Enhanced logging with retry/permanent error classification - Consistent error detail limiting (max 10 total errors logged) - Graceful shutdown support with quit channel for both processors This addresses the issue where file deletions fail when volumes are temporarily read-only (tiered volumes, maintenance, etc.) and these deletions were previously lost. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Filer: Replace magic numbers with named constants in retry processor Replace hardcoded values with package-level constants for better maintainability: - DeletionRetryPollInterval (1 minute): interval for checking retry queue - DeletionRetryBatchSize (1000): max items to process per iteration This improves code readability and makes configuration changes easier. * Filer: Optimize retry queue with min-heap data structure Replace map-based retry queue with a min-heap for better scalability and deterministic ordering. Performance improvements: - GetReadyItems: O(N) → O(K log N) where K is items retrieved - AddOrUpdate: O(1) → O(log N) (acceptable trade-off) - Early exit when checking ready items (heap top is earliest) - No full iteration over all items while holding lock Benefits: - Deterministic processing order (earliest NextRetryAt first) - Better scalability for large retry queues (thousands of items) - Reduced lock contention duration - Memory efficient (no separate slice reconstruction) Implementation: - Min-heap ordered by NextRetryAt using container/heap - Dual index: heap for ordering + map for O(1) FileId lookups - heap.Fix() used when updating existing items - Comprehensive complexity documentation in comments This addresses the performance bottleneck identified in GetReadyItems where iterating over the entire map with a write lock could block other goroutines in high-failure scenarios. * Filer: Modernize heap interface and improve error handling docs 1. Replace interface{} with any in heap methods - Addresses modern Go style (Go 1.18+) - Improves code readability 2. Enhance isRetryableError documentation - Acknowledge string matching brittleness - Add comprehensive TODO for future improvements: * Use HTTP status codes (503, 429, etc.) * Implement structured error types with errors.Is/As * Extract gRPC status codes * Add error wrapping for better context - Document each error pattern with context - Add defensive check for empty error strings Current implementation remains pragmatic for initial release while documenting a clear path for future robustness improvements. String matching is acceptable for now but should be replaced with structured error checking when refactoring the deletion pipeline. * Filer: Refactor deletion processors for better readability Extract large callback functions into dedicated private methods to improve code organization and maintainability. Changes: 1. Extract processDeletionBatch method - Handles deletion of a batch of file IDs - Classifies errors (success, not found, retryable, permanent) - Manages retry queue additions - Consolidates logging logic 2. Extract processRetryBatch method - Handles retry attempts for previously failed deletions - Processes retry results and updates queue - Symmetric to processDeletionBatch for consistency Benefits: - Main loop functions (loopProcessingDeletion, loopProcessingDeletionRetry) are now concise and focused on orchestration - Business logic is separated into testable methods - Reduced nesting depth improves readability - Easier to understand control flow at a glance - Better separation of concerns The refactored methods follow the single responsibility principle, making the codebase more maintainable and easier to extend. * Update weed/filer/filer_deletion.go Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Filer: Fix critical retry count bug and add comprehensive error patterns Critical bug fixes from PR review: 1. Fix RetryCount reset bug (CRITICAL) - Problem: When items are re-queued via AddOrUpdate, RetryCount resets to 1, breaking exponential backoff - Solution: Add RequeueForRetry() method that preserves retry state - Impact: Ensures proper exponential backoff progression 2. Add overflow protection in backoff calculation - Check shift amount > 63 to prevent bit-shift overflow - Additional safety: check if delay <= 0 or > MaxRetryDelay - Protects against arithmetic overflow in extreme cases 3. Expand retryable error patterns - Added: timeout, deadline exceeded, context canceled - Added: lookup error/failed (volume discovery issues) - Added: connection refused, broken pipe (network errors) - Added: too many requests, service unavailable (backpressure) - Added: temporarily unavailable, try again (transient errors) - Added: i/o timeout (network timeouts) Benefits: - Retry mechanism now works correctly across restarts - More robust against edge cases and overflow - Better coverage of transient failure scenarios - Improved resilience in high-failure environments Addresses feedback from CodeRabbit and Gemini Code Assist in PR #7402. * Filer: Add persistence docs and comprehensive unit tests Documentation improvements: 1. Document in-memory queue limitation - Acknowledge that retry queue is volatile (lost on restart) - Document trade-offs and future persistence options - Provide clear path for production hardening - Note eventual consistency through main deletion queue Unit test coverage: 1. TestDeletionRetryQueue_AddAndRetrieve - Basic add/retrieve operations - Verify items not ready before delay elapsed 2. TestDeletionRetryQueue_ExponentialBackoff - Verify exponential backoff progression (5m→10m→20m→40m→80m) - Validate delay calculations with timing tolerance 3. TestDeletionRetryQueue_OverflowProtection - Test high retry counts (60+) that could cause overflow - Verify capping at MaxRetryDelay 4. TestDeletionRetryQueue_MaxAttemptsReached - Verify items discarded after MaxRetryAttempts - Confirm proper queue cleanup 5. TestIsRetryableError - Comprehensive error pattern coverage - Test all retryable error types (timeout, connection, lookup, etc.) - Verify non-retryable errors correctly identified 6. TestDeletionRetryQueue_HeapOrdering - Verify min-heap property maintained - Test items processed in NextRetryAt order - Validate heap.Init() integration All tests passing. Addresses PR feedback on testing requirements. * Filer: Add code quality improvements for deletion retry Address PR feedback with minor optimizations: - Add MaxLoggedErrorDetails constant (replaces magic number 10) - Pre-allocate slices and maps in processRetryBatch for efficiency - Improve log message formatting to use constant These changes improve code maintainability and runtime performance without altering functionality. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * refactoring retrying * use constant * assert * address comment * refactor * address comments * dedup * process retried deletions * address comment * check in-flight items also; dedup code * refactoring * refactoring * simplify * reset heap * more efficient * add DeletionBatchSize as a constant;Permanent > Retryable > Success > Not Found --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: chrislu <chris.lu@gmail.com> Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com>pull/7236/merge
							committed by
							
								 GitHub
								GitHub
							
						
					
				
				
				  
				  No known key found for this signature in database
				  
				  	
						GPG Key ID: B5690EEEBB952194
				  	
				  
				
			
		
		
		
	
				 3 changed files with 840 additions and 45 deletions
			
			
		| @ -0,0 +1,308 @@ | |||
| package filer | |||
| 
 | |||
| import ( | |||
| 	"container/heap" | |||
| 	"testing" | |||
| 	"time" | |||
| ) | |||
| 
 | |||
| func TestDeletionRetryQueue_AddAndRetrieve(t *testing.T) { | |||
| 	queue := NewDeletionRetryQueue() | |||
| 
 | |||
| 	// Add items
 | |||
| 	queue.AddOrUpdate("file1", "is read only") | |||
| 	queue.AddOrUpdate("file2", "connection reset") | |||
| 
 | |||
| 	if queue.Size() != 2 { | |||
| 		t.Errorf("Expected queue size 2, got %d", queue.Size()) | |||
| 	} | |||
| 
 | |||
| 	// Items not ready yet (initial delay is 5 minutes)
 | |||
| 	readyItems := queue.GetReadyItems(10) | |||
| 	if len(readyItems) != 0 { | |||
| 		t.Errorf("Expected 0 ready items, got %d", len(readyItems)) | |||
| 	} | |||
| 
 | |||
| 	// Size should remain unchanged
 | |||
| 	if queue.Size() != 2 { | |||
| 		t.Errorf("Expected queue size 2 after checking ready items, got %d", queue.Size()) | |||
| 	} | |||
| } | |||
| 
 | |||
| func TestDeletionRetryQueue_ExponentialBackoff(t *testing.T) { | |||
| 	queue := NewDeletionRetryQueue() | |||
| 
 | |||
| 	// Create an item
 | |||
| 	item := &DeletionRetryItem{ | |||
| 		FileId:      "test-file", | |||
| 		RetryCount:  0, | |||
| 		NextRetryAt: time.Now(), | |||
| 		LastError:   "test error", | |||
| 	} | |||
| 
 | |||
| 	// Requeue multiple times to test backoff
 | |||
| 	delays := []time.Duration{} | |||
| 
 | |||
| 	for i := 0; i < 5; i++ { | |||
| 		beforeTime := time.Now() | |||
| 		queue.RequeueForRetry(item, "error") | |||
| 
 | |||
| 		// Calculate expected delay for this retry count
 | |||
| 		expectedDelay := InitialRetryDelay * time.Duration(1<<uint(i)) | |||
| 		if expectedDelay > MaxRetryDelay { | |||
| 			expectedDelay = MaxRetryDelay | |||
| 		} | |||
| 
 | |||
| 		// Verify NextRetryAt is approximately correct
 | |||
| 		actualDelay := item.NextRetryAt.Sub(beforeTime) | |||
| 		delays = append(delays, actualDelay) | |||
| 
 | |||
| 		// Allow small timing variance
 | |||
| 		timeDiff := actualDelay - expectedDelay | |||
| 		if timeDiff < 0 { | |||
| 			timeDiff = -timeDiff | |||
| 		} | |||
| 		if timeDiff > 100*time.Millisecond { | |||
| 			t.Errorf("Retry %d: expected delay ~%v, got %v (diff: %v)", i+1, expectedDelay, actualDelay, timeDiff) | |||
| 		} | |||
| 
 | |||
| 		// Verify retry count incremented
 | |||
| 		if item.RetryCount != i+1 { | |||
| 			t.Errorf("Expected RetryCount %d, got %d", i+1, item.RetryCount) | |||
| 		} | |||
| 
 | |||
| 		// Reset the heap for the next isolated test iteration
 | |||
| 		queue.lock.Lock() | |||
| 		queue.heap = retryHeap{} | |||
| 		queue.lock.Unlock() | |||
| 	} | |||
| 
 | |||
| 	t.Logf("Exponential backoff delays: %v", delays) | |||
| } | |||
| 
 | |||
| func TestDeletionRetryQueue_OverflowProtection(t *testing.T) { | |||
| 	queue := NewDeletionRetryQueue() | |||
| 
 | |||
| 	// Create an item with very high retry count
 | |||
| 	item := &DeletionRetryItem{ | |||
| 		FileId:      "test-file", | |||
| 		RetryCount:  60, // High count that would cause overflow without protection
 | |||
| 		NextRetryAt: time.Now(), | |||
| 		LastError:   "test error", | |||
| 	} | |||
| 
 | |||
| 	// Should not panic and should cap at MaxRetryDelay
 | |||
| 	queue.RequeueForRetry(item, "error") | |||
| 
 | |||
| 	delay := time.Until(item.NextRetryAt) | |||
| 	if delay > MaxRetryDelay+time.Second { | |||
| 		t.Errorf("Delay exceeded MaxRetryDelay: %v > %v", delay, MaxRetryDelay) | |||
| 	} | |||
| } | |||
| 
 | |||
| func TestDeletionRetryQueue_MaxAttemptsReached(t *testing.T) { | |||
| 	queue := NewDeletionRetryQueue() | |||
| 
 | |||
| 	// Add item
 | |||
| 	queue.AddOrUpdate("file1", "error") | |||
| 
 | |||
| 	// Manually set retry count to max
 | |||
| 	queue.lock.Lock() | |||
| 	item, exists := queue.itemIndex["file1"] | |||
| 	if !exists { | |||
| 		queue.lock.Unlock() | |||
| 		t.Fatal("Item not found in queue") | |||
| 	} | |||
| 	item.RetryCount = MaxRetryAttempts | |||
| 	item.NextRetryAt = time.Now().Add(-1 * time.Second) // Ready now
 | |||
| 	heap.Fix(&queue.heap, item.heapIndex) | |||
| 	queue.lock.Unlock() | |||
| 
 | |||
| 	// Try to get ready items - should be returned for the last retry (attempt #10)
 | |||
| 	readyItems := queue.GetReadyItems(10) | |||
| 	if len(readyItems) != 1 { | |||
| 		t.Fatalf("Expected 1 item for last retry, got %d", len(readyItems)) | |||
| 	} | |||
| 
 | |||
| 	// Requeue it, which will increment its retry count beyond the max
 | |||
| 	queue.RequeueForRetry(readyItems[0], "final error") | |||
| 
 | |||
| 	// Manually make it ready again
 | |||
| 	queue.lock.Lock() | |||
| 	item, exists = queue.itemIndex["file1"] | |||
| 	if !exists { | |||
| 		queue.lock.Unlock() | |||
| 		t.Fatal("Item not found in queue after requeue") | |||
| 	} | |||
| 	item.NextRetryAt = time.Now().Add(-1 * time.Second) | |||
| 	heap.Fix(&queue.heap, item.heapIndex) | |||
| 	queue.lock.Unlock() | |||
| 
 | |||
| 	// Now it should be discarded (retry count is 11, exceeds max of 10)
 | |||
| 	readyItems = queue.GetReadyItems(10) | |||
| 	if len(readyItems) != 0 { | |||
| 		t.Errorf("Expected 0 items (max attempts exceeded), got %d", len(readyItems)) | |||
| 	} | |||
| 
 | |||
| 	// Should be removed from queue
 | |||
| 	if queue.Size() != 0 { | |||
| 		t.Errorf("Expected queue size 0 after max attempts exceeded, got %d", queue.Size()) | |||
| 	} | |||
| } | |||
| 
 | |||
| func TestCalculateBackoff(t *testing.T) { | |||
| 	testCases := []struct { | |||
| 		retryCount    int | |||
| 		expectedDelay time.Duration | |||
| 		description   string | |||
| 	}{ | |||
| 		{1, InitialRetryDelay, "first retry"}, | |||
| 		{2, InitialRetryDelay * 2, "second retry"}, | |||
| 		{3, InitialRetryDelay * 4, "third retry"}, | |||
| 		{4, InitialRetryDelay * 8, "fourth retry"}, | |||
| 		{5, InitialRetryDelay * 16, "fifth retry"}, | |||
| 		{10, MaxRetryDelay, "capped at max delay"}, | |||
| 		{65, MaxRetryDelay, "overflow protection (shift > 63)"}, | |||
| 		{100, MaxRetryDelay, "very high retry count"}, | |||
| 	} | |||
| 
 | |||
| 	for _, tc := range testCases { | |||
| 		result := calculateBackoff(tc.retryCount) | |||
| 		if result != tc.expectedDelay { | |||
| 			t.Errorf("%s (retry %d): expected %v, got %v", | |||
| 				tc.description, tc.retryCount, tc.expectedDelay, result) | |||
| 		} | |||
| 	} | |||
| } | |||
| 
 | |||
| func TestIsRetryableError(t *testing.T) { | |||
| 	testCases := []struct { | |||
| 		error       string | |||
| 		retryable   bool | |||
| 		description string | |||
| 	}{ | |||
| 		{"volume 123 is read only", true, "read-only volume"}, | |||
| 		{"connection reset by peer", true, "connection reset"}, | |||
| 		{"timeout exceeded", true, "timeout"}, | |||
| 		{"deadline exceeded", true, "deadline exceeded"}, | |||
| 		{"context canceled", true, "context canceled"}, | |||
| 		{"lookup error: volume not found", true, "lookup error"}, | |||
| 		{"connection refused", true, "connection refused"}, | |||
| 		{"too many requests", true, "rate limiting"}, | |||
| 		{"service unavailable", true, "service unavailable"}, | |||
| 		{"i/o timeout", true, "I/O timeout"}, | |||
| 		{"broken pipe", true, "broken pipe"}, | |||
| 		{"not found", false, "not found (not retryable)"}, | |||
| 		{"invalid file id", false, "invalid input (not retryable)"}, | |||
| 		{"", false, "empty error"}, | |||
| 	} | |||
| 
 | |||
| 	for _, tc := range testCases { | |||
| 		result := isRetryableError(tc.error) | |||
| 		if result != tc.retryable { | |||
| 			t.Errorf("%s: expected retryable=%v, got %v for error: %q", | |||
| 				tc.description, tc.retryable, result, tc.error) | |||
| 		} | |||
| 	} | |||
| } | |||
| 
 | |||
| func TestDeletionRetryQueue_HeapOrdering(t *testing.T) { | |||
| 	queue := NewDeletionRetryQueue() | |||
| 
 | |||
| 	now := time.Now() | |||
| 
 | |||
| 	// Add items with different retry times (out of order)
 | |||
| 	items := []*DeletionRetryItem{ | |||
| 		{FileId: "file3", RetryCount: 1, NextRetryAt: now.Add(30 * time.Second), LastError: "error3"}, | |||
| 		{FileId: "file1", RetryCount: 1, NextRetryAt: now.Add(10 * time.Second), LastError: "error1"}, | |||
| 		{FileId: "file2", RetryCount: 1, NextRetryAt: now.Add(20 * time.Second), LastError: "error2"}, | |||
| 	} | |||
| 
 | |||
| 	// Add items directly (simulating internal state)
 | |||
| 	for _, item := range items { | |||
| 		queue.lock.Lock() | |||
| 		queue.itemIndex[item.FileId] = item | |||
| 		queue.heap = append(queue.heap, item) | |||
| 		queue.lock.Unlock() | |||
| 	} | |||
| 
 | |||
| 	// Use container/heap.Init to establish heap property
 | |||
| 	queue.lock.Lock() | |||
| 	heap.Init(&queue.heap) | |||
| 	queue.lock.Unlock() | |||
| 
 | |||
| 	// Verify heap maintains min-heap property (earliest time at top)
 | |||
| 	queue.lock.Lock() | |||
| 	if queue.heap[0].FileId != "file1" { | |||
| 		t.Errorf("Expected file1 at heap top (earliest time), got %s", queue.heap[0].FileId) | |||
| 	} | |||
| 	queue.lock.Unlock() | |||
| 
 | |||
| 	// Set all items to ready while preserving their relative order
 | |||
| 	queue.lock.Lock() | |||
| 	for _, item := range queue.itemIndex { | |||
| 		// Shift all times back by 40 seconds to make them ready, but preserve order
 | |||
| 		item.NextRetryAt = item.NextRetryAt.Add(-40 * time.Second) | |||
| 	} | |||
| 	heap.Init(&queue.heap) // Re-establish heap property after modification
 | |||
| 	queue.lock.Unlock() | |||
| 
 | |||
| 	// GetReadyItems should return in NextRetryAt order
 | |||
| 	readyItems := queue.GetReadyItems(10) | |||
| 	expectedOrder := []string{"file1", "file2", "file3"} | |||
| 
 | |||
| 	if len(readyItems) != 3 { | |||
| 		t.Fatalf("Expected 3 ready items, got %d", len(readyItems)) | |||
| 	} | |||
| 
 | |||
| 	for i, item := range readyItems { | |||
| 		if item.FileId != expectedOrder[i] { | |||
| 			t.Errorf("Item %d: expected %s, got %s", i, expectedOrder[i], item.FileId) | |||
| 		} | |||
| 	} | |||
| } | |||
| 
 | |||
| func TestDeletionRetryQueue_DuplicateFileIds(t *testing.T) { | |||
| 	queue := NewDeletionRetryQueue() | |||
| 
 | |||
| 	// Add same file ID twice with retryable error - simulates duplicate in batch
 | |||
| 	queue.AddOrUpdate("file1", "timeout error") | |||
| 
 | |||
| 	// Verify only one item exists in queue
 | |||
| 	if queue.Size() != 1 { | |||
| 		t.Fatalf("Expected queue size 1 after first add, got %d", queue.Size()) | |||
| 	} | |||
| 
 | |||
| 	// Get initial retry count
 | |||
| 	queue.lock.Lock() | |||
| 	item1, exists := queue.itemIndex["file1"] | |||
| 	if !exists { | |||
| 		queue.lock.Unlock() | |||
| 		t.Fatal("Item not found in queue after first add") | |||
| 	} | |||
| 	initialRetryCount := item1.RetryCount | |||
| 	queue.lock.Unlock() | |||
| 
 | |||
| 	// Add same file ID again - should NOT increment retry count (just update error)
 | |||
| 	queue.AddOrUpdate("file1", "timeout error again") | |||
| 
 | |||
| 	// Verify still only one item exists in queue (not duplicated)
 | |||
| 	if queue.Size() != 1 { | |||
| 		t.Errorf("Expected queue size 1 after duplicate add, got %d (duplicates detected)", queue.Size()) | |||
| 	} | |||
| 
 | |||
| 	// Verify retry count did NOT increment (AddOrUpdate only updates error, not count)
 | |||
| 	queue.lock.Lock() | |||
| 	item2, exists := queue.itemIndex["file1"] | |||
| 	queue.lock.Unlock() | |||
| 
 | |||
| 	if !exists { | |||
| 		t.Fatal("Item not found in queue after second add") | |||
| 	} | |||
| 	if item2.RetryCount != initialRetryCount { | |||
| 		t.Errorf("Expected RetryCount to stay at %d after duplicate add (should not increment), got %d", initialRetryCount, item2.RetryCount) | |||
| 	} | |||
| 	if item2.LastError != "timeout error again" { | |||
| 		t.Errorf("Expected LastError to be updated to 'timeout error again', got %q", item2.LastError) | |||
| 	} | |||
| } | |||
						Write
						Preview
					
					
					Loading…
					
					Cancel
						Save
					
		Reference in new issue