seaweedfs

Commit Graph

Author	SHA1	Message	Date
Dmitriy Pavlov	9b6b564235	Filer: Add retry mechanism for failed file deletions (#7402 ) * Filer: Add retry mechanism for failed file deletions Implement a retry queue with exponential backoff for handling transient deletion failures, particularly when volumes are temporarily read-only. Key features: - Automatic retry for retryable errors (read-only volumes, network issues) - Exponential backoff: 5min → 10min → 20min → ... (max 6 hours) - Maximum 10 retry attempts per file before giving up - Separate goroutine processing retry queue every minute - Enhanced logging with retry/permanent error classification This addresses the issue where file deletions fail when volumes are temporarily read-only (tiered volumes, maintenance, etc.) and these deletions were previously lost. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Update weed/filer/filer_deletion.go Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Filer: Add retry mechanism for failed file deletions Implement a retry queue with exponential backoff for handling transient deletion failures, particularly when volumes are temporarily read-only. Key features: - Automatic retry for retryable errors (read-only volumes, network issues) - Exponential backoff: 5min → 10min → 20min → ... (max 6 hours) - Maximum 10 retry attempts per file before giving up - Separate goroutine processing retry queue every minute - Map-based retry queue for O(1) lookups and deletions - Enhanced logging with retry/permanent error classification - Consistent error detail limiting (max 10 total errors logged) - Graceful shutdown support with quit channel for both processors This addresses the issue where file deletions fail when volumes are temporarily read-only (tiered volumes, maintenance, etc.) and these deletions were previously lost. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Filer: Replace magic numbers with named constants in retry processor Replace hardcoded values with package-level constants for better maintainability: - DeletionRetryPollInterval (1 minute): interval for checking retry queue - DeletionRetryBatchSize (1000): max items to process per iteration This improves code readability and makes configuration changes easier. * Filer: Optimize retry queue with min-heap data structure Replace map-based retry queue with a min-heap for better scalability and deterministic ordering. Performance improvements: - GetReadyItems: O(N) → O(K log N) where K is items retrieved - AddOrUpdate: O(1) → O(log N) (acceptable trade-off) - Early exit when checking ready items (heap top is earliest) - No full iteration over all items while holding lock Benefits: - Deterministic processing order (earliest NextRetryAt first) - Better scalability for large retry queues (thousands of items) - Reduced lock contention duration - Memory efficient (no separate slice reconstruction) Implementation: - Min-heap ordered by NextRetryAt using container/heap - Dual index: heap for ordering + map for O(1) FileId lookups - heap.Fix() used when updating existing items - Comprehensive complexity documentation in comments This addresses the performance bottleneck identified in GetReadyItems where iterating over the entire map with a write lock could block other goroutines in high-failure scenarios. * Filer: Modernize heap interface and improve error handling docs 1. Replace interface{} with any in heap methods - Addresses modern Go style (Go 1.18+) - Improves code readability 2. Enhance isRetryableError documentation - Acknowledge string matching brittleness - Add comprehensive TODO for future improvements: * Use HTTP status codes (503, 429, etc.) * Implement structured error types with errors.Is/As * Extract gRPC status codes * Add error wrapping for better context - Document each error pattern with context - Add defensive check for empty error strings Current implementation remains pragmatic for initial release while documenting a clear path for future robustness improvements. String matching is acceptable for now but should be replaced with structured error checking when refactoring the deletion pipeline. * Filer: Refactor deletion processors for better readability Extract large callback functions into dedicated private methods to improve code organization and maintainability. Changes: 1. Extract processDeletionBatch method - Handles deletion of a batch of file IDs - Classifies errors (success, not found, retryable, permanent) - Manages retry queue additions - Consolidates logging logic 2. Extract processRetryBatch method - Handles retry attempts for previously failed deletions - Processes retry results and updates queue - Symmetric to processDeletionBatch for consistency Benefits: - Main loop functions (loopProcessingDeletion, loopProcessingDeletionRetry) are now concise and focused on orchestration - Business logic is separated into testable methods - Reduced nesting depth improves readability - Easier to understand control flow at a glance - Better separation of concerns The refactored methods follow the single responsibility principle, making the codebase more maintainable and easier to extend. * Update weed/filer/filer_deletion.go Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Filer: Fix critical retry count bug and add comprehensive error patterns Critical bug fixes from PR review: 1. Fix RetryCount reset bug (CRITICAL) - Problem: When items are re-queued via AddOrUpdate, RetryCount resets to 1, breaking exponential backoff - Solution: Add RequeueForRetry() method that preserves retry state - Impact: Ensures proper exponential backoff progression 2. Add overflow protection in backoff calculation - Check shift amount > 63 to prevent bit-shift overflow - Additional safety: check if delay <= 0 or > MaxRetryDelay - Protects against arithmetic overflow in extreme cases 3. Expand retryable error patterns - Added: timeout, deadline exceeded, context canceled - Added: lookup error/failed (volume discovery issues) - Added: connection refused, broken pipe (network errors) - Added: too many requests, service unavailable (backpressure) - Added: temporarily unavailable, try again (transient errors) - Added: i/o timeout (network timeouts) Benefits: - Retry mechanism now works correctly across restarts - More robust against edge cases and overflow - Better coverage of transient failure scenarios - Improved resilience in high-failure environments Addresses feedback from CodeRabbit and Gemini Code Assist in PR #7402. * Filer: Add persistence docs and comprehensive unit tests Documentation improvements: 1. Document in-memory queue limitation - Acknowledge that retry queue is volatile (lost on restart) - Document trade-offs and future persistence options - Provide clear path for production hardening - Note eventual consistency through main deletion queue Unit test coverage: 1. TestDeletionRetryQueue_AddAndRetrieve - Basic add/retrieve operations - Verify items not ready before delay elapsed 2. TestDeletionRetryQueue_ExponentialBackoff - Verify exponential backoff progression (5m→10m→20m→40m→80m) - Validate delay calculations with timing tolerance 3. TestDeletionRetryQueue_OverflowProtection - Test high retry counts (60+) that could cause overflow - Verify capping at MaxRetryDelay 4. TestDeletionRetryQueue_MaxAttemptsReached - Verify items discarded after MaxRetryAttempts - Confirm proper queue cleanup 5. TestIsRetryableError - Comprehensive error pattern coverage - Test all retryable error types (timeout, connection, lookup, etc.) - Verify non-retryable errors correctly identified 6. TestDeletionRetryQueue_HeapOrdering - Verify min-heap property maintained - Test items processed in NextRetryAt order - Validate heap.Init() integration All tests passing. Addresses PR feedback on testing requirements. * Filer: Add code quality improvements for deletion retry Address PR feedback with minor optimizations: - Add MaxLoggedErrorDetails constant (replaces magic number 10) - Pre-allocate slices and maps in processRetryBatch for efficiency - Improve log message formatting to use constant These changes improve code maintainability and runtime performance without altering functionality. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * refactoring retrying * use constant * assert * address comment * refactor * address comments * dedup * process retried deletions * address comment * check in-flight items also; dedup code * refactoring * refactoring * simplify * reset heap * more efficient * add DeletionBatchSize as a constant;Permanent > Retryable > Success > Not Found --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: chrislu <chris.lu@gmail.com> Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com>	2 days ago
Chris Lu	6a8c53bc44	Filer: batch deletion operations to return individual error results (#7382 ) * batch deletion operations to return individual error results Modify batch deletion operations to return individual error results instead of one aggregated error, enabling better tracking of which specific files failed to delete (helping reduce orphan file issues). * Simplified logging logic * Optimized nested loop * handles the edge case where the RPC succeeds but connection cleanup fails * simplify * simplify * ignore 'not found' errors here	7 days ago
Aleksey Kosov	4511c2cc1f	Changes logging function (#6919 ) * updated logging methods for stores * updated logging methods for stores * updated logging methods for filer * updated logging methods for uploader and http_util * updated logging methods for weed server --------- Co-authored-by: akosov <a.kosov@kryptonite.ru>	4 months ago
Aleksey Kosov	283d9e0079	Add context with request (#6824 )	5 months ago
chrislu	d49ecde535	rename functions	1 year ago
wyang	0581ce6096	fix delete chunk failed if volumeSever specified grpc.port (#5820 ) Co-authored-by: Yang Wang <yangwang@weride.ai>	1 year ago
chrislu	a8fa78b892	refactoring	1 year ago
chrislu	464611f614	optionally skip deleting file chunks	1 year ago
chrislu	843e778875	refactor	1 year ago
chrislu	677cfb8ad1	refactor	1 year ago
chrislu	dbfbabac55	simplify	1 year ago
chrislu	70a4c98b00	refactor filer_pb.Entry and filer.Entry to use GetChunks() for later locking on reading chunks	3 years ago
Konstantin Lebedev	4a48332248	refactor error contains already deleted (#3932 )	3 years ago
Ryan Russell	e0064ae097	docs: update `fileIdsToDelete` var (#3692 ) Signed-off-by: Ryan Russell <git@ryanrussell.org> Signed-off-by: Ryan Russell <git@ryanrussell.org>	3 years ago
Konstantin Lebedev	3c3682fcce	more log detail for upload err and deleting (#3577 )	3 years ago
famosss	131d389fc4	adjust log level (#3589 )	3 years ago
chrislu	26dbc6c905	move to https://github.com/seaweedfs/seaweedfs	3 years ago
banjiaojuhao	f7f2a597dd	minor	4 years ago
banjiaojuhao	d61bea9038	[bugfix] filer: In file modification, old chunks will be mis-deleted when they are merged(Manifestized).	4 years ago
Chris Lu	2f72c24498	skip the rest logic	4 years ago
Chris Lu	5a0f92423e	use grpc and jwt	4 years ago
Chris Lu	4f31c1bb94	go fmt	5 years ago
Chris Lu	1bf22c0b5b	go fmt	5 years ago
Chris Lu	cc2bd97ad9	refactor	5 years ago
Chris Lu	effa00ed08	refactor	5 years ago
Chris Lu	290b5e2cd0	directly delete file chunks keeping current async deletions for now	5 years ago
Chris Lu	eb7929a971	rename filer2 to filer	5 years ago
Chris Lu	408e339c53	also delete the manifest chunk itself	5 years ago
Chris Lu	f2a8574448	filer and mount deletion resolves manifest chunks also	5 years ago
Chris Lu	3e1395b767	adjust log message	5 years ago
Chris Lu	015dd3a147	batch file id deletion	6 years ago
Chris Lu	5ebc95b69b	refactoring	6 years ago
Chris Lu	3505b06023	report deletion error in the log	6 years ago
Chris Lu	621cdbdf58	filer: avoid possible timeouts for updates and deletions	6 years ago
Chris Lu	bbb6ebc3c0	filer: DeleteFolderChildren for deleting large folders	6 years ago
j.laycock	6fc6322c90	Change joeslay paths to chrislusf paths	6 years ago
j.laycock	595a1beff0	Swap imports to use joeslay	6 years ago
Chris Lu	8afd8d35b3	master: followers can also lookup and redirect improve scalability	6 years ago
Chris Lu	3fa1f150d9	refactoring	6 years ago
Chris Lu	1babec00e7	check deleted chunks faster	6 years ago
Chris Lu	a111f26fe6	avoid nil fix https://github.com/chrislusf/seaweedfs/issues/988	6 years ago
Chris Lu	059ef879a8	fix issue 986 fix issue 986	6 years ago
Chris Lu	108d0fb08d	adjust log level	7 years ago
Chris Lu	4393b99332	add notes	7 years ago
Chris Lu	338e6d60a5	refactor: prepare for snapshotting	7 years ago
Chris Lu	77b9af531d	adding grpc mutual tls	7 years ago
Chris Lu	86dd933596	go fmt	7 years ago
Chris Lu	b282e34dc2	async file chunk deletion	7 years ago

26 Commits (9b6b56423547672953b6928311e6d879faad02e3)