seaweedfs

History

Anton b4c7d42a06 fix(admin): release mutex before disk I/O in maintenance queue; remove per-request LoadAllTaskStates (#8433 ) * fix(admin): release mutex before disk I/O in maintenance queue saveTaskState performs synchronous BoltDB writes. Calling it while holding mq.mutex.Lock() in AddTask, GetNextTask, and CompleteTask blocks all readers (GetTasks via RLock) for the full disk write duration on every task state change. During a maintenance scan AddTasksFromResults calls AddTask for every volume — potentially hundreds of times — meaning the write lock is held almost continuously. The HTTP handler for /maintenance calls GetTasks which blocks on RLock, exceeding the 30s timeout and returning 408 to the browser. Fix: update in-memory state (mq.tasks, mq.pendingTasks) under the lock as before, then unlock before calling saveTaskState. In-memory state is the authoritative source; persistence is crash-recovery only and does not require lock protection during the write. * fix(admin): add mutex to ConfigPersistence to synchronize tasks/ filesystem ops saveTaskState is now called outside mq.mutex, meaning SaveTaskState, LoadAllTaskStates, DeleteTaskState, and CleanupCompletedTasks can be invoked concurrently from multiple goroutines. ConfigPersistence had no internal synchronization, creating races on the tasks/ directory: - concurrent os.WriteFile + os.ReadFile on the same .pb file could yield a partial read and unmarshal error - LoadAllTaskStates (ReadDir + per-file ReadFile) could see a directory entry for a file being written or deleted concurrently - CleanupCompletedTasks (LoadAllTaskStates + DeleteTaskState) could race with SaveTaskState on the same file Fix: add tasksMu sync.Mutex to ConfigPersistence, acquired at the top of SaveTaskState, LoadTaskState, LoadAllTaskStates, DeleteTaskState, and CleanupCompletedTasks. Extract private Locked helpers so that CleanupCompletedTasks (which holds tasksMu) can call them internally without deadlocking. --------- Co-authored-by: Anton Ustyugov <anton@devops>		1 day ago
..
config_schema.go	Admin: misc improvements on admin server and workers. EC now works. (#7055)	7 months ago
config_verification.go	Admin: misc improvements on admin server and workers. EC now works. (#7055)	7 months ago
maintenance_config_proto.go	Admin: misc improvements on admin server and workers. EC now works. (#7055)	7 months ago
maintenance_integration.go	admin: fix capacity leak in maintenance system by preserving Task IDs (#8214)	3 weeks ago
maintenance_manager.go	admin: fix capacity leak in maintenance system by preserving Task IDs (#8214)	3 weeks ago
maintenance_manager_test.go	Admin UI add maintenance menu (#6944)	8 months ago
maintenance_queue.go	fix(admin): release mutex before disk I/O in maintenance queue; remove per-request LoadAllTaskStates (#8433)	1 day ago
maintenance_queue_test.go	admin: fix capacity leak in maintenance system by preserving Task IDs (#8214)	3 weeks ago
maintenance_scanner.go	Fix issue #7880: Tasks use Volume IDs instead of ip:port (#7881)	2 months ago
maintenance_types.go	admin: fix capacity leak in maintenance system by preserving Task IDs (#8214)	3 weeks ago
maintenance_worker.go	admin: Refactor task destination planning (#7063)	7 months ago
pending_operations.go	Admin: misc improvements on admin server and workers. EC now works. (#7055)	7 months ago
pending_operations_test.go	Fix issue #7880: Tasks use Volume IDs instead of ip:port (#7881)	2 months ago