seaweedfs

Commit Graph

Author	SHA1	Message	Date
Chris Lu	7f3f61ea28	fix: resolve Kafka gateway response deadlocks causing Sarama client hangs (#8762 ) * fix: resolve Kafka gateway response deadlocks causing Sarama client hangs Fix three bugs in the Kafka protocol handler that caused sequential clients (notably Sarama) to hang during E2E tests: 1. Race condition in correlation queue ordering: the correlation ID was added to the response ordering queue AFTER sending the request to the processing channel. A fast processor (e.g. ApiVersions) could finish and send its response before the ID was in the queue, causing the response writer to miss it — permanently deadlocking the connection. Now the ID is added BEFORE the channel send, with error response injection on send failure. 2. Silent error response drops: when processRequestSync returned an error, the response writer logged it but never sent anything back to the client. The client would block forever waiting for bytes that never arrived. Now sends a Kafka UNKNOWN_SERVER_ERROR response. 3. Produce V0/V1 missing timeout_ms parsing: the handler skipped the 4-byte timeout field, reading it as topicsCount instead. This caused incorrect parsing of the entire produce request for V0/V1 clients. * fix: API-versioned error responses, unsupported-version queue fix, V0V1 header alignment 1. errors.go — BuildAPIErrorResponse: emits a minimal-but-valid error body whose layout matches the schema the client expects for each API key and version (throttle_time position, array fields, etc.). The old 2-byte generic body corrupted the protocol stream for APIs whose response begins with throttle_time_ms or an array. 2. handler.go — unsupported-version path: the correlationID was never added to correlationQueue before sending to responseChan, so the response writer could never match it and the client hung. Now appends the ID under correlationQueueMu before the send. 3. produce.go — handleProduceV0V1: requestBody is already post-header (HandleConn strips client_id). The handler was erroneously parsing acks bytes as a client_id length, misaligning all subsequent field reads. Removed the client_id parsing; offset now starts at 0 with acks(2) + timeout_ms(4) + topicsCount(4), matching handleProduceV2Plus. * fix: free pooled message buffer per-iteration instead of deferring The read loop allocated messageBuf via mem.Allocate and deferred mem.Free. Since the defer only runs when HandleConn returns, pool buffers accumulated for the entire connection lifetime — one per request. Worse, the deferred frees ran in LIFO order before wg.Wait(), so processing goroutines could read from already-freed pool buffers. Now: read into a pooled buffer, immediately copy to Go-managed memory, and return the pool buffer. messageBuf is a regular slice safe for async goroutine access with no defer accumulation. * fix: cancel context before wg.Wait and on worker response-send timeout Two related issues: 1. Cleanup defer ordering deadlock: defers run LIFO — the cleanup defer (close channels, wg.Wait) ran before the cancel() defer. The response writer is in the WaitGroup and exits only on ctx.Done() or responseChan close, but both signals came after wg.Wait(). Deadlock on every normal connection close (EOF, read error, queue-full). Fix: call cancel() at the start of the cleanup defer, before wg.Wait(). 2. Worker 5s response-send timeout: when the timeout fired, the response was silently dropped but the correlationID remained in the ordered queue. The response writer could never advance past it, stalling all subsequent responses permanently. Fix: call cancel() to tear down the connection — if we cannot deliver a response in 5s the connection is irrecoverable. * chore: remove empty no-op ListOffsets conditional The `if apiKey == 2 {}` block had no body — leftover debug code. ListOffsets routing is handled by isDataPlaneAPI (returns false, sending it to the control channel). No behavior change.	3 days ago
Chris Lu	6c35a3724a	weed/mount: simplify metadata flush retry returns (#8763 )	3 days ago
Chris Lu	cca1555cc7	mount: implement create for rsync temp files (#8749 ) * mount: implement create for rsync temp files * mount: move access implementation out of unsupported * mount: tighten access checks * mount: log access group lookup failures * mount: reset dirty pages on truncate * mount: tighten create and root access handling * mount: handle existing creates before quota checks * mount: restrict access fallback when group lookup fails When lookupSupplementaryGroupIDs returns an error, the previous code fell through to checking only the "other" permission bits, which could overgrant access. Require both group and other permission classes to satisfy the mask so access is never broader than intended. * mount: guard against nil entry in Create existing-file path maybeLoadEntry can return OK with a nil entry or nil Attributes in edge cases. Check before dereferencing to prevent a panic. * mount: reopen existing file on create race without O_EXCL When createRegularFile returns EEXIST because another process won the race, and O_EXCL is not set, reload the winner's entry and open it instead of propagating the error to the caller. * mount: check parent directory permission in createRegularFile Verify the caller has write+search (W_OK\|X_OK) permission on the parent directory before creating a file. This applies to both Create and Mknod. Update test fixture mount mode to 0o777 so the existing tests pass with the new check. * mount: enforce file permission bits in AcquireHandle Map the open flags (O_RDONLY/O_WRONLY/O_RDWR) to an access mask and call hasAccess before handing out a file handle. This makes AcquireHandle the single source of truth for mode-based access control across Open, Create-existing, and Create-new paths. --------- Co-authored-by: Copilot <copilot@github.com>	3 days ago
Chris Lu	805625d06e	Add FUSE integration tests for POSIX file locking (#8752 ) * Add FUSE integration tests for POSIX file locking Test flock() and fcntl() advisory locks over the FUSE mount: - Exclusive and shared flock with conflict detection - flock upgrade (shared to exclusive) and release on close - fcntl F_SETLK write lock conflicts and shared read locks - fcntl F_GETLK conflict reporting on overlapping byte ranges - Non-overlapping byte-range locks held independently - F_SETLKW blocking until conflicting lock is released - Lock release on file descriptor close - Concurrent lock contention with multiple workers * Fix review feedback in POSIX lock integration tests - Assert specific EAGAIN error on fcntl lock conflicts instead of generic Error - Use O_APPEND in concurrent contention test so workers append rather than overwrite - Verify exact line count (numWorkers * writesPerWorker) after concurrent test - Check unlock error in F_SETLKW blocking test goroutine * Refactor fcntl tests to use subprocesses for inter-process semantics POSIX fcntl locks use the process's files_struct as lock owner, so all fds in the same process share the same owner and never conflict. This caused the fcntl tests to silently pass without exercising lock conflicts. Changes: - Add TestFcntlLockHelper subprocess entry point with hold/try/getlk actions - Add lockHolder with channel-based coordination (no scanner race) - Rewrite all fcntl tests to run contenders in separate subprocesses - Fix F_UNLCK int16 cast in GetLk assertion for type-safe comparison - Fix concurrent test: use non-blocking flock with retry to avoid exhausting go-fuse server reader goroutines (blocking FUSE SETLKW can starve unlock request processing, causing deadlock) flock tests remain same-process since flock uses per-struct-file owners. * Fix misleading comment and error handling in lock test subprocess - Fix comment: tryLockInSubprocess tests a subprocess, not the test process - Distinguish EAGAIN/EACCES from unexpected errors in subprocess try mode so real failures aren't silently masked as lock conflicts * Fix CI race in FcntlReleaseOnClose and increase flock retry budget - FcntlReleaseOnClose: retry lock acquisition after subprocess exits since the FUSE server may not process Release immediately - ConcurrentLockContention: increase retry limit from 500 to 3000 (5s → 30s budget) to handle CI load * separating flock and fcntl in the in-memory lock table and cleaning them up through the right release path: PID for POSIX locks, lock owner for flock * ReleasePosixOwner * weed/mount: flush before releasing posix close owner * weed/mount: keep woken lock waiters from losing inode state * test/fuse: make blocking fcntl helper state explicit * test/fuse: assert flock contention never overlaps * test/fuse: stabilize concurrent lock contention check * test/fuse: make concurrent contention writes deterministic * weed/mount: retry synchronous metadata flushes	3 days ago
Lars Lehtonen	9cc26d09e8	chore:(weed/worker/tasks/erasure_coding): Prune Unused and Untested Functions (#8761 ) * chore(weed/worker/tasks/erasure_coding): prune unused findVolumeReplicas() * chore(weed/worker/tasks/erasure_coding): prune unused isDiskSuitableForEC() * chore(weed/worker/tasks/erasure_coding): prune unused selectBestECDestinations() * chore(weed/worker/tasks/erasure_coding): prune unused candidatesToDiskInfos()	3 days ago
Chris Lu	3d872e86f8	Implement POSIX file locking for FUSE mount (#8750 ) * Add POSIX byte-range lock table for FUSE mount Implement PosixLockTable with per-inode range lock tracking supporting: - Shared (F_RDLCK) and exclusive (F_WRLCK) byte-range locks - Conflict detection across different lock owners - Lock coalescing for adjacent/overlapping same-owner same-type locks - Lock splitting on partial-range unlock - Blocking waiter support for SetLkw with cancellation - Owner-based cleanup for Release * Wire POSIX lock handlers into FUSE mount Implement GetLk, SetLk, SetLkw on WFS delegating to PosixLockTable. Add posixLocks field to WFS and initialize in constructor. Clean up locks on Release via ReleaseOwner using ReleaseIn.LockOwner. Remove ENOSYS stubs from weedfs_unsupported.go. * Enable POSIX and flock lock capabilities in FUSE mount Set EnableLocks: true in mount options to advertise CAP_POSIX_LOCKS and CAP_FLOCK_LOCKS during FUSE INIT. * Avoid thundering herd in lock waiter wake-up Replace broadcast-all wakeWaiters with selective wakeEligibleWaiters that checks each waiter's requested lock against remaining held locks. Only waiters whose request no longer conflicts are woken; others stay queued. Store the requested lockRange in each lockWaiter to enable this. * Fix uint64 overflow in adjacency check for lock coalescing Guard h.End+1 and lk.End+1 with < ^uint64(0) checks so that End == math.MaxUint64 (EOF) does not wrap to 0 and falsely merge non-adjacent locks. * Add test for non-adjacent ranges with gap not being coalesced	4 days ago
Chris Lu	e5f72077ee	fix: resolve CORS cache race condition causing stale 404 responses (#8748 ) The metadata subscription handler (updateBucketConfigCacheFromEntry) was making a separate RPC call via loadCORSFromBucketContent to load CORS configuration. This created a race window where a slow CreateBucket subscription event could re-cache stale data after PutBucketCors had already cleared the cache, causing subsequent GetBucketCors to return 404 NoSuchCORSConfiguration. Parse CORS directly from the subscription entry's Content field instead of making a separate RPC. Also fix getBucketConfig to parse CORS from the already-fetched entry, eliminating a redundant RPC call. Fix TestCORSCaching to use require.NoError to prevent nil pointer dereference panics when GetBucketCors fails.	4 days ago
Chris Lu	c31e6b4684	Use filer-side copy for mounted whole-file copy_file_range (#8747 ) * Optimize mounted whole-file copy_file_range * Address mounted copy review feedback * Harden mounted copy fast path --------- Co-authored-by: Copilot <copilot@github.com>	4 days ago
Chris Lu	6bf654c25c	fix: keep metadata subscriptions progressing (#8730 ) (#8746 ) * fix: keep metadata subscriptions progressing (#8730) * test: cancel slow metadata writers with parent context * filer: ignore missing persisted log chunks	4 days ago
Chris Lu	d5ee35c8df	Fix S3 delete for non-empty directory markers (#8740 ) * Fix S3 delete for non-empty directory markers * Address review feedback on directory marker deletes * Stabilize FUSE concurrent directory operations	4 days ago
Mmx233	ecadeddcbe	fix: extend ignore404Error to match 404 Not Found string from S3 sink… (#8741 ) * fix: extend ignore404Error to match 404 Not Found string from S3 sink errors * test: add unit tests for isIgnorable404 error matching * improve: pre-compute ignorable 404 string and simplify isIgnorable404 * test: replace init() with TestMain for global HTTP client setup	4 days ago
Chris Lu	9434d3733d	mount: async flush on close() when writebackCache is enabled (#8727 ) * mount: async flush on close() when writebackCache is enabled When -writebackCache is enabled, defer data upload and metadata flush from Flush() (triggered by close()) to a background goroutine in Release(). This allows processes like rsync that write many small files to proceed to the next file immediately instead of blocking on two network round-trips (volume upload + filer metadata) per file. Fixes #8718 * mount: add retry with backoff for async metadata flush The metadata flush in completeAsyncFlush now retries up to 3 times with exponential backoff (1s, 2s, 4s) on transient gRPC errors. Since the chunk data is already safely on volume servers at this point, only the filer metadata reference needs persisting — retrying is both safe and effective. Data flush (FlushData) is not retried externally because UploadWithRetry already handles transient HTTP/gRPC errors internally; if it still fails, the chunk memory has been freed. * test: add integration tests for writebackCache async flush Add comprehensive FUSE integration tests for the writebackCache async flush feature (issue #8718): - Basic operations: write/read, sequential files, large files, empty files, overwrites - Fsync correctness: fsync forces synchronous flush even in writeback mode, immediate read-after-fsync - Concurrent small files: multi-worker parallel writes (rsync-like workload), multi-directory, rapid create/close - Data integrity: append after close, partial writes, file size correctness, binary data preservation - Performance comparison: writeback vs synchronous flush throughput - Stress test: 16 workers x 100 files with content verification - Mixed concurrent operations: reads, writes, creates running together Also fix pre-existing test infrastructure issues: - Rename framework.go to framework_test.go (fixes Go package conflict) - Fix undefined totalSize variable in concurrent_operations_test.go * ci: update fuse-integration workflow to run full test suite The workflow previously only ran placeholder tests (simple_test.go, working_demo_test.go) in a temp directory due to a Go module conflict. Now that framework.go is renamed to framework_test.go, the full test suite compiles and runs correctly from test/fuse_integration/. Changes: - Run go test directly in test/fuse_integration/ (no temp dir copy) - Install weed binary to /usr/local/bin for test framework discovery - Configure /etc/fuse.conf with user_allow_other for FUSE mounts - Install fuse3 for modern FUSE support - Stream test output to log file for artifact upload * mount: fix three P1 races in async flush P1-1: Reopen overwrites data still flushing in background ReleaseByHandle removes the old handle from fhMap before the deferred flush finishes. A reopen of the same inode during that window would build from stale filer metadata, overwriting the async flush. Fix: Track in-flight async flushes per inode via pendingAsyncFlush map. AcquireHandle now calls waitForPendingAsyncFlush(inode) to block until any pending flush completes before reading filer metadata. P1-2: Deferred flush races rename and unlink after close completeAsyncFlush captured the path once at entry, but rename or unlink after close() could cause metadata to be written under the wrong name or recreate a deleted file. Fix: Re-resolve path from inode via GetPath right before metadata flush. GetPath returns the current path (reflecting renames) or ENOENT (if unlinked), in which case we skip the metadata flush. P1-3: SIGINT/SIGTERM bypasses the async-flush drain grace.OnInterrupt runs hooks then calls os.Exit(0), so WaitForAsyncFlush after server.Serve() never executes on signal. Fix: Add WaitForAsyncFlush (with 10s timeout) to the WFS interrupt handler, before cache cleanup. The timeout prevents hanging on Ctrl-C when the filer is unreachable. * mount: fix P1 races — draining handle stays in fhMap P1-1: Reopen TOCTOU The gap between ReleaseByHandle removing from fhMap and submitAsyncFlush registering in pendingAsyncFlush allowed a concurrent AcquireHandle to slip through with stale metadata. Fix: Hold pendingAsyncFlushMu across both the counter decrement (ReleaseByHandle) and the pending registration. The handle is registered as pending before the lock is released, so waitForPendingAsyncFlush always sees it. P1-2: Rename/unlink can't find draining handle ReleaseByHandle deleted from fhMap immediately. Rename's FindFileHandle(inode) at line 251 could not find the handle to update entry.Name. Unlink could not coordinate either. Fix: When asyncFlushPending is true, ReleaseByHandle/ReleaseByInode leave the handle in fhMap (counter=0 but maps intact). The handle stays visible to FindFileHandle so rename can update entry.Name. completeAsyncFlush re-resolves the path from the inode (GetPath) right before metadata flush for correctness after rename/unlink. After drain, RemoveFileHandle cleans up the maps. Double-return prevention: ReleaseByHandle/ReleaseByInode return nil if counter is already <= 0, so Forget after Release doesn't start a second drain goroutine. P1-3: SIGINT deletes swap files under running goroutines After the 10s timeout, os.RemoveAll deleted the write cache dir (containing swap files) while FlushData goroutines were still reading from them. Fix: Increase timeout to 30s. If timeout expires, skip write cache dir removal so in-flight goroutines can finish reading swap files. The OS (or next mount) cleans them up. Read cache is always removed. * mount: never skip metadata flush when Forget drops inode mapping Forget removes the inode→path mapping when the kernel's lookup count reaches zero, but this does NOT mean the file was unlinked — it only means the kernel evicted its cache entry. completeAsyncFlush was treating GetPath failure as "file unlinked" and skipping the metadata flush, which orphaned the just-uploaded chunks for live files. Fix: Save dir and name at doFlush defer time. In completeAsyncFlush, try GetPath first to pick up renames; if the mapping is gone, fall back to the saved dir/name. Always attempt the metadata flush — the filer is the authority on whether the file exists, not the local inode cache. * mount: distinguish Forget from Unlink in async flush path fallback The saved-path fallback (from the previous fix) always flushed metadata when GetPath failed, which recreated files that were explicitly unlinked after close(). The same stale fallback could recreate the pre-rename path if Forget dropped the inode mapping after a rename. Root cause: GetPath failure has two meanings: 1. Forget — kernel evicted the cache entry (file still exists) 2. Unlink — file was explicitly deleted (should not recreate) Fix (three coordinated changes): Unlink (weedfs_file_mkrm.go): Before RemovePath, look up the inode and find any draining handle via FindFileHandle. Set fh.isDeleted = true so the async flush knows the file was explicitly removed. Rename (weedfs_rename.go): When renaming a file with a draining handle, update asyncFlushDir/asyncFlushName to the post-rename location. This keeps the saved-path fallback current so Forget after rename doesn't flush to the old (pre-rename) path. completeAsyncFlush (weedfs_async_flush.go): Check fh.isDeleted first — if true, skip metadata flush (file was unlinked, chunks become orphans for volume.fsck). Otherwise, try GetPath for the current path (renames); fall back to saved path if Forget dropped the mapping (file is live, just evicted from kernel cache). * test/ci: address PR review nitpicks concurrent_operations_test.go: - Restore precise totalSize assertion instead of info.Size() > 0 writeback_cache_test.go: - Check rand.Read errors in all 3 locations (lines 310, 512, 757) - Check os.MkdirAll error in stress test (line 752) - Remove dead verifyErrors variable (line 332) - Replace both time.Sleep(5s) with polling via waitForFileContent to avoid flaky tests under CI load (lines 638, 700) fuse-integration.yml: - Add set -o pipefail so go test failures propagate through tee * ci: fix fuse3/fuse package conflict on ubuntu-22.04 runner fuse3 is pre-installed on ubuntu-22.04 runners and conflicts with the legacy fuse package. Only install libfuse3-dev for the headers. * mount/page_writer: remove debug println statements Remove leftover debug println("read new data1/2") from ReadDataAt in MemChunk and SwapFileChunk. * test: fix findWeedBinary matching source directory instead of binary findWeedBinary() matched ../../weed (the source directory) via os.Stat before checking PATH, then tried to exec a directory which fails with "permission denied" on the CI runner. Fix: Check PATH first (reliable in CI where the binary is installed to /usr/local/bin). For relative paths, verify the candidate is a regular file (!info.IsDir()). Add ../../weed/weed as a candidate for in-tree builds. * test: fix framework — dynamic ports, output capture, data dirs The integration test framework was failing in CI because: 1. All tests used hardcoded ports (19333/18080/18888), so sequential tests could conflict when prior processes hadn't fully released their ports yet. 2. Data subdirectories (data/master, data/volume) were not created before starting processes. 3. Master was started with -peers=none which is not a valid address. 4. Process stdout/stderr was not captured, making failures opaque ("service not ready within timeout" with no diagnostics). 5. The unmount fallback used 'umount' instead of 'fusermount -u'. 6. The mount used -cacheSizeMB (nonexistent) instead of -cacheCapacityMB and was missing -allowOthers=false for unprivileged CI runners. Fixes: - Dynamic port allocation via freePort() (net.Listen ":0") - Explicit gRPC ports via -port.grpc to avoid default port conflicts - Create data/master and data/volume directories in Setup() - Remove invalid -peers=none and -raftBootstrap flags - Capture process output to logDir/.log via startProcess() helper - dumpLog() prints tail of log file on service startup failure - Use fusermount3/fusermount -u for unmount - Fix mount flag names (-cacheCapacityMB, -allowOthers=false) test: remove explicit -port.grpc flags from test framework SeaweedFS convention: gRPC port = HTTP port + 10000. Volume and filer discover the master gRPC port by this convention. Setting explicit -port.grpc on master/volume/filer broke inter-service communication because the volume server computed master gRPC as HTTP+10000 but the actual gRPC was on a different port. Remove all -port.grpc flags and let the default convention work. Dynamic HTTP ports already ensure uniqueness; the derived gRPC ports (HTTP+10000) will also be unique. --------- Co-authored-by: Copilot <copilot@github.com>	5 days ago
Chris Lu	d6a872c4b9	Preserve explicit directory markers with octet-stream MIME (#8726 ) * Preserve octet-stream MIME on explicit directory markers * Run empty directory marker regression in CI * Run S3 Spark workflow for filer changes	6 days ago
Anton	7f0cf72574	admin/plugin: delete job_detail files when jobs are pruned from memory (#8722 ) * admin/plugin: delete job_detail files when jobs are pruned from memory pruneTrackedJobsLocked evicts the oldest terminal jobs from the in-memory tracker when the total exceeds maxTrackedJobsTotal (1000). However the dedicated per-job detail files in jobs/job_details/ were never removed, causing them to accumulate indefinitely on disk. Add ConfigStore.DeleteJobDetail and call it from pruneTrackedJobsLocked so that the file is cleaned up together with the in-memory entry. Deletion errors are logged at verbosity level 2 and do not abort the prune. * admin/plugin: add test for DeleteJobDetail --------- Co-authored-by: Anton Ustyugov <anton@devops> Co-authored-by: Chris Lu <chris.lu@gmail.com>	6 days ago
Anton	90277ceed5	admin/plugin: migrate inline job details asynchronously to avoid slow startup (#8721 ) loadPersistedMonitorState performed a backward-compatibility migration that wrote every job with inline rich detail fields to a dedicated per-job detail file synchronously during startup. On deployments with many historical jobs (e.g. 1000+) stored on distributed block storage (e.g. Longhorn), each individual file write requires an fsync round-trip, making startup disproportionately slow and causing readiness/liveness probe failures. The in-memory state is populated correctly before the goroutine is started because stripTrackedJobDetailFields is still called in-place; only the disk writes are deferred. A completion log message at V(1) is emitted once the background migration finishes. Co-authored-by: Anton Ustyugov <anton@devops>	6 days ago
Anton	ae170f1fbb	admin: fix manual job run to use scheduler dispatch with capacity management and retry (#8720 ) RunPluginJobTypeAPI previously executed proposals with a naive sequential loop calling ExecutePluginJob per proposal. This had two bugs: 1. Double-lock: RunPluginJobTypeAPI held pluginLock while calling ExecutePluginJob, which tried to re-acquire the same lock for every job in the loop. 2. No capacity management: proposals were fired directly at workers without reserveScheduledExecutor, so every job beyond the worker concurrency limit received an immediate at_capacity error with no retry or backoff. Fix: add Plugin.DispatchProposals which reuses dispatchScheduledProposals - the same code path the scheduler loop uses - with executor reservation, configurable concurrency, and per-job retry with backoff. RunPluginJobTypeAPI now calls DispatchPluginProposals (a thin AdminServer wrapper) after holding pluginLock once. Co-authored-by: Anton Ustyugov <anton@devops>	6 days ago
Anton	8e7b15a995	wdclient/exclusive_locks: replace println with glog in ExclusiveLocker (#8723 ) RequestLock used a bare println to report transient lock acquisition failures ('lock: already locked by ...'), which writes directly to stdout instead of going through the structured logging pipeline. This causes log noise at the wrong level and cannot be filtered with -v or redirected like glog output. Changes: - println("lock:", ...) -> glog.V(2).Infof for per-retry acquisition errors (transient, high-frequency during startup when another instance still holds) - Add glog.V(1).Infof when the lock is successfully acquired - Add glog.V(2).Infof for successful renewals (replaces commented-out println) - Errorf -> Warningf for renewal failures (the goroutine exits cleanly, it is not a fatal error; the caller will re-acquire via RequestLock) Co-authored-by: Anton Ustyugov <anton@devops>	6 days ago
Chris Lu	7fbdb9b7b7	feat(shell): add volume.tier.compact command to reclaim cloud storage space (#8715 ) * feat(shell): add volume.tier.compact command to reclaim cloud storage space Adds a new shell command that automates compaction of cloud tier volumes. When files are deleted from remote-tiered volumes, space is not reclaimed on the cloud storage. This command orchestrates: download from remote, compact locally, and re-upload to reclaim deleted space. Closes #8563 * fix: log cleanup errors in compactVolumeOnServer instead of discarding them Helps operators diagnose leftover temp files (.cpd/.cpx) if cleanup fails after a compaction or commit failure. * fix: return aggregate error from loop and use regex for collection filter - Track and return error count when one or more volumes fail to compact, so callers see partial failures instead of always getting nil. - Use compileCollectionPattern for -collection in -volumeId mode too, so regex patterns work consistently with the flag description. Empty pattern (no -collection given) matches all collections.	7 days ago
JARDEL ALVES	1413822424	glog: add JSON structured logging mode (#8708 ) * glog: add JSON structured logging mode Add opt-in JSON output format for glog, enabling integration with log aggregation systems like ELK, Loki, and Datadog. - Add --log_json flag to enable JSON output at startup - Add SetJSONMode()/IsJSONMode() for runtime toggle - Add JSON branches in println, printDepth, printf, printWithFileLine - Use manual JSON construction (no encoding/json) for performance - Add jsonEscapeString() for safe string escaping - Include 8 unit tests and 1 benchmark Enabled via --log_json flag. Default behavior unchanged. * glog: prevent lazy flag init from overriding SetJSONMode If --log_json=true and SetJSONMode(false) was called at runtime, a subsequent IsJSONMode() call would re-enable JSON mode via the sync.Once lazy initialization. Mark jsonFlagOnce as done inside SetJSONMode so the runtime API always takes precedence. * glog: fix RuneError check to not misclassify valid U+FFFD The condition r == utf8.RuneError matches both invalid UTF-8 sequences (size=1) and a valid U+FFFD replacement character (size=3). Without checking size == 1, a valid U+FFFD input would be incorrectly escaped and only advance by 1 byte, corrupting the output. --------- Co-authored-by: Chris Lu <chris.lu@gmail.com>	1 week ago
JARDEL ALVES	5f2244d25d	glog: add gzip compression for rotated log files (#8709 ) * glog: add gzip compression for rotated log files Add opt-in gzip compression that automatically compresses log files after rotation, reducing disk usage in long-running deployments. - Add --log_compress flag to enable compression at startup - Add SetCompressRotated()/IsCompressRotated() for runtime toggle - Compress rotated files in background goroutine (non-blocking) - Use gzip.BestSpeed for minimal CPU overhead - Fix .gz file cleanup: TrimSuffix approach correctly counts compressed files toward MaxFileCount limit - Include 6 unit tests covering normal, empty, large, and edge cases Enabled via --log_compress flag. Default behavior unchanged. * glog: fix compressFile to check gz/dst close errors and use atomic rename Write to a temp file (.gz.tmp) and rename atomically to prevent exposing partial archives. Check gz.Close() and dst.Close() errors to avoid deleting the original log when flush fails (e.g. ENOSPC). Use defer for robust resource cleanup. * glog: deduplicate .log/.log.gz pairs in rotation cleanup During concurrent compression, both foo.log and foo.log.gz can exist simultaneously. Count them as one entry against MaxFileCount to prevent premature eviction of rotated logs. * glog: use portable temp path in TestCompressFile_NonExistent Replace hardcoded /nonexistent/path with t.TempDir() for portability. --------- Co-authored-by: Copilot <copilot@github.com>	1 week ago
Chris Lu	51ec0d2122	fix(remote_gateway): prevent double-versioning when syncing to versioned central bucket (#8710 ) * fix(remote_gateway): prevent double-versioning when syncing to versioned central bucket When a file is uploaded to a versioned bucket on edge, SeaweedFS stores it internally as {object}.versions/v_{versionId}. The remote_gateway was syncing this internal path directly to the central S3 endpoint. When central's bucket also has versioning enabled, this caused central to apply its own versioning on top, producing corrupt paths like: object.versions/v_{edgeId}.versions/v_{centralId} Fix: rewrite internal .versions/v_{id} paths to the original S3 object key before uploading to the remote. Skip version file delete/update events that are internal bookkeeping. Fixes https://github.com/seaweedfs/seaweedfs/discussions/8481#discussioncomment-16209342 * fix(remote_gateway): propagate delete markers to remote as deletions Delete markers are zero-content version entries (ExtDeleteMarkerKey=true) created by S3 DELETE on a versioned bucket. Previously they were silently dropped by the HasData() filter, so deletions on edge never reached central. Now: detect delete markers before the HasData check, rewrite the .versions path to the original S3 key, and issue client.DeleteFile() on the remote. * fix(remote_gateway): tighten isVersionedPath to avoid false positives Address PR review feedback: - Add isDir parameter to isVersionedPath so it only matches the exact internal shapes: directories whose name ends with .versions (isDir=true), and files with the v_ prefix inside a .versions parent (isDir=false). Previously the function was too broad and could match user-created paths like "my.versions/data.txt". - Update all 4 call sites to pass the entry's IsDirectory field. - Rename TestVersionedDirectoryNotFilteredByHasData to TestVersionsDirectoryFilteredByHasData so the name reflects the actual assertion (directories ARE filtered by HasData). - Expand TestIsVersionedPath with isDir cases and false-positive checks. * fix(remote_gateway): persist sync marker after delete-marker propagation The delete-marker branch was calling client.DeleteFile() and returning without updating the local entry, making event replay re-issue the remote delete. Now call updateLocalEntry after a successful DeleteFile to stamp the delete-marker entry with a RemoteEntry, matching the pattern used by the normal create path. * refactor(remote_gateway): extract syncDeleteMarker and fix root path edge case - Extract syncDeleteMarker() shared helper used by both bucketed and mounted-dir event processors, replacing the duplicated delete + persist local marker logic. - Fix rewriteVersionedSourcePath for root-level objects: when lastSlash is 0 (e.g. "/file.xml.versions"), return "/" as the parent dir instead of an empty string. - The strings.Contains(dir, ".versions/") condition flagged in review was already removed in a prior commit that tightened isVersionedPath. * fix(remote_gateway): skip updateLocalEntry for versioned path rewrites After rewriting a .versions/v_{id} path to the logical S3 key and uploading, the code was calling updateLocalEntry on the original v_* entry, stamping it with a RemoteEntry for the logical key. This is semantically wrong: the logical object has no filer entry in versioned buckets, and the internal v_* entry should not carry a RemoteEntry for a different path. Skip updateLocalEntry when the path was rewritten from a versioned source. Replay safety is preserved because S3 PutObject is idempotent. * fix(remote_gateway): scope versioning checks to /buckets/ namespace isVersionedPath and rewriteVersionedSourcePath could wrongly match paths in non-bucket mounts (e.g. /mnt/remote/file.xml.versions). Add the same /buckets/ prefix guard used by isMultipartUploadDir so the .versions / v_ logic only applies within the bucket namespace.	1 week ago
Chris Lu	6ccda3e809	fix(s3): allow deleting the anonymous user from admin webui (#8706 ) Remove the block that prevented deleting the "anonymous" identity and stop auto-creating it when absent. If no anonymous identity exists (or it is disabled), LookupAnonymous returns not-found and both auth paths return ErrAccessDenied for anonymous requests. To enable anonymous access, explicitly create the "anonymous" user. To revoke it, delete the user like any other identity. Closes #8694	1 week ago
Chris Lu	08b79a30f6	Fix lock table shared wait condition (#8707 ) * Fix lock table shared wait condition (#8696) * Refactor lock table waiter check * Add exclusive lock wait helper test	1 week ago
Chris Lu	80f3079d2a	fix(s3): include directory markers in ListObjects without delimiter (#8704 ) * fix(s3): include directory markers in ListObjects without delimiter (#8698) Directory key objects (zero-byte objects with keys ending in "/") created via PutObject were omitted from ListObjects/ListObjectsV2 results when no delimiter was specified. AWS S3 includes these as regular keys in Contents. The issue was in doListFilerEntries: when recursing into directories in non-delimiter mode, directory key objects were only emitted when prefixEndsOnDelimiter was true. Added an else branch to emit them in the general recursive case as well. * remove issue reference from inline comment * test: add child-under-marker and paginated listing coverage Extend test 6 to place a child object under the directory marker and paginate with MaxKeys=1 so the emit-then-recurse truncation path is exercised. * fix(test): skip directory markers in Spark temporary artifacts check The listing check now correctly shows directory markers (keys ending in "/") after the ListObjects fix. These 0-byte metadata objects are not data artifacts — filter them from the listing check since the HeadObject-based check already verifies their cleanup with a timeout.	1 week ago
Chris Lu	15f4a97029	fix: improve raft leader election reliability and failover speed (#8692 ) * fix: clear raft vote state file on non-resume startup The seaweedfs/raft library v1.1.7 added a persistent `state` file for currentTerm and votedFor. When RaftResumeState=false (the default), the log, conf, and snapshot directories are cleared but this state file was not. On repeated restarts, different masters accumulate divergent terms, causing AppendEntries rejections and preventing leader election. Fixes #8690 * fix: recover TopologyId from snapshot before clearing raft state When RaftResumeState=false clears log/conf/snapshot, the TopologyId (used for license validation) was lost. Now extract it from the latest snapshot before cleanup and restore it on the topology. Both seaweedfs/raft and hashicorp/raft paths are handled, with a shared recoverTopologyIdFromState helper in raft_common.go. * fix: stagger multi-master bootstrap delay by peer index Previously all masters used a fixed 1500ms delay before the bootstrap check. Now the delay is proportional to the peer's sorted index with randomization (matching the hashicorp raft path), giving the designated bootstrap node (peer 0) a head start while later peers wait for gRPC servers to be ready. Also adds diagnostic logging showing why DoJoinCommand was or wasn't called, making leader election issues easier to diagnose from logs. * fix: skip unreachable masters during leader reconnection When a master leader goes down, non-leader masters still redirect clients to the stale leader address. The masterClient would follow these redirects, fail, and retry — wasting round-trips each cycle. Now tryAllMasters tracks which masters failed within a cycle and skips redirects pointing to them, reducing log spam and connection overhead during leader failover. * fix: take snapshot after TopologyId generation for recovery After generating a new TopologyId on the leader, immediately take a raft snapshot so the ID can be recovered from the snapshot on future restarts with RaftResumeState=false. Without this, short-lived clusters would lose the TopologyId on restart since no automatic snapshot had been taken yet. * test: add multi-master raft failover integration tests Integration test framework and 5 test scenarios for 3-node master clusters: - TestLeaderConsistencyAcrossNodes: all nodes agree on leader and TopologyId - TestLeaderDownAndRecoverQuickly: leader stops, new leader elected, old leader rejoins as follower - TestLeaderDownSlowRecover: leader gone for extended period, cluster continues with 2/3 quorum - TestTwoMastersDownAndRestart: quorum lost (2/3 down), recovered when both restart - TestAllMastersDownAndRestart: full cluster restart, leader elected, all nodes agree on TopologyId * fix: address PR review comments - peerIndex: return -1 (not 0) when self not found, add warning log - recoverTopologyIdFromSnapshot: defer dir.Close() - tests: check GetTopologyId errors instead of discarding them * fix: address review comments on failover tests - Assert no leader after quorum loss (was only logging) - Verify follower cs.Leader matches expected leader via ServerAddress.ToHttpAddress() comparison - Check GetTopologyId error in TestTwoMastersDownAndRestart	1 week ago
Chris Lu	c197206897	fix(s3): return ETag header for directory marker PutObject requests (#8688 ) * fix(s3): return ETag header for directory marker PutObject requests The PutObject handler has a special path for keys ending with "/" (directory markers) that creates the entry via mkdir. This path never computed or set the ETag response header, unlike the regular PutObject path. AWS S3 always returns an ETag header, even for empty-body puts. Compute the MD5 of the content (empty or otherwise), store it in the entry attributes and extended attributes, and set the ETag response header. Fixes #8682 * fix: handle io.ReadAll error and chunked encoding for directory markers Address review feedback: - Handle error from io.ReadAll instead of silently discarding it - Change condition from ContentLength > 0 to ContentLength != 0 to correctly handle chunked transfer encoding (ContentLength == -1) * fix hanging tests	1 week ago
Jayshan Raghunandan	1f1eac4f08	feat: improve aio support for admin/volume ingress and fix UI links (#8679 ) * feat: improve allInOne mode support for admin/volume ingress and fix master UI links - Add allInOne support to admin ingress template, matching the pattern used by filer and s3 ingress templates (or-based enablement with ternary service name selection) - Add allInOne support to volume ingress template, which previously required volume.enabled even when the volume server runs within the allInOne pod - Expose admin ports in allInOne deployment and service when allInOne.admin.enabled is set - Add allInOne.admin config section to values.yaml (enabled by default, ports inherit from admin.) - Fix legacy master UI templates (master.html, masterNewRaft.html) to prefer PublicUrl over internal Url when linking to volume server UI. The new admin UI already handles this correctly. fix: revert admin allInOne changes and fix PublicUrl in admin dashboard The admin binary (`weed admin`) is a separate process that cannot run inside `weed server` (allInOne mode). Revert the admin-related allInOne helm chart changes that caused 503 errors on admin ingress. Fix bug in cluster_topology.go where VolumeServer.PublicURL was set to node.Id (internal pod address) instead of the actual public URL. Add public_url field to DataNodeInfo proto message so the topology gRPC response carries the public URL set via -volume.publicUrl flag. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: use HTTP /dir/status to populate PublicUrl in admin dashboard The gRPC DataNodeInfo proto does not include PublicUrl, so the admin dashboard showed internal pod IPs instead of the configured public URL. Fetch PublicUrl from the master's /dir/status HTTP endpoint and apply it in both GetClusterTopology and GetClusterVolumeServers code paths. Also reverts the unnecessary proto field additions from the previous commit and cleans up a stray blank line in all-in-one-service.yml. * fix: apply PublicUrl link fix to masterNewRaft.html Match the same conditional logic already applied to master.html — prefer PublicUrl when set and different from Url. * fix: add HTTP timeout and status check to fetchPublicUrlMap Use a 5s-timeout client instead of http.DefaultClient to prevent blocking indefinitely when the master is unresponsive. Also check the HTTP status code before attempting to parse the response body. * fix: fall back to node address when PublicUrl is empty Prevents blank links in the admin dashboard when PublicUrl is not configured, such as in standalone or mixed-version clusters. * fix: log io.ReadAll error in fetchPublicUrlMap --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Chris Lu <chris.lu@gmail.com>	1 week ago
JARDEL ALVES	bd3a6b1b33	glog: add --log_rotate_hours flag for time-based log rotation (#8685 ) * glog: add --log_rotate_hours flag for time-based log rotation SeaweedFS previously only rotated log files when they reached MaxSize (1.8 GB). Long-running deployments with low log volume could accumulate log files indefinitely with no way to force rotation on a schedule. This change adds the --log_rotate_hours flag. When set to a non-zero value, the current log file is rotated once it has been open for the specified number of hours, regardless of its size. Implementation details: - New flag --log_rotate_hours (int, default 0 = disabled) in glog_file.go - Added createdAt time.Time field to syncBuffer to track file open time - rotateFile() sets createdAt to the time the new file is opened - Write() checks elapsed time and triggers rotation when the threshold is exceeded, consistent with the existing size-based check This resolves the long-standing request for time-based rotation and helps prevent unbounded log accumulation in /tmp on production systems. Related: #3455, #5763, #8336 * glog: default log_rotate_hours to 168 (7 days) Enable time-based rotation by default so log files don't accumulate indefinitely in long-running deployments. Set to 0 to disable. * glog: simplify rotation logic by combining size and time conditions Merge the two separate rotation checks into a single block to eliminate duplicated rotateFile error handling. * glog: use timeNow() in syncBuffer.Write and add time-based rotation test Use the existing testable timeNow variable instead of time.Now() in syncBuffer.Write so that time-based rotation can be tested with a mocked clock. Add TestTimeBasedRollover that verifies: - no rotation occurs before the interval elapses - rotation triggers after the configured hours --------- Co-authored-by: Copilot <copilot@github.com>	1 week ago
Weihao Jiang	01987bcafd	Make `weed-fuse` compatible with systemd-based mount (#6814 ) * Make weed-fuse compatible with systemd-mount series * fix: add missing type annotation on skipAutofs param in FreeBSD build The parameter was declared without a type, causing a compile error on FreeBSD. * fix: guard hasAutofs nil dereference and make FsName conditional on autofs mode - Check option.hasAutofs for nil before dereferencing to prevent panic when RunMount is called without the flag initialized. - Only set FsName to "fuse" when autofs mode is active; otherwise preserve the descriptive server:path name for mount/df output. - Fix typo: recogize -> recognize. * fix: consistent error handling for autofs option and log ignored _netdev - Replace panic with fmt.Fprintf+return false for autofs parse errors, matching the pattern used by other fuse option parsers. - Log when _netdev option is silently stripped to aid debugging. --------- Co-authored-by: Chris Lu <chris.lu@gmail.com>	1 week ago
JARDEL ALVES	98f301e30b	glog: add --log_max_size_mb and --log_max_files runtime flags (#8684 )	1 week ago
Chris Lu	81369b8a83	improve: large file sync throughput for remote.cache and filer.sync (#8676 ) * improve large file sync throughput for remote.cache and filer.sync Three main throughput improvements: 1. Adaptive chunk sizing for remote.cache: targets ~32 chunks per file instead of always starting at 5MB. A 500MB file now uses ~16MB chunks (32 chunks) instead of 5MB chunks (100 chunks), reducing per-chunk overhead (volume assign, gRPC call, needle write) by 3x. 2. Configurable concurrency at every layer: - remote.cache chunk concurrency: -chunkConcurrency flag (default 8) - remote.cache S3 download concurrency: -downloadConcurrency flag (default raised from 1 to 5 per chunk) - filer.sync chunk concurrency: -chunkConcurrency flag (default 32) 3. S3 multipart download concurrency raised from 1 to 5: the S3 manager downloader was using Concurrency=1, serializing all part downloads within each chunk. This alone can 5x per-chunk download speed. The concurrency values flow through the gRPC request chain: shell command → CacheRemoteObjectToLocalClusterRequest → FetchAndWriteNeedleRequest → S3 downloader Zero values in the request mean "use server defaults", maintaining full backward compatibility with existing callers. Ref #8481 * fix: use full maxMB for chunk size cap and remove loop guard Address review feedback: - Use full maxMB instead of maxMB/2 for maxChunkSize to avoid unnecessarily limiting chunk size for very large files. - Remove chunkSize < maxChunkSize guard from the safety loop so it can always grow past maxChunkSize when needed to stay under 1000 chunks (e.g., extremely large files with small maxMB). * address review feedback: help text, validation, naming, docs - Fix help text for -chunkConcurrency and -downloadConcurrency flags to say "0 = server default" instead of advertising specific numeric defaults that could drift from the server implementation. - Validate chunkConcurrency and downloadConcurrency are within int32 range before narrowing, returning a user-facing error if out of range. - Rename ReadRemoteErr to readRemoteErr to follow Go naming conventions. - Add doc comment to SetChunkConcurrency noting it must be called during initialization before replication goroutines start. - Replace doubling loop in chunk size safety check with direct ceil(remoteSize/1000) computation to guarantee the 1000-chunk cap. * address Copilot review: clamp concurrency, fix chunk count, clarify proto docs - Use ceiling division for chunk count check to avoid overcounting when file size is an exact multiple of chunk size. - Clamp chunkConcurrency (max 1024) and downloadConcurrency (max 1024 at filer, max 64 at volume server) to prevent excessive goroutines. - Always use ReadFileWithConcurrency when the client supports it, falling back to the implementation's default when value is 0. - Clarify proto comments that download_concurrency only applies when the remote storage client supports it (currently S3). - Include specific server defaults in help text (e.g., "0 = server default 8") so users see the actual values in -h output. * fix data race on executionErr and use %w for error wrapping - Protect concurrent writes to executionErr in remote.cache worker goroutines with a sync.Mutex to eliminate the data race. - Use %w instead of %v in volume_grpc_remote.go error formatting to preserve the error chain for errors.Is/errors.As callers.	1 week ago
Chris Lu	f4073107cb	fix: clean up orphaned needles on remote.cache partial download failure (#8675 ) When remote.cache downloads a file in parallel chunks and a gRPC connection drops mid-transfer, chunks already written to volume servers were not cleaned up. Since the filer metadata was never updated, these needles became orphaned — invisible to volume.vacuum and never referenced by the filer. On subsequent cache cycles the file was still treated as uncached, creating more orphans each attempt. Call DeleteUncommittedChunks on the download-error path, matching the cleanup already present for the metadata-update-failure path. Fixes #8481	1 week ago
Chris Lu	55e988a7ee	iceberg: add sort-aware compaction rewrite (#8666 ) * iceberg: add sort-aware compaction rewrite * iceberg: share filtered row iteration in compaction * iceberg: rely on table sort order for sort rewrites * iceberg: harden sort compaction planning * iceberg: include rewrite strategy in planning config hash compactionPlanningConfigHash now incorporates RewriteStrategy and SortMaxInputBytes so cached planning results are invalidated when sort strategy settings change. Also use the bytesPerMB constant in compactionNoEligibleMessage.	2 weeks ago
Chris Lu	e5c0889473	iceberg: add delete file rewrite maintenance (#8664 ) * iceberg: add delete file rewrite maintenance * iceberg: preserve untouched delete files during rewrites * iceberg: share detection threshold defaults * iceberg: add partition-scoped maintenance filters (#8665) * iceberg: add partition-scoped maintenance filters * iceberg: tighten where-filter partition matching	2 weeks ago
Chris Lu	a3717cd4b5	fix(admin): show anonymous user in Object Store Users UI (#8671 ) The anonymous identity was explicitly filtered out of the user listing, making it invisible in the admin console. Users could not view or edit its permissions. Attempting to recreate it failed with "already exists". Remove the anonymous skip in GetObjectStoreUsers so it appears like any other identity. Add a guard in DeleteObjectStoreUser to prevent deletion of the anonymous system identity, which would break unauthenticated S3 access. Fixes #8466 Co-authored-by: Copilot <copilot@github.com>	2 weeks ago
Chris Lu	6e45fc0055	iceberg: cache detection planning results (#8667 ) * iceberg: cache detection planning results * iceberg: tighten planning index cache handling * iceberg: remove dead planning-index metadata fallback * iceberg: preserve partial planning index caches * iceberg: scope planning index caching per op * iceberg: rename copy vars to avoid shadowing builtin	2 weeks ago
Chris Lu	f71cef2dc8	iceberg: add resource-group proposal controls (#8668 ) * iceberg: add resource-group proposal controls * iceberg: tighten resource group config validation	2 weeks ago
Chris Lu	7c83460b10	adjust template path	2 weeks ago
Chris Lu	e8914ac879	feat(admin): add -urlPrefix flag for subdirectory deployment (#8670 ) Allow the admin server to run behind a reverse proxy under a subdirectory by adding a -urlPrefix flag (e.g. -urlPrefix=/seaweedfs). Closes #8646	2 weeks ago
Chris Lu	9984ce7dcb	fix(s3): omit NotResource:null from bucket policy JSON response (#8658 ) * fix(s3): omit NotResource:null from bucket policy JSON response (#8657) Change NotResource from value type to pointer (StringOrStringSlice) so that omitempty properly omits it when unset, matching the existing Principal field pattern. This prevents IaC tools (Terraform, Ansible) from detecting false configuration drift. Add bucket policy round-trip idempotency integration tests. simplify JSON comparison in bucket policy idempotency test Use require.JSONEq directly on the raw JSON strings instead of round-tripping through unmarshal/marshal, since JSONEq already handles normalization internally. * fix bucket policy test cases that locked out the admin user The Deny+NotResource test cases used Action:"s3:" which denied the admin's own GetBucketPolicy call. Scope deny to s3:GetObject only, and add an Allow+NotResource variant instead. fix(s3): also make Resource a pointer to fix empty string in JSON Apply the same omitempty pointer fix to the Resource field, which was emitting "Resource":"" when only NotResource was set. Add NewStringOrStringSlicePtr helper, make Strings() nil-safe, and handle StringOrStringSlice in normalizeToStringSliceWithError. improve bucket policy integration tests per review feedback - Replace time.Sleep with waitForClusterReady using ListBuckets - Use structural hasKey check instead of brittle substring NotContains - Assert specific NoSuchBucketPolicy error code after delete - Handle single-statement policies in hasKey helper	2 weeks ago
Chris Lu	acea36a181	filer: add conditional update preconditions (#8647 ) * filer: add conditional update preconditions * iceberg: tighten metadata CAS preconditions	2 weeks ago
Chris Lu	6b2b442450	iceberg: detect maintenance work per operation (#8639 ) * iceberg: detect maintenance work per operation * iceberg: ignore delete manifests during detection * iceberg: clean up detection maintenance planning * iceberg: tighten detection manifest heuristics * Potential fix for code scanning alert no. 330: Incorrect conversion between integer types Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> * iceberg: tolerate per-operation detection errors * iceberg: fix fake metadata location versioning * iceberg: check snapshot expiry before manifest loads * iceberg: make expire-snapshots switch case explicit --------- Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>	2 weeks ago
Chris Lu	a00eddb525	Iceberg table maintenance Phase 3: multi-spec compaction, delete handling, and metrics (#8643 ) * Add multi-partition-spec compaction and delete-aware compaction (Phase 3) Multi-partition-spec compaction: - Add SpecID to compactionBin struct and group by spec+partition key - Remove the len(specIDs) > 1 skip that blocked spec-evolved tables - Write per-spec manifests in compaction commit using specByID map - Use per-bin PartitionSpec when calling NewDataFileBuilder Delete-aware compaction: - Add ApplyDeletes config (default: true) with readBoolConfig helper - Implement position delete collection (file_path + pos Parquet columns) - Implement equality delete collection (field ID to column mapping) - Update mergeParquetFiles to filter rows via position deletes (binary search) and equality deletes (hash set lookup) - Smart delete manifest carry-forward: drop when all data files compacted - Fix EXISTING/DELETED entries to include sequence numbers Tests for multi-spec bins, delete collection, merge filtering, and end-to-end compaction with position/equality/mixed deletes. * Add structured metrics and per-bin progress to iceberg maintenance - Change return type of all four operations from (string, error) to (string, map[string]int64, error) with structured metric counts (files_merged, snapshots_expired, orphans_removed, duration_ms, etc.) - Add onProgress callback to compactDataFiles for per-bin progress - In Execute, pass progress callback that sends JobProgressUpdate with per-bin stage messages - Accumulate per-operation metrics with dot-prefixed keys (e.g. compact.files_merged) into OutputValues on completion - Update testing_api.go wrappers and integration test call sites - Add tests: TestCompactDataFilesMetrics, TestExpireSnapshotsMetrics, TestExecuteCompletionOutputValues * Address review feedback: group equality deletes by field IDs, use metric constants - Group equality deletes by distinct equality_ids sets so different delete files with different equality columns are handled correctly - Use length-prefixed type-aware encoding in buildEqualityKey to avoid ambiguity between types and collisions from null bytes - Extract metric key strings into package-level constants * Fix buildEqualityKey to use length-prefixed type-aware encoding The previous implementation used plain String() concatenation with null byte separators, which caused type ambiguity (int 123 vs string "123") and separator collisions when values contain null bytes. Now each value is serialized as "kind:length:value" for unambiguous composite keys. This fix was missed in the prior cherry-pick due to a merge conflict. * Address nitpick review comments - Document patchManifestContentToDeletes workaround: explain that iceberg-go WriteManifest cannot create delete manifests, and note the fail-fast validation on pattern match - Document makeTestEntries: note that specID field is ignored and callers should use makeTestEntriesWithSpec for multi-spec testing * fmt * Fix path normalization, manifest threshold, and artifact filename collisions - Normalize file paths in position delete collection and lookup so that absolute S3 URLs and relative paths match correctly - Fix rewriteManifests threshold check to count only data manifests (was including delete manifests in the count and metric) - Add random suffix to artifact filenames in compactDataFiles and rewriteManifests to prevent collisions between concurrent runs - Sort compaction bins by SpecID then PartitionKey for deterministic ordering across specs * Fix pos delete read, deduplicate column resolution, minor cleanups - Remove broken Column() guard in position delete reading that silently defaulted pos to 0; unconditionally extract Int64() instead - Deduplicate column resolution in readEqualityDeleteFile by calling resolveEqualityColIndices instead of inlining the same logic - Add warning log in readBoolConfig for unrecognized string values - Fix CompactDataFiles call site in integration test to capture 3 return values * Advance progress on all bins, deterministic manifest order, assert metrics - Call onProgress for every bin iteration including skipped/failed bins so progress reporting never appears stalled - Sort spec IDs before iterating specEntriesMap to produce deterministic manifest list ordering across runs - Assert expected metric keys in CompactDataFiles integration test --------- Co-authored-by: Copilot <copilot@github.com>	2 weeks ago
Chris Lu	e24630251c	iceberg: handle filer-backed compaction inputs (#8638 ) * iceberg: handle filer-backed compaction inputs * iceberg: preserve upsert creation times * iceberg: align compaction test schema * iceberg: tighten compact output assertion * iceberg: document compact output match * iceberg: clear stale chunks in upsert helper * iceberg: strengthen compaction integration coverage	2 weeks ago
Chris Lu	0afc675a55	iceberg: validate filer failover targets (#8637 ) * iceberg: validate filer failover targets * iceberg: tighten filer liveness checks * iceberg: relax filer test readiness deadline	2 weeks ago
Chris Lu	b868980260	fix(remote): don't send empty StorageClass in S3 uploads (#8645 ) When S3StorageClass is empty (the default), aws.String("") was passed as the StorageClass in PutObject requests. While AWS S3 treats this as "use default," S3-compatible providers (e.g. SharkTech) reject it with InvalidStorageClass. Only set StorageClass when a non-empty value is configured, letting the provider use its default. Fixes #8644	2 weeks ago
Chris Lu	5f19e3259f	iceberg: keep split bins within target size (#8640 )	2 weeks ago
Chris Lu	d9d6707401	Change iceberg compaction target file size config from bytes to MB (#8636 ) Change iceberg target_file_size config from bytes to MB Rename the config field from target_file_size_bytes to target_file_size_mb with a default of 256 (MB). The value is converted to bytes internally. This makes the config more user-friendly — entering 256 is clearer than 268435456. Co-authored-by: Copilot <copilot@github.com>	2 weeks ago
Chris Lu	8cde3d4486	Add data file compaction to iceberg maintenance (Phase 2) (#8503 ) * Add iceberg_maintenance plugin worker handler (Phase 1) Implement automated Iceberg table maintenance as a new plugin worker job type. The handler scans S3 table buckets for tables needing maintenance and executes operations in the correct Iceberg order: expire snapshots, remove orphan files, and rewrite manifests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add data file compaction to iceberg maintenance handler (Phase 2) Implement bin-packing compaction for small Parquet data files: - Enumerate data files from manifests, group by partition - Merge small files using parquet-go (read rows, write merged output) - Create new manifest with ADDED/DELETED/EXISTING entries - Commit new snapshot with compaction metadata Add 'compact' operation to maintenance order (runs before expire_snapshots), configurable via target_file_size_bytes and min_input_files thresholds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix memory exhaustion in mergeParquetFiles by processing files sequentially Previously all source Parquet files were loaded into memory simultaneously, risking OOM when a compaction bin contained many small files. Now each file is loaded, its rows are streamed into the output writer, and its data is released before the next file is loaded — keeping peak memory proportional to one input file plus the output buffer. * Validate bucket/namespace/table names against path traversal Reject names containing '..', '/', or '\' in Execute to prevent directory traversal via crafted job parameters. * Add filer address failover in iceberg maintenance handler Try each filer address from cluster context in order instead of only using the first one. This improves resilience when the primary filer is temporarily unreachable. * Add separate MinManifestsToRewrite config for manifest rewrite threshold The rewrite_manifests operation was reusing MinInputFiles (meant for compaction bin file counts) as its manifest count threshold. Add a dedicated MinManifestsToRewrite field with its own config UI section and default value (5) so the two thresholds can be tuned independently. * Fix risky mtime fallback in orphan removal that could delete new files When entry.Attributes is nil, mtime defaulted to Unix epoch (1970), which would always be older than the safety threshold, causing the file to be treated as eligible for deletion. Skip entries with nil Attributes instead, matching the safer logic in operations.go. * Fix undefined function references in iceberg_maintenance_handler.go Use the exported function names (ShouldSkipDetectionByInterval, BuildDetectorActivity, BuildExecutorActivity) matching their definitions in vacuum_handler.go. * Remove duplicated iceberg maintenance handler in favor of iceberg/ subpackage The IcebergMaintenanceHandler and its compaction code in the parent pluginworker package duplicated the logic already present in the iceberg/ subpackage (which self-registers via init()). The old code lacked stale-plan guards, proper path normalization, CAS-based xattr updates, and error-returning parseOperations. Since the registry pattern (default "all") makes the old handler unreachable, remove it entirely. All functionality is provided by iceberg.Handler with the reviewed improvements. * Fix MinManifestsToRewrite clamping to match UI minimum of 2 The clamp reset values below 2 to the default of 5, contradicting the UI's advertised MinValue of 2. Clamp to 2 instead. * Sort entries by size descending in splitOversizedBin for better packing Entries were processed in insertion order which is non-deterministic from map iteration. Sorting largest-first before the splitting loop improves bin packing efficiency by filling bins more evenly. * Add context cancellation check to drainReader loop The row-streaming loop in drainReader did not check ctx between iterations, making long compaction merges uncancellable. Check ctx.Done() at the top of each iteration. * Fix splitOversizedBin to always respect targetSize limit The minFiles check in the split condition allowed bins to grow past targetSize when they had fewer than minFiles entries, defeating the OOM protection. Now bins always split at targetSize, and a trailing runt with fewer than minFiles entries is merged into the previous bin. * Add integration tests for iceberg table maintenance plugin worker Tests start a real weed mini cluster, create S3 buckets and Iceberg table metadata via filer gRPC, then exercise the iceberg.Handler operations (ExpireSnapshots, RemoveOrphans, RewriteManifests) against the live filer. A full maintenance cycle test runs all operations in sequence and verifies metadata consistency. Also adds exported method wrappers (testing_api.go) so the integration test package can call the unexported handler methods. * Fix splitOversizedBin dropping files and add source path to drainReader errors The runt-merge step could leave leading bins with fewer than minFiles entries (e.g. [80,80,10,10] with targetSize=100, minFiles=2 would drop the first 80-byte file). Replace the filter-based approach with an iterative merge that folds any sub-minFiles bin into its smallest neighbor, preserving all eligible files. Also add the source file path to drainReader error messages so callers can identify which Parquet file caused a read/write failure. * Harden integration test error handling - s3put: fail immediately on HTTP 4xx/5xx instead of logging and continuing - lookupEntry: distinguish NotFound (return nil) from unexpected RPC errors (fail the test) - writeOrphan and orphan creation in FullMaintenanceCycle: check CreateEntryResponse.Error in addition to the RPC error * go fmt --------- Co-authored-by: Copilot <copilot@github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2 weeks ago
Chris Lu	47799a5b4f	fix tests	2 weeks ago

1 2 3 4 5 ...

8504 Commits (7f3f61ea283f1d0c665ac806abfe883ae35f14d4)