* mount: defer file creation gRPC to flush time for faster small file writes
When creating a file via FUSE Create(), skip the synchronous gRPC
CreateEntry call to the filer. Instead, allocate the inode and build
the entry locally, deferring the filer create to the Flush/Release path
where flushMetadataToFiler already sends a CreateEntry with chunk data.
This eliminates one synchronous gRPC round-trip per file during creation.
For workloads with many small files (e.g. 30K files), this reduces the
per-file overhead from ~2 gRPC calls to ~1.
Mknod retains synchronous filer creation since it has no file handle
and thus no flush path.
* mount: use bounded worker pool for async flush operations
Replace unbounded goroutine spawning in writebackCache async flush
with a fixed-size worker pool backed by a channel. When many files
are closed rapidly (e.g., cp -r of 30K files), the previous approach
spawned one goroutine per file, leading to resource contention on
gRPC/HTTP connections and high goroutine overhead.
The worker pool size matches ConcurrentWriters (default 128), which
provides good parallelism while bounding resource usage. Work items
are queued into a buffered channel and processed by persistent worker
goroutines.
* mount: fix deferred create cache visibility and async flush race
Three fixes for the deferred create and async flush changes:
1. Insert a local placeholder entry into the metadata cache during
deferred file creation so that maybeLoadEntry() can find the file
for duplicate-create checks, stat, and readdir. Uses InsertEntry
directly (not applyLocalMetadataEvent) to avoid triggering the
directory hot-threshold eviction that would wipe the entry.
2. Fix race in ReleaseHandle where asyncFlushWg.Add(1) and the
channel send happened after pendingAsyncFlushMu was unlocked.
A concurrent WaitForAsyncFlush could observe a zero counter,
close the channel, and cause a send-on-closed panic. Move Add(1)
before the unlock; keep the send after unlock to avoid deadlock
with workers that acquire the same mutex during cleanup.
3. Update TestCreateCreatesAndOpensFile to flush the file handle
before verifying the CreateEntry gRPC call, since file creation
is now deferred to flush time.
* S3: reject part uploads after AbortMultipartUpload
PutObjectPartHandler did not verify that the multipart upload session
still exists before accepting parts. After AbortMultipartUpload deleted
the upload directory, the ErrNotFound from getEntry was silently ignored
(treated as "may be non-SSE upload"), allowing parts to be stored as
orphaned files.
Now return ErrNoSuchUpload when the upload directory is not found,
matching AWS S3 behavior.
Fixes#8766
* S3: check upload existence unconditionally in PutObjectPartHandler
Move the getEntry call out of the SSE-type conditional so the upload
existence check runs for all part uploads, including SSE-C. Previously
the SSE-C path skipped the check entirely, allowing parts to be uploaded
after abort when SSE-C headers were present.
Also flattens the nested SSE branching by one level now that getEntry
is called once upfront.
* S3: address PR review feedback for PutObjectPartHandler
- Log at error level when getEntry fails with an unexpected error,
since we return ErrInternalError to the client
- Distinguish base IV decode errors from length validation failures
with separate, clearer error messages
---------
Co-authored-by: Copilot <copilot@github.com>
* filer: add FilerError enum and error_code field to CreateEntryResponse
Add a machine-readable error code alongside the existing string error
field. This follows the precedent set by PublishMessageResponse in the
MQ broker proto. The string field is kept for human readability and
backward compatibility.
Defined codes: OK, ENTRY_NAME_TOO_LONG, PARENT_IS_FILE,
EXISTING_IS_DIRECTORY, EXISTING_IS_FILE, ENTRY_ALREADY_EXISTS.
* filer: add sentinel errors and error code mapping in filer_pb
Define sentinel errors (ErrEntryNameTooLong, ErrParentIsFile, etc.) in
the filer_pb package so both the filer and consumers can reference them
without circular imports.
Add FilerErrorToSentinel() to map proto error codes to sentinels, and
update CreateEntryWithResponse() to check error_code first, falling back
to the string-based path for backward compatibility with old servers.
* filer: return wrapped sentinel errors and set proto error codes
Replace fmt.Errorf string errors in filer.CreateEntry, UpdateEntry, and
ensureParentDirectoryEntry with wrapped filer_pb sentinel errors (using
%w). This preserves errors.Is() traversal on the server side.
In the gRPC CreateEntry handler, map sentinel errors to the
corresponding FilerError proto codes using errors.Is(), setting both
resp.Error (string, for backward compat) and resp.ErrorCode (enum).
* S3: use errors.Is() with filer sentinels instead of string matching
Replace fragile string-based error matching in filerErrorToS3Error and
other S3 API consumers with errors.Is() checks against filer_pb sentinel
errors. This works because the updated CreateEntryWithResponse helper
reconstructs sentinel errors from the proto FilerError code.
Update iceberg stage_create and metadata_files to check resp.ErrorCode
instead of parsing resp.Error strings. Update SSE-S3 to use errors.Is()
for the already-exists check.
String matching is retained only for non-filer errors (gRPC transport
errors, checksum validation) that don't go through CreateEntryResponse.
* filer: remove backward-compat string fallbacks for error codes
Clients and servers are always deployed together, so there is no need
for backward-compatibility fallback paths that parse resp.Error strings
when resp.ErrorCode is unset. Simplify all consumers to rely solely on
the structured error code.
* iceberg: ensure unknown non-OK error codes are not silently ignored
When FilerErrorToSentinel returns nil for an unrecognized error code,
return an error including the code and message rather than falling
through to return nil.
* filer: fix redundant error message and restore error wrapping in helper
Use request path instead of resp.Error in the sentinel error format
string to avoid duplicating the sentinel message (e.g. "entry already
exists: entry already exists"). Restore %w wrapping with errors.New()
in the fallback paths so callers can use errors.Is()/errors.As().
* filer: promote file to directory on path conflict instead of erroring
S3 allows both "foo/bar" (object) and "foo/bar/xyzzy" (another object)
to coexist because S3 has a flat key space. When ensureParentDirectoryEntry
finds a parent path that is a file instead of a directory, promote it to
a directory by setting ModeDir while preserving the original content and
chunks. Use Store.UpdateEntry directly to bypass the Filer.UpdateEntry
type-change guard.
This fixes the S3 compatibility test failures where creating overlapping
keys (e.g. "foo/bar" then "foo/bar/xyzzy") returned ExistingObjectIsFile.
AWS S3 policy conditions reference request headers with the s3: namespace
prefix (e.g., s3:x-amz-server-side-encryption). The extraction code was
storing these headers without the prefix, so bucket policy conditions
using the standard AWS key names would never match.
* S3: add KeyTooLongError error code
Add ErrKeyTooLongError (HTTP 400, code "KeyTooLongError") to match the
standard AWS S3 error for object keys that exceed length limits.
* S3: fix silent PutObject failure when entry name exceeds max_file_name_length
putToFiler called client.CreateEntry() directly and discarded the gRPC
response. The filer embeds application errors like "entry name too long"
in resp.Error (not as gRPC transport errors), so the error was silently
swallowed and clients received HTTP 200 with an ETag for objects that
were never stored.
Switch to the filer_pb.CreateEntry() helper which properly checks
resp.Error, and map "entry name too long" to KeyTooLongError (HTTP 400).
To avoid fragile string parsing across the gRPC boundary, define shared
error message constants in weed/util/constants and use them in both the
filer (producing errors) and S3 API (matching errors). Switch
filerErrorToS3Error to use strings.Contains/HasSuffix with these
constants so matches work regardless of any wrapper prefix. Apply
filerErrorToS3Error to the mkdir path for directory markers.
Fixes#8759
* S3: enforce 1024-byte maximum object key length
AWS S3 limits object keys to 1024 bytes. Add early validation on write
paths (PutObject, CopyObject, CreateMultipartUpload) to reject keys
exceeding the limit with the standard KeyTooLongError (HTTP 400).
The key length check runs before bucket auto-creation to prevent
overlong keys from triggering unnecessary side effects.
Also use filerErrorToS3Error for CopyObject's mkFile error paths so
name-too-long errors from the filer return KeyTooLongError instead of
InternalError.
Ref #8758
* S3: add handler-level tests for key length validation and error mapping
Add tests for filerErrorToS3Error mapping "entry name too long" to
KeyTooLongError, including a regression test for the CreateEntry-prefixed
"existing ... is a directory" form. Add handler-level integration tests
that exercise PutObjectHandler, CopyObjectHandler, and
NewMultipartUploadHandler via httptest, verifying HTTP 400 and
KeyTooLongError XML response for overlong keys and acceptance of keys at
the 1024-byte limit.
* fix: resolve Kafka gateway response deadlocks causing Sarama client hangs
Fix three bugs in the Kafka protocol handler that caused sequential
clients (notably Sarama) to hang during E2E tests:
1. Race condition in correlation queue ordering: the correlation ID was
added to the response ordering queue AFTER sending the request to
the processing channel. A fast processor (e.g. ApiVersions) could
finish and send its response before the ID was in the queue, causing
the response writer to miss it — permanently deadlocking the
connection. Now the ID is added BEFORE the channel send, with error
response injection on send failure.
2. Silent error response drops: when processRequestSync returned an
error, the response writer logged it but never sent anything back to
the client. The client would block forever waiting for bytes that
never arrived. Now sends a Kafka UNKNOWN_SERVER_ERROR response.
3. Produce V0/V1 missing timeout_ms parsing: the handler skipped the
4-byte timeout field, reading it as topicsCount instead. This caused
incorrect parsing of the entire produce request for V0/V1 clients.
* fix: API-versioned error responses, unsupported-version queue fix, V0V1 header alignment
1. errors.go — BuildAPIErrorResponse: emits a minimal-but-valid error
body whose layout matches the schema the client expects for each API
key and version (throttle_time position, array fields, etc.). The
old 2-byte generic body corrupted the protocol stream for APIs whose
response begins with throttle_time_ms or an array.
2. handler.go — unsupported-version path: the correlationID was never
added to correlationQueue before sending to responseChan, so the
response writer could never match it and the client hung. Now
appends the ID under correlationQueueMu before the send.
3. produce.go — handleProduceV0V1: requestBody is already post-header
(HandleConn strips client_id). The handler was erroneously parsing
acks bytes as a client_id length, misaligning all subsequent field
reads. Removed the client_id parsing; offset now starts at 0 with
acks(2) + timeout_ms(4) + topicsCount(4), matching handleProduceV2Plus.
* fix: free pooled message buffer per-iteration instead of deferring
The read loop allocated messageBuf via mem.Allocate and deferred
mem.Free. Since the defer only runs when HandleConn returns, pool
buffers accumulated for the entire connection lifetime — one per
request. Worse, the deferred frees ran in LIFO order before
wg.Wait(), so processing goroutines could read from already-freed
pool buffers.
Now: read into a pooled buffer, immediately copy to Go-managed
memory, and return the pool buffer. messageBuf is a regular slice
safe for async goroutine access with no defer accumulation.
* fix: cancel context before wg.Wait and on worker response-send timeout
Two related issues:
1. Cleanup defer ordering deadlock: defers run LIFO — the cleanup defer
(close channels, wg.Wait) ran before the cancel() defer. The
response writer is in the WaitGroup and exits only on ctx.Done() or
responseChan close, but both signals came after wg.Wait(). Deadlock
on every normal connection close (EOF, read error, queue-full).
Fix: call cancel() at the start of the cleanup defer, before
wg.Wait().
2. Worker 5s response-send timeout: when the timeout fired, the
response was silently dropped but the correlationID remained in the
ordered queue. The response writer could never advance past it,
stalling all subsequent responses permanently.
Fix: call cancel() to tear down the connection — if we cannot
deliver a response in 5s the connection is irrecoverable.
* chore: remove empty no-op ListOffsets conditional
The `if apiKey == 2 {}` block had no body — leftover debug code.
ListOffsets routing is handled by isDataPlaneAPI (returns false,
sending it to the control channel). No behavior change.
* mount: implement create for rsync temp files
* mount: move access implementation out of unsupported
* mount: tighten access checks
* mount: log access group lookup failures
* mount: reset dirty pages on truncate
* mount: tighten create and root access handling
* mount: handle existing creates before quota checks
* mount: restrict access fallback when group lookup fails
When lookupSupplementaryGroupIDs returns an error, the previous code
fell through to checking only the "other" permission bits, which could
overgrant access. Require both group and other permission classes to
satisfy the mask so access is never broader than intended.
* mount: guard against nil entry in Create existing-file path
maybeLoadEntry can return OK with a nil entry or nil Attributes in
edge cases. Check before dereferencing to prevent a panic.
* mount: reopen existing file on create race without O_EXCL
When createRegularFile returns EEXIST because another process won the
race, and O_EXCL is not set, reload the winner's entry and open it
instead of propagating the error to the caller.
* mount: check parent directory permission in createRegularFile
Verify the caller has write+search (W_OK|X_OK) permission on the
parent directory before creating a file. This applies to both
Create and Mknod. Update test fixture mount mode to 0o777 so the
existing tests pass with the new check.
* mount: enforce file permission bits in AcquireHandle
Map the open flags (O_RDONLY/O_WRONLY/O_RDWR) to an access mask and
call hasAccess before handing out a file handle. This makes
AcquireHandle the single source of truth for mode-based access
control across Open, Create-existing, and Create-new paths.
---------
Co-authored-by: Copilot <copilot@github.com>
* Add FUSE integration tests for POSIX file locking
Test flock() and fcntl() advisory locks over the FUSE mount:
- Exclusive and shared flock with conflict detection
- flock upgrade (shared to exclusive) and release on close
- fcntl F_SETLK write lock conflicts and shared read locks
- fcntl F_GETLK conflict reporting on overlapping byte ranges
- Non-overlapping byte-range locks held independently
- F_SETLKW blocking until conflicting lock is released
- Lock release on file descriptor close
- Concurrent lock contention with multiple workers
* Fix review feedback in POSIX lock integration tests
- Assert specific EAGAIN error on fcntl lock conflicts instead of generic Error
- Use O_APPEND in concurrent contention test so workers append rather than overwrite
- Verify exact line count (numWorkers * writesPerWorker) after concurrent test
- Check unlock error in F_SETLKW blocking test goroutine
* Refactor fcntl tests to use subprocesses for inter-process semantics
POSIX fcntl locks use the process's files_struct as lock owner, so all
fds in the same process share the same owner and never conflict. This
caused the fcntl tests to silently pass without exercising lock conflicts.
Changes:
- Add TestFcntlLockHelper subprocess entry point with hold/try/getlk actions
- Add lockHolder with channel-based coordination (no scanner race)
- Rewrite all fcntl tests to run contenders in separate subprocesses
- Fix F_UNLCK int16 cast in GetLk assertion for type-safe comparison
- Fix concurrent test: use non-blocking flock with retry to avoid
exhausting go-fuse server reader goroutines (blocking FUSE SETLKW
can starve unlock request processing, causing deadlock)
flock tests remain same-process since flock uses per-struct-file owners.
* Fix misleading comment and error handling in lock test subprocess
- Fix comment: tryLockInSubprocess tests a subprocess, not the test process
- Distinguish EAGAIN/EACCES from unexpected errors in subprocess try mode
so real failures aren't silently masked as lock conflicts
* Fix CI race in FcntlReleaseOnClose and increase flock retry budget
- FcntlReleaseOnClose: retry lock acquisition after subprocess exits
since the FUSE server may not process Release immediately
- ConcurrentLockContention: increase retry limit from 500 to 3000
(5s → 30s budget) to handle CI load
* separating flock and fcntl in the in-memory lock table and cleaning them up through the right release path: PID for POSIX locks, lock owner for flock
* ReleasePosixOwner
* weed/mount: flush before releasing posix close owner
* weed/mount: keep woken lock waiters from losing inode state
* test/fuse: make blocking fcntl helper state explicit
* test/fuse: assert flock contention never overlaps
* test/fuse: stabilize concurrent lock contention check
* test/fuse: make concurrent contention writes deterministic
* weed/mount: retry synchronous metadata flushes
* Add POSIX byte-range lock table for FUSE mount
Implement PosixLockTable with per-inode range lock tracking supporting:
- Shared (F_RDLCK) and exclusive (F_WRLCK) byte-range locks
- Conflict detection across different lock owners
- Lock coalescing for adjacent/overlapping same-owner same-type locks
- Lock splitting on partial-range unlock
- Blocking waiter support for SetLkw with cancellation
- Owner-based cleanup for Release
* Wire POSIX lock handlers into FUSE mount
Implement GetLk, SetLk, SetLkw on WFS delegating to PosixLockTable.
Add posixLocks field to WFS and initialize in constructor.
Clean up locks on Release via ReleaseOwner using ReleaseIn.LockOwner.
Remove ENOSYS stubs from weedfs_unsupported.go.
* Enable POSIX and flock lock capabilities in FUSE mount
Set EnableLocks: true in mount options to advertise
CAP_POSIX_LOCKS and CAP_FLOCK_LOCKS during FUSE INIT.
* Avoid thundering herd in lock waiter wake-up
Replace broadcast-all wakeWaiters with selective wakeEligibleWaiters
that checks each waiter's requested lock against remaining held locks.
Only waiters whose request no longer conflicts are woken; others stay
queued. Store the requested lockRange in each lockWaiter to enable this.
* Fix uint64 overflow in adjacency check for lock coalescing
Guard h.End+1 and lk.End+1 with < ^uint64(0) checks so that
End == math.MaxUint64 (EOF) does not wrap to 0 and falsely merge
non-adjacent locks.
* Add test for non-adjacent ranges with gap not being coalesced
The metadata subscription handler (updateBucketConfigCacheFromEntry) was
making a separate RPC call via loadCORSFromBucketContent to load CORS
configuration. This created a race window where a slow CreateBucket
subscription event could re-cache stale data after PutBucketCors had
already cleared the cache, causing subsequent GetBucketCors to return
404 NoSuchCORSConfiguration.
Parse CORS directly from the subscription entry's Content field instead
of making a separate RPC. Also fix getBucketConfig to parse CORS from
the already-fetched entry, eliminating a redundant RPC call.
Fix TestCORSCaching to use require.NoError to prevent nil pointer
dereference panics when GetBucketCors fails.
* fix: extend ignore404Error to match 404 Not Found string from S3 sink errors
* test: add unit tests for isIgnorable404 error matching
* improve: pre-compute ignorable 404 string and simplify isIgnorable404
* test: replace init() with TestMain for global HTTP client setup
* mount: async flush on close() when writebackCache is enabled
When -writebackCache is enabled, defer data upload and metadata flush
from Flush() (triggered by close()) to a background goroutine in
Release(). This allows processes like rsync that write many small files
to proceed to the next file immediately instead of blocking on two
network round-trips (volume upload + filer metadata) per file.
Fixes#8718
* mount: add retry with backoff for async metadata flush
The metadata flush in completeAsyncFlush now retries up to 3 times
with exponential backoff (1s, 2s, 4s) on transient gRPC errors.
Since the chunk data is already safely on volume servers at this point,
only the filer metadata reference needs persisting — retrying is both
safe and effective.
Data flush (FlushData) is not retried externally because
UploadWithRetry already handles transient HTTP/gRPC errors internally;
if it still fails, the chunk memory has been freed.
* test: add integration tests for writebackCache async flush
Add comprehensive FUSE integration tests for the writebackCache
async flush feature (issue #8718):
- Basic operations: write/read, sequential files, large files, empty
files, overwrites
- Fsync correctness: fsync forces synchronous flush even in writeback
mode, immediate read-after-fsync
- Concurrent small files: multi-worker parallel writes (rsync-like
workload), multi-directory, rapid create/close
- Data integrity: append after close, partial writes, file size
correctness, binary data preservation
- Performance comparison: writeback vs synchronous flush throughput
- Stress test: 16 workers x 100 files with content verification
- Mixed concurrent operations: reads, writes, creates running together
Also fix pre-existing test infrastructure issues:
- Rename framework.go to framework_test.go (fixes Go package conflict)
- Fix undefined totalSize variable in concurrent_operations_test.go
* ci: update fuse-integration workflow to run full test suite
The workflow previously only ran placeholder tests (simple_test.go,
working_demo_test.go) in a temp directory due to a Go module conflict.
Now that framework.go is renamed to framework_test.go, the full test
suite compiles and runs correctly from test/fuse_integration/.
Changes:
- Run go test directly in test/fuse_integration/ (no temp dir copy)
- Install weed binary to /usr/local/bin for test framework discovery
- Configure /etc/fuse.conf with user_allow_other for FUSE mounts
- Install fuse3 for modern FUSE support
- Stream test output to log file for artifact upload
* mount: fix three P1 races in async flush
P1-1: Reopen overwrites data still flushing in background
ReleaseByHandle removes the old handle from fhMap before the deferred
flush finishes. A reopen of the same inode during that window would
build from stale filer metadata, overwriting the async flush.
Fix: Track in-flight async flushes per inode via pendingAsyncFlush map.
AcquireHandle now calls waitForPendingAsyncFlush(inode) to block until
any pending flush completes before reading filer metadata.
P1-2: Deferred flush races rename and unlink after close
completeAsyncFlush captured the path once at entry, but rename or
unlink after close() could cause metadata to be written under the
wrong name or recreate a deleted file.
Fix: Re-resolve path from inode via GetPath right before metadata
flush. GetPath returns the current path (reflecting renames) or
ENOENT (if unlinked), in which case we skip the metadata flush.
P1-3: SIGINT/SIGTERM bypasses the async-flush drain
grace.OnInterrupt runs hooks then calls os.Exit(0), so
WaitForAsyncFlush after server.Serve() never executes on signal.
Fix: Add WaitForAsyncFlush (with 10s timeout) to the WFS interrupt
handler, before cache cleanup. The timeout prevents hanging on Ctrl-C
when the filer is unreachable.
* mount: fix P1 races — draining handle stays in fhMap
P1-1: Reopen TOCTOU
The gap between ReleaseByHandle removing from fhMap and
submitAsyncFlush registering in pendingAsyncFlush allowed a
concurrent AcquireHandle to slip through with stale metadata.
Fix: Hold pendingAsyncFlushMu across both the counter decrement
(ReleaseByHandle) and the pending registration. The handle is
registered as pending before the lock is released, so
waitForPendingAsyncFlush always sees it.
P1-2: Rename/unlink can't find draining handle
ReleaseByHandle deleted from fhMap immediately. Rename's
FindFileHandle(inode) at line 251 could not find the handle to
update entry.Name. Unlink could not coordinate either.
Fix: When asyncFlushPending is true, ReleaseByHandle/ReleaseByInode
leave the handle in fhMap (counter=0 but maps intact). The handle
stays visible to FindFileHandle so rename can update entry.Name.
completeAsyncFlush re-resolves the path from the inode (GetPath)
right before metadata flush for correctness after rename/unlink.
After drain, RemoveFileHandle cleans up the maps.
Double-return prevention: ReleaseByHandle/ReleaseByInode return nil
if counter is already <= 0, so Forget after Release doesn't start a
second drain goroutine.
P1-3: SIGINT deletes swap files under running goroutines
After the 10s timeout, os.RemoveAll deleted the write cache dir
(containing swap files) while FlushData goroutines were still
reading from them.
Fix: Increase timeout to 30s. If timeout expires, skip write cache
dir removal so in-flight goroutines can finish reading swap files.
The OS (or next mount) cleans them up. Read cache is always removed.
* mount: never skip metadata flush when Forget drops inode mapping
Forget removes the inode→path mapping when the kernel's lookup count
reaches zero, but this does NOT mean the file was unlinked — it only
means the kernel evicted its cache entry. completeAsyncFlush was
treating GetPath failure as "file unlinked" and skipping the metadata
flush, which orphaned the just-uploaded chunks for live files.
Fix: Save dir and name at doFlush defer time. In completeAsyncFlush,
try GetPath first to pick up renames; if the mapping is gone, fall
back to the saved dir/name. Always attempt the metadata flush — the
filer is the authority on whether the file exists, not the local
inode cache.
* mount: distinguish Forget from Unlink in async flush path fallback
The saved-path fallback (from the previous fix) always flushed
metadata when GetPath failed, which recreated files that were
explicitly unlinked after close(). The same stale fallback could
recreate the pre-rename path if Forget dropped the inode mapping
after a rename.
Root cause: GetPath failure has two meanings:
1. Forget — kernel evicted the cache entry (file still exists)
2. Unlink — file was explicitly deleted (should not recreate)
Fix (three coordinated changes):
Unlink (weedfs_file_mkrm.go): Before RemovePath, look up the inode
and find any draining handle via FindFileHandle. Set fh.isDeleted =
true so the async flush knows the file was explicitly removed.
Rename (weedfs_rename.go): When renaming a file with a draining
handle, update asyncFlushDir/asyncFlushName to the post-rename
location. This keeps the saved-path fallback current so Forget
after rename doesn't flush to the old (pre-rename) path.
completeAsyncFlush (weedfs_async_flush.go): Check fh.isDeleted
first — if true, skip metadata flush (file was unlinked, chunks
become orphans for volume.fsck). Otherwise, try GetPath for the
current path (renames); fall back to saved path if Forget dropped
the mapping (file is live, just evicted from kernel cache).
* test/ci: address PR review nitpicks
concurrent_operations_test.go:
- Restore precise totalSize assertion instead of info.Size() > 0
writeback_cache_test.go:
- Check rand.Read errors in all 3 locations (lines 310, 512, 757)
- Check os.MkdirAll error in stress test (line 752)
- Remove dead verifyErrors variable (line 332)
- Replace both time.Sleep(5s) with polling via waitForFileContent
to avoid flaky tests under CI load (lines 638, 700)
fuse-integration.yml:
- Add set -o pipefail so go test failures propagate through tee
* ci: fix fuse3/fuse package conflict on ubuntu-22.04 runner
fuse3 is pre-installed on ubuntu-22.04 runners and conflicts with
the legacy fuse package. Only install libfuse3-dev for the headers.
* mount/page_writer: remove debug println statements
Remove leftover debug println("read new data1/2") from
ReadDataAt in MemChunk and SwapFileChunk.
* test: fix findWeedBinary matching source directory instead of binary
findWeedBinary() matched ../../weed (the source directory) via
os.Stat before checking PATH, then tried to exec a directory
which fails with "permission denied" on the CI runner.
Fix: Check PATH first (reliable in CI where the binary is installed
to /usr/local/bin). For relative paths, verify the candidate is a
regular file (!info.IsDir()). Add ../../weed/weed as a candidate
for in-tree builds.
* test: fix framework — dynamic ports, output capture, data dirs
The integration test framework was failing in CI because:
1. All tests used hardcoded ports (19333/18080/18888), so sequential
tests could conflict when prior processes hadn't fully released
their ports yet.
2. Data subdirectories (data/master, data/volume) were not created
before starting processes.
3. Master was started with -peers=none which is not a valid address.
4. Process stdout/stderr was not captured, making failures opaque
("service not ready within timeout" with no diagnostics).
5. The unmount fallback used 'umount' instead of 'fusermount -u'.
6. The mount used -cacheSizeMB (nonexistent) instead of
-cacheCapacityMB and was missing -allowOthers=false for
unprivileged CI runners.
Fixes:
- Dynamic port allocation via freePort() (net.Listen ":0")
- Explicit gRPC ports via -port.grpc to avoid default port conflicts
- Create data/master and data/volume directories in Setup()
- Remove invalid -peers=none and -raftBootstrap flags
- Capture process output to logDir/*.log via startProcess() helper
- dumpLog() prints tail of log file on service startup failure
- Use fusermount3/fusermount -u for unmount
- Fix mount flag names (-cacheCapacityMB, -allowOthers=false)
* test: remove explicit -port.grpc flags from test framework
SeaweedFS convention: gRPC port = HTTP port + 10000. Volume and
filer discover the master gRPC port by this convention. Setting
explicit -port.grpc on master/volume/filer broke inter-service
communication because the volume server computed master gRPC as
HTTP+10000 but the actual gRPC was on a different port.
Remove all -port.grpc flags and let the default convention work.
Dynamic HTTP ports already ensure uniqueness; the derived gRPC
ports (HTTP+10000) will also be unique.
---------
Co-authored-by: Copilot <copilot@github.com>
* admin/plugin: delete job_detail files when jobs are pruned from memory
pruneTrackedJobsLocked evicts the oldest terminal jobs from the in-memory
tracker when the total exceeds maxTrackedJobsTotal (1000). However the
dedicated per-job detail files in jobs/job_details/ were never removed,
causing them to accumulate indefinitely on disk.
Add ConfigStore.DeleteJobDetail and call it from pruneTrackedJobsLocked so
that the file is cleaned up together with the in-memory entry. Deletion
errors are logged at verbosity level 2 and do not abort the prune.
* admin/plugin: add test for DeleteJobDetail
---------
Co-authored-by: Anton Ustyugov <anton@devops>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
loadPersistedMonitorState performed a backward-compatibility migration that
wrote every job with inline rich detail fields to a dedicated per-job detail
file synchronously during startup. On deployments with many historical jobs
(e.g. 1000+) stored on distributed block storage (e.g. Longhorn), each
individual file write requires an fsync round-trip, making startup
disproportionately slow and causing readiness/liveness probe failures.
The in-memory state is populated correctly before the goroutine is started
because stripTrackedJobDetailFields is still called in-place; only the disk
writes are deferred. A completion log message at V(1) is emitted once the
background migration finishes.
Co-authored-by: Anton Ustyugov <anton@devops>
RunPluginJobTypeAPI previously executed proposals with a naive sequential loop
calling ExecutePluginJob per proposal. This had two bugs:
1. Double-lock: RunPluginJobTypeAPI held pluginLock while calling ExecutePluginJob,
which tried to re-acquire the same lock for every job in the loop.
2. No capacity management: proposals were fired directly at workers without
reserveScheduledExecutor, so every job beyond the worker concurrency limit
received an immediate at_capacity error with no retry or backoff.
Fix: add Plugin.DispatchProposals which reuses dispatchScheduledProposals - the
same code path the scheduler loop uses - with executor reservation, configurable
concurrency, and per-job retry with backoff. RunPluginJobTypeAPI now calls
DispatchPluginProposals (a thin AdminServer wrapper) after holding pluginLock once.
Co-authored-by: Anton Ustyugov <anton@devops>
RequestLock used a bare println to report transient lock acquisition failures
('lock: already locked by ...'), which writes directly to stdout instead of
going through the structured logging pipeline. This causes log noise at the
wrong level and cannot be filtered with -v or redirected like glog output.
Changes:
- println("lock:", ...) -> glog.V(2).Infof for per-retry acquisition errors
(transient, high-frequency during startup when another instance still holds)
- Add glog.V(1).Infof when the lock is successfully acquired
- Add glog.V(2).Infof for successful renewals (replaces commented-out println)
- Errorf -> Warningf for renewal failures (the goroutine exits cleanly, it is
not a fatal error; the caller will re-acquire via RequestLock)
Co-authored-by: Anton Ustyugov <anton@devops>
* feat(shell): add volume.tier.compact command to reclaim cloud storage space
Adds a new shell command that automates compaction of cloud tier volumes.
When files are deleted from remote-tiered volumes, space is not reclaimed
on the cloud storage. This command orchestrates: download from remote,
compact locally, and re-upload to reclaim deleted space.
Closes#8563
* fix: log cleanup errors in compactVolumeOnServer instead of discarding them
Helps operators diagnose leftover temp files (.cpd/.cpx) if cleanup
fails after a compaction or commit failure.
* fix: return aggregate error from loop and use regex for collection filter
- Track and return error count when one or more volumes fail to compact,
so callers see partial failures instead of always getting nil.
- Use compileCollectionPattern for -collection in -volumeId mode too, so
regex patterns work consistently with the flag description. Empty
pattern (no -collection given) matches all collections.
* fix(telemetry): use correct TopologyId field in integration test
The proto field was renamed from cluster_id to topology_id but the
integration test was not updated, causing a compilation error.
* ci: add telemetry integration test workflow
Runs the telemetry integration test (server startup, protobuf
marshaling, client send, metrics/stats/instances API checks) on
changes to telemetry/ or weed/telemetry/.
* fix(telemetry): improve error message specificity in integration test
* fix(ci): pre-build telemetry server binary for integration test
go run compiles the server on the fly, which exceeds the 15s startup
timeout in CI. Build the binary first so the test starts instantly.
* fix(telemetry): fix ClusterId references in server and CI build path
- Replace ClusterId with TopologyId in server storage and API handler
(same rename as the integration test fix)
- Fix CI build: telemetry server has its own go.mod, so build from
within its directory
* ci(telemetry): add least-privilege permissions to workflow
Scope the workflow token to read-only repository contents, matching
the convention used in go.yml.
* fix(telemetry): set TopologyId in client integration test
The client only populates TopologyId when SetTopologyId has been
called. The test was missing this call, causing the server to reject
the request with 400 (missing required field).
* fix(telemetry): delete clusterInfo metric on instance cleanup
The cleanup loop removed all per-instance metrics except clusterInfo,
leaking that label set after eviction.
Key improvements:
- Fix concurrent map write panic in partition type cache
- Fix data races in yieldDataFiles and key map getter
- Fix response body leaks in REST catalog
- Fix index out of range in buildManifestEvaluator
- Table Metadata V3 support
- Schema evolution API
- Partitioned write throughput optimizations
- Gzipped metadata read/write support
* glog: add JSON structured logging mode
Add opt-in JSON output format for glog, enabling integration with
log aggregation systems like ELK, Loki, and Datadog.
- Add --log_json flag to enable JSON output at startup
- Add SetJSONMode()/IsJSONMode() for runtime toggle
- Add JSON branches in println, printDepth, printf, printWithFileLine
- Use manual JSON construction (no encoding/json) for performance
- Add jsonEscapeString() for safe string escaping
- Include 8 unit tests and 1 benchmark
Enabled via --log_json flag. Default behavior unchanged.
* glog: prevent lazy flag init from overriding SetJSONMode
If --log_json=true and SetJSONMode(false) was called at runtime,
a subsequent IsJSONMode() call would re-enable JSON mode via the
sync.Once lazy initialization. Mark jsonFlagOnce as done inside
SetJSONMode so the runtime API always takes precedence.
* glog: fix RuneError check to not misclassify valid U+FFFD
The condition r == utf8.RuneError matches both invalid UTF-8
sequences (size=1) and a valid U+FFFD replacement character
(size=3). Without checking size == 1, a valid U+FFFD input
would be incorrectly escaped and only advance by 1 byte,
corrupting the output.
---------
Co-authored-by: Chris Lu <chris.lu@gmail.com>
* glog: add gzip compression for rotated log files
Add opt-in gzip compression that automatically compresses log files
after rotation, reducing disk usage in long-running deployments.
- Add --log_compress flag to enable compression at startup
- Add SetCompressRotated()/IsCompressRotated() for runtime toggle
- Compress rotated files in background goroutine (non-blocking)
- Use gzip.BestSpeed for minimal CPU overhead
- Fix .gz file cleanup: TrimSuffix approach correctly counts
compressed files toward MaxFileCount limit
- Include 6 unit tests covering normal, empty, large, and edge cases
Enabled via --log_compress flag. Default behavior unchanged.
* glog: fix compressFile to check gz/dst close errors and use atomic rename
Write to a temp file (.gz.tmp) and rename atomically to prevent
exposing partial archives. Check gz.Close() and dst.Close() errors
to avoid deleting the original log when flush fails (e.g. ENOSPC).
Use defer for robust resource cleanup.
* glog: deduplicate .log/.log.gz pairs in rotation cleanup
During concurrent compression, both foo.log and foo.log.gz can exist
simultaneously. Count them as one entry against MaxFileCount to prevent
premature eviction of rotated logs.
* glog: use portable temp path in TestCompressFile_NonExistent
Replace hardcoded /nonexistent/path with t.TempDir() for portability.
---------
Co-authored-by: Copilot <copilot@github.com>
* fix(remote_gateway): prevent double-versioning when syncing to versioned central bucket
When a file is uploaded to a versioned bucket on edge, SeaweedFS stores
it internally as {object}.versions/v_{versionId}. The remote_gateway was
syncing this internal path directly to the central S3 endpoint. When
central's bucket also has versioning enabled, this caused central to
apply its own versioning on top, producing corrupt paths like:
object.versions/v_{edgeId}.versions/v_{centralId}
Fix: rewrite internal .versions/v_{id} paths to the original S3 object
key before uploading to the remote. Skip version file delete/update
events that are internal bookkeeping.
Fixes https://github.com/seaweedfs/seaweedfs/discussions/8481#discussioncomment-16209342
* fix(remote_gateway): propagate delete markers to remote as deletions
Delete markers are zero-content version entries (ExtDeleteMarkerKey=true)
created by S3 DELETE on a versioned bucket. Previously they were silently
dropped by the HasData() filter, so deletions on edge never reached
central.
Now: detect delete markers before the HasData check, rewrite the
.versions path to the original S3 key, and issue client.DeleteFile()
on the remote.
* fix(remote_gateway): tighten isVersionedPath to avoid false positives
Address PR review feedback:
- Add isDir parameter to isVersionedPath so it only matches the exact
internal shapes: directories whose name ends with .versions (isDir=true),
and files with the v_ prefix inside a .versions parent (isDir=false).
Previously the function was too broad and could match user-created paths
like "my.versions/data.txt".
- Update all 4 call sites to pass the entry's IsDirectory field.
- Rename TestVersionedDirectoryNotFilteredByHasData to
TestVersionsDirectoryFilteredByHasData so the name reflects the
actual assertion (directories ARE filtered by HasData).
- Expand TestIsVersionedPath with isDir cases and false-positive checks.
* fix(remote_gateway): persist sync marker after delete-marker propagation
The delete-marker branch was calling client.DeleteFile() and returning
without updating the local entry, making event replay re-issue the
remote delete. Now call updateLocalEntry after a successful DeleteFile
to stamp the delete-marker entry with a RemoteEntry, matching the
pattern used by the normal create path.
* refactor(remote_gateway): extract syncDeleteMarker and fix root path edge case
- Extract syncDeleteMarker() shared helper used by both bucketed and
mounted-dir event processors, replacing the duplicated delete + persist
local marker logic.
- Fix rewriteVersionedSourcePath for root-level objects: when lastSlash
is 0 (e.g. "/file.xml.versions"), return "/" as the parent dir instead
of an empty string.
- The strings.Contains(dir, ".versions/") condition flagged in review was
already removed in a prior commit that tightened isVersionedPath.
* fix(remote_gateway): skip updateLocalEntry for versioned path rewrites
After rewriting a .versions/v_{id} path to the logical S3 key and
uploading, the code was calling updateLocalEntry on the original v_*
entry, stamping it with a RemoteEntry for the logical key. This is
semantically wrong: the logical object has no filer entry in versioned
buckets, and the internal v_* entry should not carry a RemoteEntry for
a different path.
Skip updateLocalEntry when the path was rewritten from a versioned
source. Replay safety is preserved because S3 PutObject is idempotent.
* fix(remote_gateway): scope versioning checks to /buckets/ namespace
isVersionedPath and rewriteVersionedSourcePath could wrongly match
paths in non-bucket mounts (e.g. /mnt/remote/file.xml.versions).
Add the same /buckets/ prefix guard used by isMultipartUploadDir so
the .versions / v_ logic only applies within the bucket namespace.
Remove the block that prevented deleting the "anonymous" identity
and stop auto-creating it when absent. If no anonymous identity
exists (or it is disabled), LookupAnonymous returns not-found and
both auth paths return ErrAccessDenied for anonymous requests.
To enable anonymous access, explicitly create the "anonymous" user.
To revoke it, delete the user like any other identity.
Closes#8694
* fix(s3): include directory markers in ListObjects without delimiter (#8698)
Directory key objects (zero-byte objects with keys ending in "/") created
via PutObject were omitted from ListObjects/ListObjectsV2 results when no
delimiter was specified. AWS S3 includes these as regular keys in Contents.
The issue was in doListFilerEntries: when recursing into directories in
non-delimiter mode, directory key objects were only emitted when
prefixEndsOnDelimiter was true. Added an else branch to emit them in the
general recursive case as well.
* remove issue reference from inline comment
* test: add child-under-marker and paginated listing coverage
Extend test 6 to place a child object under the directory marker
and paginate with MaxKeys=1 so the emit-then-recurse truncation
path is exercised.
* fix(test): skip directory markers in Spark temporary artifacts check
The listing check now correctly shows directory markers (keys ending
in "/") after the ListObjects fix. These 0-byte metadata objects are
not data artifacts — filter them from the listing check since the
HeadObject-based check already verifies their cleanup with a timeout.
* fix(helm): namespace app-specific values under global.seaweedfs
Move all app-specific values from the global namespace to
global.seaweedfs.* to avoid polluting the shared .Values.global
namespace when the chart is used as a subchart.
Standard Helm conventions (global.imageRegistry, global.imagePullSecrets)
remain at the global level as they are designed to be shared across
subcharts.
Fixesseaweedfs/seaweedfs#8699
BREAKING CHANGE: global values have been restructured. Users must update
their values files to use the new paths:
- global.registry → global.imageRegistry
- global.repository → global.seaweedfs.image.repository
- global.imageName → global.seaweedfs.image.name
- global.<key> → global.seaweedfs.<key> (for all other app-specific values)
* fix(ci): update helm CI tests to use new global.seaweedfs.* value paths
Update all --set flags in helm_ci.yml to use the new namespaced
global.seaweedfs.* paths matching the values.yaml restructuring.
* fix(ci): install Claude Code via npm to avoid install.sh 403
The claude-code-action's built-in installer uses
`curl https://claude.ai/install.sh | bash` which can fail with 403.
Due to the pipe, bash exits 0 on empty input, masking the curl failure
and leaving the `claude` binary missing.
Work around this by installing Claude Code via npm before invoking the
action, and passing the executable path via path_to_claude_code_executable.
* revert: remove claude-code-review.yml changes from this PR
The claude-code-action OIDC token exchange validates that the workflow
file matches the version on the default branch. Modifying it in a PR
causes the review job to fail with "Workflow validation failed".
The Claude Code install fix will need to be applied directly to master
or in a separate PR.
* fix: update stale references to old global.* value paths
- admin-statefulset.yaml: fix fail message to reference
global.seaweedfs.masterServer
- values.yaml: fix comment to reference image.name instead of imageName
- helm_ci.yml: fix diagnostic message to reference
global.seaweedfs.enableSecurity
* feat(helm): add backward-compat shim for old global.* value paths
Add _compat.tpl with a seaweedfs.compat helper that detects old-style
global.* keys (e.g. global.enableSecurity, global.registry) and merges
them into the new global.seaweedfs.* namespace.
Since the old keys no longer have defaults in values.yaml, their
presence means the user explicitly provided them. The helper uses
in-place mutation via `set` so all templates see the merged values.
This ensures existing deployments using old value paths continue to
work without changes after upgrading.
* fix: update stale comment references in values.yaml
Update comments referencing global.enableSecurity and global.masterServer
to the new global.seaweedfs.* paths.
---------
Co-authored-by: Copilot <copilot@github.com>
The claude-code-action's built-in installer uses
`curl https://claude.ai/install.sh | bash` which can fail with 403.
Due to the pipe, bash exits 0 on empty input, masking the curl failure
and leaving the `claude` binary missing.
Work around this by installing Claude Code via npm before invoking the
action, and passing the executable path via path_to_claude_code_executable.
Co-authored-by: Copilot <copilot@github.com>