seaweedfs

Commit Graph

Author	SHA1	Message	Date
Chris Lu	b0e0c5aaab	s3: enable auth when IAM integration is configured (#7726 ) When only IAM integration is configured (via -s3.iam.config) without traditional S3 identities, the isAuthEnabled flag was not being set, causing the Auth middleware to bypass all authentication checks. This fix ensures that when SetIAMIntegration is called with a non-nil integration, isAuthEnabled is set to true, properly enforcing authentication for all requests. Added negative authentication tests: - TestS3AuthenticationDenied: tests rejection of unauthenticated, invalid, and expired JWT requests - TestS3IAMOnlyModeRejectsAnonymous: tests that IAM-only mode properly rejects anonymous requests Fixes #7724	4 weeks ago
Chris Lu	e8b7347031	Reduce memory allocations in hot paths (#7725 ) * filer: reduce allocations in MatchStorageRule Optimize MatchStorageRule to avoid allocations in common cases: - Return singleton emptyPathConf when no rules match (zero allocations) - Return existing rule directly when only one rule matches (zero allocations) - Only allocate and merge when multiple rules match (rare case) Based on heap profile analysis showing 111MB allocated from 1.64M calls to this function during 180 seconds of operation. * filer: add fast path for getActualStore when no path-specific stores Add hasPathSpecificStore flag to FilerStoreWrapper to skip the MatchPrefix() call and []byte(path) conversion when no path-specific stores are configured (the common case). Based on heap profile analysis showing 1.39M calls to this function during 180 seconds of operation, each requiring a string-to-byte slice conversion for the MatchPrefix call. * filer/foundationdb: use sync.Pool for tuple allocation in genKey Use sync.Pool to reuse tuple.Tuple slices in genKey(), reducing allocation overhead for every FoundationDB operation. Based on heap profile analysis showing 102MB allocated from 1.79M calls to genKey() during 180 seconds of operation. The Pack() call still allocates internally, but this reduces the tuple slice allocation overhead by ~50%. * filer: use sync.Pool for protobuf Entry and FuseAttributes Add pooling for filer_pb.Entry and filer_pb.FuseAttributes in EncodeAttributesAndChunks and DecodeAttributesAndChunks to reduce allocations during filer store operations. Changes: - Add pbEntryPool with pre-allocated FuseAttributes - Add EntryAttributeToExistingPb for in-place attribute conversion - Update ToExistingProtoEntry to reuse existing Attributes when available Based on heap profile showing: - EncodeAttributesAndChunks: 69.5MB cumulative - DecodeAttributesAndChunks: 46.5MB cumulative - EntryAttributeToPb: 47.5MB flat allocations * log_buffer: use sync.Pool for LogEntry in readTs Add logEntryPool to reuse filer_pb.LogEntry objects in readTs(), which is called frequently during binary search in ReadFromBuffer. This function only needs the TsNs field from the unmarshaled entry, so pooling the LogEntry avoids repeated allocations. Based on heap profile showing readTs with 188MB cumulative allocations from timestamp lookups during log buffer reads. * pb: reduce gRPC metadata allocations in interceptor Optimize requestIDUnaryInterceptor and WithGrpcClient to reduce metadata allocations on every gRPC request: - Use AppendToOutgoingContext instead of NewOutgoingContext + New() This avoids creating a new map[string]string for single key-value pairs - Check FromIncomingContext return value before using metadata Based on heap profile showing metadata operations contributing 0.45GB (10.5%) of allocations, with requestIDUnaryInterceptor being the main source at 0.44GB cumulative. Expected reduction: ~0.2GB from avoiding map allocations per request. * filer/log_buffer: address code review feedback - Use proto.Reset() instead of manual field clearing in resetLogEntry for more idiomatic and comprehensive state clearing - Add resetPbEntry() call before pool return in error path for consistency with success path in DecodeAttributesAndChunks * log_buffer: reduce PreviousBufferCount from 32 to 4 Reduce the number of retained previous buffers from 32 to 4. Each buffer is 8MB, so this reduces the maximum retained memory from 256MB to 32MB for previous buffers. Most subscribers catch up quickly, so 4 buffers (32MB) should be sufficient while significantly reducing memory footprint. * filer/foundationdb: use defer for tuple pool cleanup in genKey Refactor genKey to use defer for returning the pooled tuple. This ensures the pooled object is always returned even if store.seaweedfsDir.Pack panics, making the code more robust. Also simplifies the code by removing the temporary variable. * filer: early-stop MatchStorageRule prescan after 2 matches Stop the prescan callback after finding 2 matches since we only need to know if there are 0, 1, or multiple matches. This avoids unnecessarily scanning the rest of the trie when many rules exist. * fix: address critical code review issues filer_conf.go: - Remove mutable singleton emptyPathConf that could corrupt shared state - Return fresh copy for no-match case and cloned copy for single-match case - Add clonePathConf helper to create shallow copies safely grpc_client_server.go: - Remove incorrect AppendToOutgoingContext call in server interceptor (that API is for outbound client calls, not server-side handlers) - Rely on request_id.Set and SetTrailer for request ID propagation * fix: treat FilerConf_PathConf as immutable Fix callers that were incorrectly mutating the returned PathConf: - filer_server_handlers_write.go: Use local variable for MaxFileNameLength instead of mutating the shared rule - command_s3_bucket_quota_check.go: Create new PathConf explicitly when modifying config instead of mutating the returned one This allows MatchStorageRule to safely return the singleton or direct references without copying, restoring the memory optimization. Callers must NOT mutate the returned FilerConf_PathConf. filer: add ClonePathConf helper for creating mutable copies Add reusable ClonePathConf function that creates a mutable copy of a PathConf. This is useful when callers need to modify config before calling SetLocationConf. Update command_s3_bucket_quota_check.go to use the new helper. Also fix redundant return statement in DeleteLocationConf. * fmt * filer: fix protobuf pool reset to clear internal fields Address code review feedback: 1. resetPbEntry/resetFuseAttributes: Use struct assignment (e = T{}) instead of field-by-field reset to clear protobuf internal fields (unknownFields, sizeCache) that would otherwise accumulate across pool reuses, causing data corruption or memory bloat. 2. EntryAttributeToExistingPb: Add nil guard for attr parameter to prevent panic if caller passes nil. log_buffer: reset logEntry before pool return in error path For consistency with success path, reset the logEntry before putting it back in the pool in the error path. This prevents the pooled object from holding references to partially unmarshaled data. * filer: optimize MatchStorageRule and document ClonePathConf 1. Avoid double []byte(path) conversion in multi-match case by converting once and reusing pathBytes. 2. Add IMPORTANT comment to ClonePathConf documenting that it must be kept in sync with filer_pb.FilerConf_PathConf fields when the protobuf evolves. * filer/log_buffer: fix data race and use defer for pool cleanup 1. entry_codec.go EncodeAttributesAndChunks: Fix critical data race - proto.Marshal may return a slice sharing memory with the message. Copy the data before returning message to pool to prevent corruption. 2. entry_codec.go DecodeAttributesAndChunks: Use defer for cleaner pool management, ensuring message is always returned to pool. 3. log_buffer.go readTs: Use defer for pool cleanup, removing duplicated resetLogEntry/Put calls in success and error paths. * filer: fix ClonePathConf field order and add comprehensive test 1. Fix field order in ClonePathConf to match protobuf struct definition (WormGracePeriodSeconds before WormRetentionTimeSeconds). 2. Add TestClonePathConf that constructs a fully-populated PathConf, calls ClonePathConf, and asserts equality of all exported fields. This will catch future schema drift when new fields are added. 3. Add TestClonePathConfNil to verify nil handling. * filer: use reflection in ClonePathConf test to detect schema drift Replace hardcoded field comparisons with reflection-based comparison. This automatically catches: 1. New fields added to the protobuf but not copied in ClonePathConf 2. Missing non-zero test values for any exported field The test iterates over all exported fields using reflect and compares src vs clone values, failing if any field differs. * filer: update EntryAttributeToExistingPb comment to reflect nil handling The function safely handles nil attr by returning early, but the comment incorrectly stated 'attr must not be nil'. Update comment to accurately describe the defensive behavior. * Fix review feedback: restore request ID propagation and remove redundant resets 1. grpc_client_server.go: Restore AppendToOutgoingContext for request ID so handlers making downstream gRPC calls will automatically propagate the request ID to downstream services. 2. entry_codec.go: Remove redundant resetPbEntry calls after Get. The defer block ensures reset before Put, so next Get receives clean object. 3. log_buffer.go: Remove redundant resetLogEntry call after Get for same reason - defer already handles reset before Put.	4 weeks ago
Chris Lu	84b8a8e010	filer.sync: fix checkpoint not being saved properly (#7719 ) * filer.sync: fix race condition on first checkpoint save Initialize lastWriteTime to time.Now() instead of zero time to prevent the first checkpoint save from being triggered immediately when the first event arrives. This gives async jobs time to complete and update the watermark before the checkpoint is saved. Previously, the zero time caused lastWriteTime.Add(3s).Before(now) to be true on the first event, triggering an immediate checkpoint save attempt. But since jobs are processed asynchronously, the watermark was still 0 (initial value), causing the save to be skipped due to the 'if offsetTsNs == 0 { return nil }' check. Fixes #7717 * filer.sync: save checkpoint on graceful shutdown Add graceful shutdown handling to save the final checkpoint when filer.sync is terminated. Previously, any sync progress within the last 3-second checkpoint interval would be lost on shutdown. Changes: - Add syncState struct to track current processor and offset save info - Add atomic pointers syncStateA2B and syncStateB2A for both directions - Register grace.OnInterrupt hook to save checkpoints on shutdown - Modify doSubscribeFilerMetaChanges to update sync state atomically This ensures that when filer.sync is restarted, it resumes from the correct position instead of potentially replaying old events. Fixes #7717	4 weeks ago
Chris Lu	de3ecaf0de	s3: fix presigned POST upload missing slash between bucket and key (#7714 ) * s3: fix presigned POST upload missing slash between bucket and key When uploading a file using presigned POST (e.g., boto3.generate_presigned_post), the file was saved with the bucket name and object key concatenated without a slash (e.g., 'my-bucketfilename' instead of 'my-bucket/filename'). The issue was that PostPolicyBucketHandler retrieved the object key from form values without ensuring it had a leading slash, unlike GetBucketAndObject() which normalizes the key. Fixes #7713 * s3: add tests for presigned POST key normalization Add comprehensive tests for PostPolicyBucketHandler to ensure: - Object keys without leading slashes are properly normalized - ${filename} substitution works correctly with normalization - Path construction correctly separates bucket and key - Form value extraction works properly These tests would have caught the bug fixed in the previous commit where keys like 'test_image.png' were concatenated with bucket without a separator, resulting in 'my-buckettest_image.png'. * s3: create normalizeObjectKey function for robust key normalization Address review feedback by creating a reusable normalizeObjectKey function that both adds a leading slash and removes duplicate slashes, aligning with how other handlers process paths (e.g., toFilerPath uses removeDuplicateSlashes). The function handles edge cases like: - Keys without leading slashes (the original bug) - Keys with duplicate slashes (e.g., 'a//b' -> '/a/b') - Keys with leading duplicate slashes (e.g., '///a' -> '/a') Updated tests to use the new function and added TestNormalizeObjectKey for comprehensive coverage of the new function. * s3: move NormalizeObjectKey to s3_constants for shared use Move the NormalizeObjectKey function to the s3_constants package so it can be reused by: - GetBucketAndObject() - now normalizes all object keys from URL paths - GetPrefix() - now normalizes prefix query parameters - PostPolicyBucketHandler - normalizes keys from form values This ensures consistent object key normalization across all S3 API handlers, handling both missing leading slashes and duplicate slashes. Benefits: - Single source of truth for key normalization - GetBucketAndObject now removes duplicate slashes (previously only added leading slash) - All handlers benefit from the improved normalization automatically	4 weeks ago
Chris Lu	df4f2f7020	ec: add -diskType flag to EC commands for SSD support (#7607 ) * ec: add diskType parameter to core EC functions Add diskType parameter to: - ecBalancer struct - collectEcVolumeServersByDc() - collectEcNodesForDC() - collectEcNodes() - EcBalance() This allows EC operations to target specific disk types (hdd, ssd, etc.) instead of being hardcoded to HardDriveType only. For backward compatibility, all callers currently pass types.HardDriveType as the default value. Subsequent commits will add -diskType flags to the individual EC commands. * ec: update helper functions to use configurable diskType Update the following functions to accept/use diskType parameter: - findEcVolumeShards() - addEcVolumeShards() - deleteEcVolumeShards() - moveMountedShardToEcNode() - countShardsByRack() - pickNEcShardsToMoveFrom() All ecBalancer methods now use ecb.diskType instead of hardcoded types.HardDriveType. Non-ecBalancer callers (like volumeServer.evacuate and ec.rebuild) use types.HardDriveType as the default. Update all test files to pass diskType where needed. * ec: add -diskType flag to ec.balance and ec.encode commands Add -diskType flag to specify the target disk type for EC operations: - ec.balance -diskType=ssd - ec.encode -diskType=ssd The disk type can be 'hdd', 'ssd', or empty for default (hdd). This allows placing EC shards on SSD or other disk types instead of only HDD. Example usage: ec.balance -collection=mybucket -diskType=ssd -apply ec.encode -collection=mybucket -diskType=ssd -force * test: add integration tests for EC disk type support Add integration tests to verify the -diskType flag works correctly: - TestECDiskTypeSupport: Tests EC encode and balance with SSD disk type - TestECDiskTypeMixedCluster: Tests EC operations on a mixed HDD/SSD cluster The tests verify: - Volume servers can be configured with specific disk types - ec.encode accepts -diskType flag and encodes to the correct disk type - ec.balance accepts -diskType flag and balances on the correct disk type - Mixed disk type clusters work correctly with separate collections * ec: add -sourceDiskType to ec.encode and -diskType to ec.decode ec.encode: - Add -sourceDiskType flag to filter source volumes by disk type - This enables tier migration scenarios (e.g., SSD volumes → HDD EC shards) - -diskType specifies target disk type for EC shards ec.decode: - Add -diskType flag to specify source disk type where EC shards are stored - Update collectEcShardIds() and collectEcNodeShardBits() to accept diskType Examples: # Encode SSD volumes to HDD EC shards (tier migration) ec.encode -collection=mybucket -sourceDiskType=ssd -diskType=hdd # Decode EC shards from SSD ec.decode -collection=mybucket -diskType=ssd Integration tests updated to cover new flags. * ec: fix variable shadowing and add -diskType to ec.rebuild and volumeServer.evacuate Address code review comments: 1. Fix variable shadowing in collectEcVolumeServersByDc(): - Rename loop variable 'diskType' to 'diskTypeKey' and 'diskTypeStr' to avoid shadowing the function parameter 2. Fix hardcoded HardDriveType in ecBalancer methods: - balanceEcRack(): use ecb.diskType instead of types.HardDriveType - collectVolumeIdToEcNodes(): use ecb.diskType 3. Add -diskType flag to ec.rebuild command: - Add diskType field to ecRebuilder struct - Pass diskType to collectEcNodes() and addEcVolumeShards() 4. Add -diskType flag to volumeServer.evacuate command: - Add diskType field to commandVolumeServerEvacuate struct - Pass diskType to collectEcVolumeServersByDc() and moveMountedShardToEcNode() * test: add diskType field to ecBalancer in TestPickEcNodeToBalanceShardsInto Address nitpick comment: ensure test ecBalancer struct has diskType field set for consistency with other tests. * ec: filter disk selection by disk type in pickBestDiskOnNode When evacuating or rebalancing EC shards, pickBestDiskOnNode now filters disks by the target disk type. This ensures: 1. EC shards from SSD disks are moved to SSD disks on destination nodes 2. EC shards from HDD disks are moved to HDD disks on destination nodes 3. No cross-disk-type shard movement occurs This maintains the storage tier isolation when moving EC shards between nodes during evacuation or rebalancing operations. * ec: allow disk type fallback during evacuation Update pickBestDiskOnNode to accept a strictDiskType parameter: - strictDiskType=true (balancing): Only use disks of matching type. This maintains storage tier isolation during normal rebalancing. - strictDiskType=false (evacuation): Prefer same disk type, but fall back to other disk types if no matching disk is available. This ensures evacuation can complete even when same-type capacity is insufficient. Priority order for evacuation: 1. Same disk type with lowest shard count (preferred) 2. Different disk type with lowest shard count (fallback) * test: use defer for lock/unlock to prevent lock leaks Use defer to ensure locks are always released, even on early returns or test failures. This prevents lock leaks that could cause subsequent tests to hang or fail. Changes: - Return early if lock acquisition fails - Immediately defer unlock after successful lock - Remove redundant explicit unlock calls at end of tests - Fix unused variable warning (err -> encodeErr/locErr) * ec: dynamically discover disk types from topology for evacuation Disk types are free-form tags (e.g., 'ssd', 'nvme', 'archive') that come from the topology, not a hardcoded set. Only 'hdd' (or empty) is the default disk type. Use collectVolumeDiskTypes() to discover all disk types present in the cluster topology instead of hardcoding [HardDriveType, SsdType]. * test: add evacuation fallback and cross-rack EC placement tests Add two new integration tests: 1. TestEvacuationFallbackBehavior: - Tests that when same disk type has no capacity, shards fall back to other disk types during evacuation - Creates cluster with 1 SSD + 2 HDD servers (limited SSD capacity) - Verifies pickBestDiskOnNode behavior with strictDiskType=false 2. TestCrossRackECPlacement: - Tests EC shard distribution across different racks - Creates cluster with 4 servers in 4 different racks - Verifies shards are spread across multiple racks - Tests that ec.balance respects rack placement Helper functions added: - startLimitedSsdCluster: 1 SSD + 2 HDD servers - startMultiRackCluster: 4 servers in 4 racks - countShardsPerRack: counts EC shards per rack from disk * test: fix collection mismatch in TestCrossRackECPlacement The EC commands were using collection 'rack_test' but uploaded test data uses collection 'test' (default). This caused ec.encode/ec.balance to not find the uploaded volume. Fix: Change EC commands to use '-collection test' to match the uploaded data. Addresses review comment from PR #7607. * test: close log files in MultiDiskCluster.Stop() to prevent FD leaks Track log files in MultiDiskCluster.logFiles and close them in Stop() to prevent file descriptor accumulation in long-running or many-test scenarios. Addresses review comment about logging resources cleanup. * test: improve EC integration tests with proper assertions - Add assertNoFlagError helper to detect flag parsing regressions - Update diskType subtests to fail on flag errors (ec.encode, ec.balance, ec.decode) - Update verify_disktype_flag_parsing to check help output contains diskType - Remove verify_fallback_disk_selection (was documentation-only, not executable) - Add assertion to verify_cross_rack_distribution for minimum 2 racks - Consolidate uploadTestDataWithDiskType to accept collection parameter - Remove duplicate uploadTestDataWithDiskTypeMixed function * test: extract captureCommandOutput helper and fix error handling - Add captureCommandOutput helper to reduce code duplication in diskType tests - Create commandRunner interface to match shell command Do method - Update ec_encode_with_ssd_disktype, ec_balance_with_ssd_disktype, ec_encode_with_source_disktype, ec_decode_with_disktype to use helper - Fix filepath.Glob error handling in countShardsPerRack instead of ignoring it * test: add flag validation to ec_balance_targets_correct_disk_type Add assertNoFlagError calls after ec.balance commands to ensure -diskType flag is properly recognized for both SSD and HDD disk types. * test: add proper assertions for EC command results - ec_encode_with_ssd_disktype: check for expected volume-related errors - ec_balance_with_ssd_disktype: require success with require.NoError - ec_encode_with_source_disktype: check for expected no-volume errors - ec_decode_with_disktype: check for expected no-ec-volume errors - upload_to_ssd_and_hdd: use require.NoError for setup validation Tests now properly fail on unexpected errors rather than just logging. * test: fix missing unlock in ec_encode_with_disk_awareness Add defer unlock pattern to ensure lock is always released, matching the pattern used in other subtests. * test: improve helper robustness - Make assertNoFlagError case-insensitive for pattern matching - Use defer in captureCommandOutput to restore stdout/stderr and close pipe ends to avoid FD leaks even if cmd.Do panics	4 weeks ago
chrislu	1ee03b6411	fmt	4 weeks ago
Konstantin Lebedev	dfac1315ca	fix: filer do not support IP whitelist right now #7094 (#7095 ) * fix: filer do not support IP whitelist right now #7094 * Apply suggestion from @gemini-code-assist[bot] Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	4 weeks ago
Chris Lu	2d06ddab41	Remove default concurrent upload/download limits for best performance (#7712 ) Change all concurrentUploadLimitMB and concurrentDownloadLimitMB defaults from fixed values (64, 128, 256 MB) to 0 (unlimited). This removes artificial throttling that can limit throughput on high-performance systems, especially on all-flash setups with many cores. Files changed: - volume.go: concurrentUploadLimitMB 256->0, concurrentDownloadLimitMB 256->0 - server.go: filer/volume/s3 concurrent limits 64/128->0 - s3.go: concurrentUploadLimitMB 128->0 - filer.go: concurrentUploadLimitMB 128->0, s3.concurrentUploadLimitMB 128->0 Users can still set explicit limits if needed for resource management.	4 weeks ago
Chris Lu	924d410dc8	fix: weed shell can't connect to master when no volume servers (#7710 ) fix: weed shell can't connect to master when no volume servers (#7701) When there are no volume servers registered, the master's KeepConnected handler would not send any initial message to clients. This caused the shell's masterClient to block indefinitely on stream.Recv(), preventing it from setting currentMaster and completing the connection handshake. The fix ensures the master always sends at least one message with leader information to newly connected clients, even when ToVolumeLocations() returns an empty slice.	4 weeks ago
chrislu	2188d1ccc5	fix object name	4 weeks ago
Chris Lu	4c36cd04d6	mount: add periodic metadata sync to protect chunks from orphan cleanup (#7700 ) mount: add periodic metadata flush to protect chunks from orphan cleanup When a file is opened via FUSE mount and written for a long time without being closed, chunks are uploaded to volume servers but the file metadata (containing chunk references) is only saved to the filer on file close. If volume.fsck runs during this window, it may identify these chunks as orphans (not referenced in filer metadata) and purge them, causing data loss. This commit adds a background task that periodically flushes file metadata for open files to the filer, ensuring chunk references are visible to volume.fsck even before files are closed. New option: -metadataFlushSeconds (default: 120) Interval in seconds for flushing dirty file metadata to filer. Set to 0 to disable. Fixes: https://github.com/seaweedfs/seaweedfs/issues/7649	4 weeks ago
jfburdet	27a28faf49	Fix s3 versioning listing bugs (#7705 ) * fix: add pagination to list-object-versions for buckets with >1000 objects The findVersionsRecursively() function used a fixed limit of 1000 entries without pagination. This caused objects beyond the first 1000 entries (sorted alphabetically) to never appear in list-object-versions responses. Changes: - Add pagination loop using filer.PaginationSize (1024) - Use isLast flag from s3a.list() to detect end of pagination - Track startFrom marker for each page 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix: prevent infinite loop in ListObjects when processing .versions directories The doListFilerEntries() function processes .versions directories in a secondary loop after the main entry loop, but failed to update nextMarker. This caused infinite pagination loops when results were truncated, as the same .versions directories would be reprocessed on each page. Bug introduced by: `c196d03951` ("fix listing object versions (#7006)") 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	4 weeks ago
Chris Lu	c153420022	filer: add write batching for FoundationDB store to improve throughput (#7708 ) This addresses issue #7699 where FoundationDB filer store had low throughput (~400-500 obj/s) due to each write operation creating a separate transaction. Changes: - Add writeBatcher that collects multiple writes into batched transactions - New config options: batch_size (default: 100), batch_interval (default: 5ms) - Batching provides ~5.7x throughput improvement (from ~456 to ~2600 obj/s) Benchmark results with different batch sizes: - batch_size=1: ~456 obj/s (baseline, no batching) - batch_size=10: ~2621 obj/s (5.7x improvement) - batch_size=16: ~2514 obj/s (5.5x improvement) - batch_size=100: ~2617 obj/s (5.7x improvement) - batch_size=1000: ~2593 obj/s (5.7x improvement) The batch_interval timer (5ms) ensures writes are flushed promptly even when batch is not full, providing good latency characteristics. Addressed review feedback: - Changed wait=false to wait=true in UpdateEntry/DeleteEntry to properly propagate errors to callers - Fixed timer reset race condition by stopping and draining before reset Fixes #7699	4 weeks ago
Chris Lu	ae7333d28e	fix: cache successful volume lookups instead of failed ones (#7698 ) The condition was inverted - it was caching lookups with errors instead of successful lookups. This caused every replicated write to make a gRPC call to master for volume location lookup, resulting in ~1 second latency for writeToReplicas. The bug particularly affected TTL volumes because: - More unique volumes are created (separate pools per TTL) - Volumes expire and get recreated frequently - Each new volume requires a fresh lookup (cache miss) - Higher volume churn = more cache misses = more master lookups With this fix, successful lookups are cached for 10 minutes, reducing replication latency from ~1s to ~10ms for cached volumes.	1 month ago
Chris Lu	0cd9f34177	mount: improve EnsureVisited performance with dedup, parallelism, and batching (#7697 ) * mount: add singleflight to deduplicate concurrent EnsureVisited calls When multiple goroutines access the same uncached directory simultaneously, they would all make redundant network requests to the filer. This change uses singleflight.Group to ensure only one goroutine fetches the directory entries while others wait for the result. This fixes a race condition where concurrent lookups or readdir operations on the same uncached directory would: 1. Make duplicate network requests to the filer 2. Insert duplicate entries into LevelDB cache 3. Waste CPU and network bandwidth * mount: fetch parent directories in parallel during EnsureVisited Previously, when accessing a deep path like /a/b/c/d, the parent directories were fetched serially from target to root. This change: 1. Collects all uncached directories from target to root first 2. Fetches them all in parallel using errgroup 3. Relies on singleflight (from previous commit) for deduplication This reduces latency when accessing deep uncached paths, especially in high-latency network environments where parallel requests can significantly improve performance. * mount: add batch inserts for LevelDB meta cache When populating the meta cache from filer, entries were inserted one-by-one into LevelDB. This change: 1. Adds BatchInsertEntries method to LevelDBStore that uses LevelDB's native batch write API 2. Updates MetaCache to keep a direct reference to the LevelDB store for batch operations 3. Modifies doEnsureVisited to collect entries and insert them in batches of 100 entries Batch writes are more efficient because: - Reduces number of individual write operations - Reduces disk syncs - Improves throughput for large directories * mount: fix potential nil dereference in MarkChildrenCached Add missing check for inode existence in inode2path map before accessing the InodeEntry. This prevents a potential nil pointer dereference if the inode exists in path2inode but not in inode2path (which could happen due to race conditions or bugs). This follows the same pattern used in IsChildrenCached which properly checks for existence before accessing the entry. * mount: fix batch flush when last entry is hidden The previous batch insert implementation relied on the isLast flag to flush remaining entries. However, if the last entry is a hidden system entry (like 'topics' or 'etc' in root), the callback returns early and the remaining entries in the batch are never flushed. Fix by: 1. Only flush when batch reaches threshold inside the callback 2. Flush any remaining entries after ReadDirAllEntries completes 3. Use error wrapping instead of logging+returning to avoid duplicate logs 4. Create new slice after flush to allow GC of flushed entries 5. Add documentation for batchInsertSize constant This ensures all entries are properly inserted regardless of whether the last entry is hidden, and prevents memory retention issues. * mount: add context support for cancellation in EnsureVisited Thread context.Context through the batch insert call chain to enable proper cancellation and timeout support: 1. Use errgroup.WithContext() so if one fetch fails, others are cancelled 2. Add context parameter to BatchInsertEntries for consistency with InsertEntry 3. Pass context to ReadDirAllEntries for cancellation during network calls 4. Check context cancellation before starting work in doEnsureVisited 5. Use %w for error wrapping to preserve error types for inspection This prevents unnecessary work when one directory fetch fails and makes the batch operations consistent with the existing context-aware APIs.	1 month ago
Chris Lu	1e1473ef4a	mount: improve NFS directory listing (#7696 ) mount: remove unused isEarlyTerminated variable The variable was redundant because when processEachEntryFn returns false, we immediately return fuse.OK, so the check was always false.	1 month ago
Bruce Zou	4bc2b6d62b	fix nfs list with prefix batch scan (#7694 ) * fix nfs list with prefix batch scan * remove else branch	1 month ago
Chris Lu	d970c15d71	fix: prevent filer.backup stall in single-filer setups (#7695 ) * fix: prevent filer.backup stall in single-filer setups (#4977) When MetaAggregator.MetaLogBuffer is empty (which happens in single-filer setups with no peers), ReadFromBuffer was returning nil error, causing LoopProcessLogData to enter an infinite wait loop on ListenersCond. This fix returns ResumeFromDiskError instead, allowing SubscribeMetadata to loop back and read from persisted logs on disk. This ensures filer.backup continues processing events even when the in-memory aggregator buffer is empty. Fixes #4977 * test: add integration tests for metadata subscription Add integration tests for metadata subscription functionality: - TestMetadataSubscribeBasic: Tests basic subscription and event receiving - TestMetadataSubscribeSingleFilerNoStall: Regression test for #4977, verifies subscription doesn't stall under high load in single-filer setups - TestMetadataSubscribeResumeFromDisk: Tests resuming subscription from disk Related to #4977 * ci: add GitHub Actions workflow for metadata subscribe tests Add CI workflow that runs on: - Push/PR to master affecting filer, log_buffer, or metadata subscribe code - Runs the integration tests for metadata subscription - Uploads logs on failure for debugging Related to #4977 * fix: use multipart form-data for file uploads in integration tests The filer expects multipart/form-data for file uploads, not raw POST body. This fixes the 'Content-Type isn't multipart/form-data' error. * test: use -peers=none for faster master startup * test: add -peers=none to remaining master startup in ec tests * fix: use filer HTTP port 8888, WithFilerClient adds 10000 for gRPC WithFilerClient calls ToGrpcAddress() which adds 10000 to the port. Passing 18888 resulted in connecting to 28888. Use 8888 instead. * test: add concurrent writes and million updates tests - TestMetadataSubscribeConcurrentWrites: 50 goroutines writing 20 files each - TestMetadataSubscribeMillionUpdates: 1 million metadata entries via gRPC (metadata only, no actual file content for speed) * fix: address PR review comments - Handle os.MkdirAll errors explicitly instead of ignoring - Handle log file creation errors with proper error messages - Replace silent event dropping with 100ms timeout and warning log * Update metadata_subscribe_integration_test.go	1 month ago
Chris Lu	1b13324fb7	fix: skip log files with deleted volumes in filer backup (#7692 ) fix: skip log files with deleted volumes in filer backup (#3720) When filer.backup or filer.meta.backup resumes after being stopped, it may encounter persisted log files stored on volumes that have since been deleted (via volume.deleteEmpty -force). Previously, this caused the backup to get stuck in an infinite retry loop with 'volume X not found' errors. This fix catches 'volume not found' errors when reading log files and skips the problematic file instead of failing. The backup will now: - Log a warning about the missing volume - Skip the problematic log file - Continue with the next log file, allowing progress The VolumeNotFoundPattern regex was already defined but never used - this change puts it to use. Fixes #3720	1 month ago
Chris Lu	80c7de8d76	Helm Charts: add admin and worker to helm charts (#7688 ) * add admin and worker to helm charts * workers are stateless, admin is stateful * removed the duplicate admin-deployment.yaml * address comments * address comments * purge * Update README.md * Update k8s/charts/seaweedfs/templates/admin/admin-ingress.yaml Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * address comments * address comments * supports Kubernetes versions from v1.14 to v1.30+, ensuring broad compatibility * add probe for workers * address comments * add a todo * chore: trigger CI * use port name for probes in admin statefulset * fix: remove trailing blank line in values.yaml * address code review feedback - Quote admin credentials in shell command to handle special characters - Remove unimplemented capabilities (remote, replication) from worker defaults - Add security note about admin password character restrictions --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	1 month ago
Chris Lu	0ede657a34	fix: return error on size mismatch in ReadNeedleMeta for consistency (#7687 ) * fix: return error on size mismatch in ReadNeedleMeta for consistency When ReadNeedleMeta encounters a size mismatch at offset >= MaxPossibleVolumeSize, it previously just continued without returning an error, potentially using wrong data. This fix makes ReadNeedleMeta consistent with ReadBytes (needle_read.go), which properly returns an error in both cases: - ErrorSizeMismatch when offset < MaxPossibleVolumeSize (to trigger retry at offset+32GB) - A descriptive error when offset >= MaxPossibleVolumeSize (after retry failed) Fixes #7673 * refactor: use more accurate error message for size mismatch	1 month ago
Chris Lu	5c27522507	fix: prevent empty .vif files from ec.decode causing parse errors (#7686 ) * fix: prevent empty .vif files from ec.decode causing parse errors When ec.decode copies .vif files from EC shard nodes, if a source node doesn't have the .vif file, an empty .vif file was created on the target node. This caused volume.configure.replication to fail with 'proto: syntax error' when trying to parse the empty file. This fix: 1. In writeToFile: Remove empty files when no data was written (source file was not found) to avoid leaving corrupted empty files 2. In MaybeLoadVolumeInfo: Handle empty .vif files gracefully by treating them as non-existent, allowing the system to create a proper one Fixes #7666 * refactor: remove redundant dst.Close() and add error logging Address review feedback: - Remove redundant dst.Close() call since defer already handles it - Add error logging for os.Remove() failure	1 month ago
Chen Pu	40eee23be9	mount: fix weed inode nlookup do not equel kernel inode nlookup (#7682 ) * mount: fix weed inode nlookup do not equel kernel inode nlookup * mount: add underflow protection for nlookup decrement in Forget * mount: use consistent == 0 check for uint64 nlookup * Update weed/mount/inode_to_path.go Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * mount: snapshot data before unlock in Forget to avoid using deleted InodeEntry --------- Co-authored-by: chrislu <chris.lu@gmail.com> Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	1 month ago
Chris Lu	b4e2cca204	s3api: remove redundant auth verification in getRequestDataReader (#7685 ) * s3api: remove redundant auth verification in getRequestDataReader The handlers PutObjectHandler and PutObjectPartHandler are already wrapped with s3a.iam.Auth() middleware which performs signature verification via authRequest() before the handler is invoked. The signature verification for authTypeSignedV2, authTypePresignedV2, authTypePresigned, and authTypeSigned in getRequestDataReader was therefore redundant. The newChunkedReader() call for streaming auth types is kept as it's needed to parse the chunked transfer encoding and extract the actual data. Fixes #7683 * simplify switch to if statement for single condition	1 month ago
Chris Lu	d6d893c8c3	s3: add s3:ExistingObjectTag condition support for bucket policies (#7677 ) * s3: add s3:ExistingObjectTag condition support in policy engine Add support for s3:ExistingObjectTag/<tag-key> condition keys in bucket policies, allowing access control based on object tags. Changes: - Add ObjectEntry field to PolicyEvaluationArgs (entry.Extended metadata) - Update EvaluateConditions to handle s3:ExistingObjectTag/<key> format - Extract tag value from entry metadata using X-Amz-Tagging-<key> prefix This enables policies like: { "Condition": { "StringEquals": { "s3:ExistingObjectTag/status": ["public"] } } } Fixes: https://github.com/seaweedfs/seaweedfs/issues/7447 * s3: update EvaluatePolicy to accept object entry for tag conditions Update BucketPolicyEngine.EvaluatePolicy to accept objectEntry parameter (entry.Extended metadata) for evaluating tag-based policy conditions. Changes: - Add objectEntry parameter to EvaluatePolicy method - Update callers in auth_credentials.go and s3api_bucket_handlers.go - Pass nil for objectEntry in auth layer (entry fetched later in handlers) For tag-based conditions to work, handlers should call EvaluatePolicy with the object's entry.Extended after fetching the entry from filer. * s3: add tests for s3:ExistingObjectTag policy conditions Add comprehensive tests for object tag-based policy conditions: - TestExistingObjectTagCondition: Basic tag matching scenarios - Matching/non-matching tag values - Missing tags, no tags, empty tags - Multiple tags with one matching - TestExistingObjectTagConditionMultipleTags: Multiple tag conditions - Both tags match - Only one tag matches - TestExistingObjectTagDenyPolicy: Deny policies with tag conditions - Default allow without tag - Deny when specific tag present * s3: document s3:ExistingObjectTag support and feature status Update policy engine documentation: - Add s3:ExistingObjectTag/<tag-key> to supported condition keys - Add 'Object Tag-Based Access Control' section with examples - Add 'Feature Status' section with implemented and planned features Planned features for future implementation: - s3:RequestObjectTag/<key> - s3:RequestObjectTagKeys - s3:x-amz-server-side-encryption - Cross-account access * Implement tag-based policy re-check in handlers - Add checkPolicyWithEntry helper to S3ApiServer for handlers to re-check policy after fetching object entry (for s3:ExistingObjectTag conditions) - Add HasPolicyForBucket method to policy engine for efficient check - Integrate policy re-check in GetObjectHandler after entry is fetched - Integrate policy re-check in HeadObjectHandler after entry is fetched - Update auth_credentials.go comments to explain two-phase evaluation - Update documentation with supported operations for tag-based conditions This implements 'Approach 1' where handlers re-check the policy with the object entry after fetching it, allowing tag-based conditions to be properly evaluated. * Add integration tests for s3:ExistingObjectTag conditions - Add TestCheckPolicyWithEntry: tests checkPolicyWithEntry helper with various tag scenarios (matching tags, non-matching tags, empty entry, nil entry) - Add TestCheckPolicyWithEntryNoPolicyForBucket: tests early return when no policy - Add TestCheckPolicyWithEntryNilPolicyEngine: tests nil engine handling - Add TestCheckPolicyWithEntryDenyPolicy: tests deny policies with tag conditions - Add TestHasPolicyForBucket: tests HasPolicyForBucket method These tests cover the Phase 2 policy evaluation with object entry metadata, ensuring tag-based conditions are properly evaluated. * Address code review nitpicks - Remove unused extractObjectTags placeholder function (engine.go) - Add clarifying comment about s3:ExistingObjectTag/<key> evaluation - Consolidate duplicate tag-based examples in README - Factor out tagsToEntry helper to package level in tests * Address code review feedback - Fix unsafe type assertions in GetObjectHandler and HeadObjectHandler when getting identity from context (properly handle type assertion failure) - Extract getConditionContextValue helper to eliminate duplicated logic between EvaluateConditions and EvaluateConditionsLegacy - Ensure consistent handling of missing condition keys (always return empty slice) * Fix GetObjectHandler to match HeadObjectHandler pattern Add safety check for nil objectEntryForSSE before tag-based policy evaluation, ensuring tag-based conditions are always evaluated rather than silently skipped if entry is unexpectedly nil. Addresses review comment from Copilot. * Fix HeadObject action name in docs for consistency Change 'HeadObject' to 's3:HeadObject' to match other action names. * Extract recheckPolicyWithObjectEntry helper to reduce duplication Move the repeated identity extraction and policy re-check logic from GetObjectHandler and HeadObjectHandler into a shared helper method. * Add validation for empty tag key in s3:ExistingObjectTag condition Prevent potential issues with malformed policies containing s3:ExistingObjectTag/ (empty tag key after slash).	1 month ago
Chris Lu	d5f21fd8ba	fix: add missing backslash for volume extraArgs in helm chart (#7676 ) Fixes #7467 The -mserver argument line in volume-statefulset.yaml was missing a trailing backslash, which prevented extraArgs from being passed to the weed volume process. Also: - Extracted master server list generation logic into shared helper templates in _helpers.tpl for better maintainability - Updated all occurrences of deprecated -mserver flag to -master across docker-compose files, test files, and documentation	1 month ago
Chris Lu	cea12ba3c4	fix: prevent makeslice panic in ReadNeedleMeta with corrupted needle (#7675 ) * fix: prevent makeslice panic in ReadNeedleMeta with corrupted needle When a needle's DataSize in the .dat file is corrupted to a very large value, the calculation of metaSize can become negative, causing a panic with 'makeslice: len out of range' when creating the metadata slice. This fix adds validation to check if metaSize is negative before creating the slice, returning a descriptive error instead of panicking. Fixes #7475 * Update weed/storage/needle/needle_read_page.go Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	1 month ago
Chris Lu	9196696278	mount: add mutex to DirectoryHandle to fix race condition (#7674 ) * mount: add mutex to DirectoryHandle to fix race condition When using Ganesha NFS on top of FUSE mount, ls operations would hang forever on directories with hundreds of files. This was caused by a race condition in DirectoryHandle where multiple concurrent readdir operations could modify shared state (entryStream, entryStreamOffset, isFinished) without synchronization. The fix adds a mutex to DirectoryHandle and holds it for the entire duration of doReadDirectory. This serializes concurrent readdir calls on the same handle, which is the correct behavior for a directory handle and fixes the race condition. Key changes: - Added sync.Mutex to DirectoryHandle struct - Lock the mutex at the start of doReadDirectory - This ensures thread-safe access to entryStream and other state The lock is per-handle (not global), so different directories can still be listed concurrently. Only concurrent operations on the same directory handle are serialized. Fixes: https://github.com/seaweedfs/seaweedfs/issues/7672 * mount: add mutex to DirectoryHandle to fix race condition When using Ganesha NFS on top of FUSE mount, ls operations would hang forever on directories with hundreds of files. This was caused by a race condition in DirectoryHandle where multiple concurrent readdir operations could modify shared state (entryStream, entryStreamOffset, isFinished) without synchronization. The fix adds a mutex to DirectoryHandle and holds it for the entire duration of doReadDirectory. This serializes concurrent readdir calls on the same handle, which is the correct behavior for a directory handle and fixes the race condition. Key changes: - Added sync.Mutex to DirectoryHandle struct - Lock the mutex at the start of doReadDirectory - Optimized reset() to reuse slice capacity and allow GC of old entries The lock is per-handle (not global), so different directories can still be listed concurrently. Only concurrent operations on the same directory handle are serialized. Fixes: https://github.com/seaweedfs/seaweedfs/issues/7672	1 month ago
Chris Lu	ff4855dcbe	sts: limit session duration to incoming token's exp claim (#7670 ) * sts: limit session duration to incoming token's exp claim This fixes the issue where AssumeRoleWithWebIdentity would issue sessions that outlive the source identity token's expiration. For use cases like GitLab CI Jobs where the ID Token has an exp claim limited to the CI job's timeout, the STS session should not exceed that expiration. Changes: - Add TokenExpiration field to ExternalIdentity struct - Extract exp/iat/nbf claims in OIDC provider's ValidateToken - Pass token expiration from Authenticate to ExternalIdentity - Modify calculateSessionDuration to cap at source token's exp - Add comprehensive tests for the new behavior Fixes: https://github.com/seaweedfs/seaweedfs/discussions/7653 * refactor: reduce duplication in time claim extraction Use a loop over claim names instead of repeating the same extraction logic three times for exp, iat, and nbf claims. * address review: add defense-in-depth for expired tokens - Handle already-expired tokens defensively with 1 minute minimum duration - Enforce MaxSessionLength from config as additional cap - Fix potential nil dereference in test mock - Add test case for expired token scenario * remove issue reference from test * fix: remove early return to ensure MaxSessionLength is always checked	1 month ago
Chris Lu	772459f93c	fix: restore volume mount when VolumeConfigure fails (#7669 ) * fix: restore volume mount when VolumeConfigure fails When volume.configure.replication command fails (e.g., due to corrupted .vif file), the volume was left unmounted and the master was already notified that the volume was deleted, causing the volume to disappear. This fix attempts to re-mount the volume when ConfigureVolume fails, restoring the volume state and preventing data loss. Fixes #7666 * include mount restore error in response message	1 month ago
Chris Lu	086ab3e28c	Fix webhook duplicate deliveries and POST to GET conversion (#7668 ) * Fix webhook duplicate deliveries and POST to GET conversion Fixes #7667 This commit addresses two critical issues with the webhook notification system: 1. Duplicate webhook deliveries based on worker count 2. POST requests being converted to GET when following redirects Issue 1: Multiple webhook deliveries ------------------------------------ Problem: The webhook queue was creating multiple handlers (one per worker) that all subscribed to the same topic. With Watermill's gochannel, each handler creates a separate subscription, and all subscriptions receive their own copy of every message, resulting in duplicate webhook calls equal to the worker count. Solution: Use a single handler instead of multiple handlers to ensure each webhook event is sent only once, regardless of worker configuration. Issue 2: POST to GET conversion with intelligent redirect handling ------------------------------------------------------------------ Problem: When webhook endpoints returned redirects (301/302/303), Go's default HTTP client would automatically follow them and convert POST requests to GET requests per HTTP specification. Solution: Implement intelligent redirect handling that: - Prevents automatic redirects to preserve POST method - Manually follows redirects by recreating POST requests - Caches the final redirect destination for performance - Invalidates cache and retries on failures (network or HTTP errors) - Provides automatic recovery from cached endpoint failures Benefits: - Webhooks are now sent exactly once per event - POST method is always preserved through redirects - Reduced latency through redirect destination caching - Automatic failover when cached destinations become unavailable - Thread-safe concurrent webhook delivery Testing: - Added TestQueueNoDuplicateWebhooks to verify single delivery - Added TestHttpClientFollowsRedirectAsPost for redirect handling - Added TestHttpClientUsesCachedRedirect for caching behavior - Added cache invalidation tests for error scenarios - All 18 webhook tests pass successfully * Address code review comments - Add maxWebhookRetryDepth constant to avoid magic number - Extract cache invalidation logic into invalidateCache() helper method - Fix redirect handling to properly follow redirects even on retry attempts - Remove misleading comment about nWorkers controlling handler parallelism - Fix test assertions to match actual execution flow - Remove trailing whitespace in test file All tests passing. * Refactor: use setFinalURL() instead of invalidateCache() Replace invalidateCache() with more explicit setFinalURL() function. This is cleaner as it makes the intent clear - we're setting the URL (either to a value or to empty string to clear it), rather than having a separate function just for clearing. No functional changes, all tests passing. * Add concurrent webhook delivery using nWorkers configuration Webhooks were previously sent sequentially (one-by-one), which could be a performance bottleneck for high-throughput scenarios. Now nWorkers configuration is properly used to control concurrent webhook delivery. Implementation: - Added semaphore channel (buffered to nWorkers capacity) - handleWebhook acquires semaphore slot before sending (blocks if at capacity) - Releases slot after webhook completes - Allows up to nWorkers concurrent webhook HTTP requests Benefits: - Improved throughput for slow webhook endpoints - nWorkers config now has actual purpose (was validated but unused) - Default 5 workers provides good balance - Configurable from 1-100 workers based on needs Example performance improvement: - Before: 500ms webhook latency = ~2 webhooks/sec max - After (5 workers): 500ms latency = ~10 webhooks/sec - After (10 workers): 500ms latency = ~20 webhooks/sec All tests passing. * Replace deprecated AddNoPublisherHandler with AddConsumerHandler AddNoPublisherHandler is deprecated in Watermill. Use AddConsumerHandler instead, which is the current recommended API for handlers that only consume messages without publishing. No functional changes, all tests passing. * Drain response bodies to enable HTTP connection reuse Added drainBody() calls in all code paths to ensure response bodies are consumed before returning. This is critical for HTTP keep-alive connection reuse. Without draining: - Connections are closed after each request - New TCP handshake + TLS handshake for every webhook - Higher latency and resource usage With draining: - Connections are reused via HTTP keep-alive - Significant performance improvement for repeated webhooks - Lower latency (no handshake overhead) - Reduced resource usage Implementation: - Added drainBody() helper that reads up to 1MB (prevents memory issues) - Drain on success path (line 161) - Drain on error responses before retry (lines 119, 152) - Drain on redirect responses before following (line 118) - Already had drainResponse() for network errors (line 99) All tests passing. * Use existing CloseResponse utility instead of custom drainBody Replaced custom drainBody() function with the existing util_http.CloseResponse() utility which is already used throughout the codebase. This provides: - Consistent behavior with rest of the codebase - Better logging (logs bytes drained via CountingReader) - Full body drainage (not limited to 1MB) - Cleaner code (no duplication) CloseResponse properly drains and closes the response body to enable HTTP keep-alive connection reuse. All tests passing. * Fix: Don't overwrite original error when draining response Before: err was being overwritten by drainResponse() result After: Use drainErr to avoid losing the original client.Do() error This was a subtle bug where if drainResponse() succeeded (returned nil), we would lose the original network error and potentially return a confusing error message. All tests passing. * Optimize HTTP client: reuse client and remove redundant timeout 1. Reuse single http.Client instance instead of creating new one per request - Reduces allocation overhead - More efficient for high-volume webhooks 2. Remove redundant timeout configuration - Before: timeout set on both context AND http.Client - After: timeout only on context (cleaner, context fires first anyway) Performance benefits: - Reduced GC pressure (fewer client allocations) - Better connection pooling (single transport instance) - Cleaner code (no redundancy) All tests passing.	1 month ago
Lisandro Pin	ca1ad9c4c2	Nit: have `ec.encode` exit immediately if no volumes are processed. (#7654 ) * Nit: have `ec.encode` exit immediately if no volumes are processed. * Update weed/shell/command_ec_encode.go Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> --------- Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>	1 month ago
chrislu	1856eaca03	Update notification.toml	1 month ago
chrislu	95f6a2bc13	Added a complete webhook configuration example	1 month ago
Trim21	0a0eb21b86	fix random volume ids in master.html (#7655 )	1 month ago
chrislu	805950b401	4.02	1 month ago
Chris Lu	cadb2eeb05	fix: ARM v7 alignment issue for 64-bit atomic operations (#7652 ) Fixes #7643 Reordered filerHealth struct fields to ensure int64 field comes first, guaranteeing 8-byte alignment required for atomic operations on 32-bit ARM architectures (ARMv7, as used in OpenWRT).	1 month ago
Chris Lu	982aae6d53	SFTP: support reloading user store on HUP signal (#7651 ) Fixes #7650 This change enables the SFTP server to reload the user store configuration (sftp_userstore.json) when a HUP signal is sent to the process, without requiring a service restart. Changes: - Add Reload() method to FileStore to re-read users from disk - Add Reload() method to SFTPService to handle reload requests - Register reload hook with grace.OnReload() in sftp command This allows administrators to add users or change access policies dynamically by editing the user store file and sending a HUP signal (e.g., 'systemctl reload seaweedfs' or 'kill -HUP <pid>').	1 month ago
Chris Lu	f5c0bcafa3	s3: fix ListBuckets not showing buckets created by authenticated users (#7648 ) * s3: fix ListBuckets not showing buckets created by authenticated users Fixes #7647 ## Problem Users with proper Admin permissions could create buckets but couldn't list them. The issue occurred because ListBucketsHandler was not wrapped with the Auth middleware, so the authenticated identity was never set in the request context. ## Root Cause - PutBucketHandler uses iam.Auth() middleware which sets identity in context - ListBucketsHandler did NOT use iam.Auth() middleware - Without the middleware, GetIdentityNameFromContext() returned empty string - Bucket ownership checks failed because no identity was present ## Changes 1. Wrap ListBucketsHandler with iam.Auth() middleware (s3api_server.go) 2. Update ListBucketsHandler to get identity from context (s3api_bucket_handlers.go) 3. Add lookupByIdentityName() helper method (auth_credentials.go) 4. Add comprehensive test TestListBucketsIssue7647 (s3api_bucket_handlers_test.go) ## Testing - All existing tests pass (1348 tests in s3api package) - New test TestListBucketsIssue7647 validates the fix - Verified admin users can see their created buckets - Verified admin users can see all buckets - Verified backward compatibility maintained * s3: fix ListBuckets for JWT/Keycloak authentication The previous fix broke JWT/Keycloak authentication because JWT identities are created on-the-fly and not stored in the iam.identities list. The lookupByIdentityName() would return nil for JWT users. Solution: Store the full Identity object in the request context, not just the name. This allows ListBucketsHandler to retrieve the complete identity for all authentication types (SigV2, SigV4, JWT, Anonymous). Changes: - Add SetIdentityInContext/GetIdentityFromContext in s3_constants/header.go - Update Auth middleware to store full identity in context - Update ListBucketsHandler to retrieve identity from context first, with fallback to lookup for backward compatibility * s3: optimize lookupByIdentityName to O(1) using map Address code review feedback: Use a map for O(1) lookups instead of O(N) linear scan through identities list. Changes: - Add nameToIdentity map to IdentityAccessManagement struct - Populate map in loadS3ApiConfiguration (consistent with accessKeyIdent pattern) - Update lookupByIdentityName to use map lookup instead of loop This improves performance when many identities are configured and aligns with the existing pattern used for accessKeyIdent lookups. * s3: address code review feedback on nameToIdentity and logging Address two code review points: 1. Wire nameToIdentity into env-var fallback path - The AWS env-var fallback in NewIdentityAccessManagementWithStore now populates nameToIdentity map along with accessKeyIdent - Keeps all identity lookup maps in sync - Avoids potential issues if handlers rely on lookupByIdentityName 2. Improve access key lookup logging - Reduce log verbosity: V(1) -> V(2) for failed lookups - Truncate access keys in logs (show first 4 chars + **) - Include key length for debugging - Prevents credential exposure in production logs - Reduces log noise from misconfigured clients fmt * s3: refactor truncation logic and improve error handling Address additional code review feedback: 1. DRY principle: Extract key truncation logic into local function - Define truncate() helper at function start - Reuse throughout lookupByAccessKey - Eliminates code duplication 2. Enhanced security: Mask very short access keys - Keys <= 4 chars now show as '***' instead of full key - Prevents any credential exposure even for short keys - Consistent masking across all log statements 3. Improved robustness: Add warning log for type assertion failure - Log unexpected type when identity context object is wrong type - Helps debug potential middleware or context issues - Better production diagnostics 4. Documentation: Add comment about future optimization opportunity - Note potential for lightweight identity view in context - Suggests credential-free view for better data minimization - Documents design decision for future maintainers	1 month ago
Chris Lu	a9b3be416b	fix: initialize missing S3 options in filer to prevent nil pointer dereference (#7646 ) * fix: initialize missing S3 options in filer to prevent nil pointer dereference Fixes #7644 When starting the S3 gateway from the filer, several S3Options fields were not being initialized, which could cause nil pointer dereferences during startup. This commit adds initialization for: - iamConfig: for advanced IAM configuration - metricsHttpPort: for Prometheus metrics endpoint - metricsHttpIp: for binding the metrics endpoint Also ensures metricsHttpIp defaults to bindIp when not explicitly set, matching the behavior of the standalone S3 server. This prevents the panic that was occurring in the s3.go:226 area when these pointer fields were accessed but never initialized. * fix: copy value instead of pointer for metricsHttpIp default Address review comment to avoid pointer aliasing. Copy the value instead of the pointer to prevent unexpected side effects if the bindIp value is modified later.	1 month ago
Chris Lu	5d53edb93b	Optimize database connection pool settings for MySQL and PostgreSQL (#7645 ) - Reduce connection_max_idle from 100 to 10 (PostgreSQL) and from 2 to 10 (MySQL) - Reduce connection_max_open from 100 to 50 for all database stores - Set connection_max_lifetime_seconds to 300 (5 minutes) to force connection recycling This prevents 'cannot assign requested address' errors under high load by: 1. Limiting the number of concurrent connections to reduce ephemeral port usage 2. Forcing connection recycling to prevent stale connections and port exhaustion 3. Reducing idle connections to minimize resource consumption Fixes #6887	1 month ago
chrislu	5167bbd2a9	Remove deprecated allowEmptyFolder CLI option The allowEmptyFolder option is no longer functional because: 1. The code that used it was already commented out 2. Empty folder cleanup is now handled asynchronously by EmptyFolderCleaner The CLI flags are kept for backward compatibility but marked as deprecated and ignored. This removes: - S3ApiServerOption.AllowEmptyFolder field - The actual usage in s3api_object_handlers_list.go - Helm chart values and template references - References in test Makefiles and docker-compose files	1 month ago
Chris Lu	55f0fbf364	s3: optimize DELETE by skipping lock check for buckets without Object Lock (#7642 ) This optimization avoids an expensive filer gRPC call for every DELETE operation on buckets that don't have Object Lock enabled. Before this change, enforceObjectLockProtections() would always call getObjectEntry() to fetch object metadata to check for retention/legal hold, even for buckets that never had Object Lock configured. Changes: 1. Add early return in enforceObjectLockProtections() if bucket has no Object Lock config or bucket doesn't exist 2. Add isObjectLockEnabled() helper function to check if a bucket has Object Lock configured 3. Fix validateObjectLockHeaders() to check ObjectLockConfig instead of just versioningEnabled - this ensures object-lock headers are properly rejected on buckets without Object Lock enabled, which aligns with AWS S3 semantics 4. Make bucket creation with Object Lock atomic - set Object Lock config in the same CreateEntry call as bucket creation, preventing race conditions where bucket exists without Object Lock enabled 5. Properly handle Object Lock setup failures during bucket creation - if StoreObjectLockConfigurationInExtended fails, roll back the bucket creation and return an error instead of leaving a bucket without the requested Object Lock configuration This significantly improves DELETE latency for non-Object-Lock buckets, which is the common case (lockCheck time reduced from 1-10ms to ~1µs).	1 month ago
Chris Lu	9c266fac29	fix: CompleteMultipartUpload fails for uploads with more than 1000 parts (#7641 ) When completing a multipart upload, the code was listing parts with limit=0, which relies on the server's DirListingLimit default. In 'weed server' mode, this defaults to 1000, causing uploads with more than 1000 parts to fail with InvalidPart error. For a 38GB file with 8MB parts (AWS CLI default), this results in ~4564 parts, far exceeding the 1000 limit. Fix: Use explicit limit of MaxS3MultipartParts+1 (10001) to ensure all parts are listed regardless of server configuration. Fixes #7638	1 month ago
Chris Lu	28ac536280	fix: normalize Windows backslash paths in weed admin file uploads (#7636 ) fix: normalize Windows backslash paths in file uploads When uploading files from a Windows client to a Linux server, file paths containing backslashes were not being properly interpreted as directory separators. This caused files intended for subdirectories to be created in the root directory with backslashes in their filenames. Changes: - Add util.CleanWindowsPath and util.CleanWindowsPathBase helper functions in weed/util/fullpath.go for reusable path normalization - Use path.Join/path.Clean/path.Base instead of filepath equivalents for URL path semantics (filepath is OS-specific) - Apply normalization in weed admin handlers and filer upload parsing Fixes #7628	1 month ago
Chris Lu	89b6deaefa	fix: Use mime.FormatMediaType for RFC 6266 compliant Content-Disposition (#7635 ) Updated Content-Disposition header generation to use mime.FormatMediaType from the standard library, which properly handles non-ASCII characters and special characters per RFC 6266. Changes: - weed/server/common.go: Updated adjustHeaderContentDisposition to use mime.FormatMediaType instead of manual escaping with fileNameEscaper - weed/operation/upload_content.go: Updated multipart form Content-Disposition to use mime.FormatMediaType - weed/server/volume_server_handlers_read.go: Removed unused fileNameEscaper This ensures correct filename display for international users across filer downloads and file uploads. Fixes #7634	1 month ago
Chris Lu	f1384108e8	fix: Admin UI file browser uses https.client TLS config for filer communication (#7633 ) * fix: Admin UI file browser uses https.client TLS config for filer communication When filer is configured with HTTPS (https.filer section in security.toml), the Admin UI file browser was still using plain HTTP for file uploads, downloads, and viewing. This caused TLS handshake errors: 'http: TLS handshake error: client sent an HTTP request to an HTTPS server' This fix: - Updates FileBrowserHandlers to use the HTTPClient from weed/util/http/client which properly loads TLS configuration from https.client section - The HTTPClient automatically uses HTTPS when https.client.enabled=true - All file operations (upload, download, view) now respect TLS configuration - Falls back to plain HTTP if TLS client creation fails Fixes #7631 * fix: Address code review comments - Fix fallback client Transport wiring (properly assign transport to http.Client) - Use per-operation timeouts instead of unified 60s timeout: - uploadFileToFiler: 60s (for large file uploads) - ViewFile: 30s (original timeout) - isLikelyTextFile: 10s (original timeout) * fix: Proxy file downloads through Admin UI for mTLS support The DownloadFile function previously used browser redirect, which would fail when filer requires mutual TLS (client certificates) since the browser doesn't have these certificates. Now the Admin UI server proxies the download, using its TLS-aware HTTP client with the configured client certificates, then streams the response to the browser. * fix: Ensure HTTP response body is closed on non-200 responses In ViewFile, the response body was only closed on 200 OK paths, which could leak connections on non-200 responses. Now the body is always closed via defer immediately after checking err == nil, before checking the status code. * refactor: Extract fetchFileContent helper to reduce nesting in ViewFile Extracted the deeply nested file fetch logic (7+ levels) into a separate fetchFileContent helper method. This improves readability while maintaining the same TLS-aware behavior and error handling. * refactor: Use idiomatic Go error handling in fetchFileContent Changed fetchFileContent to return (string, error) instead of (content string, reason string) for idiomatic Go error handling. This enables error wrapping and standard 'if err != nil' checks. Also improved error messages to be more descriptive for debugging, including the HTTP status code and response body on non-200 responses. * refactor: Extract newClientWithTimeout helper to reduce code duplication - Added newClientWithTimeout() helper method that creates a temporary http.Client with the specified timeout, reusing the TLS transport - Updated uploadFileToFiler, fetchFileContent, DownloadFile, and isLikelyTextFile to use the new helper - Improved error message in DownloadFile to include response body for better debuggability (consistent with fetchFileContent) * fix: Address CodeRabbit review comments - Fix connection leak in isLikelyTextFile: ensure resp.Body.Close() is called even when status code is not 200 - Use http.NewRequestWithContext in DownloadFile so the filer request is cancelled when the client disconnects, improving resource cleanup * fix: Escape Content-Disposition filename per RFC 2616 Filenames containing quotes, backslashes, or special characters could break the Content-Disposition header or cause client-side parsing issues. Now properly escapes these characters before including in the header. * fix: Handle io.ReadAll errors when reading error response bodies In fetchFileContent and DownloadFile, the error from io.ReadAll was ignored when reading the filer's error response body. Now properly handles these errors to provide complete error messages. * fix: Fail fast when TLS client creation fails If TLS is enabled (https.client.enabled=true) but misconfigured, fail immediately with glog.Fatalf rather than silently falling back to plain HTTP. This prevents confusing runtime errors when the filer only accepts HTTPS connections. * fix: Use mime.FormatMediaType for RFC 6266 compliant Content-Disposition Replace manual escaping with mime.FormatMediaType which properly handles non-ASCII characters and special characters per RFC 6266, ensuring correct filename display for international users.	1 month ago
msementsov	c0dad091f1	Separate vacuum speed from replication speed (#7632 )	1 month ago
Chris Lu	4cc6a2a4e5	fix: Admin UI user creation fails before filer discovery (#7624 ) (#7625 ) * fix: Admin UI user creation fails before filer discovery (#7624) The credential manager's filer address function was not configured quickly enough after admin server startup, causing 'filer address function not configured' errors when users tried to create users immediately. Changes: - Use exponential backoff (200ms -> 5s) instead of fixed 5s polling for faster filer discovery on startup - Improve error messages to be more user-friendly and actionable Fixes #7624 * Add more debug logging to help diagnose filer discovery issues * fix: Use dynamic filer address function to eliminate race condition Instead of using a goroutine to wait for filer discovery before setting the filer address function, we now set a dynamic function immediately that returns the current filer address whenever it's called. This eliminates the race condition where users could create users before the goroutine completed, and provides clearer error messages when no filer is available. The dynamic function is HA-aware - it automatically returns whatever filer is currently available, adapting to filer failovers.	1 month ago
Chris Lu	5c1de633cb	mount: improve read throughput with parallel chunk fetching (#7627 ) * filer: remove lock contention during chunk download This addresses issue #7504 where a single weed mount FUSE instance does not fully utilize node network bandwidth when reading large files. The SingleChunkCacher was holding a mutex during the entire HTTP download, causing readers to block until the download completed. This serialized chunk reads even when multiple goroutines were downloading in parallel. Changes: - Add sync.Cond to SingleChunkCacher for efficient waiting - Move HTTP download outside the critical section in startCaching() - Use condition variable in readChunkAt() to wait for download completion - Add isComplete flag to track download state Now multiple chunk downloads can proceed truly in parallel, and readers wait efficiently using the condition variable instead of blocking on a mutex held during I/O operations. Ref: #7504 * filer: parallel chunk fetching within doReadAt This addresses issue #7504 by enabling parallel chunk downloads within a single read operation. Previously, doReadAt() processed chunks sequentially in a loop, meaning each chunk had to be fully downloaded before the next one started. This left significant network bandwidth unused when chunks resided on different volume servers. Changes: - Collect all chunk read tasks upfront - Use errgroup to fetch multiple chunks in parallel - Each chunk reads directly into its correct buffer position - Limit concurrency to prefetchCount (min 4) to avoid overwhelming the system - Handle gaps and zero-filling before parallel fetch - Trigger prefetch after parallel reads complete For a read spanning N chunks on different volume servers, this can now utilize up to N times the bandwidth of a single connection. Ref: #7504 * http: direct buffer read to reduce memory copies This addresses issue #7504 by reducing memory copy overhead during chunk downloads. Previously, RetriedFetchChunkData used ReadUrlAsStream which: 1. Allocated a 64KB intermediate buffer 2. Read data in 64KB chunks 3. Called a callback to copy each chunk to the destination For a 16MB chunk, this meant 256 copy operations plus the callback overhead. Profiling showed significant time spent in memmove. Changes: - Add readUrlDirectToBuffer() that reads directly into the destination - Add retriedFetchChunkDataDirect() for unencrypted, non-gzipped chunks - Automatically use direct read path when possible (cipher=nil, gzip=false) - Use http.NewRequestWithContext for proper cancellation For unencrypted chunks (the common case), this eliminates the intermediate buffer entirely, reading HTTP response bytes directly into the final destination buffer. Ref: #7504 * address review comments - Use channel (done) instead of sync.Cond for download completion signaling This integrates better with context cancellation patterns - Remove redundant groupErr check in reader_at.go (errors are already captured in task.err) - Remove buggy URL encoding logic from retriedFetchChunkDataDirect (The existing url.PathEscape on full URL is a pre-existing bug that should be fixed separately) * address review comments (round 2) - Return io.ErrUnexpectedEOF when HTTP response is truncated This prevents silent data corruption from incomplete reads - Simplify errgroup error handling by using g.Wait() error directly Remove redundant task.err field and manual error aggregation loop - Define minReadConcurrency constant instead of magic number 4 Improves code readability and maintainability Note: Context propagation to startCaching() is intentionally NOT changed. The downloaded chunk is a shared resource that may be used by multiple readers. Using context.Background() ensures the download completes even if one reader cancels, preventing data loss for other waiting readers. * http: inject request ID for observability in direct read path Add request_id.InjectToRequest() call to readUrlDirectToBuffer() for consistency with ReadUrlAsStream path. This ensures full-chunk reads carry the same tracing/correlation headers for server logs and metrics. * filer: consistent timestamp handling in sequential read path Use max(ts, task.chunk.ModifiedTsNs) in sequential path to match parallel path behavior. Also update ts before error check so that on failure, the returned timestamp reflects the max of all chunks processed so far. * filer: document why context.Background() is used in startCaching Add comment explaining the intentional design decision: the downloaded chunk is a shared resource that may be used by multiple concurrent readers. Using context.Background() ensures the download completes even if one reader cancels, preventing errors for other waiting readers. * filer: propagate context for reader cancellation Address review comment: pass context through ReadChunkAt call chain so that a reader can cancel its wait for a download. The key distinction is: - Download uses context.Background() - shared resource, always completes - Reader wait uses request context - can be cancelled individually If a reader cancels, it stops waiting and returns ctx.Err(), but the download continues to completion for other readers waiting on the same chunk. This properly handles the shared resource semantics while still allowing individual reader cancellation. * filer: use defer for close(done) to guarantee signal on panic Move close(s.done) to a defer statement at the start of startCaching() to ensure the completion signal is always sent, even if an unexpected panic occurs. This prevents readers from blocking indefinitely. * filer: remove unnecessary code - Remove close(s.cacheStartedCh) in destroy() - the channel is only used for one-time synchronization, closing it provides no benefit - Remove task := task loop variable capture - Go 1.22+ fixed loop variable semantics, this capture is no longer necessary (go.mod specifies Go 1.24.0) * filer: restore fallback to chunkCache when cacher returns no data Fix critical issue where ReadChunkAt would return 0,nil immediately if SingleChunkCacher couldn't provide data for the requested offset, without trying the chunkCache fallback. Now if cacher.readChunkAt returns n=0 and err=nil, we fall through to try chunkCache. * filer: add comprehensive tests for ReaderCache Tests cover: - Context cancellation while waiting for download - Fallback to chunkCache when cacher returns n=0, err=nil - Multiple concurrent readers waiting for same chunk - Partial reads at different offsets - Downloader cleanup when exceeding cache limit - Done channel signaling (no hangs on completion) * filer: prioritize done channel over context cancellation If data is already available (done channel closed), return it even if the reader's context is also cancelled. This avoids unnecessary errors when the download has already completed. * filer: add lookup error test and document test limitations Add TestSingleChunkCacherLookupError to test error handling when lookup fails. Document that full HTTP integration tests for SingleChunkCacher require global HTTP client initialization which is complex in unit tests. The download path is tested via FUSE integration tests. * filer: add tests that exercise SingleChunkCacher concurrency logic Add tests that use blocking lookupFileIdFn to exercise the actual SingleChunkCacher wait/cancellation logic: - TestSingleChunkCacherContextCancellationDuringLookup: tests reader cancellation while lookup is blocked - TestSingleChunkCacherMultipleReadersWaitForDownload: tests multiple readers waiting on the same download - TestSingleChunkCacherOneReaderCancelsOthersContinue: tests that when one reader cancels, other readers continue waiting These tests properly exercise the done channel wait/cancel logic without requiring HTTP calls - the blocking lookup simulates a slow download.	1 month ago

1 2 3 4 5 ...

7803 Commits (b0e0c5aaabd393ca633c9c1e5d24d15d47e05bec)