* s3: fix presigned POST upload missing slash between bucket and key
When uploading a file using presigned POST (e.g., boto3.generate_presigned_post),
the file was saved with the bucket name and object key concatenated without a
slash (e.g., 'my-bucketfilename' instead of 'my-bucket/filename').
The issue was that PostPolicyBucketHandler retrieved the object key from form
values without ensuring it had a leading slash, unlike GetBucketAndObject()
which normalizes the key.
Fixes#7713
* s3: add tests for presigned POST key normalization
Add comprehensive tests for PostPolicyBucketHandler to ensure:
- Object keys without leading slashes are properly normalized
- ${filename} substitution works correctly with normalization
- Path construction correctly separates bucket and key
- Form value extraction works properly
These tests would have caught the bug fixed in the previous commit
where keys like 'test_image.png' were concatenated with bucket
without a separator, resulting in 'my-buckettest_image.png'.
* s3: create normalizeObjectKey function for robust key normalization
Address review feedback by creating a reusable normalizeObjectKey function
that both adds a leading slash and removes duplicate slashes, aligning with
how other handlers process paths (e.g., toFilerPath uses removeDuplicateSlashes).
The function handles edge cases like:
- Keys without leading slashes (the original bug)
- Keys with duplicate slashes (e.g., 'a//b' -> '/a/b')
- Keys with leading duplicate slashes (e.g., '///a' -> '/a')
Updated tests to use the new function and added TestNormalizeObjectKey
for comprehensive coverage of the new function.
* s3: move NormalizeObjectKey to s3_constants for shared use
Move the NormalizeObjectKey function to the s3_constants package so it can
be reused by:
- GetBucketAndObject() - now normalizes all object keys from URL paths
- GetPrefix() - now normalizes prefix query parameters
- PostPolicyBucketHandler - normalizes keys from form values
This ensures consistent object key normalization across all S3 API handlers,
handling both missing leading slashes and duplicate slashes.
Benefits:
- Single source of truth for key normalization
- GetBucketAndObject now removes duplicate slashes (previously only added leading slash)
- All handlers benefit from the improved normalization automatically
* ec: add diskType parameter to core EC functions
Add diskType parameter to:
- ecBalancer struct
- collectEcVolumeServersByDc()
- collectEcNodesForDC()
- collectEcNodes()
- EcBalance()
This allows EC operations to target specific disk types (hdd, ssd, etc.)
instead of being hardcoded to HardDriveType only.
For backward compatibility, all callers currently pass types.HardDriveType
as the default value. Subsequent commits will add -diskType flags to
the individual EC commands.
* ec: update helper functions to use configurable diskType
Update the following functions to accept/use diskType parameter:
- findEcVolumeShards()
- addEcVolumeShards()
- deleteEcVolumeShards()
- moveMountedShardToEcNode()
- countShardsByRack()
- pickNEcShardsToMoveFrom()
All ecBalancer methods now use ecb.diskType instead of hardcoded
types.HardDriveType. Non-ecBalancer callers (like volumeServer.evacuate
and ec.rebuild) use types.HardDriveType as the default.
Update all test files to pass diskType where needed.
* ec: add -diskType flag to ec.balance and ec.encode commands
Add -diskType flag to specify the target disk type for EC operations:
- ec.balance -diskType=ssd
- ec.encode -diskType=ssd
The disk type can be 'hdd', 'ssd', or empty for default (hdd).
This allows placing EC shards on SSD or other disk types instead of
only HDD.
Example usage:
ec.balance -collection=mybucket -diskType=ssd -apply
ec.encode -collection=mybucket -diskType=ssd -force
* test: add integration tests for EC disk type support
Add integration tests to verify the -diskType flag works correctly:
- TestECDiskTypeSupport: Tests EC encode and balance with SSD disk type
- TestECDiskTypeMixedCluster: Tests EC operations on a mixed HDD/SSD cluster
The tests verify:
- Volume servers can be configured with specific disk types
- ec.encode accepts -diskType flag and encodes to the correct disk type
- ec.balance accepts -diskType flag and balances on the correct disk type
- Mixed disk type clusters work correctly with separate collections
* ec: add -sourceDiskType to ec.encode and -diskType to ec.decode
ec.encode:
- Add -sourceDiskType flag to filter source volumes by disk type
- This enables tier migration scenarios (e.g., SSD volumes → HDD EC shards)
- -diskType specifies target disk type for EC shards
ec.decode:
- Add -diskType flag to specify source disk type where EC shards are stored
- Update collectEcShardIds() and collectEcNodeShardBits() to accept diskType
Examples:
# Encode SSD volumes to HDD EC shards (tier migration)
ec.encode -collection=mybucket -sourceDiskType=ssd -diskType=hdd
# Decode EC shards from SSD
ec.decode -collection=mybucket -diskType=ssd
Integration tests updated to cover new flags.
* ec: fix variable shadowing and add -diskType to ec.rebuild and volumeServer.evacuate
Address code review comments:
1. Fix variable shadowing in collectEcVolumeServersByDc():
- Rename loop variable 'diskType' to 'diskTypeKey' and 'diskTypeStr'
to avoid shadowing the function parameter
2. Fix hardcoded HardDriveType in ecBalancer methods:
- balanceEcRack(): use ecb.diskType instead of types.HardDriveType
- collectVolumeIdToEcNodes(): use ecb.diskType
3. Add -diskType flag to ec.rebuild command:
- Add diskType field to ecRebuilder struct
- Pass diskType to collectEcNodes() and addEcVolumeShards()
4. Add -diskType flag to volumeServer.evacuate command:
- Add diskType field to commandVolumeServerEvacuate struct
- Pass diskType to collectEcVolumeServersByDc() and moveMountedShardToEcNode()
* test: add diskType field to ecBalancer in TestPickEcNodeToBalanceShardsInto
Address nitpick comment: ensure test ecBalancer struct has diskType
field set for consistency with other tests.
* ec: filter disk selection by disk type in pickBestDiskOnNode
When evacuating or rebalancing EC shards, pickBestDiskOnNode now
filters disks by the target disk type. This ensures:
1. EC shards from SSD disks are moved to SSD disks on destination nodes
2. EC shards from HDD disks are moved to HDD disks on destination nodes
3. No cross-disk-type shard movement occurs
This maintains the storage tier isolation when moving EC shards
between nodes during evacuation or rebalancing operations.
* ec: allow disk type fallback during evacuation
Update pickBestDiskOnNode to accept a strictDiskType parameter:
- strictDiskType=true (balancing): Only use disks of matching type.
This maintains storage tier isolation during normal rebalancing.
- strictDiskType=false (evacuation): Prefer same disk type, but
fall back to other disk types if no matching disk is available.
This ensures evacuation can complete even when same-type capacity
is insufficient.
Priority order for evacuation:
1. Same disk type with lowest shard count (preferred)
2. Different disk type with lowest shard count (fallback)
* test: use defer for lock/unlock to prevent lock leaks
Use defer to ensure locks are always released, even on early returns
or test failures. This prevents lock leaks that could cause subsequent
tests to hang or fail.
Changes:
- Return early if lock acquisition fails
- Immediately defer unlock after successful lock
- Remove redundant explicit unlock calls at end of tests
- Fix unused variable warning (err -> encodeErr/locErr)
* ec: dynamically discover disk types from topology for evacuation
Disk types are free-form tags (e.g., 'ssd', 'nvme', 'archive') that come
from the topology, not a hardcoded set. Only 'hdd' (or empty) is the
default disk type.
Use collectVolumeDiskTypes() to discover all disk types present in the
cluster topology instead of hardcoding [HardDriveType, SsdType].
* test: add evacuation fallback and cross-rack EC placement tests
Add two new integration tests:
1. TestEvacuationFallbackBehavior:
- Tests that when same disk type has no capacity, shards fall back
to other disk types during evacuation
- Creates cluster with 1 SSD + 2 HDD servers (limited SSD capacity)
- Verifies pickBestDiskOnNode behavior with strictDiskType=false
2. TestCrossRackECPlacement:
- Tests EC shard distribution across different racks
- Creates cluster with 4 servers in 4 different racks
- Verifies shards are spread across multiple racks
- Tests that ec.balance respects rack placement
Helper functions added:
- startLimitedSsdCluster: 1 SSD + 2 HDD servers
- startMultiRackCluster: 4 servers in 4 racks
- countShardsPerRack: counts EC shards per rack from disk
* test: fix collection mismatch in TestCrossRackECPlacement
The EC commands were using collection 'rack_test' but uploaded test data
uses collection 'test' (default). This caused ec.encode/ec.balance to not
find the uploaded volume.
Fix: Change EC commands to use '-collection test' to match the uploaded data.
Addresses review comment from PR #7607.
* test: close log files in MultiDiskCluster.Stop() to prevent FD leaks
Track log files in MultiDiskCluster.logFiles and close them in Stop()
to prevent file descriptor accumulation in long-running or many-test
scenarios.
Addresses review comment about logging resources cleanup.
* test: improve EC integration tests with proper assertions
- Add assertNoFlagError helper to detect flag parsing regressions
- Update diskType subtests to fail on flag errors (ec.encode, ec.balance, ec.decode)
- Update verify_disktype_flag_parsing to check help output contains diskType
- Remove verify_fallback_disk_selection (was documentation-only, not executable)
- Add assertion to verify_cross_rack_distribution for minimum 2 racks
- Consolidate uploadTestDataWithDiskType to accept collection parameter
- Remove duplicate uploadTestDataWithDiskTypeMixed function
* test: extract captureCommandOutput helper and fix error handling
- Add captureCommandOutput helper to reduce code duplication in diskType tests
- Create commandRunner interface to match shell command Do method
- Update ec_encode_with_ssd_disktype, ec_balance_with_ssd_disktype,
ec_encode_with_source_disktype, ec_decode_with_disktype to use helper
- Fix filepath.Glob error handling in countShardsPerRack instead of ignoring it
* test: add flag validation to ec_balance_targets_correct_disk_type
Add assertNoFlagError calls after ec.balance commands to ensure
-diskType flag is properly recognized for both SSD and HDD disk types.
* test: add proper assertions for EC command results
- ec_encode_with_ssd_disktype: check for expected volume-related errors
- ec_balance_with_ssd_disktype: require success with require.NoError
- ec_encode_with_source_disktype: check for expected no-volume errors
- ec_decode_with_disktype: check for expected no-ec-volume errors
- upload_to_ssd_and_hdd: use require.NoError for setup validation
Tests now properly fail on unexpected errors rather than just logging.
* test: fix missing unlock in ec_encode_with_disk_awareness
Add defer unlock pattern to ensure lock is always released, matching
the pattern used in other subtests.
* test: improve helper robustness
- Make assertNoFlagError case-insensitive for pattern matching
- Use defer in captureCommandOutput to restore stdout/stderr and close
pipe ends to avoid FD leaks even if cmd.Do panics
* fix: filer do not support IP whitelist right now #7094
* Apply suggestion from @gemini-code-assist[bot]
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
---------
Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Change all concurrentUploadLimitMB and concurrentDownloadLimitMB defaults
from fixed values (64, 128, 256 MB) to 0 (unlimited).
This removes artificial throttling that can limit throughput on high-performance
systems, especially on all-flash setups with many cores.
Files changed:
- volume.go: concurrentUploadLimitMB 256->0, concurrentDownloadLimitMB 256->0
- server.go: filer/volume/s3 concurrent limits 64/128->0
- s3.go: concurrentUploadLimitMB 128->0
- filer.go: concurrentUploadLimitMB 128->0, s3.concurrentUploadLimitMB 128->0
Users can still set explicit limits if needed for resource management.
fix: weed shell can't connect to master when no volume servers (#7701)
When there are no volume servers registered, the master's KeepConnected
handler would not send any initial message to clients. This caused the
shell's masterClient to block indefinitely on stream.Recv(), preventing
it from setting currentMaster and completing the connection handshake.
The fix ensures the master always sends at least one message with leader
information to newly connected clients, even when ToVolumeLocations()
returns an empty slice.
Alpine's busybox wget does not support --ca-cert, --certificate, and
--private-key options required for HTTPS healthchecks with client
certificate authentication.
Adding curl to Docker images enables proper HTTPS healthchecks.
Fixes#7707
mount: add periodic metadata flush to protect chunks from orphan cleanup
When a file is opened via FUSE mount and written for a long time without
being closed, chunks are uploaded to volume servers but the file metadata
(containing chunk references) is only saved to the filer on file close.
If volume.fsck runs during this window, it may identify these chunks as
orphans (not referenced in filer metadata) and purge them, causing data loss.
This commit adds a background task that periodically flushes file metadata
for open files to the filer, ensuring chunk references are visible to
volume.fsck even before files are closed.
New option:
-metadataFlushSeconds (default: 120)
Interval in seconds for flushing dirty file metadata to filer.
Set to 0 to disable.
Fixes: https://github.com/seaweedfs/seaweedfs/issues/7649
* fix: add pagination to list-object-versions for buckets with >1000 objects
The findVersionsRecursively() function used a fixed limit of 1000 entries
without pagination. This caused objects beyond the first 1000 entries
(sorted alphabetically) to never appear in list-object-versions responses.
Changes:
- Add pagination loop using filer.PaginationSize (1024)
- Use isLast flag from s3a.list() to detect end of pagination
- Track startFrom marker for each page
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* fix: prevent infinite loop in ListObjects when processing .versions directories
The doListFilerEntries() function processes .versions directories in a
secondary loop after the main entry loop, but failed to update nextMarker.
This caused infinite pagination loops when results were truncated, as the
same .versions directories would be reprocessed on each page.
Bug introduced by: c196d03951
("fix listing object versions (#7006)")
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
This addresses issue #7699 where FoundationDB filer store had low throughput
(~400-500 obj/s) due to each write operation creating a separate transaction.
Changes:
- Add writeBatcher that collects multiple writes into batched transactions
- New config options: batch_size (default: 100), batch_interval (default: 5ms)
- Batching provides ~5.7x throughput improvement (from ~456 to ~2600 obj/s)
Benchmark results with different batch sizes:
- batch_size=1: ~456 obj/s (baseline, no batching)
- batch_size=10: ~2621 obj/s (5.7x improvement)
- batch_size=16: ~2514 obj/s (5.5x improvement)
- batch_size=100: ~2617 obj/s (5.7x improvement)
- batch_size=1000: ~2593 obj/s (5.7x improvement)
The batch_interval timer (5ms) ensures writes are flushed promptly even
when batch is not full, providing good latency characteristics.
Addressed review feedback:
- Changed wait=false to wait=true in UpdateEntry/DeleteEntry to properly
propagate errors to callers
- Fixed timer reset race condition by stopping and draining before reset
Fixes#7699
The condition was inverted - it was caching lookups with errors
instead of successful lookups. This caused every replicated write
to make a gRPC call to master for volume location lookup, resulting
in ~1 second latency for writeToReplicas.
The bug particularly affected TTL volumes because:
- More unique volumes are created (separate pools per TTL)
- Volumes expire and get recreated frequently
- Each new volume requires a fresh lookup (cache miss)
- Higher volume churn = more cache misses = more master lookups
With this fix, successful lookups are cached for 10 minutes,
reducing replication latency from ~1s to ~10ms for cached volumes.
* mount: add singleflight to deduplicate concurrent EnsureVisited calls
When multiple goroutines access the same uncached directory simultaneously,
they would all make redundant network requests to the filer. This change
uses singleflight.Group to ensure only one goroutine fetches the directory
entries while others wait for the result.
This fixes a race condition where concurrent lookups or readdir operations
on the same uncached directory would:
1. Make duplicate network requests to the filer
2. Insert duplicate entries into LevelDB cache
3. Waste CPU and network bandwidth
* mount: fetch parent directories in parallel during EnsureVisited
Previously, when accessing a deep path like /a/b/c/d, the parent directories
were fetched serially from target to root. This change:
1. Collects all uncached directories from target to root first
2. Fetches them all in parallel using errgroup
3. Relies on singleflight (from previous commit) for deduplication
This reduces latency when accessing deep uncached paths, especially in
high-latency network environments where parallel requests can significantly
improve performance.
* mount: add batch inserts for LevelDB meta cache
When populating the meta cache from filer, entries were inserted one-by-one
into LevelDB. This change:
1. Adds BatchInsertEntries method to LevelDBStore that uses LevelDB's
native batch write API
2. Updates MetaCache to keep a direct reference to the LevelDB store
for batch operations
3. Modifies doEnsureVisited to collect entries and insert them in
batches of 100 entries
Batch writes are more efficient because:
- Reduces number of individual write operations
- Reduces disk syncs
- Improves throughput for large directories
* mount: fix potential nil dereference in MarkChildrenCached
Add missing check for inode existence in inode2path map before accessing
the InodeEntry. This prevents a potential nil pointer dereference if the
inode exists in path2inode but not in inode2path (which could happen due
to race conditions or bugs).
This follows the same pattern used in IsChildrenCached which properly
checks for existence before accessing the entry.
* mount: fix batch flush when last entry is hidden
The previous batch insert implementation relied on the isLast flag to flush
remaining entries. However, if the last entry is a hidden system entry
(like 'topics' or 'etc' in root), the callback returns early and the
remaining entries in the batch are never flushed.
Fix by:
1. Only flush when batch reaches threshold inside the callback
2. Flush any remaining entries after ReadDirAllEntries completes
3. Use error wrapping instead of logging+returning to avoid duplicate logs
4. Create new slice after flush to allow GC of flushed entries
5. Add documentation for batchInsertSize constant
This ensures all entries are properly inserted regardless of whether
the last entry is hidden, and prevents memory retention issues.
* mount: add context support for cancellation in EnsureVisited
Thread context.Context through the batch insert call chain to enable
proper cancellation and timeout support:
1. Use errgroup.WithContext() so if one fetch fails, others are cancelled
2. Add context parameter to BatchInsertEntries for consistency with InsertEntry
3. Pass context to ReadDirAllEntries for cancellation during network calls
4. Check context cancellation before starting work in doEnsureVisited
5. Use %w for error wrapping to preserve error types for inspection
This prevents unnecessary work when one directory fetch fails and makes
the batch operations consistent with the existing context-aware APIs.
mount: remove unused isEarlyTerminated variable
The variable was redundant because when processEachEntryFn returns false,
we immediately return fuse.OK, so the check was always false.
* fix: prevent filer.backup stall in single-filer setups (#4977)
When MetaAggregator.MetaLogBuffer is empty (which happens in single-filer
setups with no peers), ReadFromBuffer was returning nil error, causing
LoopProcessLogData to enter an infinite wait loop on ListenersCond.
This fix returns ResumeFromDiskError instead, allowing SubscribeMetadata
to loop back and read from persisted logs on disk. This ensures filer.backup
continues processing events even when the in-memory aggregator buffer is empty.
Fixes#4977
* test: add integration tests for metadata subscription
Add integration tests for metadata subscription functionality:
- TestMetadataSubscribeBasic: Tests basic subscription and event receiving
- TestMetadataSubscribeSingleFilerNoStall: Regression test for #4977,
verifies subscription doesn't stall under high load in single-filer setups
- TestMetadataSubscribeResumeFromDisk: Tests resuming subscription from disk
Related to #4977
* ci: add GitHub Actions workflow for metadata subscribe tests
Add CI workflow that runs on:
- Push/PR to master affecting filer, log_buffer, or metadata subscribe code
- Runs the integration tests for metadata subscription
- Uploads logs on failure for debugging
Related to #4977
* fix: use multipart form-data for file uploads in integration tests
The filer expects multipart/form-data for file uploads, not raw POST body.
This fixes the 'Content-Type isn't multipart/form-data' error.
* test: use -peers=none for faster master startup
* test: add -peers=none to remaining master startup in ec tests
* fix: use filer HTTP port 8888, WithFilerClient adds 10000 for gRPC
WithFilerClient calls ToGrpcAddress() which adds 10000 to the port.
Passing 18888 resulted in connecting to 28888. Use 8888 instead.
* test: add concurrent writes and million updates tests
- TestMetadataSubscribeConcurrentWrites: 50 goroutines writing 20 files each
- TestMetadataSubscribeMillionUpdates: 1 million metadata entries via gRPC
(metadata only, no actual file content for speed)
* fix: address PR review comments
- Handle os.MkdirAll errors explicitly instead of ignoring
- Handle log file creation errors with proper error messages
- Replace silent event dropping with 100ms timeout and warning log
* Update metadata_subscribe_integration_test.go
fix: skip log files with deleted volumes in filer backup (#3720)
When filer.backup or filer.meta.backup resumes after being stopped, it may
encounter persisted log files stored on volumes that have since been deleted
(via volume.deleteEmpty -force). Previously, this caused the backup to get
stuck in an infinite retry loop with 'volume X not found' errors.
This fix catches 'volume not found' errors when reading log files and skips
the problematic file instead of failing. The backup will now:
- Log a warning about the missing volume
- Skip the problematic log file
- Continue with the next log file, allowing progress
The VolumeNotFoundPattern regex was already defined but never used - this
change puts it to use.
Fixes#3720
* fix: return error on size mismatch in ReadNeedleMeta for consistency
When ReadNeedleMeta encounters a size mismatch at offset >= MaxPossibleVolumeSize,
it previously just continued without returning an error, potentially using wrong data.
This fix makes ReadNeedleMeta consistent with ReadBytes (needle_read.go), which
properly returns an error in both cases:
- ErrorSizeMismatch when offset < MaxPossibleVolumeSize (to trigger retry at offset+32GB)
- A descriptive error when offset >= MaxPossibleVolumeSize (after retry failed)
Fixes#7673
* refactor: use more accurate error message for size mismatch
* fix: prevent empty .vif files from ec.decode causing parse errors
When ec.decode copies .vif files from EC shard nodes, if a source node
doesn't have the .vif file, an empty .vif file was created on the target
node. This caused volume.configure.replication to fail with 'proto: syntax
error' when trying to parse the empty file.
This fix:
1. In writeToFile: Remove empty files when no data was written (source
file was not found) to avoid leaving corrupted empty files
2. In MaybeLoadVolumeInfo: Handle empty .vif files gracefully by treating
them as non-existent, allowing the system to create a proper one
Fixes#7666
* refactor: remove redundant dst.Close() and add error logging
Address review feedback:
- Remove redundant dst.Close() call since defer already handles it
- Add error logging for os.Remove() failure
* mount: fix weed inode nlookup do not equel kernel inode nlookup
* mount: add underflow protection for nlookup decrement in Forget
* mount: use consistent == 0 check for uint64 nlookup
* Update weed/mount/inode_to_path.go
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* mount: snapshot data before unlock in Forget to avoid using deleted InodeEntry
---------
Co-authored-by: chrislu <chris.lu@gmail.com>
Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* s3api: remove redundant auth verification in getRequestDataReader
The handlers PutObjectHandler and PutObjectPartHandler are already wrapped
with s3a.iam.Auth() middleware which performs signature verification via
authRequest() before the handler is invoked.
The signature verification for authTypeSignedV2, authTypePresignedV2,
authTypePresigned, and authTypeSigned in getRequestDataReader was therefore
redundant.
The newChunkedReader() call for streaming auth types is kept as it's needed
to parse the chunked transfer encoding and extract the actual data.
Fixes#7683
* simplify switch to if statement for single condition
* s3: add s3:ExistingObjectTag condition support in policy engine
Add support for s3:ExistingObjectTag/<tag-key> condition keys in bucket
policies, allowing access control based on object tags.
Changes:
- Add ObjectEntry field to PolicyEvaluationArgs (entry.Extended metadata)
- Update EvaluateConditions to handle s3:ExistingObjectTag/<key> format
- Extract tag value from entry metadata using X-Amz-Tagging-<key> prefix
This enables policies like:
{
"Condition": {
"StringEquals": {
"s3:ExistingObjectTag/status": ["public"]
}
}
}
Fixes: https://github.com/seaweedfs/seaweedfs/issues/7447
* s3: update EvaluatePolicy to accept object entry for tag conditions
Update BucketPolicyEngine.EvaluatePolicy to accept objectEntry parameter
(entry.Extended metadata) for evaluating tag-based policy conditions.
Changes:
- Add objectEntry parameter to EvaluatePolicy method
- Update callers in auth_credentials.go and s3api_bucket_handlers.go
- Pass nil for objectEntry in auth layer (entry fetched later in handlers)
For tag-based conditions to work, handlers should call EvaluatePolicy
with the object's entry.Extended after fetching the entry from filer.
* s3: add tests for s3:ExistingObjectTag policy conditions
Add comprehensive tests for object tag-based policy conditions:
- TestExistingObjectTagCondition: Basic tag matching scenarios
- Matching/non-matching tag values
- Missing tags, no tags, empty tags
- Multiple tags with one matching
- TestExistingObjectTagConditionMultipleTags: Multiple tag conditions
- Both tags match
- Only one tag matches
- TestExistingObjectTagDenyPolicy: Deny policies with tag conditions
- Default allow without tag
- Deny when specific tag present
* s3: document s3:ExistingObjectTag support and feature status
Update policy engine documentation:
- Add s3:ExistingObjectTag/<tag-key> to supported condition keys
- Add 'Object Tag-Based Access Control' section with examples
- Add 'Feature Status' section with implemented and planned features
Planned features for future implementation:
- s3:RequestObjectTag/<key>
- s3:RequestObjectTagKeys
- s3:x-amz-server-side-encryption
- Cross-account access
* Implement tag-based policy re-check in handlers
- Add checkPolicyWithEntry helper to S3ApiServer for handlers to re-check
policy after fetching object entry (for s3:ExistingObjectTag conditions)
- Add HasPolicyForBucket method to policy engine for efficient check
- Integrate policy re-check in GetObjectHandler after entry is fetched
- Integrate policy re-check in HeadObjectHandler after entry is fetched
- Update auth_credentials.go comments to explain two-phase evaluation
- Update documentation with supported operations for tag-based conditions
This implements 'Approach 1' where handlers re-check the policy with
the object entry after fetching it, allowing tag-based conditions to
be properly evaluated.
* Add integration tests for s3:ExistingObjectTag conditions
- Add TestCheckPolicyWithEntry: tests checkPolicyWithEntry helper with various
tag scenarios (matching tags, non-matching tags, empty entry, nil entry)
- Add TestCheckPolicyWithEntryNoPolicyForBucket: tests early return when no policy
- Add TestCheckPolicyWithEntryNilPolicyEngine: tests nil engine handling
- Add TestCheckPolicyWithEntryDenyPolicy: tests deny policies with tag conditions
- Add TestHasPolicyForBucket: tests HasPolicyForBucket method
These tests cover the Phase 2 policy evaluation with object entry metadata,
ensuring tag-based conditions are properly evaluated.
* Address code review nitpicks
- Remove unused extractObjectTags placeholder function (engine.go)
- Add clarifying comment about s3:ExistingObjectTag/<key> evaluation
- Consolidate duplicate tag-based examples in README
- Factor out tagsToEntry helper to package level in tests
* Address code review feedback
- Fix unsafe type assertions in GetObjectHandler and HeadObjectHandler
when getting identity from context (properly handle type assertion failure)
- Extract getConditionContextValue helper to eliminate duplicated logic
between EvaluateConditions and EvaluateConditionsLegacy
- Ensure consistent handling of missing condition keys (always return
empty slice)
* Fix GetObjectHandler to match HeadObjectHandler pattern
Add safety check for nil objectEntryForSSE before tag-based policy
evaluation, ensuring tag-based conditions are always evaluated rather
than silently skipped if entry is unexpectedly nil.
Addresses review comment from Copilot.
* Fix HeadObject action name in docs for consistency
Change 'HeadObject' to 's3:HeadObject' to match other action names.
* Extract recheckPolicyWithObjectEntry helper to reduce duplication
Move the repeated identity extraction and policy re-check logic from
GetObjectHandler and HeadObjectHandler into a shared helper method.
* Add validation for empty tag key in s3:ExistingObjectTag condition
Prevent potential issues with malformed policies containing
s3:ExistingObjectTag/ (empty tag key after slash).
Fixes#7467
The -mserver argument line in volume-statefulset.yaml was missing a
trailing backslash, which prevented extraArgs from being passed to
the weed volume process.
Also:
- Extracted master server list generation logic into shared helper
templates in _helpers.tpl for better maintainability
- Updated all occurrences of deprecated -mserver flag to -master
across docker-compose files, test files, and documentation
* fix: prevent makeslice panic in ReadNeedleMeta with corrupted needle
When a needle's DataSize in the .dat file is corrupted to a very large
value, the calculation of metaSize can become negative, causing a panic
with 'makeslice: len out of range' when creating the metadata slice.
This fix adds validation to check if metaSize is negative before
creating the slice, returning a descriptive error instead of panicking.
Fixes#7475
* Update weed/storage/needle/needle_read_page.go
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
---------
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* mount: add mutex to DirectoryHandle to fix race condition
When using Ganesha NFS on top of FUSE mount, ls operations would hang
forever on directories with hundreds of files. This was caused by a
race condition in DirectoryHandle where multiple concurrent readdir
operations could modify shared state (entryStream, entryStreamOffset,
isFinished) without synchronization.
The fix adds a mutex to DirectoryHandle and holds it for the entire
duration of doReadDirectory. This serializes concurrent readdir calls
on the same handle, which is the correct behavior for a directory
handle and fixes the race condition.
Key changes:
- Added sync.Mutex to DirectoryHandle struct
- Lock the mutex at the start of doReadDirectory
- This ensures thread-safe access to entryStream and other state
The lock is per-handle (not global), so different directories can
still be listed concurrently. Only concurrent operations on the
same directory handle are serialized.
Fixes: https://github.com/seaweedfs/seaweedfs/issues/7672
* mount: add mutex to DirectoryHandle to fix race condition
When using Ganesha NFS on top of FUSE mount, ls operations would hang
forever on directories with hundreds of files. This was caused by a
race condition in DirectoryHandle where multiple concurrent readdir
operations could modify shared state (entryStream, entryStreamOffset,
isFinished) without synchronization.
The fix adds a mutex to DirectoryHandle and holds it for the entire
duration of doReadDirectory. This serializes concurrent readdir calls
on the same handle, which is the correct behavior for a directory
handle and fixes the race condition.
Key changes:
- Added sync.Mutex to DirectoryHandle struct
- Lock the mutex at the start of doReadDirectory
- Optimized reset() to reuse slice capacity and allow GC of old entries
The lock is per-handle (not global), so different directories can
still be listed concurrently. Only concurrent operations on the
same directory handle are serialized.
Fixes: https://github.com/seaweedfs/seaweedfs/issues/7672
* sts: limit session duration to incoming token's exp claim
This fixes the issue where AssumeRoleWithWebIdentity would issue sessions
that outlive the source identity token's expiration.
For use cases like GitLab CI Jobs where the ID Token has an exp claim
limited to the CI job's timeout, the STS session should not exceed that
expiration.
Changes:
- Add TokenExpiration field to ExternalIdentity struct
- Extract exp/iat/nbf claims in OIDC provider's ValidateToken
- Pass token expiration from Authenticate to ExternalIdentity
- Modify calculateSessionDuration to cap at source token's exp
- Add comprehensive tests for the new behavior
Fixes: https://github.com/seaweedfs/seaweedfs/discussions/7653
* refactor: reduce duplication in time claim extraction
Use a loop over claim names instead of repeating the same
extraction logic three times for exp, iat, and nbf claims.
* address review: add defense-in-depth for expired tokens
- Handle already-expired tokens defensively with 1 minute minimum duration
- Enforce MaxSessionLength from config as additional cap
- Fix potential nil dereference in test mock
- Add test case for expired token scenario
* remove issue reference from test
* fix: remove early return to ensure MaxSessionLength is always checked
* fix: restore volume mount when VolumeConfigure fails
When volume.configure.replication command fails (e.g., due to corrupted
.vif file), the volume was left unmounted and the master was already
notified that the volume was deleted, causing the volume to disappear.
This fix attempts to re-mount the volume when ConfigureVolume fails,
restoring the volume state and preventing data loss.
Fixes#7666
* include mount restore error in response message
* Fix webhook duplicate deliveries and POST to GET conversion
Fixes#7667
This commit addresses two critical issues with the webhook notification system:
1. Duplicate webhook deliveries based on worker count
2. POST requests being converted to GET when following redirects
Issue 1: Multiple webhook deliveries
------------------------------------
Problem: The webhook queue was creating multiple handlers (one per worker)
that all subscribed to the same topic. With Watermill's gochannel, each
handler creates a separate subscription, and all subscriptions receive
their own copy of every message, resulting in duplicate webhook calls
equal to the worker count.
Solution: Use a single handler instead of multiple handlers to ensure
each webhook event is sent only once, regardless of worker configuration.
Issue 2: POST to GET conversion with intelligent redirect handling
------------------------------------------------------------------
Problem: When webhook endpoints returned redirects (301/302/303), Go's
default HTTP client would automatically follow them and convert POST
requests to GET requests per HTTP specification.
Solution: Implement intelligent redirect handling that:
- Prevents automatic redirects to preserve POST method
- Manually follows redirects by recreating POST requests
- Caches the final redirect destination for performance
- Invalidates cache and retries on failures (network or HTTP errors)
- Provides automatic recovery from cached endpoint failures
Benefits:
- Webhooks are now sent exactly once per event
- POST method is always preserved through redirects
- Reduced latency through redirect destination caching
- Automatic failover when cached destinations become unavailable
- Thread-safe concurrent webhook delivery
Testing:
- Added TestQueueNoDuplicateWebhooks to verify single delivery
- Added TestHttpClientFollowsRedirectAsPost for redirect handling
- Added TestHttpClientUsesCachedRedirect for caching behavior
- Added cache invalidation tests for error scenarios
- All 18 webhook tests pass successfully
* Address code review comments
- Add maxWebhookRetryDepth constant to avoid magic number
- Extract cache invalidation logic into invalidateCache() helper method
- Fix redirect handling to properly follow redirects even on retry attempts
- Remove misleading comment about nWorkers controlling handler parallelism
- Fix test assertions to match actual execution flow
- Remove trailing whitespace in test file
All tests passing.
* Refactor: use setFinalURL() instead of invalidateCache()
Replace invalidateCache() with more explicit setFinalURL() function.
This is cleaner as it makes the intent clear - we're setting the URL
(either to a value or to empty string to clear it), rather than having
a separate function just for clearing.
No functional changes, all tests passing.
* Add concurrent webhook delivery using nWorkers configuration
Webhooks were previously sent sequentially (one-by-one), which could be
a performance bottleneck for high-throughput scenarios. Now nWorkers
configuration is properly used to control concurrent webhook delivery.
Implementation:
- Added semaphore channel (buffered to nWorkers capacity)
- handleWebhook acquires semaphore slot before sending (blocks if at capacity)
- Releases slot after webhook completes
- Allows up to nWorkers concurrent webhook HTTP requests
Benefits:
- Improved throughput for slow webhook endpoints
- nWorkers config now has actual purpose (was validated but unused)
- Default 5 workers provides good balance
- Configurable from 1-100 workers based on needs
Example performance improvement:
- Before: 500ms webhook latency = ~2 webhooks/sec max
- After (5 workers): 500ms latency = ~10 webhooks/sec
- After (10 workers): 500ms latency = ~20 webhooks/sec
All tests passing.
* Replace deprecated AddNoPublisherHandler with AddConsumerHandler
AddNoPublisherHandler is deprecated in Watermill.
Use AddConsumerHandler instead, which is the current recommended API
for handlers that only consume messages without publishing.
No functional changes, all tests passing.
* Drain response bodies to enable HTTP connection reuse
Added drainBody() calls in all code paths to ensure response bodies
are consumed before returning. This is critical for HTTP keep-alive
connection reuse.
Without draining:
- Connections are closed after each request
- New TCP handshake + TLS handshake for every webhook
- Higher latency and resource usage
With draining:
- Connections are reused via HTTP keep-alive
- Significant performance improvement for repeated webhooks
- Lower latency (no handshake overhead)
- Reduced resource usage
Implementation:
- Added drainBody() helper that reads up to 1MB (prevents memory issues)
- Drain on success path (line 161)
- Drain on error responses before retry (lines 119, 152)
- Drain on redirect responses before following (line 118)
- Already had drainResponse() for network errors (line 99)
All tests passing.
* Use existing CloseResponse utility instead of custom drainBody
Replaced custom drainBody() function with the existing util_http.CloseResponse()
utility which is already used throughout the codebase. This provides:
- Consistent behavior with rest of the codebase
- Better logging (logs bytes drained via CountingReader)
- Full body drainage (not limited to 1MB)
- Cleaner code (no duplication)
CloseResponse properly drains and closes the response body to enable
HTTP keep-alive connection reuse.
All tests passing.
* Fix: Don't overwrite original error when draining response
Before: err was being overwritten by drainResponse() result
After: Use drainErr to avoid losing the original client.Do() error
This was a subtle bug where if drainResponse() succeeded (returned nil),
we would lose the original network error and potentially return a
confusing error message.
All tests passing.
* Optimize HTTP client: reuse client and remove redundant timeout
1. Reuse single http.Client instance instead of creating new one per request
- Reduces allocation overhead
- More efficient for high-volume webhooks
2. Remove redundant timeout configuration
- Before: timeout set on both context AND http.Client
- After: timeout only on context (cleaner, context fires first anyway)
Performance benefits:
- Reduced GC pressure (fewer client allocations)
- Better connection pooling (single transport instance)
- Cleaner code (no redundancy)
All tests passing.
* Nit: have `ec.encode` exit immediately if no volumes are processed.
* Update weed/shell/command_ec_encode.go
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
---------
Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Fixes#7643
Reordered filerHealth struct fields to ensure int64 field comes first,
guaranteeing 8-byte alignment required for atomic operations on 32-bit
ARM architectures (ARMv7, as used in OpenWRT).
Fixes#7650
This change enables the SFTP server to reload the user store configuration
(sftp_userstore.json) when a HUP signal is sent to the process, without
requiring a service restart.
Changes:
- Add Reload() method to FileStore to re-read users from disk
- Add Reload() method to SFTPService to handle reload requests
- Register reload hook with grace.OnReload() in sftp command
This allows administrators to add users or change access policies
dynamically by editing the user store file and sending a HUP signal
(e.g., 'systemctl reload seaweedfs' or 'kill -HUP <pid>').
* s3: fix ListBuckets not showing buckets created by authenticated users
Fixes#7647
## Problem
Users with proper Admin permissions could create buckets but couldn't
list them. The issue occurred because ListBucketsHandler was not wrapped
with the Auth middleware, so the authenticated identity was never set in
the request context.
## Root Cause
- PutBucketHandler uses iam.Auth() middleware which sets identity in context
- ListBucketsHandler did NOT use iam.Auth() middleware
- Without the middleware, GetIdentityNameFromContext() returned empty string
- Bucket ownership checks failed because no identity was present
## Changes
1. Wrap ListBucketsHandler with iam.Auth() middleware (s3api_server.go)
2. Update ListBucketsHandler to get identity from context (s3api_bucket_handlers.go)
3. Add lookupByIdentityName() helper method (auth_credentials.go)
4. Add comprehensive test TestListBucketsIssue7647 (s3api_bucket_handlers_test.go)
## Testing
- All existing tests pass (1348 tests in s3api package)
- New test TestListBucketsIssue7647 validates the fix
- Verified admin users can see their created buckets
- Verified admin users can see all buckets
- Verified backward compatibility maintained
* s3: fix ListBuckets for JWT/Keycloak authentication
The previous fix broke JWT/Keycloak authentication because JWT identities
are created on-the-fly and not stored in the iam.identities list.
The lookupByIdentityName() would return nil for JWT users.
Solution: Store the full Identity object in the request context, not just
the name. This allows ListBucketsHandler to retrieve the complete identity
for all authentication types (SigV2, SigV4, JWT, Anonymous).
Changes:
- Add SetIdentityInContext/GetIdentityFromContext in s3_constants/header.go
- Update Auth middleware to store full identity in context
- Update ListBucketsHandler to retrieve identity from context first,
with fallback to lookup for backward compatibility
* s3: optimize lookupByIdentityName to O(1) using map
Address code review feedback: Use a map for O(1) lookups instead of
O(N) linear scan through identities list.
Changes:
- Add nameToIdentity map to IdentityAccessManagement struct
- Populate map in loadS3ApiConfiguration (consistent with accessKeyIdent pattern)
- Update lookupByIdentityName to use map lookup instead of loop
This improves performance when many identities are configured and
aligns with the existing pattern used for accessKeyIdent lookups.
* s3: address code review feedback on nameToIdentity and logging
Address two code review points:
1. Wire nameToIdentity into env-var fallback path
- The AWS env-var fallback in NewIdentityAccessManagementWithStore now
populates nameToIdentity map along with accessKeyIdent
- Keeps all identity lookup maps in sync
- Avoids potential issues if handlers rely on lookupByIdentityName
2. Improve access key lookup logging
- Reduce log verbosity: V(1) -> V(2) for failed lookups
- Truncate access keys in logs (show first 4 chars + ***)
- Include key length for debugging
- Prevents credential exposure in production logs
- Reduces log noise from misconfigured clients
* fmt
* s3: refactor truncation logic and improve error handling
Address additional code review feedback:
1. DRY principle: Extract key truncation logic into local function
- Define truncate() helper at function start
- Reuse throughout lookupByAccessKey
- Eliminates code duplication
2. Enhanced security: Mask very short access keys
- Keys <= 4 chars now show as '***' instead of full key
- Prevents any credential exposure even for short keys
- Consistent masking across all log statements
3. Improved robustness: Add warning log for type assertion failure
- Log unexpected type when identity context object is wrong type
- Helps debug potential middleware or context issues
- Better production diagnostics
4. Documentation: Add comment about future optimization opportunity
- Note potential for lightweight identity view in context
- Suggests credential-free view for better data minimization
- Documents design decision for future maintainers
* fix: initialize missing S3 options in filer to prevent nil pointer dereference
Fixes#7644
When starting the S3 gateway from the filer, several S3Options fields
were not being initialized, which could cause nil pointer dereferences
during startup.
This commit adds initialization for:
- iamConfig: for advanced IAM configuration
- metricsHttpPort: for Prometheus metrics endpoint
- metricsHttpIp: for binding the metrics endpoint
Also ensures metricsHttpIp defaults to bindIp when not explicitly set,
matching the behavior of the standalone S3 server.
This prevents the panic that was occurring in the s3.go:226 area when
these pointer fields were accessed but never initialized.
* fix: copy value instead of pointer for metricsHttpIp default
Address review comment to avoid pointer aliasing. Copy the value
instead of the pointer to prevent unexpected side effects if the
bindIp value is modified later.