* Refactor Admin UI to use unified IAM storage and add MultipleFileStore
* Address PR feedback: fix renames, error handling, and sync logic in FilerMultipleStore
* Address refined PR feedback: safe rename order, rollback logic, and structural sync refinement
* Optimize LoadConfiguration: use streaming callback for memory efficiency
* Refactor UpdateUser: log rollback failures during rename
* Implement PolicyManager for FilerMultipleStore
* include the filer_multiple backend configuration
* Implement cross-S3 synchronization and proper shutdown for all IAM backends
* Extract Admin UI refactoring to a separate PR
* Add AWS IAM integration tests and refactor admin authorization
- Added AWS IAM management integration tests (User, AccessKey, Policy)
- Updated test framework to support IAM client creation with JWT/OIDC
- Refactored s3api authorization to be policy-driven for IAM actions
- Removed hardcoded role name checks for admin privileges
- Added new tests to GitHub Actions basic test matrix
* test(s3/iam): add UpdateUser and UpdateAccessKey tests and fix nil pointer dereference
* feat(s3api): add DeletePolicy and update tests with cleanup logic
* test(s3/iam): use t.Cleanup for managed policy deletion in CreatePolicy test
* fix: IAM authentication with AWS Signature V4 and environment credentials
Three key fixes for authenticated IAM requests to work:
1. Fix request body consumption before signature verification
- iamMatcher was calling r.ParseForm() which consumed POST body
- This broke AWS Signature V4 verification on subsequent reads
- Now only check query string in matcher, preserving body for verification
- File: weed/s3api/s3api_server.go
2. Preserve environment variable credentials across config reloads
- After IAM mutations, config reload overwrote env var credentials
- Extract env var loading into loadEnvironmentVariableCredentials()
- Call after every config reload to persist credentials
- File: weed/s3api/auth_credentials.go
3. Add authenticated IAM tests and test infrastructure
- New TestIAMAuthenticated suite with AWS SDK + Signature V4
- Dynamic port allocation for independent test execution
- Flag reset to prevent state leakage between tests
- CI workflow to run S3 and IAM tests separately
- Files: test/s3/example/*, .github/workflows/s3-example-integration-tests.yml
All tests pass:
- TestIAMCreateUser (unauthenticated)
- TestIAMAuthenticated (with AWS Signature V4)
- S3 integration tests
* fmt
* chore: rename test/s3/example to test/s3/normal
* simplify: CI runs all integration tests in single job
* Update s3-example-integration-tests.yml
* ci: run each test group separately to avoid raft registry conflicts
* Fix imbalance detection disk type grouping and volume grow errors
This PR addresses two issues:
1. Imbalance Detection: Previously, balance detection did not verify disk types, leading to false positives when comparing heterogenous nodes (e.g. SSD vs HDD). Logic is now updated to group volumes by DiskType before calculating imbalance.
2. Volume Grow Errors: Fixed a variable scope issue in master_grpc_server_volume.go and added a pre-check for available space to prevent 'only 0 volumes left' error logs when a disk type is full or abandoned.
Included units tests for the detection logic.
* Refactor balance detection loop into detectForDiskType
* Fix potential panic in volume grow logic by checking replica placement parse error
* storage: fix EC shard recovery with improved diagnostics and logging
- Fix buffer size mismatch in ReconstructData call
- Add detailed logging of available and missing shards
- Improve error messages when recovery is impossible
- Add unit tests for EC recovery shard counting logic
* test: refine EC recovery unit tests
- Remove redundant tests that only validate setup
- Use standard strings.Contains instead of custom recursive helper
* adjust tests and minor improvement
* fix: S3 listing NextMarker missing intermediate directory component
When listing with nested prefixes like "character/member/", the NextMarker
was incorrectly constructed as "character/res024/" instead of
"character/member/res024/", causing continuation requests to fail.
Root cause: The code at line 331 was constructing NextMarker as:
nextMarker = requestDir + "/" + nextMarker
This worked when nextMarker already contained the full relative path,
but failed when it was just the entry name from the innermost recursion.
Fix: Include the prefix component when constructing NextMarker:
if prefix != "" {
nextMarker = requestDir + "/" + prefix + "/" + nextMarker
}
This ensures the full path is always constructed correctly for both:
- CommonPrefix entries (directories)
- Regular entries (files)
Also includes fix for cursor.prefixEndsOnDelimiter state leak that was
causing sibling directories to be incorrectly listed.
* test: add regression tests for NextMarker construction
Add comprehensive unit tests to verify NextMarker is correctly constructed
with nested prefixes. Tests cover:
- Regular entries with nested prefix (character/member/res024)
- CommonPrefix entries (directories)
- Edge cases (no requestDir, no prefix, deeply nested)
These tests ensure the fix prevents regression of the bug where
NextMarker was missing intermediate directory components.
Specifically:
- Use bytes.NewReader for binary data instead of strings.NewReader
- Increase binary test data from 8 bytes to 1KB to avoid edge cases
- Add 50ms delay between subtests to prevent overwhelming the server
* Fix: Populate Claims from STS session RequestContext for policy variable substitution
When using STS temporary credentials (from AssumeRoleWithWebIdentity) with
AWS Signature V4 authentication, JWT claims like preferred_username were
not available for bucket policy variable substitution (e.g., ${jwt:preferred_username}).
Root Cause:
- STS session tokens store user claims in the req_ctx field (added in PR #8079)
- validateSTSSessionToken() created Identity but didn't populate Claims field
- authorizeWithIAM() created IAMIdentity but didn't copy Claims
- Policy engine couldn't resolve ${jwt:*} variables without claims
Changes:
1. auth_signature_v4.go: Extract claims from sessionInfo.RequestContext
and populate Identity.Claims in validateSTSSessionToken()
2. auth_credentials.go: Copy Claims when creating IAMIdentity in
authorizeWithIAM()
3. auth_sts_identity_test.go: Add TestSTSIdentityClaimsPopulation to
verify claims are properly populated from RequestContext
This enables bucket policies with JWT claim variables to work correctly
with STS temporary credentials obtained via AssumeRoleWithWebIdentity.
Fixes#8037
* Refactor: Idiomatic map population for STS claims
* Fix S3 conditional writes with versioning (Issue #8073)
Refactors conditional header checks to properly resolve the latest object version when versioning is enabled. This prevents incorrect validation against non-versioned root objects.
* Add integration test for S3 conditional writes with versioning (Issue #8073)
* Refactor: Propagate internal errors in conditional header checks
- Make resolveObjectEntry return errors from isVersioningConfigured
- Update checkConditionalHeaders checks to return 500 on internal resolve errors
* Refactor: Stricter error handling and test assertions
- Propagate internal errors in checkConditionalHeaders*WithGetter functions
- Enforce strict 412 PreconditionFailed check in integration test
* Perf: Add early return for conditional headers + safety improvements
- Add fast path to skip resolveObjectEntry when no conditional headers present
- Avoids expensive getLatestObjectVersion retries in common case
- Add nil checks before dereferencing pointers in integration test
- Fix grammar in test comments
- Remove duplicate comment in resolveObjectEntry
* Refactor: Use errors.Is for robust ErrNotFound checking
- Update checkConditionalHeaders* to use errors.Is(err, filer_pb.ErrNotFound)
- Update resolveObjectEntry to use errors.Is for wrapped error compatibility
- Remove duplicate comment lines in s3api handlers
* Perf: Optimize resolveObjectEntry for conditional checks
- Refactor getLatestObjectVersion to doGetLatestObjectVersion supporting variable retries
- Use 1-retry path in resolveObjectEntry to avoid exponential backoff latency
* Test: Enhance integration test with content verification
- Verify actual object content equals expected content after successful conditional write
- Add missing io and errors imports to test file
* Refactor: Final refinements based on feedback
- Optimize header validation by passing parsed headers to avoid redundant parsing
- Simplify integration test assertions using require.Error and assert.True
- Fix build errors in s3api handler and test imports
* Test: Use smithy.APIError for robust error code checking
- Replace string-based error checking with structured API error
- Add smithy-go import for AWS SDK v2 error handling
* Test: Use types.PreconditionFailed and handle io.ReadAll error
- Replace smithy.APIError with more specific types.PreconditionFailed
- Add proper error handling for io.ReadAll in content verification
* Refactor: Use combined error checking and add nil guards
- Use smithy.APIError with ErrorCode() for robust error checking
- Add nil guards for entry.Attributes before accessing Mtime
- Prevents potential panics when Attributes is uninitialized
* fix: Refactor CORS middleware to consistently apply the `Vary: Origin` header when a configuration exists and streamline request processing logic.
* fix: Add Vary: Origin header to CORS OPTIONS responses and refactor request handling for clarity and correctness.
* fix: update CORS middleware tests to correctly parse and check for 'Origin' in Vary header.
* refactor: extract `hasVaryOrigin` helper function to simplify Vary header checks in tests.
* test: Remove `Vary: Origin` header from CORS test expectations.
* refactor: consolidate CORS request handling into a new `processCORS` method using a `next` callback.
Fix S3 CORS for non-existent buckets
Enable fallback to global CORS configuration when a bucket is not found (s3err.ErrNoSuchBucket). This ensures consistent CORS behavior and prevents information disclosure.
Fixes#8065
Problem:
- CORS headers were only applied after checking bucket existence
- Non-existent buckets returned responses without CORS headers
- This caused CORS preflight failures and information disclosure vulnerability
- Unauthenticated users could infer bucket existence from CORS header presence
Solution:
- Moved CORS evaluation before bucket existence check in middleware
- CORS headers now applied consistently regardless of bucket existence
- Preflight requests succeed for non-existent buckets (matching AWS S3)
- Actual requests still return NoSuchBucket error but with CORS headers
Changes:
- Modified Handler() and HandleOptionsRequest() in middleware.go
- Added comprehensive test suite for non-existent bucket scenarios
- All 39 tests passing (31 existing + 8 new)
Security Impact:
- Prevents information disclosure about bucket existence
- Bucket existence cannot be inferred from CORS header presence/absence
AWS S3 Compatibility:
- Improved compatibility with AWS S3 CORS behavior
- Preflight requests now succeed for non-existent buckets
* Fix nil pointer panic in maintenance worker when receiving empty task assignment
When a worker requests a task and none are available, the admin server
sends an empty TaskAssignment message. The worker was attempting to log
the task details without checking if the TaskId was empty, causing a
nil pointer dereference when accessing taskAssign.Params.VolumeId.
This fix adds a check for empty TaskId before processing the assignment,
preventing worker crashes and improving stability in production environments.
* Add EC integration test for admin-worker maintenance system
Adds comprehensive integration test that verifies the end-to-end flow
of erasure coding maintenance tasks:
- Admin server detects volumes needing EC encoding
- Workers register and receive task assignments
- EC encoding is executed and verified in master topology
- File read-back validation confirms data integrity
The test uses unique absolute working directories for each worker to
prevent ID conflicts and ensure stable worker registration. Includes
proper cleanup and process management for reliable test execution.
* Improve maintenance system stability and task deduplication
- Add cross-type task deduplication to prevent concurrent maintenance
operations on the same volume (EC, balance, vacuum)
- Implement HasAnyTask check in ActiveTopology for better coordination
- Increase RequestTask timeout from 5s to 30s to prevent unnecessary
worker reconnections
- Add TaskTypeNone sentinel for generic task checks
- Update all task detectors to use HasAnyTask for conflict prevention
- Improve config persistence and schema handling
* Add GitHub Actions workflow for EC integration tests
Adds CI workflow that runs EC integration tests on push and pull requests
to master branch. The workflow:
- Triggers on changes to admin, worker, or test files
- Builds the weed binary
- Runs the EC integration test suite
- Uploads test logs as artifacts on failure for debugging
This ensures the maintenance system remains stable and worker-admin
integration is validated in CI.
* go version 1.24
* address comments
* Update maintenance_integration.go
* support seconds
* ec prioritize over balancing in tests
Fix: Propagate OIDC claims to IAM identity for dynamic policy variables
Fixes#8037. Ensures additional OIDC claims (like preferred_username) are preserved in ExternalIdentity attributes and propagated to IAM tokens, enabling substitution in dynamic policies.
* Prevent split-brain: Persistent ClusterID and Join Validation
- Persist ClusterId in Raft store to survive restarts.
- Validate ClusterId on Raft command application (piggybacked on MaxVolumeId).
- Prevent masters with conflicting ClusterIds from joining/operating together.
- Update Telemetry to report the persistent ClusterId.
* Refine ClusterID validation based on feedback
- Improved error message in cluster_commands.go.
- Added ClusterId mismatch check in RaftServer.Recovery.
* Handle Raft errors and support Hashicorp Raft for ClusterId
- Check for errors when persisting ClusterId in legacy Raft.
- Implement ClusterId generation and persistence for Hashicorp Raft leader changes.
- Ensure consistent error logging.
* Refactor ClusterId validation
- Centralize ClusterId mismatch check in Topology.SetClusterId.
- Simplify MaxVolumeIdCommand.Apply and RaftServer.Recovery to rely on SetClusterId.
* Fix goroutine leak and add timeout
- Handle channel closure in Hashicorp Raft leader listener.
- Add timeout to Raft Apply call to prevent blocking.
* Fix deadlock in legacy Raft listener
- Wrap ClusterId generation/persistence in a goroutine to avoid blocking the Raft event loop (deadlock).
* Rename ClusterId to SystemId
- Renamed ClusterId to SystemId across the codebase (protobuf, topology, server, telemetry).
- Regenerated telemetry.pb.go with new field.
* Rename SystemId to TopologyId
- Rename to SystemId was intermediate step.
- Final name is TopologyId for the persistent cluster identifier.
- Updated protobuf, topology, raft server, master server, and telemetry.
* Optimize Hashicorp Raft listener
- Integrated TopologyId generation into existing monitorLeaderLoop.
- Removed extra goroutine in master_server.go.
* Fix optimistic TopologyId update
- Removed premature local state update of TopologyId in master_server.go and raft_hashicorp.go.
- State is now solely updated via the Raft state machine Apply/Restore methods after consensus.
* Add explicit log for recovered TopologyId
- Added glog.V(0) info log in RaftServer.Recovery to print the recovered TopologyId on startup.
* Add Raft barrier to prevent TopologyId race condition
- Implement ensureTopologyId helper method
- Send no-op MaxVolumeIdCommand to sync Raft log before checking TopologyId
- Ensures persisted TopologyId is recovered before generating new one
- Prevents race where generation happens during log replay
* Serialize TopologyId generation with mutex
- Add topologyIdGenLock mutex to MasterServer struct
- Wrap ensureTopologyId method with lock to prevent concurrent generation
- Fixes race where event listener and manual leadership check both generate IDs
- Second caller waits for first to complete and sees the generated ID
* Add TopologyId recovery logging to Apply method
- Change log level from V(1) to V(0) for visibility
- Log 'Recovered TopologyId' when applying from Raft log
- Ensures recovery is visible whether from snapshot or log replay
- Matches Recovery() method logging for consistency
* Fix Raft barrier timing issue
- Add 100ms delay after barrier command to ensure log application completes
- Add debug logging to track barrier execution and TopologyId state
- Return early if barrier command fails
- Prevents TopologyId generation before old logs are fully applied
* ensure leader
* address comments
* address comments
* redundant
* clean up
* double check
* refactoring
* comment