Replay a delete event for the old entry name during same-directory
renames so handlers like onBucketMetadataChange can clean up stale
state for the old name.
Find the source identity before checking for collisions, matching
the standalone handler's logic. Previously a non-existent user
renamed to an existing name would get EntityAlreadyExists instead
of NoSuchEntity.
After credentialManager.CreateGroup may normalize the name (e.g.,
trim whitespace), use group.Name instead of the raw input for
the returned GroupData to ensure consistency.
Trim leading/trailing whitespace from group.Name before validation
in CreateGroup and UpdateGroup to prevent whitespace-only filenames.
Also merge groups by name during multi-file load to prevent duplicates.
Add require.Equal checks for 200 status after UpdateGroup calls
so the test fails immediately on API errors rather than relying
on the subsequent Eventually timeout.
Groups are always dynamic (from filer), never static (from s3.config).
Seeding from iam.groups caused stale deleted groups to persist.
Now only uses config.Groups from the dynamic filer config.
Pass nil newEntry to bucket, IAM, and circuit-breaker handlers for
the source directory during cross-directory moves, so all watchers
can clear caches for the moved-away resource.
Reorder UpdateUser to find the source identity first and return
NoSuchEntityException if not found, before checking if the rename
is a no-op. Previously a non-existent user renamed to itself
would incorrectly return success.
Match the identity loader's merge behavior: find existing group
by name and replace, only append when no match exists. Prevents
duplicates when legacy and multi-file configs overlap.
When a file is moved out of an IAM directory (e.g., /etc/iam/groups),
the dir variable was overwritten with NewParentPath, causing the
source directory change to be missed. Now also notifies handlers
about the source directory for cross-directory moves.
The basic lane's -run "TestIAM" regex also matched TestIAMGroup*
tests, causing them to run in both the basic and group lanes.
Replace with explicit test function names.
* fix: volume balance detection now returns multiple tasks per run (#8551)
Previously, detectForDiskType() returned at most 1 balance task per disk
type, making the MaxJobsPerDetection setting ineffective. The detection
loop now iterates within each disk type, planning multiple moves until
the imbalance drops below threshold or maxResults is reached. Effective
volume counts are adjusted after each planned move so the algorithm
correctly re-evaluates which server is overloaded.
* fix: factor pending tasks into destination scoring and use UnixNano for task IDs
- Use UnixNano instead of Unix for task IDs to avoid collisions when
multiple tasks are created within the same second
- Adjust calculateBalanceScore to include LoadCount (pending + assigned
tasks) in the utilization estimate, so the destination picker avoids
stacking multiple planned moves onto the same target disk
* test: add comprehensive balance detection tests for complex scenarios
Cover multi-server convergence, max-server shifting, destination
spreading, pre-existing pending task skipping, no-duplicate-volume
invariant, and parameterized convergence verification across different
cluster shapes and thresholds.
* fix: address PR review findings in balance detection
- hasMore flag: compute from len(results) >= maxResults so the scheduler
knows more pages may exist, matching vacuum/EC handler pattern
- Exhausted server fallthrough: when no eligible volumes remain on the
current maxServer (all have pending tasks) or destination planning
fails, mark the server as exhausted and continue to the next
overloaded server instead of stopping the entire detection loop
- Return canonical destination server ID directly from createBalanceTask
instead of resolving via findServerIDByAddress, eliminating the
fragile address→ID lookup for adjustment tracking
- Fix bestScore sentinel: use math.Inf(-1) instead of -1.0 so disks
with negative scores (high pending load, same rack/DC) are still
selected as the best available destination
- Add TestDetection_ExhaustedServerFallsThrough covering the scenario
where the top server's volumes are all blocked by pre-existing tasks
* test: fix computeEffectiveCounts and add len guard in no-duplicate test
- computeEffectiveCounts now takes a servers slice to seed counts for all
known servers (including empty ones) and uses an address→ID map from
the topology spec instead of scanning metrics, so destination servers
with zero initial volumes are tracked correctly
- TestDetection_NoDuplicateVolumesAcrossIterations now asserts len > 1
before checking duplicates, so the test actually fails if Detection
regresses to returning a single task
* fix: remove redundant HasAnyTask check in createBalanceTask
The HasAnyTask check in createBalanceTask duplicated the same check
already performed in detectForDiskType's volume selection loop.
Since detection runs single-threaded (MaxDetectionConcurrency: 1),
no race can occur between the two points.
* fix: consistent hasMore pattern and remove double-counted LoadCount in scoring
- Adopt vacuum_handler's hasMore pattern: over-fetch by 1, check
len > maxResults, and truncate — consistent truncation semantics
- Remove direct LoadCount penalty in calculateBalanceScore since
LoadCount is already factored into effectiveVolumeCount for
utilization scoring; bump utilization weight from 40 to 50 to
compensate for the removed 10-point load penalty
* fix: handle zero maxResults as no-cap, emit trace after trim, seed empty servers
- When MaxResults is 0 (omitted), treat as no explicit cap instead of
defaulting to 1; only apply the +1 over-fetch probe when caller
supplies a positive limit
- Move decision trace emission after hasMore/trim so the trace
accurately reflects the returned proposals
- Seed serverVolumeCounts from ActiveTopology so servers that have a
matching disk type but zero volumes are included in the imbalance
calculation and MinServerCount check
* fix: nil-guard clusterInfo, uncap legacy DetectionFunc, deterministic disk type order
- Add early nil guard for clusterInfo in Detection to prevent panics
in downstream helpers (detectForDiskType, createBalanceTask)
- Change register.go DetectionFunc wrapper from maxResults=1 to 0
(no cap) so the legacy code path returns all detected tasks
- Sort disk type keys before iteration so results are deterministic
when maxResults spans multiple disk types (HDD/SSD)
* fix: don't over-fetch in stateful detection to avoid orphaned pending tasks
Detection registers planned moves in ActiveTopology via AddPendingTask,
so requesting maxResults+1 would create an extra pending task that gets
discarded during trim. Use len(results) >= maxResults as the hasMore
signal instead, which is correct since Detection already caps internally.
* fix: return explicit truncated flag from Detection instead of approximating
Detection now returns (results, truncated, error) where truncated is true
only when the loop stopped because it hit maxResults, not when it ran out
of work naturally. This eliminates false hasMore signals when detection
happens to produce exactly maxResults results by resolving the imbalance.
* cleanup: simplify detection logic and remove redundancies
- Remove redundant clusterInfo nil check in detectForDiskType since
Detection already guards against nil clusterInfo
- Remove adjustments loop for destination servers not in
serverVolumeCounts — topology seeding ensures all servers with
matching disk type are already present
- Merge two-loop min/max calculation into a single loop: min across
all servers, max only among non-exhausted servers
- Replace magic number 100 with len(metrics) for minC initialization
in convergence test
* fix: accurate truncation flag, deterministic server order, indexed volume lookup
- Track balanced flag to distinguish "hit maxResults cap" from "cluster
balanced at exactly maxResults" — truncated is only true when there's
genuinely more work to do
- Sort servers for deterministic iteration and tie-breaking when
multiple servers have equal volume counts
- Pre-index volumes by server with per-server cursors to avoid
O(maxResults * volumes) rescanning on each iteration
- Add truncation flag assertions to RespectsMaxResults test: true when
capped, false when detection finishes naturally
* fix: seed trace server counts from ActiveTopology to match detection logic
The decision trace was building serverVolumeCounts only from metrics,
missing zero-volume servers seeded from ActiveTopology by Detection.
This could cause the trace to report wrong server counts, incorrect
imbalance ratios, or spurious "too few servers" messages. Pass
activeTopology into the trace function and seed server counts the
same way Detection does.
* fix: don't exhaust server on per-volume planning failure, sort volumes by ID
- When createBalanceTask returns nil, continue to the next volume on
the same server instead of marking the entire server as exhausted.
The failure may be volume-specific (not found in topology, pending
task registration failed) and other volumes on the server may still
be viable candidates.
- Sort each server's volume slice by VolumeID after pre-indexing so
volume selection is fully deterministic regardless of input order.
* fix: use require instead of assert to prevent nil dereference panic in CORS test
The test used assert.NoError (non-fatal) for GetBucketCors, then
immediately accessed getResp.CORSRules. When the API returns an error,
getResp is nil causing a panic. Switch to require.NoError/NotNil/Len
so the test stops before dereferencing a nil response.
* fix: deterministic disk tie-breaking and stronger pre-existing task test
- Sort available disks by NodeID then DiskID before scoring so
destination selection is deterministic when two disks score equally
- Add task count bounds assertion to SkipsPreExistingPendingTasks test:
with 15 of 20 volumes already having pending tasks, at most 5 new
tasks should be created and at least 1 (imbalance still exists)
* fix: seed adjustments from existing pending/assigned tasks to prevent over-scheduling
Detection now calls ActiveTopology.GetTaskServerAdjustments() to
initialize the adjustments map with source/destination deltas from
existing pending and assigned balance tasks. This ensures
effectiveCounts reflects in-flight moves, preventing the algorithm
from planning additional moves in the same direction when prior
moves already address the imbalance.
Added GetTaskServerAdjustments(taskType) to ActiveTopology which
iterates pending and assigned tasks, decrementing source servers
and incrementing destination servers for the given task type.
Previously the merge started with empty group maps, dropping any
static-file groups. Now seeds from existing iam.groups before
overlaying dynamic config, and builds the reverse index after
merging to avoid stale entries from overridden groups.
The user S3 client may lack permissions by cleanup time since the
user is removed from the group in an earlier subtest. Use the admin
S3 client to ensure bucket and object cleanup always succeeds.
When renaming a user via UpdateUser, also update ParentUser references
in service accounts to prevent them from becoming orphaned after the
next configuration reload.
Group changes propagate to S3 servers via filer subscription
(watching /etc/iam/groups/) rather than gRPC RPCs, since there
are no group-specific RPCs in the S3 cache protocol.
Return errors instead of logging and continuing when group files
cannot be read or unmarshaled. This prevents silently applying a
partial IAM config with missing group memberships or policies.
Move the nil check on identity before accessing identity.Name to
prevent panic. Also refine hasAttachedPolicies to only consider groups
that are enabled and have actual policies attached, so membership in
a no-policy group doesn't incorrectly trigger IAM authorization.
Replace scattered defers with single ordered t.Cleanup in each test
to ensure resources are torn down in reverse-creation order:
remove membership, detach policies, delete access keys, delete users,
delete groups, delete policies. Move bucket cleanup to parent test
scope and delete objects before bucket.
If PutPolicies fails after moving inline policies to the new username,
restore both the identity name and the inline policies map to their
original state to avoid a partial-write window.
Check managed policies from GetPolicies() instead of s3cfg.Policies
so dynamically created policies are found. Also add duplicate name
check to UpdateGroup rename.
Return ServiceFailure for credential manager errors instead of masking
them as NoSuchEntity. Also switch ListGroupsForUser to use s3cfg.Groups
instead of in-memory reverse index to avoid stale data. Add duplicate
name check to UpdateGroup rename.
The embedded IAM endpoint rejects anonymous requests. Replace
callIAMAPI with callIAMAPIAuthenticated that uses JWT bearer token
authentication via the test framework.
Add UpdateGroup action to enable/disable groups and rename groups
via the IAM API. This is a SeaweedFS extension (not in AWS SDK) used
by tests to toggle group disabled status.
Policies created via CreatePolicy through credentialManager are stored
in the credential store, not in s3cfg.Policies (which only has static
config policies). Change AttachGroupPolicy to use credentialManager.GetPolicy()
for policy existence validation.
* master: return 503/Unavailable during topology warmup after leader change
After a master restart or leader change, the topology is empty until
volume servers reconnect and send heartbeats. During this warmup window
(3 heartbeat intervals = 15 seconds), volume lookups that fail now
return 503 Service Unavailable (HTTP) or gRPC Unavailable instead of
404 Not Found, signaling clients to retry with other masters.
* master: skip warmup 503 on fresh start and single-master setups
- Check MaxVolumeId > 0 to distinguish restart from fresh start
(MaxVolumeId is Raft-persisted, so 0 means no prior data)
- Check peer count > 1 so single-master deployments aren't affected
(no point suggesting "retry with other masters" if there are none)
* master: address review feedback and block assigns during warmup
- Protect LastLeaderChangeTime with dedicated mutex (fix data race)
- Extract warmup multiplier as WarmupPulseMultiplier constant
- Derive Retry-After header from pulse config instead of hardcoding
- Only trigger warmup 503 for "not found" errors, not parse errors
- Return nil response (not partial) on gRPC Unavailable
- Add doc comments to IsWarmingUp, getter/setter, WarmupDuration
- Block volume assign requests (HTTP and gRPC) during warmup,
since the topology is incomplete and assignments would be unreliable
- Skip warmup behavior for single-master setups (no peers to retry)
* master: apply warmup to all setups, skip only on fresh start
Single-master restarts still have an empty topology until heartbeats
arrive, so warmup protection should apply there too. The only case
to skip is a fresh cluster start (MaxVolumeId == 0), which already
has no volumes to look up.
- Remove GetMasterCount() > 1 guard from all warmup checks
- Remove now-unused GetMasterCount helper
- Update error messages to "topology is still loading" (not
"retry with other masters" which doesn't apply to single-master)
* master: add client-side retry on Unavailable for lookup and assign
The server-side 503/Unavailable during warmup needs client cooperation.
Previously, LookupVolumeIds and Assign would immediately propagate the
error without retry.
Now both paths retry with exponential backoff (1s -> 1.5s -> ... up to
6s) when receiving Unavailable, respecting context cancellation. This
covers the warmup window where the master's topology is still loading
after a restart or leader change.
* master: seed warmup timestamp in legacy raft path at setup
The legacy raft path only set lastLeaderChangeTime inside the event
listener callback, which could fire after IsLeader() was already
observed as true in SetRaftServer. Seed the timestamp at setup time
(matching the hashicorp path) so IsWarmingUp() is active immediately.
* master: fix assign retry loop to cover full warmup window
The retry loop used waitTime <= maxWaitTime as a stop condition,
causing it to give up after ~13s while warmup lasts 15s. Now cap
each individual sleep at maxWaitTime but keep retrying until the
context is cancelled.
* master: preserve gRPC status in lookup retry and fix retry window
Return the raw gRPC error instead of wrapping with fmt.Errorf so
status.FromError() can extract the status code. Use proper gRPC
status check (codes.Unavailable) instead of string matching. Also
cap individual sleep at maxWaitTime while retrying until ctx is done.
* master: use gRPC status code instead of string matching in assign retry
Use status.FromError/codes.Unavailable instead of brittle
strings.Contains for detecting retriable gRPC errors in the
assign retry loop.
* master: use remaining warmup duration for Retry-After header
Set Retry-After to the remaining warmup time instead of the full
warmup duration, so clients don't wait longer than necessary.
* master: reset ret.Replicas before populating from assign response
Clear Replicas slice before appending to prevent duplicate entries
when the assign response is retried or when alternative requests
are attempted.
* master: add unit tests for warmup retry behavior
Test that Assign() and LookupVolumeIds() retry on codes.Unavailable
and stop promptly when the context is cancelled.
* master: record leader change time before initialization work
Move SetLastLeaderChangeTime() to fire immediately when the leader
change event is received, before DoBarrier(), EnsureTopologyId(),
and updatePeers(), so the warmup clock starts at the true moment
of leadership transition.
* master: use topology warmup duration in volume growth wait loop
Replace hardcoded constants.VolumePulsePeriod * 2 with
topo.IsWarmingUp() and topo.WarmupDuration() so the growth wait
stays in sync with the configured warmup window. Remove unused
constants import.
* master: resolve master before creating RPC timeout context
Move GetMaster() call before context.WithTimeout() so master
resolution blocking doesn't consume the gRPC call timeout.
* master: use NotFound flag instead of string matching for volume lookup
Add a NotFound field to LookupResult and set it in findVolumeLocation
when a volume is genuinely missing. Update HTTP and gRPC warmup
checks to use this flag instead of strings.Contains on the error
message.
* master: bound assign retry loop to 30s for deadline-free contexts
Without a context deadline, the Unavailable retry loop could spin
forever. Add a maxRetryDuration of 30s so the loop gives up even
when no context deadline is set.
* master: strengthen assign retry cancellation test
Verify the retry loop actually retried (callCount > 1) and that
the returned error is context.DeadlineExceeded, not just any error.
* master: extract shared retry-with-backoff utility
Add util.RetryWithBackoff for context-aware, bounded retry with
exponential backoff. Refactor both Assign() and LookupVolumeIds()
to use it instead of duplicating the retry/sleep/backoff logic.
* master: cap waitTime in RetryWithBackoff to prevent unbounded growth
Cap the backoff waitTime at maxWaitTime so it doesn't grow
indefinitely in long-running retry scenarios.
* master: only return Unavailable during warmup when all lookups failed
For batched LookupVolume requests, return partial results when some
volumes are found. Only return codes.Unavailable when no volumes
were successfully resolved, so clients benefit from partial results
instead of retrying unnecessarily.
* master: set retriable error message in 503 response body
When returning 503 during warmup, replace the "not found" error
in the JSON body with "service warming up, please retry" so
clients don't treat it as a permanent error.
* master: guard empty master address in LookupVolumeIds
If GetMaster() returns empty (no master found or ctx cancelled),
return an appropriate error instead of dialing an empty address.
Returns ctx.Err() if context is done, otherwise codes.Unavailable
to trigger retry.
* master: add comprehensive tests for RetryWithBackoff
Test success after retries, non-retryable error handling, context
cancellation, and maxDuration cap with context.Background().
* master: enforce hard maxDuration bound in RetryWithBackoff
Use a deadline instead of elapsed-time check so the last sleep is
capped to remaining time. This prevents the total retry duration
from overshooting maxDuration by up to one full backoff interval.
* master: respect fresh-start bypass in RemainingWarmupDuration
Check IsWarmingUp() first (which returns false when MaxVolumeId==0)
so RemainingWarmupDuration returns 0 on fresh clusters.
* master: round up Retry-After seconds to avoid underestimating
Use math.Ceil so fractional remaining seconds (e.g. 1.9s) round
up to the next integer (2) instead of flooring down (1).
* master: tighten batch lookup warmup to all-NotFound only
Only return codes.Unavailable when every requested volume ID was
a transient not-found. Mixed cases with non-NotFound errors now
return the response with per-volume error details preserved.
* master: reduce retry log noise and fix timer leak
Lower per-attempt retry log from V(0) to V(1) to reduce noise
during warmup. Replace time.After with time.NewTimer to avoid
lingering timers when context is cancelled.
* master: add per-attempt timeout for assign RPC
Use a 10s per-attempt timeout so a single slow RPC can't consume
the entire 30s retry budget when ctx has no deadline.
* master: share single 30s retry deadline across assign request entries
The Assign() function iterates over primary and fallback requests,
previously giving each its own 30s RetryWithBackoff budget. With a
primary + fallback, the total could reach 60s. Compute one deadline
up front and pass the remaining budget to each RetryWithBackoff call
so the entire Assign() call stays within a single 30s cap.
* master: strengthen context-cancel test with DeadlineExceeded and retry assertions
Assert errors.Is(err, context.DeadlineExceeded) to verify the error
is specifically from the context deadline, and check callCount > 1
to prove retries actually occurred before cancellation. Mirrors the
pattern used in TestAssignStopsOnContextCancel.
* master: bound GetMaster with per-attempt timeout in LookupVolumeIds
GetMaster() calls WaitUntilConnected() which can block indefinitely
if no master is available. Previously it used the outer ctx, so a
slow master resolution could consume the entire RetryWithBackoff
budget in a single attempt. Move the per-attempt timeoutCtx creation
before the GetMaster call so both master resolution and the gRPC
LookupVolume RPC share one grpcTimeout-bounded attempt.
* master: use deadline-aware context for assign retry budget
The shared 30s deadline only limited RetryWithBackoff's internal
wall-clock tracking, but per-attempt contexts were still derived
from the original ctx and could run for up to 10s even when the
budget was nearly exhausted. Create a deadlineCtx from the computed
deadline and derive both RetryWithBackoff and per-attempt timeouts
from it so all operations honor the shared 30s cap.
* master: skip warmup gate for empty lookup requests
When VolumeOrFileIds is empty, notFoundCount == len(req.VolumeOrFileIds)
is 0 == 0 which is true, causing empty lookup batches during warmup to
return codes.Unavailable and be retried endlessly. Add a
len(req.VolumeOrFileIds) > 0 guard so empty requests pass through.
* master: validate request fields before warmup gate in Assign
Move Replication and Ttl parsing before the IsWarmingUp() check so
invalid inputs get a proper validation error instead of being masked
by codes.Unavailable during warmup. Pure syntactic validation does
not depend on topology state and should run first.
* master: check deadline and context before starting retry attempt
RetryWithBackoff only checked the deadline and context after an
attempt completed or during the sleep select. If the deadline
expired or context was canceled during sleep, the next iteration
would still call operation() before detecting it. Add pre-operation
checks so no new attempt starts after the budget is exhausted.
* master: always return ctx.Err() on context cancellation in RetryWithBackoff
When ctx.Err() is non-nil, the pre-operation check was returning
lastErr instead of ctx.Err(). This broke callers checking
errors.Is(err, context.DeadlineExceeded) and contradicted the
documented contract. Always return ctx.Err() so the cancellation
reason is properly surfaced.
* master: handle warmup errors in StreamAssign without killing the stream
StreamAssign was returning codes.Unavailable errors from Assign
directly, which terminates the gRPC stream and breaks pooled
connections. Instead, return transient errors as in-band error
responses so the stream survives warmup periods.
Also reset assignClient in doAssign on Send/Recv failures so a
broken stream doesn't leave the proxy permanently dead.
* master: wait for warmup before slot search in findAndGrow
findEmptySlotsForOneVolume was called before the warmup wait loop,
selecting slots from an incomplete topology. Move the warmup wait
before slot search so volume placement uses the fully warmed-up
topology with all servers registered.
* master: add Retry-After header to /dir/assign warmup response
The /dir/lookup handler already sets Retry-After during warmup but
/dir/assign did not, leaving HTTP clients without guidance on when
to retry. Add the same header using RemainingWarmupDuration().
* master: only seed warmup timestamp on leader at startup
SetLastLeaderChangeTime was called unconditionally for both leader
and follower nodes. Followers don't need warmup state, and the
leader change event listener handles real elections. Move the seed
into the IsLeader() block so only the startup leader gets warmup
initialized.
* master: preserve codes.Unavailable for StreamAssign warmup errors in doAssign
StreamAssign returns transient warmup errors as in-band
AssignResponse.Error messages. doAssign was converting these to plain
fmt.Errorf, losing the codes.Unavailable classification needed for
the caller's retry logic. Detect warmup error messages and wrap them
as status.Error(codes.Unavailable) so RetryWithBackoff can retry.