AllocateBlockVolumeResponse used bs.ListenAddr() to derive replica
addresses. When the VS binds to ":port" (no explicit IP), host
resolved to empty string, producing ":dataPort" as the replica
address. This ":port" propagated through master assignments to both
primary and replica sides.
Now canonicalizes empty/wildcard host using PreferredOutboundIP()
before constructing replication addresses. Also exported
PreferredOutboundIP for use by the server package.
This is the source fix — all downstream paths (heartbeat, API
response, assignment) inherit the canonical address.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
setupReplicaReceiver now reads back canonical addresses from
the ReplicaReceiver (which applies CP13-2 canonicalization)
instead of storing raw assignment addresses in replStates.
This fixes the API-level leak where replica_data_addr showed
":port" instead of "ip:port" in /block/volumes responses,
even though the engine-level CP13-2 fix was working.
New BlockVol.ReplicaReceiverAddr() returns canonical addresses
from the running receiver. Falls back to assignment addresses
if receiver didn't report.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Same-epoch reconciliation now trusts reported roles first:
- one claims primary, other replica → trust roles
- both claim primary → WALHeadLSN heuristic tiebreak
- both claim replica → keep existing, log ambiguity
Replaced addServerAsReplica with upsertServerAsReplica: checks
for existing replica entry by server name before appending.
Prevents duplicate ReplicaInfo rows during restart/replay windows.
2 new tests: role-trusted same-epoch, duplicate replica prevention.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a second server reports the same volume during master restart,
UpdateFullHeartbeat now uses epoch-based tie-breaking instead of
first-heartbeat-wins:
1. Higher epoch wins as primary — old entry demoted to replica
2. Same epoch — higher WALHeadLSN wins (heuristic, warning logged)
3. Lower epoch — added as replica
Applied in both code paths: the auto-register branch (no entry
exists yet for this name) and the unlinked-server branch (entry
exists but this server is not in it).
This is a deterministic reconstruction improvement, not ground
truth. The long-term fix is persisting authoritative volume state.
5 new tests covering all reconciliation scenarios.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Lookup() and ListAll() now return value copies (not pointers to
internal registry state). Callers can no longer mutate registry
entries without holding a lock.
Added clone() on BlockVolumeEntry with deep-copied Replicas slice.
Added UpdateEntry(name, func(*BlockVolumeEntry)) for locked mutation.
ListByServer() also returns copies.
Migrated 1 production mutation (ReplicaPlacement + Preset in create
handler) and ~20 test mutations to use UpdateEntry.
5 new copy-correctness tests: Lookup returns copy, Replicas slice
isolated, ListAll returns copies, UpdateEntry mutates, UpdateEntry
not-found error.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New POST /block/volume/plan endpoint returns full placement preview:
resolved policy, ordered candidate list, selected primary/replicas,
and per-server rejection reasons with stable string constants.
Core design: evaluateBlockPlacement() is a pure function with no
registry/topology dependency. gatherPlacementCandidates() is the
single topology bridge point. Plan and create share the same planner —
parity contract is same ordered candidate list for same cluster state.
Create path refactored: uses evaluateBlockPlacement() instead of
PickServer(), iterates all candidates (no 3-retry cap), recomputes
replica order after primary fallback. rf_not_satisfiable severity
is durability-mode-aware (warning for best_effort, error for strict).
15 unit tests + 20 QA adversarial tests.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Preset system: ResolvePolicy resolves named presets (database, general,
throughput) with per-field overrides into concrete volume parameters.
Create path now uses resolved policy instead of ad-hoc validation.
New /block/volume/resolve diagnostic endpoint for dry-run resolution.
Review fix 1 (MED): HasNVMeCapableServer now derives NVMe capability
from server-level heartbeat attribute (block_nvme_addr proto field)
instead of scanning volume entries. Fixes false "no NVMe" warning on
fresh clusters with NVMe-capable servers but no volumes yet.
Review fix 2 (LOW): /block/volume/resolve no longer proxied to leader —
read-only diagnostic endpoint can be served by any master.
Engine fix: ReadLBA retry loop closes stale dirty-map race when WAL
entry is recycled between lookup and read.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Six-task checkpoint hardening the promotion and failover paths:
T1: 4-gate candidate evaluation (heartbeat freshness, WAL lag, role,
server liveness) with structured rejection reasons.
T2: Orphaned-primary re-evaluation on replica reconnect (B-06/B-08).
T3: Deferred timer safety — epoch validation prevents stale timers
from firing on recreated/changed volumes (B-07).
T4: Rebuild addr cleanup on promotion (B-11), NVMe publication
refresh on heartbeat, and preflight endpoint wiring.
T5: Manual promote API — POST /block/volume/{name}/promote with
force flag, target server selection, and structured rejection
response. Shared applyPromotionLocked/finalizePromotion helpers
eliminate duplication between auto and manual paths.
T6: Read-only preflight endpoint (GET /block/volume/{name}/preflight)
and blockapi client wrappers (Preflight, Promote).
BUG-T5-1: PromotionsTotal counter moved to finalizePromotion (shared
by both auto and manual paths) to prevent metrics divergence.
24 files changed, ~6500 lines added. 42 new QA adversarial tests.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
B-09: ExpandBlockVolume re-reads the registry entry after acquiring
the expand inflight lock. Previously it used the entry from the
initial Lookup, which could be stale if failover changed VolumeServer
or Replicas between Lookup and PREPARE.
B-10: UpdateFullHeartbeat stale-cleanup now skips entries with
ExpandInProgress=true. Previously a primary VS restart during
coordinated expand would delete the entry (path not in heartbeat),
orphaning the volume and stranding the expand coordinator.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two-phase prepare/commit/cancel protocol ensures all replicas expand
atomically. Standalone volumes use direct-commit (unchanged behavior).
Engine: PrepareExpand/CommitExpand/CancelExpand with on-disk
PreparedSize+ExpandEpoch in superblock, crash recovery clears stale
prepare state on open, v.mu serializes concurrent expand operations.
Proto: 3 new RPCs (PrepareExpand/CommitExpand/CancelExpandBlockVolume).
Coordinator: expandClean flag pattern — ReleaseExpandInflight only on
clean success or full cancel. Partial replica commit failure calls
MarkExpandFailed (keeps ExpandInProgress=true, suppresses heartbeat
size updates). ClearExpandFailed for manual reconciliation.
Registry: AcquireExpandInflight records PendingExpandSize+ExpandEpoch.
ExpandFailed state blocks new expands until cleared.
Tests: 15 engine + 4 VS + 10 coordinator + heartbeat suppression
regression + updated QA CP82/durability tests with prepare/commit mocks.
Also includes CP11A-1 remaining: QA storage profile tests, QA
io_backend config tests, testrunner perf-baseline scenarios and
coordinated-expand actions.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ReplicaInfo now carries NvmeAddr/NQN. Fields are populated during
replica allocation (tryCreateOneReplica), updated from replica
heartbeats, and copied in PromoteBestReplica. This ensures master
lookup returns correct NVMe endpoints immediately after failover,
without waiting for the first post-promotion heartbeat.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add nvme_addr and nqn fields to proto messages (AllocateBlockVolume,
CreateBlockVolume, LookupBlockVolume, BlockVolumeInfoMessage), wire
through volume server → master registry → CSI driver. Volume servers
report NVMe address in heartbeats when NVMe target is running. CSI
MasterVolumeClient now populates NvmeAddr/NQN from master responses,
enabling NVMe/TCP via the master-backend path.
Proto files regenerated with protoc 29.5.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add ProcessBlockVolumeAssignments to BlockVolumeStore and wire
AssignmentSource/AssignmentCallback into the heartbeat collector's
Run() loop. Assignments are fetched and applied each tick after
status collection.
Bug fixes:
- BUG-CP4B3-1: TOCTOU between GetBlockVolume and HandleAssignment.
Added withVolume() helper that holds RLock across lookup+operation,
preventing RemoveBlockVolume from closing the volume mid-assignment.
- BUG-CP4B3-2: Data race on callback fields read by Run() goroutine.
Made StatusCallback/AssignmentSource/AssignmentCallback private,
added cbMu mutex and SetXxx() setter methods. Lock held only for
load/store, not during callback execution.
7 dev tests + 13 QA adversarial tests = 20 new tests.
972 total unit tests, all passing.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
BlockVolumeHeartbeatCollector periodically collects block volume status
via callback (standalone, no gRPC wiring yet). Store() accessor on
BlockService. Three bugs found by QA and fixed: Stop-before-Run deadlock
(BUG-CP4B2-1), zero interval panic (BUG-CP4B2-2), callback panic crashes
goroutine (BUG-CP4B2-3). 12 new tests (3 dev + 9 QA adversarial).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* iam: add XML responses for managed user policy APIs
* s3api: implement attach/detach/list attached user policies
* s3api: add embedded IAM tests for managed user policies
* iam: update CredentialStore interface and Manager for managed policies
Updated the `CredentialStore` interface to include `AttachUserPolicy`,
`DetachUserPolicy`, and `ListAttachedUserPolicies` methods.
The `CredentialManager` was updated to delegate these calls to the store.
Added common error variables for policy management.
* iam: implement managed policy methods in MemoryStore
Implemented `AttachUserPolicy`, `DetachUserPolicy`, and
`ListAttachedUserPolicies` in the MemoryStore.
Also ensured deep copying of identities includes PolicyNames.
* iam: implement managed policy methods in PostgresStore
Modified Postgres schema to include `policy_names` JSONB column in `users`.
Implemented `AttachUserPolicy`, `DetachUserPolicy`, and `ListAttachedUserPolicies`.
Updated user CRUD operations to handle policy names persistence.
* iam: implement managed policy methods in remaining stores
Implemented user policy management in:
- `FilerEtcStore` (partial implementation)
- `IamGrpcStore` (delegated via GetUser/UpdateUser)
- `PropagatingCredentialStore` (to broadcast updates)
Ensures cluster-wide consistency for policy attachments.
* s3api: refactor EmbeddedIamApi to use managed policy APIs
- Refactored `AttachUserPolicy`, `DetachUserPolicy`, and `ListAttachedUserPolicies`
to use `e.credentialManager` directly.
- Fixed a critical error suppression bug in `ExecuteAction` that always
returned success even on failure.
- Implemented robust error matching using string comparison fallbacks.
- Improved consistency by reloading configuration after policy changes.
* s3api: update and refine IAM integration tests
- Updated tests to use a real `MemoryStore`-backed `CredentialManager`.
- Refined test configuration synchronization using `sync.Once` and
manual deep-copying to prevent state corruption.
- Improved `extractEmbeddedIamErrorCodeAndMessage` to handle more XML
formats robustly.
- Adjusted test expectations to match current AWS IAM behavior.
* fix compilation
* visibility
* ensure 10 policies
* reload
* add integration tests
* Guard raft command registration
* Allow IAM actions in policy tests
* Validate gRPC policy attachments
* Revert Validate gRPC policy attachments
* Tighten gRPC policy attach/detach
* Improve IAM managed policy handling
* Improve managed policy filters
* filer: add default log purging to master maintenance scripts
* filer: fix default maintenance scripts to include full set of tasks
* filer: refactor maintenance scripts to avoid duplication
* Fix master leader election startup issue
Fixes #error-log-leader-not-selected-yet
* Fix master leader election startup issue
This change improves server address comparison using the 'Equals' method and handles recursion in topology leader lookup, resolving the 'leader not selected yet' error during master startup.
* Merge user improvements: use MaybeLeader for non-blocking checks
* not useful test
* Address code review: optimize Equals, fix deadlock in IsLeader, safe access in Leader
* fix multipart etag
* address comments
* clean up
* clean up
* optimization
* address comments
* unquoted etag
* dedup
* upgrade
* clean
* etag
* return quoted tag
* quoted etag
* debug
* s3api: unify ETag retrieval and quoting across handlers
Refactor newListEntry to take *S3ApiServer and use getObjectETag,
and update setResponseHeaders to use the same logic. This ensures
consistent ETags are returned for both listing and direct access.
* s3api: implement ListObjects deduplication for versioned buckets
Handle duplicate entries between the main path and the .versions
directory by prioritizing the latest version when bucket versioning
is enabled.
* s3api: cleanup stale main file entries during versioned uploads
Add explicit deletion of pre-existing "main" files when creating new
versions in versioned buckets. This prevents stale entries from
appearing in bucket listings and ensures consistency.
* s3api: fix cleanup code placement in versioned uploads
Correct the placement of rm calls in completeMultipartUpload and
putVersionedObject to ensure stale main files are properly deleted
during versioned uploads.
* s3api: improve getObjectETag fallback for empty ExtETagKey
Ensure that when ExtETagKey exists but contains an empty value,
the function falls through to MD5/chunk-based calculation instead
of returning an empty string.
* s3api: fix test files for new newListEntry signature
Update test files to use the new newListEntry signature where the
first parameter is *S3ApiServer. Created mockS3ApiServer to properly
test owner display name lookup functionality.
* s3api: use filer.ETag for consistent Md5 handling in getEtagFromEntry
Change getEtagFromEntry fallback to use filer.ETag(entry) instead of
filer.ETagChunks to ensure legacy entries with Attributes.Md5 are
handled consistently with the rest of the codebase.
* s3api: optimize list logic and fix conditional header logging
- Hoist bucket versioning check out of per-entry callback to avoid
repeated getVersioningState calls
- Extract appendOrDedup helper function to eliminate duplicate
dedup/append logic across multiple code paths
- Change If-Match mismatch logging from glog.Errorf to glog.V(3).Infof
and remove DEBUG prefix for consistency
* s3api: fix test mock to properly initialize IAM accounts
Fixed nil pointer dereference in TestNewListEntryOwnerDisplayName by
directly initializing the IdentityAccessManagement.accounts map in the
test setup. This ensures newListEntry can properly look up account
display names without panicking.
* cleanup
* s3api: remove premature main file cleanup in versioned uploads
Removed incorrect cleanup logic that was deleting main files during
versioned uploads. This was causing test failures because it deleted
objects that should have been preserved as null versions when
versioning was first enabled. The deduplication logic in listing is
sufficient to handle duplicate entries without deleting files during
upload.
* s3api: add empty-value guard to getEtagFromEntry
Added the same empty-value guard used in getObjectETag to prevent
returning quoted empty strings. When ExtETagKey exists but is empty,
the function now falls through to filer.ETag calculation instead of
returning "".
* s3api: fix listing of directory key objects with matching prefix
Revert prefix handling logic to use strings.TrimPrefix instead of
checking HasPrefix with empty string result. This ensures that when a
directory key object exactly matches the prefix (e.g. prefix="dir/",
object="dir/"), it is correctly handled as a regular entry instead of
being skipped or incorrectly processed as a common prefix. Also fixed
missing variable definition.
* s3api: refactor list inline dedup to use appendOrDedup helper
Refactored the inline deduplication logic in listFilerEntries to use the
shared appendOrDedup helper function. This ensures consistent behavior
and reduces code duplication.
* test: fix port allocation race in s3tables integration test
Updated startMiniCluster to find all required ports simultaneously using
findAvailablePorts instead of sequentially. This prevents race conditions
where the OS reallocates a port that was just released, causing multiple
services (e.g. Filer and Volume) to be assigned the same port and fail
to start.
* Add a version token on `GetState()`/`SetState()` RPCs for volume server states.
* Make state version a property ov `VolumeServerState` instead of an in-memory counter.
Also extend state atomicity to reads, instead of just writes.
Implement index (fast) scrubbing for regular/EC volumes via `ScrubVolume()`/`ScrubEcVolume()`.
Also rearranges existing index test files for reuse across unit tests for different modules.
* fix float stepping
* do not auto refresh
* only logs when non 200 status
* fix maintenance task sorting and cleanup redundant handler logic
* Refactor log retrieval to persist to disk and fix slowness
- Move log retrieval to disk-based persistence in GetMaintenanceTaskDetail
- Implement background log fetching on task completion in worker_grpc_server.go
- Implement async background refresh for in-progress tasks
- Completely remove blocking gRPC calls from the UI path to fix 10s timeouts
- Cleanup debug logs and performance profiling code
* Ensure consistent deterministic sorting in config_persistence cleanup
* Replace magic numbers with constants and remove debug logs
- Added descriptive constants for truncation limits and timeouts in admin_server.go and worker_grpc_server.go
- Replaced magic numbers with these constants throughout the codebase
- Verified removal of stdout debug printing
- Ensured consistent truncation logic during log persistence
* Address code review feedback on history truncation and logging logic
- Fix AssignmentHistory double-serialization by copying task in GetMaintenanceTaskDetail
- Fix handleTaskCompletion logging logic (mutually exclusive success/failure logs)
- Remove unused Timeout field from LogRequestContext and sync select timeouts with constants
- Ensure AssignmentHistory is only provided in the top-level field for better JSON structure
* Implement goroutine leak protection and request deduplication
- Add request deduplication in RequestTaskLogs to prevent multiple concurrent fetches for the same task
- Implement safe cleanup in timeout handlers to avoid race conditions in pendingLogRequests map
- Add a 10s cooldown for background log refreshes in GetMaintenanceTaskDetail to prevent spamming
- Ensure all persistent log-fetching goroutines are bounded and efficiently managed
* Fix potential nil pointer panics in maintenance handlers
- Add nil checks for adminServer in ShowTaskDetail, ShowMaintenanceWorkers, and UpdateTaskConfig
- Update getMaintenanceQueueData to return a descriptive error instead of nil when adminServer is uninitialized
- Ensure internal helper methods consistently check for adminServer initialization before use
* Strictly enforce disk-only log reading
- Remove background log fetching from GetMaintenanceTaskDetail to prevent timeouts and network calls during page view
- Remove unused lastLogFetch tracking fields to clean up dead code
- Ensure logs are only updated upon task completion via handleTaskCompletion
* Refactor GetWorkerLogs to read from disk
- Update /api/maintenance/workers/:id/logs endpoint to use configPersistence.LoadTaskExecutionLogs
- Remove synchronous gRPC call RequestTaskLogs to prevent timeouts and bad gateway errors
- Ensure consistent log retrieval behavior across the application (disk-only)
* Fix timestamp parsing in log viewer
- Update task_detail.templ JS to handle both ISO 8601 strings and Unix timestamps
- Fix "Invalid time value" error when displaying logs fetched from disk
- Regenerate templates
* master: fallback to HDD if SSD volumes are full in Assign
* worker: improve EC detection logging and fix skip counters
* worker: add Sync method to TaskLogger interface
* worker: implement Sync and ensure logs are flushed before task completion
* admin: improve task log retrieval with retries and better timeouts
* admin: robust timestamp parsing in task detail view
* Fix: Initialize filer CredentialManager with filer address
* The fix involves checking for directory existence before creation.
* adjust error message
* Fix: Implement FilerAddressSetter in PropagatingCredentialStore
* Refactor: Reorder credential manager initialization in filer server
* refactor
* Implement RPC skeleton for regular/EC volumes scrubbing.
See https://github.com/seaweedfs/seaweedfs/issues/8018 for details.
* Minor proto improvements for `ScrubVolume()`, `ScrubEcVolume()`:
- Add fields for scrubbing details in `ScrubVolumeResponse` and `ScrubEcVolumeResponse`,
instead of reporting these through RPC errors.
- Return a list of broken shards when scrubbing EC volumes, via `EcShardInfo'.
* Boostrap persistent state for volume servers.
This PR implements logic load/save persistent state information for storages
associated with volume servers, and reporting state changes back to masters
via heartbeat messages.
More work ensues!
See https://github.com/seaweedfs/seaweedfs/issues/7977 for details.
* Add volume server RPCs to read and update state flags.
* Boostrap persistent state for volume servers.
This PR implements logic load/save persistent state information for storages
associated with volume servers, and reporting state changes back to masters
via heartbeat messages.
More work ensues!
See https://github.com/seaweedfs/seaweedfs/issues/7977 for details.
* Block RPC operations writing to volume servers when maintenance mode is on.