New POST /block/volume/plan endpoint returns full placement preview:
resolved policy, ordered candidate list, selected primary/replicas,
and per-server rejection reasons with stable string constants.
Core design: evaluateBlockPlacement() is a pure function with no
registry/topology dependency. gatherPlacementCandidates() is the
single topology bridge point. Plan and create share the same planner —
parity contract is same ordered candidate list for same cluster state.
Create path refactored: uses evaluateBlockPlacement() instead of
PickServer(), iterates all candidates (no 3-retry cap), recomputes
replica order after primary fallback. rf_not_satisfiable severity
is durability-mode-aware (warning for best_effort, error for strict).
15 unit tests + 20 QA adversarial tests.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Preset system: ResolvePolicy resolves named presets (database, general,
throughput) with per-field overrides into concrete volume parameters.
Create path now uses resolved policy instead of ad-hoc validation.
New /block/volume/resolve diagnostic endpoint for dry-run resolution.
Review fix 1 (MED): HasNVMeCapableServer now derives NVMe capability
from server-level heartbeat attribute (block_nvme_addr proto field)
instead of scanning volume entries. Fixes false "no NVMe" warning on
fresh clusters with NVMe-capable servers but no volumes yet.
Review fix 2 (LOW): /block/volume/resolve no longer proxied to leader —
read-only diagnostic endpoint can be served by any master.
Engine fix: ReadLBA retry loop closes stale dirty-map race when WAL
entry is recycled between lookup and read.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Six-task checkpoint hardening the promotion and failover paths:
T1: 4-gate candidate evaluation (heartbeat freshness, WAL lag, role,
server liveness) with structured rejection reasons.
T2: Orphaned-primary re-evaluation on replica reconnect (B-06/B-08).
T3: Deferred timer safety — epoch validation prevents stale timers
from firing on recreated/changed volumes (B-07).
T4: Rebuild addr cleanup on promotion (B-11), NVMe publication
refresh on heartbeat, and preflight endpoint wiring.
T5: Manual promote API — POST /block/volume/{name}/promote with
force flag, target server selection, and structured rejection
response. Shared applyPromotionLocked/finalizePromotion helpers
eliminate duplication between auto and manual paths.
T6: Read-only preflight endpoint (GET /block/volume/{name}/preflight)
and blockapi client wrappers (Preflight, Promote).
BUG-T5-1: PromotionsTotal counter moved to finalizePromotion (shared
by both auto and manual paths) to prevent metrics divergence.
24 files changed, ~6500 lines added. 42 new QA adversarial tests.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two-phase prepare/commit/cancel protocol ensures all replicas expand
atomically. Standalone volumes use direct-commit (unchanged behavior).
Engine: PrepareExpand/CommitExpand/CancelExpand with on-disk
PreparedSize+ExpandEpoch in superblock, crash recovery clears stale
prepare state on open, v.mu serializes concurrent expand operations.
Proto: 3 new RPCs (PrepareExpand/CommitExpand/CancelExpandBlockVolume).
Coordinator: expandClean flag pattern — ReleaseExpandInflight only on
clean success or full cancel. Partial replica commit failure calls
MarkExpandFailed (keeps ExpandInProgress=true, suppresses heartbeat
size updates). ClearExpandFailed for manual reconciliation.
Registry: AcquireExpandInflight records PendingExpandSize+ExpandEpoch.
ExpandFailed state blocks new expands until cleared.
Tests: 15 engine + 4 VS + 10 coordinator + heartbeat suppression
regression + updated QA CP82/durability tests with prepare/commit mocks.
Also includes CP11A-1 remaining: QA storage profile tests, QA
io_backend config tests, testrunner perf-baseline scenarios and
coordinated-expand actions.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add nvme_addr and nqn fields to proto messages (AllocateBlockVolume,
CreateBlockVolume, LookupBlockVolume, BlockVolumeInfoMessage), wire
through volume server → master registry → CSI driver. Volume servers
report NVMe address in heartbeats when NVMe target is running. CSI
MasterVolumeClient now populates NvmeAddr/NQN from master responses,
enabling NVMe/TCP via the master-backend path.
Proto files regenerated with protoc 29.5.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* filer: add default log purging to master maintenance scripts
* filer: fix default maintenance scripts to include full set of tasks
* filer: refactor maintenance scripts to avoid duplication
* Prevent split-brain: Persistent ClusterID and Join Validation
- Persist ClusterId in Raft store to survive restarts.
- Validate ClusterId on Raft command application (piggybacked on MaxVolumeId).
- Prevent masters with conflicting ClusterIds from joining/operating together.
- Update Telemetry to report the persistent ClusterId.
* Refine ClusterID validation based on feedback
- Improved error message in cluster_commands.go.
- Added ClusterId mismatch check in RaftServer.Recovery.
* Handle Raft errors and support Hashicorp Raft for ClusterId
- Check for errors when persisting ClusterId in legacy Raft.
- Implement ClusterId generation and persistence for Hashicorp Raft leader changes.
- Ensure consistent error logging.
* Refactor ClusterId validation
- Centralize ClusterId mismatch check in Topology.SetClusterId.
- Simplify MaxVolumeIdCommand.Apply and RaftServer.Recovery to rely on SetClusterId.
* Fix goroutine leak and add timeout
- Handle channel closure in Hashicorp Raft leader listener.
- Add timeout to Raft Apply call to prevent blocking.
* Fix deadlock in legacy Raft listener
- Wrap ClusterId generation/persistence in a goroutine to avoid blocking the Raft event loop (deadlock).
* Rename ClusterId to SystemId
- Renamed ClusterId to SystemId across the codebase (protobuf, topology, server, telemetry).
- Regenerated telemetry.pb.go with new field.
* Rename SystemId to TopologyId
- Rename to SystemId was intermediate step.
- Final name is TopologyId for the persistent cluster identifier.
- Updated protobuf, topology, raft server, master server, and telemetry.
* Optimize Hashicorp Raft listener
- Integrated TopologyId generation into existing monitorLeaderLoop.
- Removed extra goroutine in master_server.go.
* Fix optimistic TopologyId update
- Removed premature local state update of TopologyId in master_server.go and raft_hashicorp.go.
- State is now solely updated via the Raft state machine Apply/Restore methods after consensus.
* Add explicit log for recovered TopologyId
- Added glog.V(0) info log in RaftServer.Recovery to print the recovered TopologyId on startup.
* Add Raft barrier to prevent TopologyId race condition
- Implement ensureTopologyId helper method
- Send no-op MaxVolumeIdCommand to sync Raft log before checking TopologyId
- Ensures persisted TopologyId is recovered before generating new one
- Prevents race where generation happens during log replay
* Serialize TopologyId generation with mutex
- Add topologyIdGenLock mutex to MasterServer struct
- Wrap ensureTopologyId method with lock to prevent concurrent generation
- Fixes race where event listener and manual leadership check both generate IDs
- Second caller waits for first to complete and sees the generated ID
* Add TopologyId recovery logging to Apply method
- Change log level from V(1) to V(0) for visibility
- Log 'Recovered TopologyId' when applying from Raft log
- Ensures recovery is visible whether from snapshot or log replay
- Matches Recovery() method logging for consistency
* Fix Raft barrier timing issue
- Add 100ms delay after barrier command to ensure log application completes
- Add debug logging to track barrier execution and TopologyId state
- Return early if barrier command fails
- Prevents TopologyId generation before old logs are fully applied
* ensure leader
* address comments
* address comments
* redundant
* clean up
* double check
* refactoring
* comment
* Added global http client
* Added Do func for global http client
* Changed the code to use the global http client
* Fix http client in volume uploader
* Fixed pkg name
* Fixed http util funcs
* Fixed http client for bench_filer_upload
* Fixed http client for stress_filer_upload
* Fixed http client for filer_server_handlers_proxy
* Fixed http client for command_fs_merge_volumes
* Fixed http client for command_fs_merge_volumes and command_volume_fsck
* Fixed http client for s3api_server
* Added init global client for main funcs
* Rename global_client to client
* Changed:
- fixed NewHttpClient;
- added CheckIsHttpsClientEnabled func
- updated security.toml in scaffold
* Reduce the visibility of some functions in the util/http/client pkg
* Added the loadSecurityConfig function
* Use util.LoadSecurityConfiguration() in NewHttpClient func
* Added context for the MasterClient's methods to avoid endless loops
* Returned WithClient function. Added WithClientCustomGetMaster function
* Hid unused ctx arguments
* Using a common context for the KeepConnectedToMaster and WaitUntilConnected functions
* Changed the context termination check in the tryConnectToMaster function
* Added a child context to the tryConnectToMaster function
* Added a common context for KeepConnectedToMaster and WaitUntilConnected functions in benchmark
* remove old raft servers if they don't answer to pings for too long
add ping durations as options
rename ping fields
fix some todos
get masters through masterclient
raft remove server from leader
use raft servers to ping them
CheckMastersAlive for hashicorp raft only
* prepare blocking ping
* pass waitForReady as param
* pass waitForReady through all functions
* waitForReady works
* refactor
* remove unneeded params
* rollback unneeded changes
* fix