* fix: check freeEcSlot before evacuating EC shards to prevent data loss
Related to #7619
The moveAwayOneEcVolume function was missing the freeEcSlot check that
exists in other EC shard placement functions. This could cause EC shards
to be moved to volume servers that have no capacity, resulting in:
1. 0-byte shard files when disk is full
2. Data loss when source shards are deleted after 'successful' copy
Changes:
- Add freeEcSlot check before attempting to move EC shards
- Sort destinations by both shard count and free slots
- Refresh topology during evacuation to get updated slot counts
- Log when nodes are skipped due to no free slots
- Update freeEcSlot count after successful moves
* fix: clarify comment wording per CodeRabbit review
The comment stated 'after each move' but the code executes before
calling moveAwayOneEcVolume. Updated to 'before moving each EC volume'
for accuracy.
* fix: collect topology once and track capacity changes locally
Remove the topology refresh within the loop as it gives a false sense
of correctness - the refreshed topology could still be stale (minutes old).
Instead, we:
1. Collect topology once at the start
2. Track capacity changes ourselves via freeEcSlot decrement after each move
This is more accurate because we know exactly what moves we've made,
rather than relying on potentially stale topology refreshes.
* fix: ensure partial EC volume moves are reported as failures
Set hasMoved=false when a shard fails to move, even if previous shards
succeeded. This prevents the caller from incorrectly assuming the entire
volume was evacuated, which could lead to data loss if the source server
is decommissioned based on this incorrect status.
* fix: also reset hasMoved on moveMountedShardToEcNode error
Same issue as the previous fix: if moveMountedShardToEcNode fails
after some shards succeeded, hasMoved would incorrectly be true.
Ensure partial moves are always reported as failures.
* Add placement package for EC shard placement logic
- Consolidate EC shard placement algorithm for reuse across shell and worker tasks
- Support multi-pass selection: racks, then servers, then disks
- Include proper spread verification and scoring functions
- Comprehensive test coverage for various cluster topologies
* Make ec.balance disk-aware for multi-disk servers
- Add EcDisk struct to track individual disks on volume servers
- Update EcNode to maintain per-disk shard distribution
- Parse disk_id from EC shard information during topology collection
- Implement pickBestDiskOnNode() for selecting best disk per shard
- Add diskDistributionScore() for tie-breaking node selection
- Update all move operations to specify target disk in RPC calls
- Improves shard balance within multi-disk servers, not just across servers
* Use placement package in EC detection for consistent disk-level placement
- Replace custom EC disk selection logic with shared placement package
- Convert topology DiskInfo to placement.DiskCandidate format
- Use SelectDestinations() for multi-rack/server/disk spreading
- Convert placement results back to topology DiskInfo for task creation
- Ensures EC detection uses same placement logic as shell commands
* Make volume server evacuation disk-aware
- Use pickBestDiskOnNode() when selecting evacuation target disk
- Specify target disk in evacuation RPC requests
- Maintains balanced disk distribution during server evacuations
* Rename PlacementConfig to PlacementRequest for clarity
PlacementRequest better reflects that this is a request for placement
rather than a configuration object. This improves API semantics.
* Rename DefaultConfig to DefaultPlacementRequest
Aligns with the PlacementRequest type naming for consistency
* Address review comments from Gemini and CodeRabbit
Fix HIGH issues:
- Fix empty disk discovery: Now discovers all disks from VolumeInfos,
not just from EC shards. This ensures disks without EC shards are
still considered for placement.
- Fix EC shard count calculation in detection.go: Now correctly filters
by DiskId and sums actual shard counts using ShardBits.ShardIdCount()
instead of just counting EcShardInfo entries.
Fix MEDIUM issues:
- Add disk ID to evacuation log messages for consistency with other logging
- Remove unused serverToDisks variable in placement.go
- Fix comment that incorrectly said 'ascending' when sorting is 'descending'
* add ec tests
* Update ec-integration-tests.yml
* Update ec_integration_test.go
* Fix EC integration tests CI: build weed binary and update actions
- Add 'Build weed binary' step before running tests
- Update actions/setup-go from v4 to v6 (Node20 compatibility)
- Update actions/checkout from v2 to v4 (Node20 compatibility)
- Move working-directory to test step only
* Add disk-aware EC rebalancing integration tests
- Add TestDiskAwareECRebalancing test with multi-disk cluster setup
- Test EC encode with disk awareness (shows disk ID in output)
- Test EC balance with disk-level shard distribution
- Add helper functions for disk-level verification:
- startMultiDiskCluster: 3 servers x 4 disks each
- countShardsPerDisk: track shards per disk per server
- calculateDiskShardVariance: measure distribution balance
- Verify no single disk is overloaded with shards
* Unify the parameter to disable dry-run on weed shell commands to --apply (instead of --force).
* lint
* refactor
* Execution Order Corrected
* handle deprecated force flag
* fix help messages
* Refactoring]: Using flag.FlagSet.Visit()
* consistent with other commands
* Checks for both flags
* fix toml files
---------
Co-authored-by: chrislu <chris.lu@gmail.com>