* Add multi-partition-spec compaction and delete-aware compaction (Phase 3)
Multi-partition-spec compaction:
- Add SpecID to compactionBin struct and group by spec+partition key
- Remove the len(specIDs) > 1 skip that blocked spec-evolved tables
- Write per-spec manifests in compaction commit using specByID map
- Use per-bin PartitionSpec when calling NewDataFileBuilder
Delete-aware compaction:
- Add ApplyDeletes config (default: true) with readBoolConfig helper
- Implement position delete collection (file_path + pos Parquet columns)
- Implement equality delete collection (field ID to column mapping)
- Update mergeParquetFiles to filter rows via position deletes (binary
search) and equality deletes (hash set lookup)
- Smart delete manifest carry-forward: drop when all data files compacted
- Fix EXISTING/DELETED entries to include sequence numbers
Tests for multi-spec bins, delete collection, merge filtering, and
end-to-end compaction with position/equality/mixed deletes.
* Add structured metrics and per-bin progress to iceberg maintenance
- Change return type of all four operations from (string, error) to
(string, map[string]int64, error) with structured metric counts
(files_merged, snapshots_expired, orphans_removed, duration_ms, etc.)
- Add onProgress callback to compactDataFiles for per-bin progress
- In Execute, pass progress callback that sends JobProgressUpdate with
per-bin stage messages
- Accumulate per-operation metrics with dot-prefixed keys
(e.g. compact.files_merged) into OutputValues on completion
- Update testing_api.go wrappers and integration test call sites
- Add tests: TestCompactDataFilesMetrics, TestExpireSnapshotsMetrics,
TestExecuteCompletionOutputValues
* Address review feedback: group equality deletes by field IDs, use metric constants
- Group equality deletes by distinct equality_ids sets so different
delete files with different equality columns are handled correctly
- Use length-prefixed type-aware encoding in buildEqualityKey to avoid
ambiguity between types and collisions from null bytes
- Extract metric key strings into package-level constants
* Fix buildEqualityKey to use length-prefixed type-aware encoding
The previous implementation used plain String() concatenation with null
byte separators, which caused type ambiguity (int 123 vs string "123")
and separator collisions when values contain null bytes. Now each value
is serialized as "kind:length:value" for unambiguous composite keys.
This fix was missed in the prior cherry-pick due to a merge conflict.
* Address nitpick review comments
- Document patchManifestContentToDeletes workaround: explain that
iceberg-go WriteManifest cannot create delete manifests, and note
the fail-fast validation on pattern match
- Document makeTestEntries: note that specID field is ignored and
callers should use makeTestEntriesWithSpec for multi-spec testing
* fmt
* Fix path normalization, manifest threshold, and artifact filename collisions
- Normalize file paths in position delete collection and lookup so that
absolute S3 URLs and relative paths match correctly
- Fix rewriteManifests threshold check to count only data manifests
(was including delete manifests in the count and metric)
- Add random suffix to artifact filenames in compactDataFiles and
rewriteManifests to prevent collisions between concurrent runs
- Sort compaction bins by SpecID then PartitionKey for deterministic
ordering across specs
* Fix pos delete read, deduplicate column resolution, minor cleanups
- Remove broken Column() guard in position delete reading that silently
defaulted pos to 0; unconditionally extract Int64() instead
- Deduplicate column resolution in readEqualityDeleteFile by calling
resolveEqualityColIndices instead of inlining the same logic
- Add warning log in readBoolConfig for unrecognized string values
- Fix CompactDataFiles call site in integration test to capture 3 return
values
* Advance progress on all bins, deterministic manifest order, assert metrics
- Call onProgress for every bin iteration including skipped/failed bins
so progress reporting never appears stalled
- Sort spec IDs before iterating specEntriesMap to produce deterministic
manifest list ordering across runs
- Assert expected metric keys in CompactDataFiles integration test
---------
Co-authored-by: Copilot <copilot@github.com>
When S3StorageClass is empty (the default), aws.String("") was passed
as the StorageClass in PutObject requests. While AWS S3 treats this as
"use default," S3-compatible providers (e.g. SharkTech) reject it with
InvalidStorageClass. Only set StorageClass when a non-empty value is
configured, letting the provider use its default.
Fixes#8644
Change iceberg target_file_size config from bytes to MB
Rename the config field from target_file_size_bytes to
target_file_size_mb with a default of 256 (MB). The value is
converted to bytes internally. This makes the config more
user-friendly — entering 256 is clearer than 268435456.
Co-authored-by: Copilot <copilot@github.com>
* Add iceberg_maintenance plugin worker handler (Phase 1)
Implement automated Iceberg table maintenance as a new plugin worker job
type. The handler scans S3 table buckets for tables needing maintenance
and executes operations in the correct Iceberg order: expire snapshots,
remove orphan files, and rewrite manifests.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Add data file compaction to iceberg maintenance handler (Phase 2)
Implement bin-packing compaction for small Parquet data files:
- Enumerate data files from manifests, group by partition
- Merge small files using parquet-go (read rows, write merged output)
- Create new manifest with ADDED/DELETED/EXISTING entries
- Commit new snapshot with compaction metadata
Add 'compact' operation to maintenance order (runs before expire_snapshots),
configurable via target_file_size_bytes and min_input_files thresholds.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Fix memory exhaustion in mergeParquetFiles by processing files sequentially
Previously all source Parquet files were loaded into memory simultaneously,
risking OOM when a compaction bin contained many small files. Now each file
is loaded, its rows are streamed into the output writer, and its data is
released before the next file is loaded — keeping peak memory proportional
to one input file plus the output buffer.
* Validate bucket/namespace/table names against path traversal
Reject names containing '..', '/', or '\' in Execute to prevent
directory traversal via crafted job parameters.
* Add filer address failover in iceberg maintenance handler
Try each filer address from cluster context in order instead of only
using the first one. This improves resilience when the primary filer
is temporarily unreachable.
* Add separate MinManifestsToRewrite config for manifest rewrite threshold
The rewrite_manifests operation was reusing MinInputFiles (meant for
compaction bin file counts) as its manifest count threshold. Add a
dedicated MinManifestsToRewrite field with its own config UI section
and default value (5) so the two thresholds can be tuned independently.
* Fix risky mtime fallback in orphan removal that could delete new files
When entry.Attributes is nil, mtime defaulted to Unix epoch (1970),
which would always be older than the safety threshold, causing the
file to be treated as eligible for deletion. Skip entries with nil
Attributes instead, matching the safer logic in operations.go.
* Fix undefined function references in iceberg_maintenance_handler.go
Use the exported function names (ShouldSkipDetectionByInterval,
BuildDetectorActivity, BuildExecutorActivity) matching their
definitions in vacuum_handler.go.
* Remove duplicated iceberg maintenance handler in favor of iceberg/ subpackage
The IcebergMaintenanceHandler and its compaction code in the parent
pluginworker package duplicated the logic already present in the
iceberg/ subpackage (which self-registers via init()). The old code
lacked stale-plan guards, proper path normalization, CAS-based xattr
updates, and error-returning parseOperations.
Since the registry pattern (default "all") makes the old handler
unreachable, remove it entirely. All functionality is provided by
iceberg.Handler with the reviewed improvements.
* Fix MinManifestsToRewrite clamping to match UI minimum of 2
The clamp reset values below 2 to the default of 5, contradicting the
UI's advertised MinValue of 2. Clamp to 2 instead.
* Sort entries by size descending in splitOversizedBin for better packing
Entries were processed in insertion order which is non-deterministic
from map iteration. Sorting largest-first before the splitting loop
improves bin packing efficiency by filling bins more evenly.
* Add context cancellation check to drainReader loop
The row-streaming loop in drainReader did not check ctx between
iterations, making long compaction merges uncancellable. Check
ctx.Done() at the top of each iteration.
* Fix splitOversizedBin to always respect targetSize limit
The minFiles check in the split condition allowed bins to grow past
targetSize when they had fewer than minFiles entries, defeating the
OOM protection. Now bins always split at targetSize, and a trailing
runt with fewer than minFiles entries is merged into the previous bin.
* Add integration tests for iceberg table maintenance plugin worker
Tests start a real weed mini cluster, create S3 buckets and Iceberg
table metadata via filer gRPC, then exercise the iceberg.Handler
operations (ExpireSnapshots, RemoveOrphans, RewriteManifests) against
the live filer. A full maintenance cycle test runs all operations in
sequence and verifies metadata consistency.
Also adds exported method wrappers (testing_api.go) so the integration
test package can call the unexported handler methods.
* Fix splitOversizedBin dropping files and add source path to drainReader errors
The runt-merge step could leave leading bins with fewer than minFiles
entries (e.g. [80,80,10,10] with targetSize=100, minFiles=2 would drop
the first 80-byte file). Replace the filter-based approach with an
iterative merge that folds any sub-minFiles bin into its smallest
neighbor, preserving all eligible files.
Also add the source file path to drainReader error messages so callers
can identify which Parquet file caused a read/write failure.
* Harden integration test error handling
- s3put: fail immediately on HTTP 4xx/5xx instead of logging and
continuing
- lookupEntry: distinguish NotFound (return nil) from unexpected RPC
errors (fail the test)
- writeOrphan and orphan creation in FullMaintenanceCycle: check
CreateEntryResponse.Error in addition to the RPC error
* go fmt
---------
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
The mini command previously hardcoded a list of specific job types
(vacuum, volume_balance, erasure_coding, admin_script). Use the "all"
category instead so that newly registered handlers are automatically
picked up without requiring changes to the mini command.
* feat(ec_balance): add TaskTypeECBalance constant and protobuf definitions
Add the ec_balance task type constant to both topology and worker type
systems. Define EcBalanceTaskParams, EcShardMoveSpec, and
EcBalanceTaskConfig protobuf messages for EC shard balance operations.
* feat(ec_balance): add configuration for EC shard balance task
Config includes imbalance threshold, min server count, collection
filter, disk type, and preferred tags for tag-aware placement.
* feat(ec_balance): add multi-phase EC shard balance detection algorithm
Implements four detection phases adapted from the ec.balance shell
command:
1. Duplicate shard detection and removal proposals
2. Cross-rack shard distribution balancing
3. Within-rack node-level shard balancing
4. Global shard count equalization across nodes
Detection is side-effect-free: it builds an EC topology view from
ActiveTopology and generates move proposals without executing them.
* feat(ec_balance): add EC shard move task execution
Implements the shard move sequence using the same VolumeEcShardsCopy,
VolumeEcShardsMount, VolumeEcShardsUnmount, and VolumeEcShardsDelete
RPCs as the shell ec.balance command. Supports both regular shard
moves and dedup-phase deletions (unmount+delete without copy).
* feat(ec_balance): add task registration and scheduling
Register EC balance task definition with auto-config update support.
Scheduling respects max concurrent limits and worker capabilities.
* feat(ec_balance): add plugin handler for EC shard balance
Implements the full plugin handler with detection, execution, admin
and worker config forms, proposal building, and decision trace
reporting. Supports collection/DC/disk type filtering, preferred tag
placement, and configurable detection intervals. Auto-registered via
init() with the handler registry.
* test(ec_balance): add tests for detection algorithm and plugin handler
Detection tests cover: duplicate shard detection, cross-rack imbalance,
within-rack imbalance, global rebalancing, topology building, collection
filtering, and edge cases. Handler tests cover: config derivation with
clamping, proposal building, protobuf encode/decode round-trip, fallback
parameter decoding, capability, and config policy round-trip.
* fix(ec_balance): address PR review feedback and fix CI test failure
- Update TestWorkerDefaultJobTypes to expect 6 handlers (was 5)
- Extract threshold constants (ecBalanceMinImbalanceThreshold, etc.)
to eliminate magic numbers in Descriptor and config derivation
- Remove duplicate ShardIdsToUint32 helper (use erasure_coding package)
- Add bounds checks for int64→int/uint32 conversions to fix CodeQL
integer conversion warnings
* fix(ec_balance): address code review findings
storage_impact.go:
- Add TaskTypeECBalance case returning shard-level reservation
(ShardSlots: -1/+1) instead of falling through to default which
incorrectly reserves a full volume slot on target.
detection.go:
- Use dc:rack composite key to avoid cross-DC rack name collisions.
Only create rack entries after confirming node has matching disks.
- Add exceedsImbalanceThreshold check to cross-rack, within-rack,
and global phases so trivial skews below the configured threshold
are ignored. Dedup phase always runs since duplicates are errors.
- Reserve destination capacity after each planned move (decrement
destNode.freeSlots, update rackShardCount/nodeShardCount) to
prevent overbooking the same destination.
- Skip nodes with freeSlots <= 0 when selecting minNode in global
balance to avoid proposing moves to full nodes.
- Include loop index and source/target node IDs in TaskID to
guarantee uniqueness across moves with the same volumeID/shardID.
ec_balance_handler.go:
- Fail fast with error when shard_id is absent in fallback parameter
decoding instead of silently defaulting to shard 0.
ec_balance_task.go:
- Delegate GetProgress() to BaseTask.GetProgress() so progress
updates from ReportProgressWithStage are visible to callers.
- Add fail-fast guard rejecting multiple sources/targets until
batch execution is implemented.
Findings verified but not changed (matches existing codebase pattern
in vacuum/balance/erasure_coding handlers):
- register.go globalTaskDef.Config race: same unsynchronized pattern
in all 4 task packages.
- CreateTask using generated ID: same fmt.Sprintf pattern in all 4
task packages.
* fix(ec_balance): harden parameter decoding, progress tracking, and validation
ec_balance_handler.go (decodeECBalanceTaskParams):
- Validate execution-critical fields (Sources[0].Node, ShardIds,
Targets[0].Node, ShardIds) after protobuf deserialization.
- Require source_disk_id and target_disk_id in legacy fallback path
so Targets[0].DiskId is populated for VolumeEcShardsCopyRequest.
- All error messages reference decodeECBalanceTaskParams and the
specific missing field (TaskParams, shard_id, Targets[0].DiskId,
EcBalanceTaskParams) for debuggability.
ec_balance_task.go:
- Track progress in ECBalanceTask.progress field, updated via
reportProgress() helper called before ReportProgressWithStage(),
so GetProgress() returns real stage progress instead of stale 0.
- Validate: require exactly 1 source and 1 target (mirrors Execute
guard), require ShardIds on both, with error messages referencing
ECBalanceTask.Validate and the specific field.
* fix(ec_balance): fix dedup execution path, stale topology, collection filter, timeout, and dedupeKey
detection.go:
- Dedup moves now set target=source so isDedupPhase() triggers the
unmount+delete-only execution path instead of attempting a copy.
- Apply moves to in-memory topology between phases via
applyMovesToTopology() so subsequent phases see updated shard
placement and don't conflict with already-planned moves.
- detectGlobalImbalance now accepts allowedVids and filters both
shard counting and shard selection to respect CollectionFilter.
ec_balance_task.go:
- Apply EcBalanceTaskParams.TimeoutSeconds to the context via
context.WithTimeout so all RPC operations respect the configured
timeout instead of hanging indefinitely.
ec_balance_handler.go:
- Include source node ID in dedupeKey so dedup deletions from
different source nodes for the same shard aren't collapsed.
- Clamp minServerCountRaw and minIntervalRaw lower bounds on int64
before narrowing to int, preventing undefined overflow on 32-bit.
* fix(ec_balance): log warning before cancelling on progress send failure
Log the error, job ID, job type, progress percentage, and stage
before calling execCancel() in the progress callback so failed
progress sends are diagnosable instead of silently cancelling.
* fix(ec): gather shards from all disk locations before rebuild (#8631)
Fix "too few shards given" error during ec.rebuild on multi-disk volume
servers. The root cause has two parts:
1. VolumeEcShardsRebuild only looked at a single disk location for shard
files. On multi-disk servers, the existing local shards could be on one
disk while copied shards were placed on another, causing the rebuild to
see fewer shards than actually available.
2. VolumeEcShardsCopy had a DiskId condition (req.DiskId == 0 &&
len(vs.store.Locations) > 0) that was always true, making the
FindFreeLocation fallback dead code. This meant copies always went to
Locations[0] regardless of where existing shards were.
Changes:
- VolumeEcShardsRebuild now finds the location with the most shards,
then gathers shard files from other locations via hard links (or
symlinks for cross-device) before rebuilding. Gathered files are
cleaned up after rebuild.
- VolumeEcShardsCopy now only uses Locations[DiskId] when DiskId > 0
(explicitly set). Otherwise, it prefers the location that already has
the EC volume, falling back to HDD then any free location.
- generateMissingEcFiles now logs shard counts and provides a clear
error message when not enough shards are found, instead of passing
through to the opaque reedsolomon "too few shards given" error.
* fix(ec): update test to match skip behavior for unrepairable volumes
The test expected an error for volumes with insufficient shards, but
commit 5acb4578a changed unrepairable volumes to be skipped with a log
message instead of returning an error. Update the test to verify the
skip behavior and log output.
* fix(ec): address PR review comments
- Add comment clarifying DiskId=0 means "not specified" (protobuf default),
callers must use DiskId >= 1 to target a specific disk.
- Log warnings on cleanup failures for gathered shard links.
* fix(ec): read shard files from other disks directly instead of linking
Replace the hard link / symlink gathering approach with passing
additional search directories into RebuildEcFiles. The rebuild
function now opens shard files directly from whichever disk they
live on, avoiding filesystem link operations and cleanup.
RebuildEcFiles and RebuildEcFilesWithContext gain a variadic
additionalDirs parameter (backward compatible with existing callers).
* fix(ec): clarify DiskId selection semantics in VolumeEcShardsCopy comment
* fix(ec): avoid empty files on failed rebuild; don't skip ecx-only locations
- generateMissingEcFiles: two-pass approach — first discover present/missing
shards and check reconstructability, only then create output files. This
avoids leaving behind empty truncated shard files when there are too few
shards to rebuild.
- VolumeEcShardsRebuild: compute hasEcx before skipping zero-shard locations.
A location with an .ecx file but no shard files (all shards on other disks)
is now a valid rebuild candidate instead of being silently skipped.
* fix(ec): select ecx-only location as rebuildLocation when none chosen yet
When rebuildLocation is nil and a location has hasEcx=true but
existingShardCount=0 (all shards on other disks), the condition
0 > 0 was false so it was never promoted to rebuildLocation.
Add rebuildLocation == nil to the predicate so the first location
with an .ecx file is always selected as a candidate.
* Fix ec.rebuild failing on unrepairable volumes instead of skipping them
When an EC volume has fewer shards than DataShardsCount, ec.rebuild would
return an error and abort the entire operation. Now it logs a warning and
continues rebuilding the remaining volumes.
Fixes#8630
* Remove duplicate volume ID in unrepairable log message
---------
Co-authored-by: Copilot <copilot@github.com>
* feat(vacuum): add volume state, location, and enhanced collection filters
Align the vacuum handler's admin config with the balance handler by adding:
- volume_state filter (ALL/ACTIVE/FULL) to scope vacuum to writable or
read-only volumes
- data_center_filter, rack_filter, node_filter to scope vacuum to
specific infrastructure locations
- Enhanced collection_filter description matching the balance handler's
ALL_COLLECTIONS/EACH_COLLECTION/regex modes
The new filters reuse filterMetricsByVolumeState() and
filterMetricsByLocation() already defined in the same package.
* use wildcard matchers for DC/rack/node filters
Replace exact-match and CSV set lookups with wildcard matching
from util/wildcard package. Patterns like "dc*", "rack-1?", or
"node-a*" are now supported in all location filter fields for
both balance and vacuum handlers.
* add nil guard in filterMetricsByLocation
* feat(plugin): make page tabs and sub-tabs addressable by URLs
Update the plugin page so that clicking tabs and sub-tabs pushes
browser history via history.pushState(), enabling bookmarkable URLs,
browser back/forward navigation, and shareable links.
URL mapping:
- /plugin → Overview tab
- /plugin/configuration → Configuration sub-tab
- /plugin/detection → Job Detection sub-tab
- /plugin/queue → Job Queue sub-tab
- /plugin/execution → Job Execution sub-tab
Job-type-specific URLs use the ?job= query parameter (e.g.,
/plugin/configuration?job=vacuum) so that a specific job type tab
is pre-selected on page load.
Changes:
- Add initialJob parameter to Plugin() template and handler
- Extract ?job= query param in renderPluginPage handler
- Add buildPluginURL/updateURL helpers in JavaScript
- Push history state on top-tab, sub-tab, and job-type clicks
- Listen for popstate to restore tab state on back/forward
- Replace initial history entry on page load via replaceState
* make popstate handler async with proper error handling
Await loadDescriptorAndConfig so data loading completes before
rendering dependent views. Log errors instead of silently
swallowing them.
* feat: auto-disable master vacuum when plugin vacuum worker is active
When a vacuum-capable plugin worker connects to the admin server, the
admin server calls DisableVacuum on the master to prevent the automatic
scheduled vacuum from conflicting with the plugin worker's vacuum. When
the worker disconnects, EnableVacuum is called to restore the default
behavior. A safety net in the topology refresh loop re-enables vacuum
if the admin server disconnects without cleanup.
* rename isAdminServerConnected to isAdminServerConnectedFunc
* add 5s timeout to DisableVacuum/EnableVacuum gRPC calls
Prevents the monitor goroutine from blocking indefinitely if the
master is unresponsive.
* track plugin ownership of vacuum disable to avoid overriding operator
- Add vacuumDisabledByPlugin flag to Topology, set when DisableVacuum
is called while admin server is connected (i.e., by plugin monitor)
- Safety net only re-enables vacuum when it was disabled by plugin,
not when an operator intentionally disabled it via shell command
- EnableVacuum clears the plugin flag
* extract syncVacuumState for testability, add fake toggler tests
Extract the single sync step into syncVacuumState() with a
vacuumToggler interface. Add TestSyncVacuumState with a fake
toggler that verifies disable/enable calls on state transitions.
* use atomic.Bool for isDisableVacuum and vacuumDisabledByPlugin
Both fields are written by gRPC handlers and read by the vacuum
goroutine, causing a data race. Use atomic.Bool with Store/Load
for thread-safe access.
* use explicit by_plugin field instead of connection heuristic
Add by_plugin bool to DisableVacuumRequest proto so the caller
declares intent explicitly. The admin server monitor sets it to
true; shell commands leave it false. This prevents an operator's
intentional disable from being auto-reversed by the safety net.
* use setter for admin server callback instead of function parameter
Move isAdminServerConnected from StartRefreshWritableVolumes
parameter to Topology.SetAdminServerConnectedFunc() setter.
Keeps the function signature stable and decouples the topology
layer from the admin server concept.
* suppress repeated log messages on persistent sync failures
Add retrying parameter to syncVacuumState so the initial
state transition is logged at V(0) but subsequent retries
of the same transition are silent until the call succeeds.
* clear plugin ownership flag on manual DisableVacuum
Prevents stale plugin flag from causing incorrect auto-enable
when an operator manually disables vacuum after a plugin had
previously disabled it.
* add by_plugin to EnableVacuumRequest for symmetric ownership tracking
Plugin-driven EnableVacuum now only re-enables if the plugin was
the one that disabled it. If an operator manually disabled vacuum
after the plugin, the plugin's EnableVacuum is a no-op. This
prevents the plugin monitor from overriding operator intent on
worker disconnect.
* use cancellable context for monitorVacuumWorker goroutine
Replace context.Background() with a cancellable context stored
as bgCancel on AdminServer. Shutdown() calls bgCancel() so
monitorVacuumWorker exits cleanly via ctx.Done().
* track operator and plugin vacuum disables independently
Replace single isDisableVacuum flag with two independent flags:
vacuumDisabledByOperator and vacuumDisabledByPlugin. Each caller
only flips its own flag. The effective disabled state is the OR
of both. This prevents a plugin connect/disconnect cycle from
overriding an operator's manual disable, and vice versa.
* fix safety net to clear plugin flag, not operator flag
The safety net should call EnableVacuumByPlugin() to clear only
the plugin disable flag when the admin server disconnects. The
previous call to EnableVacuum() incorrectly cleared the operator
flag instead.
* feat(balance): add replica placement validation for volume moves
When the volume balance detection proposes moving a volume, validate
that the move does not violate the volume's replication policy (e.g.,
ReplicaPlacement=010 requires replicas on different racks). If the
preferred destination violates the policy, fall back to score-based
planning; if that also violates, skip the volume entirely.
- Add ReplicaLocation type and VolumeReplicaMap to ClusterInfo
- Build replica map from all volumes before collection filtering
- Port placement validation logic from command_volume_fix_replication.go
- Thread replica map through collectVolumeMetrics call chain
- Add IsGoodMove check in createBalanceTask before destination use
* address PR review: extract validation closure, add defensive checks
- Extract validateMove closure to eliminate duplicated ReplicaLocation
construction and IsGoodMove calls
- Add defensive check for empty replica map entries (len(replicas) == 0)
- Add bounds check for int-to-byte cast on ExpectedReplicas (0-255)
* address nitpick: rp test helper accepts *testing.T and fails on error
Prevents silent failures from typos in replica placement codes.
* address review: add composite replica placement tests (011, 110)
Test multi-constraint placement policies where both rack and DC
rules must be satisfied simultaneously.
* address review: use struct keys instead of string concatenation
Replace string-concatenated map keys with typed rackKey/nodeKey
structs to eliminate allocations and avoid ambiguity if IDs
contain spaces.
* address review: simplify bounds check, log fallback error, guard source
- Remove unreachable ExpectedReplicas < 0 branch (outer condition
already guarantees > 0), fold bounds check into single condition
- Log error from planBalanceDestination in replica validation fallback
- Return false from IsGoodMove when sourceNodeID not found in
existing replicas (inconsistent cluster state)
* address review: use slices.Contains instead of hand-rolled helpers
Replace isAmongDC and isAmongRack with slices.Contains from the
standard library, reducing boilerplate.
* feat(plugin): add DC/rack/node filtering for volume balance detection
Add scoping filters so balance detection can be limited to specific data
centers, racks, or nodes. Filters are applied both at the metrics level
(in the handler) and at the topology seeding level (in detection) to
ensure only the targeted infrastructure participates in balancing.
* address PR review: use set lookups, deduplicate test helpers, add target checks
* address review: assert non-empty tasks in filter tests
Prevent vacuous test passes by requiring len(tasks) > 0
before checking source/target exclusions.
* address review: enforce filter scope in fallback, clarify DC filter
- Thread allowedServers into createBalanceTask so the fallback
planner cannot produce out-of-scope targets when DC/rack/node
filters are active
- Update data_center_filter description to clarify single-DC usage
* address review: centralize parseCSVSet, fix filter scope leak, iterate all targets
- Extract ParseCSVSet to shared weed/worker/tasks/util package,
remove duplicates from detection.go and volume_balance_handler.go
- Fix metric accumulation re-introducing filtered-out servers by
only counting metrics for servers that passed DC/rack/node filters
- Trim DataCenterFilter before matching to handle trailing spaces
- Iterate all task.TypedParams.Targets in filter tests, not just [0]
* remove useless descriptor string test
* feat(plugin): enhanced collection filtering for volume balance
Replace wildcard matching with three collection filter modes:
- ALL_COLLECTIONS (default): treat all volumes as one pool
- EACH_COLLECTION: run detection separately per collection
- Regex pattern: filter volumes by matching collection names
The EACH_COLLECTION mode extracts distinct collections from metrics
and calls Detection() per collection, sharing the maxResults budget
and clusterInfo (with ActiveTopology) across all calls.
* address PR review: fix wildcard→regexp replacement, optimize EACH_COLLECTION
* address nitpick: fail fast on config errors (invalid regex)
Add configError type so invalid collection_filter regex returns
immediately instead of retrying across all masters with the same
bad config. Transient errors still retry.
* address review: constants, unbounded maxResults, wildcard compat
- Define collectionFilterAll/collectionFilterEach constants to
eliminate magic strings across handler and metrics code
- Fix EACH_COLLECTION budget loop to treat maxResults <= 0 as
unbounded, matching Detection's existing semantics
- Treat "*" as ALL_COLLECTIONS for backward compat with wildcard
* address review: nil guard in EACH_COLLECTION grouping loop
* remove useless descriptor string test
* feat(balance): add volume state filter (ALL/ACTIVE/FULL)
Add a volume_state admin config field to the plugin worker volume balance
handler, matching the shell's -volumeBy flag. This allows filtering volumes
by state before balance detection:
- ALL (default): consider all volumes
- ACTIVE: only writable volumes below the size limit (FullnessRatio < 1.01)
- FULL: only read-only volumes above the size limit (FullnessRatio >= 1.01)
The 1.01 threshold mirrors the shell's thresholdVolumeSize constant.
* address PR review: use enum/select widget, switch-based filter, nil safety
- Change volume_state field from string/text to enum/select with
dropdown options (ALL, ACTIVE, FULL)
- Refactor filterMetricsByVolumeState to use switch with predicate
function for clearer extensibility
- Add nil-check guard to prevent panic on nil metric elements
- Add TestFilterMetricsByVolumeState_NilElement regression test
* feat(filer): add lazy directory listing for remote mounts
Directory listings on remote mounts previously only queried the local
filer store. With lazy mounts the listing was empty; with eager mounts
it went stale over time.
Add on-demand directory listing that fetches from remote and caches
results with a 5-minute TTL:
- Add `ListDirectory` to `RemoteStorageClient` interface (delimiter-based,
single-level listing, separate from recursive `Traverse`)
- Implement in S3, GCS, and Azure backends using each platform's
hierarchical listing API
- Add `maybeLazyListFromRemote` to filer: before each directory listing,
check if the directory is under a remote mount with an expired cache,
fetch from remote, persist entries to the local store, then let existing
listing logic run on the populated store
- Use singleflight to deduplicate concurrent requests for the same directory
- Skip local-only entries (no RemoteEntry) to avoid overwriting unsynced uploads
- Errors are logged and swallowed (availability over consistency)
* refactor: extract xattr key to constant xattrRemoteListingSyncedAt
* feat: make listing cache TTL configurable per mount via listing_cache_ttl_seconds
Add listing_cache_ttl_seconds field to RemoteStorageLocation protobuf.
When 0 (default), lazy directory listing is disabled for that mount.
When >0, enables on-demand directory listing with the specified TTL.
Expose as -listingCacheTTL flag on remote.mount command.
* refactor: address review feedback for lazy directory listing
- Add context.Context to ListDirectory interface and all implementations
- Capture startTime before remote call for accurate TTL tracking
- Simplify S3 ListDirectory using ListObjectsV2PagesWithContext
- Make maybeLazyListFromRemote return void (errors always swallowed)
- Remove redundant trailing-slash path manipulation in caller
- Update tests to match new signatures
* When an existing entry has Remote != nil, we should merge remote metadata into it rather than replacing it.
* fix(gcs): wrap ListDirectory iterator error with context
The raw iterator error was returned without bucket/path context,
making it harder to debug. Wrap it consistently with the S3 pattern.
* fix(s3): guard against nil pointer dereference in Traverse and ListDirectory
Some S3-compatible backends may return nil for LastModified, Size, or
ETag fields. Check for nil before dereferencing to prevent panics.
* fix(filer): remove blanket 2-minute timeout from lazy listing context
Individual SDK operations (S3, GCS, Azure) already have per-request
timeouts and retry policies. The blanket timeout could cut off large
directory listings mid-operation even though individual pages were
succeeding.
* fix(filer): preserve trace context in lazy listing with WithoutCancel
Use context.WithoutCancel(ctx) instead of context.Background() so
trace/span values from the incoming request are retained for
distributed tracing, while still decoupling cancellation.
* fix(filer): use Store.FindEntry for internal lookups, add Uid/Gid to files, fix updateDirectoryListingSyncedAt
- Use f.Store.FindEntry instead of f.FindEntry for staleness check and
child lookups to avoid unnecessary lazy-fetch overhead
- Set OS_UID/OS_GID on new file entries for consistency with directories
- In updateDirectoryListingSyncedAt, use Store.UpdateEntry for existing
directories instead of CreateEntry to avoid deleteChunksIfNotNew and
NotifyUpdateEvent side effects
* fix(filer): distinguish not-found from store errors in lazy listing
Previously, any error from Store.FindEntry was treated as "not found,"
which could cause entry recreation/overwrite on transient DB failures.
Now check for filer_pb.ErrNotFound explicitly and skip entries or
bail out on real store errors.
* refactor(filer): use errors.Is for ErrNotFound comparisons
* fix(helm): trim whitespace before s3 TLS args to prevent command breakage (#8613)
When global.enableSecurity is enabled, the `{{ include }}` call for
s3 TLS args lacked the leading dash (`{{-`), producing an extra blank
line in the rendered shell command. This broke shell continuation and
caused the filer (and s3/all-in-one) to crash because arguments after
the blank line were silently dropped.
* ci(helm): assert no blank lines in security+S3 command blocks
Renders the chart with global.enableSecurity=true and S3 enabled for
normal mode (filer + s3 deployments) and all-in-one mode, then parses
every /bin/sh -ec command block and fails if any contains blank lines.
This catches the whitespace regression from #8613 where a missing {{-
dash on the seaweedfs.s3.tlsArgs include produced a blank line that
broke shell continuation.
* ci(helm): enable S3 in all-in-one security render test
The s3.tlsArgs include is gated by allInOne.s3.enabled, so without
this flag the all-in-one command block wasn't actually exercising the
TLS args path.
* feat(remote): add -noSync flag to skip upfront metadata pull on mount
Made-with: Cursor
* refactor(remote): split mount setup from metadata sync
Extract ensureMountDirectory for create/validate; call pullMetadata
directly when sync is needed. Caller controls sync step for -noSync.
Made-with: Cursor
* fix(remote): validate mount root when -noSync so bad bucket/creds fail fast
When -noSync is used, perform a cheap remote check (ListBuckets and
verify bucket exists) instead of skipping all remote I/O. Invalid
buckets or credentials now fail at mount time.
Made-with: Cursor
* test(remote): add TestRemoteMountNoSync for -noSync mount and persisted mapping
Made-with: Cursor
* test(remote): assert no upfront metadata after -noSync mount
After remote.mount -noSync, run fs.ls on the mount dir and assert empty
listing so the test fails if pullMetadata was invoked eagerly.
Made-with: Cursor
* fix(remote): propagate non-ErrNotFound lookup errors in ensureMountDirectory
Return lookupErr immediately for any LookupDirectoryEntry failure that
is not filer_pb.ErrNotFound, so only the not-found case creates the
entry and other lookup failures are reported to the caller.
Made-with: Cursor
* fix(remote): use errors.Is for ErrNotFound in ensureMountDirectory
Replace fragile strings.Contains(lookupErr.Error(), ...) with
errors.Is(lookupErr, filer_pb.ErrNotFound) before calling CreateEntry.
Made-with: Cursor
* fix(remote): use LookupEntry so ErrNotFound is recognised after gRPC
Raw gRPC LookupDirectoryEntry returns a status error, not the sentinel,
so errors.Is(lookupErr, filer_pb.ErrNotFound) was always false. Use
filer_pb.LookupEntry which normalises not-found to ErrNotFound so the
mount directory is created when missing.
Made-with: Cursor
* test(remote): ignore weed shell banner in TestRemoteMountNoSync fs.ls count
Exclude master/filer and prompt lines from entry count so the assertion
checks only actual fs.ls output for empty -noSync mount.
Made-with: Cursor
* fix(remote.mount): use 0755 for mount dir, document bucket-less early return
Made-with: Cursor
* feat(remote.mount): replace -noSync with -metadataStrategy=lazy|eager
- Add -metadataStrategy flag (eager default, lazy skips upfront metadata pull)
- Accept lazy/eager case-insensitively; reject invalid values with clear error
- Rename TestRemoteMountNoSync to TestRemoteMountMetadataStrategyLazy
- Add TestRemoteMountMetadataStrategyEager and TestRemoteMountMetadataStrategyInvalid
Made-with: Cursor
* fix(remote.mount): validate strategy and remote before creating mount directory
Move strategy validation and validateMountRoot (lazy path) before
ensureMountDirectory so that invalid strategies or bad bucket/credentials
fail without leaving orphaned directory entries in the filer.
* refactor(remote.mount): remove unused remote param from ensureMountDirectory
The remote *RemoteStorageLocation parameter was left over from the old
syncMetadata signature. Only remoteConf.Name is used inside the function.
* doc(remote.mount): add TODO for HeadBucket-style validation
validateMountRoot currently lists all buckets to verify one exists.
Note the need for a targeted BucketExists method in the interface.
* refactor(remote.mount): use MetadataStrategy type and constants
Replace raw string comparisons with a MetadataStrategy type and
MetadataStrategyEager/MetadataStrategyLazy constants for clarity
and compile-time safety.
* refactor(remote.mount): rename MetadataStrategy to MetadataCacheStrategy
More precisely describes the purpose: controlling how metadata is
cached from the remote, not metadata handling in general.
* fix(remote.mount): remove validateMountRoot from lazy path
Lazy mount's purpose is to skip remote I/O. Validating via ListBuckets
contradicts that, especially on accounts with many buckets. Invalid
buckets or credentials will surface on first lazy access instead.
* fix(test): handle shell exit 0 in TestRemoteMountMetadataStrategyInvalid
The weed shell process exits with code 0 even when individual commands
fail — errors appear in stdout. Check output instead of requiring a
non-nil error.
* test(remote.mount): remove metadataStrategy shell integration tests
These tests only verify string output from a shell process that always
exits 0 — they cannot meaningfully validate eager vs lazy behavior
without a real remote backend.
---------
Co-authored-by: Chris Lu <chris.lu@gmail.com>
* filer: propagate lazy metadata deletes to remote mounts
Delete operations now call the remote backend for mounted remote-only entries before removing filer metadata, keeping remote state aligned and preserving retry semantics on remote failures.
Made-with: Cursor
* filer: harden remote delete metadata recovery
Persist remote-delete metadata pendings so local entry removal can be retried after failures, and return explicit errors when remote client resolution fails to prevent silent local-only deletes.
Made-with: Cursor
* filer: streamline remote delete client lookup and logging
Avoid a redundant mount trie traversal by resolving the remote client directly from the matched mount location, and add parity logging for successful remote directory deletions.
Made-with: Cursor
* filer: harden pending remote metadata deletion flow
Retry pending-marker writes before local delete, fail closed when marking cannot be persisted, and start remote pending reconciliation only after the filer store is initialised to avoid nil store access.
Made-with: Cursor
* filer: avoid lazy fetch in pending metadata reconciliation
Use a local-only entry lookup during pending remote metadata reconciliation so cache misses do not trigger remote lazy fetches.
Made-with: Cursor
* filer: serialise concurrent index read-modify-write in pending metadata deletion
Add remoteMetadataDeletionIndexMu to Filer and acquire it for the full
read→mutate→commit sequence in markRemoteMetadataDeletionPending and
clearRemoteMetadataDeletionPending, preventing concurrent goroutines
from overwriting each other's index updates.
Made-with: Cursor
* filer: start remote deletion reconciliation loop in NewFiler
Move the background goroutine for pending remote metadata deletion
reconciliation from SetStore (where it was gated by sync.Once) to
NewFiler alongside the existing loopProcessingDeletion goroutine.
The sync.Once approach was problematic: it buried a goroutine launch
as a side effect of a setter, was unrecoverable if the goroutine
panicked, could race with store initialisation, and coupled its
lifecycle to unrelated shutdown machinery. The existing nil-store
guard in reconcilePendingRemoteMetadataDeletions handles the window
before SetStore is called.
* filer: skip remote delete for replicated deletes from other filers
When isFromOtherCluster is true the delete was already propagated to
the remote backend by the originating filer. Repeating the remote
delete on every replica doubles API calls, and a transient remote
failure on the replica would block local metadata cleanup — leaving
filers inconsistent.
* filer: skip pending marking for directory remote deletes
Directory remote deletes are idempotent and do not need the
pending/reconcile machinery that was designed for file deletes where
the local metadata delete might fail after the remote object is
already removed.
* filer: propagate remote deletes for children in recursive folder deletion
doBatchDeleteFolderMetaAndData iterated child files but only called
NotifyUpdateEvent and collected chunks — it never called
maybeDeleteFromRemote for individual children. This left orphaned
objects in the remote backend when a directory containing remote-only
files was recursively deleted.
Also fix isFromOtherCluster being hardcoded to false in the recursive
call to doBatchDeleteFolderMetaAndData for subdirectories.
* filer: simplify pending remote deletion tracking to single index key
Replace the double-bookkeeping scheme (individual KV entry per path +
newline-delimited index key) with a single index key that stores paths
directly. This removes the per-path KV writes/deletes, the base64
encoding round-trip, and the transaction overhead that was only needed
to keep the two representations in sync.
* filer: address review feedback on remote deletion flow
- Distinguish missing remote config from client initialization failure
in maybeDeleteFromRemote error messages.
- Use a detached context (30s timeout) for pending-mark and
pending-clear KV writes so they survive request cancellation after
the remote object has already been deleted.
- Emit NotifyUpdateEvent in reconcilePendingRemoteMetadataDeletions
after a successful retry deletion so downstream watchers and replicas
learn about the eventual metadata removal.
* filer: remove background reconciliation for pending remote deletions
The pending-mark/reconciliation machinery (KV index, mutex, background
loop, detached contexts) handled the narrow case where the remote
object was deleted but the subsequent local metadata delete failed.
The client already receives the error and can retry — on retry the
remote not-found is treated as success and the local delete proceeds
normally. The added complexity (and new edge cases around
NotifyUpdateEvent, multi-filer consistency during reconciliation, and
context lifetime) is not justified for a transient store failure the
caller already handles.
Remove: loopProcessingRemoteMetadataDeletionPending,
reconcilePendingRemoteMetadataDeletions, markRemoteMetadataDeletionPending,
clearRemoteMetadataDeletionPending, listPendingRemoteMetadataDeletionPaths,
encodePendingRemoteMetadataDeletionIndex, FindEntryLocal, and all
associated constants, fields, and test infrastructure.
* filer: fix test stubs and add early exit on child remote delete error
- Refactor stubFilerStore to release lock before invoking callbacks and
propagate callback errors, preventing potential deadlocks in tests
- Implement ListDirectoryPrefixedEntries with proper prefix filtering
instead of delegating to the unfiltered ListDirectoryEntries
- Add continue after setting err on child remote delete failure in
doBatchDeleteFolderMetaAndData to skip further processing of the
failed entry
* filer: propagate child remote delete error instead of silently continuing
Replace `continue` with early `break` when maybeDeleteFromRemote fails
for a child entry during recursive folder deletion. The previous
`continue` skipped the error check at the end of the loop body, so a
subsequent successful entry would overwrite err and the remote delete
error was silently lost. Now the loop breaks, the existing error check
returns the error, and NotifyUpdateEvent / chunk collection are
correctly skipped for the failed entry.
* filer: delete remote file when entry has Remote pointer, not only when remote-only
Replace IsInRemoteOnly() guard with entry.Remote == nil check in
maybeDeleteFromRemote. IsInRemoteOnly() requires zero local chunks and
RemoteSize > 0, which incorrectly skips remote deletion for cached
files (local chunks exist) and zero-byte remote objects (RemoteSize 0).
The correct condition is whether the entry has a remote backing object
at all.
---------
Co-authored-by: Chris Lu <chris.lu@gmail.com>
* fix(helm): use componentName for all service names to fix truncation mismatch (#8610)
PR #8143 updated statefulsets and deployments to use the componentName
helper (which truncates the fullname before appending the suffix), but
left service definitions using the old `printf + trunc 63` pattern.
When release names are long enough, these two strategies produce
different names, causing DNS resolution failures (e.g., S3 cannot
find the filer-client service and falls back to localhost:8888).
Unify all service name definitions and cluster address helpers to use
the componentName helper consistently.
* refactor(helm): simplify cluster address helpers with ternary
* test(helm): add regression test for service name truncation with long release names
Renders the chart with a >63-char fullname in both normal and all-in-one
modes, then asserts that Service metadata.name values match the hostnames
produced by cluster.masterAddress, cluster.filerAddress, and the S3
deployment's -filer= argument. Prevents future truncation/DNS mismatch
regressions like #8610.
* fix(helm-ci): limit S3_FILER_HOST extraction to first match
* fix(filer): limit concurrent proxy reads per volume server
Add a per-volume-server semaphore (default 16) to proxyToVolumeServer
to prevent replication bursts from overwhelming individual volume
servers with hundreds of concurrent connections, which causes them
to drop connections with "unexpected EOF".
Excess requests queue up and respect the client's context, returning
503 if the client disconnects while waiting.
Also log io.CopyBuffer errors that were previously silently discarded.
* Apply suggestion from @gemini-code-assist[bot]
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* fix(filer): use non-blocking release for proxy semaphore
Prevents a goroutine from blocking forever if releaseProxySemaphore
is ever called without a matching acquire.
* test(filer): clean up proxySemaphores entries in all proxy tests
---------
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* fix(replication): resume partial chunk reads on EOF instead of re-downloading
When replicating chunks and the source connection drops mid-transfer,
accumulate the bytes already received and retry with a Range header
to fetch only the remaining bytes. This avoids re-downloading
potentially large chunks from scratch on each retry, reducing load
on busy source servers and speeding up recovery.
* test(replication): add tests for downloadWithRange including gzip partial reads
Tests cover:
- No offset (no Range header sent)
- With offset (Range header verified)
- Content-Disposition filename extraction
- Partial read + resume: server drops connection mid-transfer, client
resumes with Range from the offset of received bytes
- Gzip partial read + resume: first response is gzip-encoded (Go auto-
decompresses), connection drops, resume request gets decompressed data
(Go doesn't add Accept-Encoding when Range is set, so the server
decompresses), combined bytes match original
* fix(replication): address PR review comments
- Consolidate downloadWithRange into DownloadFile with optional offset
parameter (variadic), eliminating code duplication (DRY)
- Validate HTTP response status: require 206 + correct Content-Range
when offset > 0, reject when server ignores Range header
- Use if/else for fullData assignment for clarity
- Add test for rejected Range (server returns 200 instead of 206)
* refactor(replication): remove unused ReplicationSource interface
The interface was never referenced and its signature didn't match
the actual FilerSource.ReadPart method.
---------
Co-authored-by: Copilot <copilot@github.com>