Key improvements:
- Fix concurrent map write panic in partition type cache
- Fix data races in yieldDataFiles and key map getter
- Fix response body leaks in REST catalog
- Fix index out of range in buildManifestEvaluator
- Table Metadata V3 support
- Schema evolution API
- Partitioned write throughput optimizations
- Gzipped metadata read/write support
* glog: add JSON structured logging mode
Add opt-in JSON output format for glog, enabling integration with
log aggregation systems like ELK, Loki, and Datadog.
- Add --log_json flag to enable JSON output at startup
- Add SetJSONMode()/IsJSONMode() for runtime toggle
- Add JSON branches in println, printDepth, printf, printWithFileLine
- Use manual JSON construction (no encoding/json) for performance
- Add jsonEscapeString() for safe string escaping
- Include 8 unit tests and 1 benchmark
Enabled via --log_json flag. Default behavior unchanged.
* glog: prevent lazy flag init from overriding SetJSONMode
If --log_json=true and SetJSONMode(false) was called at runtime,
a subsequent IsJSONMode() call would re-enable JSON mode via the
sync.Once lazy initialization. Mark jsonFlagOnce as done inside
SetJSONMode so the runtime API always takes precedence.
* glog: fix RuneError check to not misclassify valid U+FFFD
The condition r == utf8.RuneError matches both invalid UTF-8
sequences (size=1) and a valid U+FFFD replacement character
(size=3). Without checking size == 1, a valid U+FFFD input
would be incorrectly escaped and only advance by 1 byte,
corrupting the output.
---------
Co-authored-by: Chris Lu <chris.lu@gmail.com>
* glog: add gzip compression for rotated log files
Add opt-in gzip compression that automatically compresses log files
after rotation, reducing disk usage in long-running deployments.
- Add --log_compress flag to enable compression at startup
- Add SetCompressRotated()/IsCompressRotated() for runtime toggle
- Compress rotated files in background goroutine (non-blocking)
- Use gzip.BestSpeed for minimal CPU overhead
- Fix .gz file cleanup: TrimSuffix approach correctly counts
compressed files toward MaxFileCount limit
- Include 6 unit tests covering normal, empty, large, and edge cases
Enabled via --log_compress flag. Default behavior unchanged.
* glog: fix compressFile to check gz/dst close errors and use atomic rename
Write to a temp file (.gz.tmp) and rename atomically to prevent
exposing partial archives. Check gz.Close() and dst.Close() errors
to avoid deleting the original log when flush fails (e.g. ENOSPC).
Use defer for robust resource cleanup.
* glog: deduplicate .log/.log.gz pairs in rotation cleanup
During concurrent compression, both foo.log and foo.log.gz can exist
simultaneously. Count them as one entry against MaxFileCount to prevent
premature eviction of rotated logs.
* glog: use portable temp path in TestCompressFile_NonExistent
Replace hardcoded /nonexistent/path with t.TempDir() for portability.
---------
Co-authored-by: Copilot <copilot@github.com>
* fix(remote_gateway): prevent double-versioning when syncing to versioned central bucket
When a file is uploaded to a versioned bucket on edge, SeaweedFS stores
it internally as {object}.versions/v_{versionId}. The remote_gateway was
syncing this internal path directly to the central S3 endpoint. When
central's bucket also has versioning enabled, this caused central to
apply its own versioning on top, producing corrupt paths like:
object.versions/v_{edgeId}.versions/v_{centralId}
Fix: rewrite internal .versions/v_{id} paths to the original S3 object
key before uploading to the remote. Skip version file delete/update
events that are internal bookkeeping.
Fixes https://github.com/seaweedfs/seaweedfs/discussions/8481#discussioncomment-16209342
* fix(remote_gateway): propagate delete markers to remote as deletions
Delete markers are zero-content version entries (ExtDeleteMarkerKey=true)
created by S3 DELETE on a versioned bucket. Previously they were silently
dropped by the HasData() filter, so deletions on edge never reached
central.
Now: detect delete markers before the HasData check, rewrite the
.versions path to the original S3 key, and issue client.DeleteFile()
on the remote.
* fix(remote_gateway): tighten isVersionedPath to avoid false positives
Address PR review feedback:
- Add isDir parameter to isVersionedPath so it only matches the exact
internal shapes: directories whose name ends with .versions (isDir=true),
and files with the v_ prefix inside a .versions parent (isDir=false).
Previously the function was too broad and could match user-created paths
like "my.versions/data.txt".
- Update all 4 call sites to pass the entry's IsDirectory field.
- Rename TestVersionedDirectoryNotFilteredByHasData to
TestVersionsDirectoryFilteredByHasData so the name reflects the
actual assertion (directories ARE filtered by HasData).
- Expand TestIsVersionedPath with isDir cases and false-positive checks.
* fix(remote_gateway): persist sync marker after delete-marker propagation
The delete-marker branch was calling client.DeleteFile() and returning
without updating the local entry, making event replay re-issue the
remote delete. Now call updateLocalEntry after a successful DeleteFile
to stamp the delete-marker entry with a RemoteEntry, matching the
pattern used by the normal create path.
* refactor(remote_gateway): extract syncDeleteMarker and fix root path edge case
- Extract syncDeleteMarker() shared helper used by both bucketed and
mounted-dir event processors, replacing the duplicated delete + persist
local marker logic.
- Fix rewriteVersionedSourcePath for root-level objects: when lastSlash
is 0 (e.g. "/file.xml.versions"), return "/" as the parent dir instead
of an empty string.
- The strings.Contains(dir, ".versions/") condition flagged in review was
already removed in a prior commit that tightened isVersionedPath.
* fix(remote_gateway): skip updateLocalEntry for versioned path rewrites
After rewriting a .versions/v_{id} path to the logical S3 key and
uploading, the code was calling updateLocalEntry on the original v_*
entry, stamping it with a RemoteEntry for the logical key. This is
semantically wrong: the logical object has no filer entry in versioned
buckets, and the internal v_* entry should not carry a RemoteEntry for
a different path.
Skip updateLocalEntry when the path was rewritten from a versioned
source. Replay safety is preserved because S3 PutObject is idempotent.
* fix(remote_gateway): scope versioning checks to /buckets/ namespace
isVersionedPath and rewriteVersionedSourcePath could wrongly match
paths in non-bucket mounts (e.g. /mnt/remote/file.xml.versions).
Add the same /buckets/ prefix guard used by isMultipartUploadDir so
the .versions / v_ logic only applies within the bucket namespace.
Remove the block that prevented deleting the "anonymous" identity
and stop auto-creating it when absent. If no anonymous identity
exists (or it is disabled), LookupAnonymous returns not-found and
both auth paths return ErrAccessDenied for anonymous requests.
To enable anonymous access, explicitly create the "anonymous" user.
To revoke it, delete the user like any other identity.
Closes#8694
* fix(s3): include directory markers in ListObjects without delimiter (#8698)
Directory key objects (zero-byte objects with keys ending in "/") created
via PutObject were omitted from ListObjects/ListObjectsV2 results when no
delimiter was specified. AWS S3 includes these as regular keys in Contents.
The issue was in doListFilerEntries: when recursing into directories in
non-delimiter mode, directory key objects were only emitted when
prefixEndsOnDelimiter was true. Added an else branch to emit them in the
general recursive case as well.
* remove issue reference from inline comment
* test: add child-under-marker and paginated listing coverage
Extend test 6 to place a child object under the directory marker
and paginate with MaxKeys=1 so the emit-then-recurse truncation
path is exercised.
* fix(test): skip directory markers in Spark temporary artifacts check
The listing check now correctly shows directory markers (keys ending
in "/") after the ListObjects fix. These 0-byte metadata objects are
not data artifacts — filter them from the listing check since the
HeadObject-based check already verifies their cleanup with a timeout.
* fix(helm): namespace app-specific values under global.seaweedfs
Move all app-specific values from the global namespace to
global.seaweedfs.* to avoid polluting the shared .Values.global
namespace when the chart is used as a subchart.
Standard Helm conventions (global.imageRegistry, global.imagePullSecrets)
remain at the global level as they are designed to be shared across
subcharts.
Fixesseaweedfs/seaweedfs#8699
BREAKING CHANGE: global values have been restructured. Users must update
their values files to use the new paths:
- global.registry → global.imageRegistry
- global.repository → global.seaweedfs.image.repository
- global.imageName → global.seaweedfs.image.name
- global.<key> → global.seaweedfs.<key> (for all other app-specific values)
* fix(ci): update helm CI tests to use new global.seaweedfs.* value paths
Update all --set flags in helm_ci.yml to use the new namespaced
global.seaweedfs.* paths matching the values.yaml restructuring.
* fix(ci): install Claude Code via npm to avoid install.sh 403
The claude-code-action's built-in installer uses
`curl https://claude.ai/install.sh | bash` which can fail with 403.
Due to the pipe, bash exits 0 on empty input, masking the curl failure
and leaving the `claude` binary missing.
Work around this by installing Claude Code via npm before invoking the
action, and passing the executable path via path_to_claude_code_executable.
* revert: remove claude-code-review.yml changes from this PR
The claude-code-action OIDC token exchange validates that the workflow
file matches the version on the default branch. Modifying it in a PR
causes the review job to fail with "Workflow validation failed".
The Claude Code install fix will need to be applied directly to master
or in a separate PR.
* fix: update stale references to old global.* value paths
- admin-statefulset.yaml: fix fail message to reference
global.seaweedfs.masterServer
- values.yaml: fix comment to reference image.name instead of imageName
- helm_ci.yml: fix diagnostic message to reference
global.seaweedfs.enableSecurity
* feat(helm): add backward-compat shim for old global.* value paths
Add _compat.tpl with a seaweedfs.compat helper that detects old-style
global.* keys (e.g. global.enableSecurity, global.registry) and merges
them into the new global.seaweedfs.* namespace.
Since the old keys no longer have defaults in values.yaml, their
presence means the user explicitly provided them. The helper uses
in-place mutation via `set` so all templates see the merged values.
This ensures existing deployments using old value paths continue to
work without changes after upgrading.
* fix: update stale comment references in values.yaml
Update comments referencing global.enableSecurity and global.masterServer
to the new global.seaweedfs.* paths.
---------
Co-authored-by: Copilot <copilot@github.com>
The claude-code-action's built-in installer uses
`curl https://claude.ai/install.sh | bash` which can fail with 403.
Due to the pipe, bash exits 0 on empty input, masking the curl failure
and leaving the `claude` binary missing.
Work around this by installing Claude Code via npm before invoking the
action, and passing the executable path via path_to_claude_code_executable.
Co-authored-by: Copilot <copilot@github.com>
* fix: clear raft vote state file on non-resume startup
The seaweedfs/raft library v1.1.7 added a persistent `state` file for
currentTerm and votedFor. When RaftResumeState=false (the default), the
log, conf, and snapshot directories are cleared but this state file was
not. On repeated restarts, different masters accumulate divergent terms,
causing AppendEntries rejections and preventing leader election.
Fixes#8690
* fix: recover TopologyId from snapshot before clearing raft state
When RaftResumeState=false clears log/conf/snapshot, the TopologyId
(used for license validation) was lost. Now extract it from the latest
snapshot before cleanup and restore it on the topology.
Both seaweedfs/raft and hashicorp/raft paths are handled, with a shared
recoverTopologyIdFromState helper in raft_common.go.
* fix: stagger multi-master bootstrap delay by peer index
Previously all masters used a fixed 1500ms delay before the bootstrap
check. Now the delay is proportional to the peer's sorted index with
randomization (matching the hashicorp raft path), giving the designated
bootstrap node (peer 0) a head start while later peers wait for gRPC
servers to be ready.
Also adds diagnostic logging showing why DoJoinCommand was or wasn't
called, making leader election issues easier to diagnose from logs.
* fix: skip unreachable masters during leader reconnection
When a master leader goes down, non-leader masters still redirect
clients to the stale leader address. The masterClient would follow
these redirects, fail, and retry — wasting round-trips each cycle.
Now tryAllMasters tracks which masters failed within a cycle and skips
redirects pointing to them, reducing log spam and connection overhead
during leader failover.
* fix: take snapshot after TopologyId generation for recovery
After generating a new TopologyId on the leader, immediately take a raft
snapshot so the ID can be recovered from the snapshot on future restarts
with RaftResumeState=false. Without this, short-lived clusters would
lose the TopologyId on restart since no automatic snapshot had been
taken yet.
* test: add multi-master raft failover integration tests
Integration test framework and 5 test scenarios for 3-node master
clusters:
- TestLeaderConsistencyAcrossNodes: all nodes agree on leader and
TopologyId
- TestLeaderDownAndRecoverQuickly: leader stops, new leader elected,
old leader rejoins as follower
- TestLeaderDownSlowRecover: leader gone for extended period, cluster
continues with 2/3 quorum
- TestTwoMastersDownAndRestart: quorum lost (2/3 down), recovered
when both restart
- TestAllMastersDownAndRestart: full cluster restart, leader elected,
all nodes agree on TopologyId
* fix: address PR review comments
- peerIndex: return -1 (not 0) when self not found, add warning log
- recoverTopologyIdFromSnapshot: defer dir.Close()
- tests: check GetTopologyId errors instead of discarding them
* fix: address review comments on failover tests
- Assert no leader after quorum loss (was only logging)
- Verify follower cs.Leader matches expected leader via
ServerAddress.ToHttpAddress() comparison
- Check GetTopologyId error in TestTwoMastersDownAndRestart
* fix(s3): return ETag header for directory marker PutObject requests
The PutObject handler has a special path for keys ending with "/" (directory
markers) that creates the entry via mkdir. This path never computed or set
the ETag response header, unlike the regular PutObject path. AWS S3 always
returns an ETag header, even for empty-body puts.
Compute the MD5 of the content (empty or otherwise), store it in the entry
attributes and extended attributes, and set the ETag response header.
Fixes#8682
* fix: handle io.ReadAll error and chunked encoding for directory markers
Address review feedback:
- Handle error from io.ReadAll instead of silently discarding it
- Change condition from ContentLength > 0 to ContentLength != 0 to
correctly handle chunked transfer encoding (ContentLength == -1)
* fix hanging tests
* feat: improve allInOne mode support for admin/volume ingress and fix master UI links
- Add allInOne support to admin ingress template, matching the pattern
used by filer and s3 ingress templates (or-based enablement with
ternary service name selection)
- Add allInOne support to volume ingress template, which previously
required volume.enabled even when the volume server runs within the
allInOne pod
- Expose admin ports in allInOne deployment and service when
allInOne.admin.enabled is set
- Add allInOne.admin config section to values.yaml (enabled by default,
ports inherit from admin.*)
- Fix legacy master UI templates (master.html, masterNewRaft.html) to
prefer PublicUrl over internal Url when linking to volume server UI.
The new admin UI already handles this correctly.
* fix: revert admin allInOne changes and fix PublicUrl in admin dashboard
The admin binary (`weed admin`) is a separate process that cannot run
inside `weed server` (allInOne mode). Revert the admin-related allInOne
helm chart changes that caused 503 errors on admin ingress.
Fix bug in cluster_topology.go where VolumeServer.PublicURL was set to
node.Id (internal pod address) instead of the actual public URL. Add
public_url field to DataNodeInfo proto message so the topology gRPC
response carries the public URL set via -volume.publicUrl flag.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: use HTTP /dir/status to populate PublicUrl in admin dashboard
The gRPC DataNodeInfo proto does not include PublicUrl, so the admin dashboard showed internal pod IPs instead of the configured public URL.
Fetch PublicUrl from the master's /dir/status HTTP endpoint and apply it
in both GetClusterTopology and GetClusterVolumeServers code paths.
Also reverts the unnecessary proto field additions from the previous
commit and cleans up a stray blank line in all-in-one-service.yml.
* fix: apply PublicUrl link fix to masterNewRaft.html
Match the same conditional logic already applied to master.html —
prefer PublicUrl when set and different from Url.
* fix: add HTTP timeout and status check to fetchPublicUrlMap
Use a 5s-timeout client instead of http.DefaultClient to prevent
blocking indefinitely when the master is unresponsive. Also check
the HTTP status code before attempting to parse the response body.
* fix: fall back to node address when PublicUrl is empty
Prevents blank links in the admin dashboard when PublicUrl is not
configured, such as in standalone or mixed-version clusters.
* fix: log io.ReadAll error in fetchPublicUrlMap
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
* glog: add --log_rotate_hours flag for time-based log rotation
SeaweedFS previously only rotated log files when they reached MaxSize
(1.8 GB). Long-running deployments with low log volume could accumulate
log files indefinitely with no way to force rotation on a schedule.
This change adds the --log_rotate_hours flag. When set to a non-zero
value, the current log file is rotated once it has been open for the
specified number of hours, regardless of its size.
Implementation details:
- New flag --log_rotate_hours (int, default 0 = disabled) in glog_file.go
- Added createdAt time.Time field to syncBuffer to track file open time
- rotateFile() sets createdAt to the time the new file is opened
- Write() checks elapsed time and triggers rotation when the threshold
is exceeded, consistent with the existing size-based check
This resolves the long-standing request for time-based rotation and
helps prevent unbounded log accumulation in /tmp on production systems.
Related: #3455, #5763, #8336
* glog: default log_rotate_hours to 168 (7 days)
Enable time-based rotation by default so log files don't accumulate
indefinitely in long-running deployments. Set to 0 to disable.
* glog: simplify rotation logic by combining size and time conditions
Merge the two separate rotation checks into a single block to
eliminate duplicated rotateFile error handling.
* glog: use timeNow() in syncBuffer.Write and add time-based rotation test
Use the existing testable timeNow variable instead of time.Now() in
syncBuffer.Write so that time-based rotation can be tested with a
mocked clock.
Add TestTimeBasedRollover that verifies:
- no rotation occurs before the interval elapses
- rotation triggers after the configured hours
---------
Co-authored-by: Copilot <copilot@github.com>
* fix(chart): missing resources on volume statefulset initContainer
* chore(chart): use own resources for idx-vol-move initContainer
* chore(chart): improve comment for idxMoveResources value
* fix(chart): all-in-one deployment maxVolumes value
* chore(chart): improve readability
* fix(chart): maxVolume nil value check
* fix(chart): guard against nil/empty volume.dataDirs before calling first
Without this check, `first` errors when volume.dataDirs is nil or empty,
causing a template render failure for users who omit the setting entirely.
---------
Co-authored-by: Copilot <copilot@github.com>
* Make weed-fuse compatible with systemd-mount series
* fix: add missing type annotation on skipAutofs param in FreeBSD build
The parameter was declared without a type, causing a compile error on FreeBSD.
* fix: guard hasAutofs nil dereference and make FsName conditional on autofs mode
- Check option.hasAutofs for nil before dereferencing to prevent panic
when RunMount is called without the flag initialized.
- Only set FsName to "fuse" when autofs mode is active; otherwise
preserve the descriptive server:path name for mount/df output.
- Fix typo: recogize -> recognize.
* fix: consistent error handling for autofs option and log ignored _netdev
- Replace panic with fmt.Fprintf+return false for autofs parse errors,
matching the pattern used by other fuse option parsers.
- Log when _netdev option is silently stripped to aid debugging.
---------
Co-authored-by: Chris Lu <chris.lu@gmail.com>
* improve large file sync throughput for remote.cache and filer.sync
Three main throughput improvements:
1. Adaptive chunk sizing for remote.cache: targets ~32 chunks per file
instead of always starting at 5MB. A 500MB file now uses ~16MB chunks
(32 chunks) instead of 5MB chunks (100 chunks), reducing per-chunk
overhead (volume assign, gRPC call, needle write) by 3x.
2. Configurable concurrency at every layer:
- remote.cache chunk concurrency: -chunkConcurrency flag (default 8)
- remote.cache S3 download concurrency: -downloadConcurrency flag
(default raised from 1 to 5 per chunk)
- filer.sync chunk concurrency: -chunkConcurrency flag (default 32)
3. S3 multipart download concurrency raised from 1 to 5: the S3 manager
downloader was using Concurrency=1, serializing all part downloads
within each chunk. This alone can 5x per-chunk download speed.
The concurrency values flow through the gRPC request chain:
shell command → CacheRemoteObjectToLocalClusterRequest →
FetchAndWriteNeedleRequest → S3 downloader
Zero values in the request mean "use server defaults", maintaining
full backward compatibility with existing callers.
Ref #8481
* fix: use full maxMB for chunk size cap and remove loop guard
Address review feedback:
- Use full maxMB instead of maxMB/2 for maxChunkSize to avoid
unnecessarily limiting chunk size for very large files.
- Remove chunkSize < maxChunkSize guard from the safety loop so it
can always grow past maxChunkSize when needed to stay under 1000
chunks (e.g., extremely large files with small maxMB).
* address review feedback: help text, validation, naming, docs
- Fix help text for -chunkConcurrency and -downloadConcurrency flags
to say "0 = server default" instead of advertising specific numeric
defaults that could drift from the server implementation.
- Validate chunkConcurrency and downloadConcurrency are within int32
range before narrowing, returning a user-facing error if out of range.
- Rename ReadRemoteErr to readRemoteErr to follow Go naming conventions.
- Add doc comment to SetChunkConcurrency noting it must be called
during initialization before replication goroutines start.
- Replace doubling loop in chunk size safety check with direct
ceil(remoteSize/1000) computation to guarantee the 1000-chunk cap.
* address Copilot review: clamp concurrency, fix chunk count, clarify proto docs
- Use ceiling division for chunk count check to avoid overcounting
when file size is an exact multiple of chunk size.
- Clamp chunkConcurrency (max 1024) and downloadConcurrency (max 1024
at filer, max 64 at volume server) to prevent excessive goroutines.
- Always use ReadFileWithConcurrency when the client supports it,
falling back to the implementation's default when value is 0.
- Clarify proto comments that download_concurrency only applies when
the remote storage client supports it (currently S3).
- Include specific server defaults in help text (e.g., "0 = server
default 8") so users see the actual values in -h output.
* fix data race on executionErr and use %w for error wrapping
- Protect concurrent writes to executionErr in remote.cache worker
goroutines with a sync.Mutex to eliminate the data race.
- Use %w instead of %v in volume_grpc_remote.go error formatting
to preserve the error chain for errors.Is/errors.As callers.
When remote.cache downloads a file in parallel chunks and a gRPC
connection drops mid-transfer, chunks already written to volume servers
were not cleaned up. Since the filer metadata was never updated, these
needles became orphaned — invisible to volume.vacuum and never
referenced by the filer. On subsequent cache cycles the file was still
treated as uncached, creating more orphans each attempt.
Call DeleteUncommittedChunks on the download-error path, matching the
cleanup already present for the metadata-update-failure path.
Fixes#8481
* iceberg: add sort-aware compaction rewrite
* iceberg: share filtered row iteration in compaction
* iceberg: rely on table sort order for sort rewrites
* iceberg: harden sort compaction planning
* iceberg: include rewrite strategy in planning config hash
compactionPlanningConfigHash now incorporates RewriteStrategy and
SortMaxInputBytes so cached planning results are invalidated when
sort strategy settings change. Also use the bytesPerMB constant in
compactionNoEligibleMessage.
The anonymous identity was explicitly filtered out of the user listing,
making it invisible in the admin console. Users could not view or edit
its permissions. Attempting to recreate it failed with "already exists".
Remove the anonymous skip in GetObjectStoreUsers so it appears like any
other identity. Add a guard in DeleteObjectStoreUser to prevent deletion
of the anonymous system identity, which would break unauthenticated S3
access.
Fixes#8466
Co-authored-by: Copilot <copilot@github.com>
* fix(s3): omit NotResource:null from bucket policy JSON response (#8657)
Change NotResource from value type to pointer (*StringOrStringSlice) so
that omitempty properly omits it when unset, matching the existing
Principal field pattern. This prevents IaC tools (Terraform, Ansible)
from detecting false configuration drift.
Add bucket policy round-trip idempotency integration tests.
* simplify JSON comparison in bucket policy idempotency test
Use require.JSONEq directly on the raw JSON strings instead of
round-tripping through unmarshal/marshal, since JSONEq already
handles normalization internally.
* fix bucket policy test cases that locked out the admin user
The Deny+NotResource test cases used Action:"s3:*" which denied the
admin's own GetBucketPolicy call. Scope deny to s3:GetObject only,
and add an Allow+NotResource variant instead.
* fix(s3): also make Resource a pointer to fix empty string in JSON
Apply the same omitempty pointer fix to the Resource field, which was
emitting "Resource":"" when only NotResource was set. Add
NewStringOrStringSlicePtr helper, make Strings() nil-safe, and handle
*StringOrStringSlice in normalizeToStringSliceWithError.
* improve bucket policy integration tests per review feedback
- Replace time.Sleep with waitForClusterReady using ListBuckets
- Use structural hasKey check instead of brittle substring NotContains
- Assert specific NoSuchBucketPolicy error code after delete
- Handle single-statement policies in hasKey helper
* Add multi-partition-spec compaction and delete-aware compaction (Phase 3)
Multi-partition-spec compaction:
- Add SpecID to compactionBin struct and group by spec+partition key
- Remove the len(specIDs) > 1 skip that blocked spec-evolved tables
- Write per-spec manifests in compaction commit using specByID map
- Use per-bin PartitionSpec when calling NewDataFileBuilder
Delete-aware compaction:
- Add ApplyDeletes config (default: true) with readBoolConfig helper
- Implement position delete collection (file_path + pos Parquet columns)
- Implement equality delete collection (field ID to column mapping)
- Update mergeParquetFiles to filter rows via position deletes (binary
search) and equality deletes (hash set lookup)
- Smart delete manifest carry-forward: drop when all data files compacted
- Fix EXISTING/DELETED entries to include sequence numbers
Tests for multi-spec bins, delete collection, merge filtering, and
end-to-end compaction with position/equality/mixed deletes.
* Add structured metrics and per-bin progress to iceberg maintenance
- Change return type of all four operations from (string, error) to
(string, map[string]int64, error) with structured metric counts
(files_merged, snapshots_expired, orphans_removed, duration_ms, etc.)
- Add onProgress callback to compactDataFiles for per-bin progress
- In Execute, pass progress callback that sends JobProgressUpdate with
per-bin stage messages
- Accumulate per-operation metrics with dot-prefixed keys
(e.g. compact.files_merged) into OutputValues on completion
- Update testing_api.go wrappers and integration test call sites
- Add tests: TestCompactDataFilesMetrics, TestExpireSnapshotsMetrics,
TestExecuteCompletionOutputValues
* Address review feedback: group equality deletes by field IDs, use metric constants
- Group equality deletes by distinct equality_ids sets so different
delete files with different equality columns are handled correctly
- Use length-prefixed type-aware encoding in buildEqualityKey to avoid
ambiguity between types and collisions from null bytes
- Extract metric key strings into package-level constants
* Fix buildEqualityKey to use length-prefixed type-aware encoding
The previous implementation used plain String() concatenation with null
byte separators, which caused type ambiguity (int 123 vs string "123")
and separator collisions when values contain null bytes. Now each value
is serialized as "kind:length:value" for unambiguous composite keys.
This fix was missed in the prior cherry-pick due to a merge conflict.
* Address nitpick review comments
- Document patchManifestContentToDeletes workaround: explain that
iceberg-go WriteManifest cannot create delete manifests, and note
the fail-fast validation on pattern match
- Document makeTestEntries: note that specID field is ignored and
callers should use makeTestEntriesWithSpec for multi-spec testing
* fmt
* Fix path normalization, manifest threshold, and artifact filename collisions
- Normalize file paths in position delete collection and lookup so that
absolute S3 URLs and relative paths match correctly
- Fix rewriteManifests threshold check to count only data manifests
(was including delete manifests in the count and metric)
- Add random suffix to artifact filenames in compactDataFiles and
rewriteManifests to prevent collisions between concurrent runs
- Sort compaction bins by SpecID then PartitionKey for deterministic
ordering across specs
* Fix pos delete read, deduplicate column resolution, minor cleanups
- Remove broken Column() guard in position delete reading that silently
defaulted pos to 0; unconditionally extract Int64() instead
- Deduplicate column resolution in readEqualityDeleteFile by calling
resolveEqualityColIndices instead of inlining the same logic
- Add warning log in readBoolConfig for unrecognized string values
- Fix CompactDataFiles call site in integration test to capture 3 return
values
* Advance progress on all bins, deterministic manifest order, assert metrics
- Call onProgress for every bin iteration including skipped/failed bins
so progress reporting never appears stalled
- Sort spec IDs before iterating specEntriesMap to produce deterministic
manifest list ordering across runs
- Assert expected metric keys in CompactDataFiles integration test
---------
Co-authored-by: Copilot <copilot@github.com>
When S3StorageClass is empty (the default), aws.String("") was passed
as the StorageClass in PutObject requests. While AWS S3 treats this as
"use default," S3-compatible providers (e.g. SharkTech) reject it with
InvalidStorageClass. Only set StorageClass when a non-empty value is
configured, letting the provider use its default.
Fixes#8644