* Allow user to define admin access and secret key via values
* Add comments to values.yaml
* Add support for read for consistency
* Simplify templating
* Add checksum to s3 config
* Update comments
* Revert "Add checksum to s3 config"
This reverts commit d21a7038a8.
* Enforce IAM for s3tables bucket creation
* Prefer IAM path when policies exist
* Ensure IAM enforcement honors default allow
* address comments
* Reused the precomputed principal when setting tableBucketMetadata.OwnerAccountID, avoiding the redundant getAccountID call.
* get identity
* fix
* dedup
* fix
* comments
* fix tests
* update iam config
* go fmt
* fix ports
* fix flags
* mini clean shutdown
* Revert "update iam config"
This reverts commit ca48fdbb0a.
Revert "mini clean shutdown"
This reverts commit 9e17f6baff.
Revert "fix flags"
This reverts commit e9e7b29d2f.
Revert "go fmt"
This reverts commit bd3241960b.
* test/s3tables: share single weed mini per test package via TestMain
Previously each top-level test function in the catalog and s3tables
package started and stopped its own weed mini instance. This caused
failures when a prior instance wasn't cleanly stopped before the next
one started (port conflicts, leaked global state).
Changes:
- catalog/iceberg_catalog_test.go: introduce TestMain that starts one
shared TestEnvironment (external weed binary) before all tests and
tears it down after. All individual test functions now use sharedEnv.
Added randomSuffix() for unique resource names across tests.
- catalog/pyiceberg_test.go: updated to use sharedEnv instead of
per-test environments.
- catalog/pyiceberg_test_helpers.go -> pyiceberg_test_helpers_test.go:
renamed to a _test.go file so it can access TestEnvironment which is
defined in a test file.
- table-buckets/setup.go: add package-level sharedCluster variable.
- table-buckets/s3tables_integration_test.go: introduce TestMain that
starts one shared TestCluster before all tests. TestS3TablesIntegration
now uses sharedCluster. Extract startMiniClusterInDir (no *testing.T)
for TestMain use. TestS3TablesCreateBucketIAMPolicy keeps its own
cluster (different IAM config). Remove miniClusterMutex (no longer
needed). Fix Stop() to not panic when t is nil."
* delete
* parse
* default allow should work with anonymous
* fix port
* iceberg route
The failures are from Iceberg REST using the default bucket warehouse when no prefix is provided. Your tests create random buckets, so /v1/namespaces was looking in warehouse and failing. I updated the tests to use the prefixed Iceberg routes (/v1/{bucket}/...) via a small helper.
* test(s3tables): fix port conflicts and IAM ARN matching in integration tests
- Pass -master.dir explicitly to prevent filer store directory collision
between shared cluster and per-test clusters running in the same process
- Pass -volume.port.public and -volume.publicUrl to prevent the global
publicPort flag (mutated from 0 → concrete port by first cluster) from
being reused by a second cluster, causing 'address already in use'
- Remove the flag-reset loop in Stop() that reset global flag values while
other goroutines were reading them (race → panic)
- Fix IAM policy Resource ARN in TestS3TablesCreateBucketIAMPolicy to use
wildcards (arn:aws:s3tables:*:*:bucket/<name>) because the handler
generates ARNs with its own DefaultRegion (us-east-1) and principal name
('admin'), not the test constants testRegion/testAccountID
* docker: fix entrypoint chown guard; helm: add openshift-values.yaml
Fix a regression in entrypoint.sh where the DATA_UID/DATA_GID
ownership comparison was dropped, causing chown -R /data to run
unconditionally on every container start even when ownership was
already correct. Restore the guard so the recursive chown is
skipped when the seaweed user already owns /data — making startup
faster on subsequent runs and a no-op on OpenShift/PVC deployments
where fsGroup has already set correct ownership.
Add k8s/charts/seaweedfs/openshift-values.yaml: an example Helm
overrides file for deploying SeaweedFS on OpenShift (or any cluster
enforcing the Kubernetes restricted Pod Security Standard). Replaces
hostPath volumes with PVCs, sets runAsUser/fsGroup to 1000
(the seaweed user baked into the image), drops all capabilities,
disables privilege escalation, and enables RuntimeDefault seccomp —
satisfying OpenShift's default restricted SCC without needing a
custom SCC or root access.
Fixes #8381"
* admin: add plugin runtime UI page and route wiring
* pb: add plugin gRPC contract and generated bindings
* admin/plugin: implement worker registry, runtime, monitoring, and config store
* admin/dash: wire plugin runtime and expose plugin workflow APIs
* command: add flags to enable plugin runtime
* admin: rename remaining plugin v2 wording to plugin
* admin/plugin: add detectable job type registry helper
* admin/plugin: add scheduled detection and dispatch orchestration
* admin/plugin: prefetch job type descriptors when workers connect
* admin/plugin: add known job type discovery API and UI
* admin/plugin: refresh design doc to match current implementation
* admin/plugin: enforce per-worker scheduler concurrency limits
* admin/plugin: use descriptor runtime defaults for scheduler policy
* admin/ui: auto-load first known plugin job type on page open
* admin/plugin: bootstrap persisted config from descriptor defaults
* admin/plugin: dedupe scheduled proposals by dedupe key
* admin/ui: add job type and state filters for plugin monitoring
* admin/ui: add per-job-type plugin activity summary
* admin/plugin: split descriptor read API from schema refresh
* admin/ui: keep plugin summary metrics global while tables are filtered
* admin/plugin: retry executor reservation before timing out
* admin/plugin: expose scheduler states for monitoring
* admin/ui: show per-job-type scheduler states in plugin monitor
* pb/plugin: rename protobuf package to plugin
* admin/plugin: rename pluginRuntime wiring to plugin
* admin/plugin: remove runtime naming from plugin APIs and UI
* admin/plugin: rename runtime files to plugin naming
* admin/plugin: persist jobs and activities for monitor recovery
* admin/plugin: lease one detector worker per job type
* admin/ui: show worker load from plugin heartbeats
* admin/plugin: skip stale workers for detector and executor picks
* plugin/worker: add plugin worker command and stream runtime scaffold
* plugin/worker: implement vacuum detect and execute handlers
* admin/plugin: document external vacuum plugin worker starter
* command: update plugin.worker help to reflect implemented flow
* command/admin: drop legacy Plugin V2 label
* plugin/worker: validate vacuum job type and respect min interval
* plugin/worker: test no-op detect when min interval not elapsed
* command/admin: document plugin.worker external process
* plugin/worker: advertise configured concurrency in hello
* command/plugin.worker: add jobType handler selection
* command/plugin.worker: test handler selection by job type
* command/plugin.worker: persist worker id in workingDir
* admin/plugin: document plugin.worker jobType and workingDir flags
* plugin/worker: support cancel request for in-flight work
* plugin/worker: test cancel request acknowledgements
* command/plugin.worker: document workingDir and jobType behavior
* plugin/worker: emit executor activity events for monitor
* plugin/worker: test executor activity builder
* admin/plugin: send last successful run in detection request
* admin/plugin: send cancel request when detect or execute context ends
* admin/plugin: document worker cancel request responsibility
* admin/handlers: expose plugin scheduler states API in no-auth mode
* admin/handlers: test plugin scheduler states route registration
* admin/plugin: keep worker id on worker-generated activity records
* admin/plugin: test worker id propagation in monitor activities
* admin/dash: always initialize plugin service
* command/admin: remove plugin enable flags and default to enabled
* admin/dash: drop pluginEnabled constructor parameter
* admin/plugin UI: stop checking plugin enabled state
* admin/plugin: remove docs for plugin enable flags
* admin/dash: remove unused plugin enabled check method
* admin/dash: fallback to in-memory plugin init when dataDir fails
* admin/plugin API: expose worker gRPC port in status
* command/plugin.worker: resolve admin gRPC port via plugin status
* split plugin UI into overview/configuration/monitoring pages
* Update layout_templ.go
* add volume_balance plugin worker handler
* wire plugin.worker CLI for volume_balance job type
* add erasure_coding plugin worker handler
* wire plugin.worker CLI for erasure_coding job type
* support multi-job handlers in plugin worker runtime
* allow plugin.worker jobType as comma-separated list
* admin/plugin UI: rename to Workers and simplify config view
* plugin worker: queue detection requests instead of capacity reject
* Update plugin_worker.go
* plugin volume_balance: remove force_move/timeout from worker config UI
* plugin erasure_coding: enforce local working dir and cleanup
* admin/plugin UI: rename admin settings to job scheduling
* admin/plugin UI: persist and robustly render detection results
* admin/plugin: record and return detection trace metadata
* admin/plugin UI: show detection process and decision trace
* plugin: surface detector decision trace as activities
* mini: start a plugin worker by default
* admin/plugin UI: split monitoring into detection and execution tabs
* plugin worker: emit detection decision trace for EC and balance
* admin workers UI: split monitoring into detection and execution pages
* plugin scheduler: skip proposals for active assigned/running jobs
* admin workers UI: add job queue tab
* plugin worker: add dummy stress detector and executor job type
* admin workers UI: reorder tabs to detection queue execution
* admin workers UI: regenerate plugin template
* plugin defaults: include dummy stress and add stress tests
* plugin dummy stress: rotate detection selections across runs
* plugin scheduler: remove cross-run proposal dedupe
* plugin queue: track pending scheduled jobs
* plugin scheduler: wait for executor capacity before dispatch
* plugin scheduler: skip detection when waiting backlog is high
* plugin: add disk-backed job detail API and persistence
* admin ui: show plugin job detail modal from job id links
* plugin: generate unique job ids instead of reusing proposal ids
* plugin worker: emit heartbeats on work state changes
* plugin registry: round-robin tied executor and detector picks
* add temporary EC overnight stress runner
* plugin job details: persist and render EC execution plans
* ec volume details: color data and parity shard badges
* shard labels: keep parity ids numeric and color-only distinction
* admin: remove legacy maintenance UI routes and templates
* admin: remove dead maintenance endpoint helpers
* Update layout_templ.go
* remove dummy_stress worker and command support
* refactor plugin UI to job-type top tabs and sub-tabs
* migrate weed worker command to plugin runtime
* remove plugin.worker command and keep worker runtime with metrics
* update helm worker args for jobType and execution flags
* set plugin scheduling defaults to global 16 and per-worker 4
* stress: fix RPC context reuse and remove redundant variables in ec_stress_runner
* admin/plugin: fix lifecycle races, safe channel operations, and terminal state constants
* admin/dash: randomize job IDs and fix priority zero-value overwrite in plugin API
* admin/handlers: implement buffered rendering to prevent response corruption
* admin/plugin: implement debounced persistence flusher and optimize BuildJobDetail memory lookups
* admin/plugin: fix priority overwrite and implement bounded wait in scheduler reserve
* admin/plugin: implement atomic file writes and fix run record side effects
* admin/plugin: use P prefix for parity shard labels in execution plans
* admin/plugin: enable parallel execution for cancellation tests
* admin: refactor time.Time fields to pointers for better JSON omitempty support
* admin/plugin: implement pointer-safe time assignments and comparisons in plugin core
* admin/plugin: fix time assignment and sorting logic in plugin monitor after pointer refactor
* admin/plugin: update scheduler activity tracking to use time pointers
* admin/plugin: fix time-based run history trimming after pointer refactor
* admin/dash: fix JobSpec struct literal in plugin API after pointer refactor
* admin/view: add D/P prefixes to EC shard badges for UI consistency
* admin/plugin: use lifecycle-aware context for schema prefetching
* Update ec_volume_details_templ.go
* admin/stress: fix proposal sorting and log volume cleanup errors
* stress: refine ec stress runner with math/rand and collection name
- Added Collection field to VolumeEcShardsDeleteRequest for correct filename construction.
- Replaced crypto/rand with seeded math/rand PRNG for bulk payloads.
- Added documentation for EcMinAge zero-value behavior.
- Added logging for ignored errors in volume/shard deletion.
* admin: return internal server error for plugin store failures
Changed error status code from 400 Bad Request to 500 Internal Server Error for failures in GetPluginJobDetail to correctly reflect server-side errors.
* admin: implement safe channel sends and graceful shutdown sync
- Added sync.WaitGroup to Plugin struct to manage background goroutines.
- Implemented safeSendCh helper using recover() to prevent panics on closed channels.
- Ensured Shutdown() waits for all background operations to complete.
* admin: robustify plugin monitor with nil-safe time and record init
- Standardized nil-safe assignment for *time.Time pointers (CreatedAt, UpdatedAt, CompletedAt).
- Ensured persistJobDetailSnapshot initializes new records correctly if they don't exist on disk.
- Fixed debounced persistence to trigger immediate write on job completion.
* admin: improve scheduler shutdown behavior and logic guards
- Replaced brittle error string matching with explicit r.shutdownCh selection for shutdown detection.
- Removed redundant nil guard in buildScheduledJobSpec.
- Standardized WaitGroup usage for schedulerLoop.
* admin: implement deep copy for job parameters and atomic write fixes
- Implemented deepCopyGenericValue and used it in cloneTrackedJob to prevent shared state.
- Ensured atomicWriteFile creates parent directories before writing.
* admin: remove unreachable branch in shard classification
Removed an unreachable 'totalShards <= 0' check in classifyShardID as dataShards and parityShards are already guarded.
* admin: secure UI links and use canonical shard constants
- Added rel="noopener noreferrer" to external links for security.
- Replaced magic number 14 with erasure_coding.TotalShardsCount.
- Used renderEcShardBadge for missing shard list consistency.
* admin: stabilize plugin tests and fix regressions
- Composed a robust plugin_monitor_test.go to handle asynchronous persistence.
- Updated all time.Time literals to use timeToPtr helper.
- Added explicit Shutdown() calls in tests to synchronize with debounced writes.
- Fixed syntax errors and orphaned struct literals in tests.
* Potential fix for code scanning alert no. 278: Slice memory allocation with excessive size value
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
* Potential fix for code scanning alert no. 283: Uncontrolled data used in path expression
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
* admin: finalize refinements for error handling, scheduler, and race fixes
- Standardized HTTP 500 status codes for store failures in plugin_api.go.
- Tracked scheduled detection goroutines with sync.WaitGroup for safe shutdown.
- Fixed race condition in safeSendDetectionComplete by extracting channel under lock.
- Implemented deep copy for JobActivity details.
- Used defaultDirPerm constant in atomicWriteFile.
* test(ec): migrate admin dockertest to plugin APIs
* admin/plugin_api: fix RunPluginJobTypeAPI to return 500 for server-side detection/filter errors
* admin/plugin_api: fix ExecutePluginJobAPI to return 500 for job execution failures
* admin/plugin_api: limit parseProtoJSONBody request body to 1MB to prevent unbounded memory usage
* admin/plugin: consolidate regex to package-level validJobTypePattern; add char validation to sanitizeJobID
* admin/plugin: fix racy Shutdown channel close with sync.Once
* admin/plugin: track sendLoop and recv goroutines in WorkerStream with r.wg
* admin/plugin: document writeProtoFiles atomicity — .pb is source of truth, .json is human-readable only
* admin/plugin: extract activityLess helper to deduplicate nil-safe OccurredAt sort comparators
* test/ec: check http.NewRequest errors to prevent nil req panics
* test/ec: replace deprecated ioutil/math/rand, fix stale step comment 5.1→3.1
* plugin(ec): raise default detection and scheduling throughput limits
* topology: include empty disks in volume list and EC capacity fallback
* topology: remove hard 10-task cap for detection planning
* Update ec_volume_details_templ.go
* adjust default
* fix tests
---------
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
* helm: add Iceberg REST catalog support to S3 service
* helm: add Iceberg REST catalog support to S3 service
* add ingress for iceberg catalog endpoint
* helm: conditionally render ingressClassName in s3-iceberg-ingress.yaml
* helm: refactor s3-iceberg-ingress.yaml to use named template for paths
* helm: remove unused $serviceName variable in s3-iceberg-ingress.yaml
---------
Co-authored-by: yalin.sahin <yalin.sahin@tradition.ch>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
* helm: add Iceberg REST catalog support to S3 service
* helm: add Iceberg REST catalog support to S3 service
---------
Co-authored-by: yalin.sahin <yalin.sahin@tradition.ch>
* Update Helm hook annotations for post-install and upgrade
I believe it makes sense to allow this job to run also after installation. Assuming weed shell is idempotent, and assuming someone wants to add a new bucket after the initial installation, it makes sense to trigger the job again.
* Add check for existing buckets before creation
* Enhances S3 bucket existence check
Improves the reliability of checking for existing S3 buckets in the post-install hook.
The previous `grep -w` command could lead to imprecise matches. This update extracts only the bucket name and performs an exact, whole-line match to ensure accurate detection of existing buckets. This prevents potential issues with redundant creation attempts or false negatives.
* Currently Bucket Creation is ignored if filer.s3.enabled is disabled
This commit enables bucket creation on both scenarios,i.e. if any of filer.s3.enabled or s3.enabled are used.
---------
Co-authored-by: Emanuele <emanuele.leopardi@tset.com>
* refactor(helm): add componentName helper for truncation
* fix(helm): unify ingress backend naming with truncation
* fix(helm): unify statefulset/deployment naming with truncation
* fix(helm): add missing labels to services for servicemonitor discovery
* chore(helm): secure secrets and add upgrade notes
* fix(helm): truncate context instead of suffix in componentName
* revert(docs): remove upgrade notes per feedback
* fix(helm): use componentName for COSI serviceAccountName
* helm: update master -ip to use component name for correct truncation
* helm: refactor masterServers helper to use truncated component names
* helm: update volume -ip to use component name and cleanup redundant printf
* helm: refine helpers with robustness check and updated docs
There is a mistmatch in the conditionals for the definition and mounting of the `config-users` volume in the filer's template.
Volume definition:
```
{{- if and .Values.filer.s3.enabled .Values.filer.s3.enableAuth }}
```
Mount:
```
{{- if .Values.filer.s3.enableAuth }}
```
This leads to an invalid specification in the case where s3 is disabled but the enableAuth value is set to true, as it tries to mount in an undefined volume. I've fixed it here by adding the extra check to the latter conditional.
* Fix: Add -admin.grpc flag to worker for explicit gRPC port configuration
* Fix(helm): Add adminGrpcServer to worker configuration
* Refactor: Support host:port.grpcPort address format, revert -admin.grpc flag
* Helm: Conditionally append grpcPort to worker admin address
* weed/admin: fix "send on closed channel" panic in worker gRPC server
Make unregisterWorker connection-aware to prevent closing channels
belonging to newer connections.
* weed/worker: improve gRPC client stability and logging
- Fix goroutine leak in reconnection logic
- Refactor reconnection loop to exit on success and prevent busy-waiting
- Add session identification and enhanced logging to client handlers
- Use constant for internal reset action and remove unused variables
* weed/worker: fix worker state initialization and add lifecycle logs
- Revert workerState to use running boolean correctly
- Prevent handleStart failing by checking running state instead of startTime
- Add more detailed logs for worker startup events
* feat: Add probes to worker service
* feat: Add probes to worker service
* Merge branch 'master' into pr/7896
* refactor
---------
Co-authored-by: Chris Lu <chris.lu@gmail.com>
The worker deployment was incorrectly passing the admin gRPC port (33646)
to the -admin flag. However, the SeaweedFS worker command automatically
calculates the gRPC port by adding 10000 to the HTTP port provided.
This caused workers to attempt connection to port 43646 (33646 + 10000)
instead of the correct gRPC port 33646 (23646 + 10000).
Changes:
- Update worker-deployment.yaml to use admin.port instead of admin.grpcPort
- Workers now correctly connect to admin HTTP port, allowing the binary
to calculate the gRPC port automatically
Fixes workers failing with:
"dial tcp <admin-ip>:43646: connect: no route to host"
Related:
- Worker code: weed/pb/grpc_client_server.go:272 (grpcPort = port + 10000)
- Worker docs: weed/command/worker.go:36 (admin HTTP port + 10000)
* feat(worker): add metrics HTTP server and debug profiling support
- Add -metricsPort flag to enable Prometheus metrics endpoint
- Add -metricsIp flag to configure metrics server bind address
- Implement /metrics endpoint for Prometheus-compatible metrics
- Implement /health endpoint for Kubernetes readiness/liveness probes
- Add -debug flag to enable pprof debugging server
- Add -debug.port flag to configure debug server port
- Fix stats package import naming conflict by using alias
- Update usage examples to show new flags
Fixes#7843
* feat(helm): add worker metrics and health check support
- Update worker readiness probe to use httpGet on /health endpoint
- Update worker liveness probe to use httpGet on /health endpoint
- Add metricsPort flag to worker command in deployment template
- Support both httpGet and tcpSocket probe types for backward compatibility
- Update values.yaml with health check configuration
This enables Kubernetes pod lifecycle management for worker components through
proper health checks on the new metrics HTTP endpoint.
* feat(mini): align all services to share single debug and metrics servers
- Disable S3's separate debug server in mini mode (port 6060 now shared by all)
- Add metrics server startup to embedded worker for health monitoring
- All services now share the single metrics port (9327) and single debug port (6060)
- Consistent pattern with master, filer, volume, webdav services
* fix(worker): fix variable shadowing in health check handler
- Rename http.ResponseWriter parameter from 'w' to 'rw' to avoid shadowing
the outer 'w *worker.Worker' parameter
- Prevents potential bugs if future code tries to use worker state in handler
- Improves code clarity and follows Go best practices
* fix(worker): remove unused worker parameter in metrics server
- Change 'w *worker.Worker' parameter to '_' as it's not used
- Clarifies intent that parameter is intentionally unused
- Follows Go best practices and improves code clarity
* fix(helm): fix trailing backslash syntax errors in worker command
- Fix conditional backslash placement to prevent shell syntax errors
- Only add backslash when metricsPort OR extraArgs are present
- Prevents worker pod startup failures due to malformed command arguments
- Ensures proper shell command parsing regardless of configuration state
* refactor(worker): use standard stats.StartMetricsServer for consistency
- Replace custom metrics server implementation with stats.StartMetricsServer
to match pattern used in master, volume, s3, filer_sync components
- Simplifies code and improves maintainability
- Uses glog.Fatal for errors (consistent with other SeaweedFS components)
- Remove unused net/http and prometheus/promhttp imports
- Automatically provides /metrics and /health endpoints via standard implementation
* Fix Worker and Admin CA in helm chart
* Fix Worker and Admin CA in helm chart - add security.toml modification
* Fix Worker and Admin CA in helm chart - fix security.toml modification error
* Fix Worker and Admin CA in helm chart - fix errors in volume mounts
* Fix Worker and Admin CA in helm chart - address review comments
- Remove worker-cert from admin pod (principle of least privilege)
- Remove admin-cert from worker pod (principle of least privilege)
- Remove overly broad namespace wildcards from admin-cert dnsNames
- Remove overly broad namespace wildcards from worker-cert dnsNames
---------
Co-authored-by: chrislu <chris.lu@gmail.com>
* feat: add S3 bucket size and object count metrics
Adds periodic collection of bucket size metrics:
- SeaweedFS_s3_bucket_size_bytes: logical size (deduplicated across replicas)
- SeaweedFS_s3_bucket_physical_size_bytes: physical size (including replicas)
- SeaweedFS_s3_bucket_object_count: object count (deduplicated)
Collection runs every 1 minute via background goroutine that queries
filer Statistics RPC for each bucket's collection.
Also adds Grafana dashboard panels for:
- S3 Bucket Size (logical vs physical)
- S3 Bucket Object Count
* address PR comments: fix bucket size metrics collection
1. Fix collectCollectionInfoFromMaster to use master VolumeList API
- Now properly queries master for topology info
- Uses WithMasterClient to get volume list from master
- Correctly calculates logical vs physical size based on replication
2. Return error when filerClient is nil to trigger fallback
- Changed from 'return nil, nil' to 'return nil, error'
- Ensures fallback to filer stats is properly triggered
3. Implement pagination in listBucketNames
- Added listBucketPageSize constant (1000)
- Uses StartFromFileName for pagination
- Continues fetching until fewer entries than limit returned
4. Handle NewReplicaPlacementFromByte error and prevent division by zero
- Check error return from NewReplicaPlacementFromByte
- Default to 1 copy if error occurs
- Add explicit check for copyCount == 0
* simplify bucket size metrics: remove filer fallback, align with quota enforcement
- Remove fallback to filer Statistics RPC
- Use only master topology for collection info (same as s3.bucket.quota.enforce)
- Updated comments to clarify this runs the same collection logic as quota enforcement
- Simplified code by removing collectBucketSizeFromFilerStats
* use s3a.option.Masters directly instead of querying filer
* address PR comments: fix dashboard overlaps and improve metrics collection
Grafana dashboard fixes:
- Fix overlapping panels 55 and 59 in grafana_seaweedfs.json (moved 59 to y=30)
- Fix grid collision in k8s dashboard (moved panel 72 to y=48)
- Aggregate bucket metrics with max() by (bucket) for multi-instance S3 gateways
Go code improvements:
- Add graceful shutdown support via context cancellation
- Use ticker instead of time.Sleep for better shutdown responsiveness
- Distinguish EOF from actual errors in stream handling
* improve bucket size metrics: multi-master failover and proper error handling
- Initial delay now respects context cancellation using select with time.After
- Use WithOneOfGrpcMasterClients for multi-master failover instead of hardcoding Masters[0]
- Properly propagate stream errors instead of just logging them (EOF vs real errors)
* improve bucket size metrics: distributed lock and volume ID deduplication
- Add distributed lock (LiveLock) so only one S3 instance collects metrics at a time
- Add IsLocked() method to LiveLock for checking lock status
- Fix deduplication: use volume ID tracking instead of dividing by copyCount
- Previous approach gave wrong results if replicas were missing
- Now tracks seen volume IDs and counts each volume only once
- Physical size still includes all replicas for accurate disk usage reporting
* rename lock to s3.leader
* simplify: remove StartBucketSizeMetricsCollection wrapper function
* fix data race: use atomic operations for LiveLock.isLocked field
- Change isLocked from bool to int32
- Use atomic.LoadInt32/StoreInt32 for all reads/writes
- Sync shared isLocked field in StartLongLivedLock goroutine
* add nil check for topology info to prevent panic
* fix bucket metrics: use Ticker for consistent intervals, fix pagination logic
- Use time.Ticker instead of time.After for consistent interval execution
- Fix pagination: count all entries (not just directories) for proper termination
- Update lastFileName for all entries to prevent pagination issues
* address PR comments: remove redundant atomic store, propagate context
- Remove redundant atomic.StoreInt32 in StartLongLivedLock (AttemptToLock already sets it)
- Propagate context through metrics collection for proper cancellation on shutdown
- collectAndUpdateBucketSizeMetrics now accepts ctx
- collectCollectionInfoFromMaster uses ctx for VolumeList RPC
- listBucketNames uses ctx for ListEntries RPC
* fix: add S3 bucket traffic sent metric tracking
The BucketTrafficSent() function was defined but never called, causing
the S3 Bucket Traffic Sent Grafana dashboard panel to not display data.
Added BucketTrafficSent() calls in the streaming functions:
- streamFromVolumeServers: for inline and chunked content
- streamFromVolumeServersWithSSE: for encrypted range and full object requests
The traffic received metric already worked because BucketTrafficReceived()
was properly called in putToFiler() for both regular and multipart uploads.
* feat: add S3 API Calls per Bucket panel to Grafana dashboards
Added a new panel showing API calls per bucket using the existing
SeaweedFS_s3_request_total metric aggregated by bucket.
Updated all Grafana dashboard files:
- other/metrics/grafana_seaweedfs.json
- other/metrics/grafana_seaweedfs_k8s.json
- other/metrics/grafana_seaweedfs_heartbeat.json
- k8s/charts/seaweedfs/dashboards/seaweedfs-grafana-dashboard.json
* address PR comments: use actual bytes written for traffic metrics
- Use actual bytes written from w.Write instead of expected size for inline content
- Add countingWriter wrapper to track actual bytes for chunked content streaming
- Update streamDecryptedRangeFromChunks to return actual bytes written for SSE
- Remove redundant nil check that caused linter warning
- Fix duplicate panel id 86 in grafana_seaweedfs.json (changed to 90)
- Fix overlapping panel positions in grafana_seaweedfs_k8s.json (rebalanced x positions)
* fix grafana k8s dashboard: rebalance S3 panels to avoid overlap
- Panel 86 (S3 API Calls per Bucket): w:6, x:0, y:15
- Panel 67 (S3 Request Duration 95th): w:6, x:6, y:15
- Panel 68 (S3 Request Duration 80th): w:6, x:12, y:15
- Panel 65 (S3 Request Duration 99th): w:6, x:18, y:15
All four S3 panels now fit in a single row (y:15) with width 6 each.
Filer row header at y:22 and subsequent panels remain correctly positioned.
* add input validation and clarify comments in adjustRangeForPart
- Add validation that partStartOffset <= partEndOffset at function start
- Add clarifying comments for suffix-range handling where clientEnd
temporarily holds the suffix length before being reassigned
* align pluginVersion for panel 86 to 10.3.1 in k8s dashboard
* track partial writes for accurate egress traffic accounting
- Change condition from 'err == nil' to 'written > 0' for inline content
- Move BucketTrafficSent before error check for chunked content streaming
- Track traffic even on partial SSE range writes
- Track traffic even on partial full SSE object copies
This ensures egress traffic is counted even when writes fail partway through,
providing more accurate bandwidth metrics.
Fixes#7467
The -mserver argument line in volume-statefulset.yaml was missing a
trailing backslash, which prevented extraArgs from being passed to
the weed volume process.
Also:
- Extracted master server list generation logic into shared helper
templates in _helpers.tpl for better maintainability
- Updated all occurrences of deprecated -mserver flag to -master
across docker-compose files, test files, and documentation
The allowEmptyFolder option is no longer functional because:
1. The code that used it was already commented out
2. Empty folder cleanup is now handled asynchronously by EmptyFolderCleaner
The CLI flags are kept for backward compatibility but marked as deprecated
and ignored. This removes:
- S3ApiServerOption.AllowEmptyFolder field
- The actual usage in s3api_object_handlers_list.go
- Helm chart values and template references
- References in test Makefiles and docker-compose files
* helm: enhance all-in-one deployment configuration
Fixes#7110
This PR addresses multiple issues with the all-in-one Helm chart configuration:
## New Features
### Configurable Replicas
- Added `allInOne.replicas` (was hardcoded to 1)
### S3 Gateway Configuration
- Added full S3 config under `allInOne.s3`:
- port, httpsPort, domainName, allowEmptyFolder
- enableAuth, existingConfigSecret, auditLogConfig
- createBuckets for declarative bucket creation
### SFTP Server Configuration
- Added full SFTP config under `allInOne.sftp`:
- port, sshPrivateKey, hostKeysFolder, authMethods
- maxAuthTries, bannerMessage, loginGraceTime
- clientAliveInterval, clientAliveCountMax, enableAuth
### Command Line Arguments
- Added `allInOne.extraArgs` for custom CLI arguments
### Update Strategy
- Added `allInOne.updateStrategy.type` (Recreate/RollingUpdate)
### Secret Environment Variables
- Added `allInOne.secretExtraEnvironmentVars` for injecting secrets
### Ingress Support
- Added `allInOne.ingress` with S3, filer, and master sub-configs
### Storage Options
- Enhanced `allInOne.data` with existingClaim support
- Added PVC template for persistentVolumeClaim type
## CI Enhancements
- Added comprehensive tests for all-in-one configurations
- Tests cover replicas, S3, SFTP, extraArgs, strategies, PVC, ingress
* helm: add real cluster deployment tests to CI
- Deploy all-in-one cluster with S3 enabled on kind cluster
- Test Master API (/cluster/status endpoint)
- Test Filer API (file upload/download)
- Test S3 API (/status endpoint)
- Test S3 operations with AWS CLI:
- Create/delete buckets
- Upload/download/delete objects
- Verify file content integrity
* helm: simplify CI and remove all-in-one ingress
Address review comments:
- Remove detailed all-in-one template rendering tests from CI
- Remove real cluster deployment tests from CI
- Remove all-in-one ingress template and values configuration
Keep the core improvements:
- allInOne.replicas configuration
- allInOne.s3.* full configuration
- allInOne.sftp.* full configuration
- allInOne.extraArgs support
- allInOne.updateStrategy configuration
- allInOne.secretExtraEnvironmentVars support
* helm: address review comments
- Fix post-install-bucket-hook.yaml: add filer.s3.enableAuth and
filer.s3.existingConfigSecret to or statements for consistency
- Fix all-in-one-deployment.yaml: use default function for s3.domainName
- Fix all-in-one-deployment.yaml: use hasKey function for s3.allowEmptyFolder
* helm: clarify updateStrategy multi-replica behavior
Expand comment to warn users that RollingUpdate with multiple replicas
requires shared storage (ReadWriteMany) to avoid data loss.
* helm: address gemini-code-assist review comments
- Make PVC accessModes configurable to support ReadWriteMany for
multi-replica deployments (defaults to ReadWriteOnce)
- Use configured readiness probe paths in post-install bucket hook
instead of hardcoded paths, respecting custom configurations
* helm: simplify allowEmptyFolder logic using coalesce
Use coalesce function for cleaner template code as suggested in review.
* helm: fix extraArgs trailing backslash issue
Remove trailing backslash after the last extraArgs argument to avoid
shell syntax error. Use counter to only add backslash between arguments.
* helm: fix fallback logic for allInOne s3/sftp configuration
Changes:
- Set allInOne.s3.* and allInOne.sftp.* override parameters to null by default
This allows proper inheritance from global s3.* and sftp.* settings
- Fix allowEmptyFolder logic to use explicit nil checking instead of coalesce
The coalesce/default functions treat 'false' as empty, causing incorrect
fallback behavior when users want to explicitly set false values
Addresses review feedback about default value conflicts with fallback logic.
* helm: fix exec in bucket creation loop causing premature termination
Remove 'exec' from the range loops that create and configure S3 buckets.
The exec command replaces the current shell process, causing the script
to terminate after the first bucket, preventing creation/configuration
of subsequent buckets.
* helm: quote extraArgs to handle arguments with spaces
Use the quote function to ensure each item in extraArgs is treated as
a single, complete argument even if it contains spaces.
* helm: make s3/filer ingress work for both normal and all-in-one modes
Modified s3-ingress.yaml and filer-ingress.yaml to dynamically select
the service name based on deployment mode:
- Normal mode: points to seaweedfs-s3 / seaweedfs-filer services
- All-in-one mode: points to seaweedfs-all-in-one service
This eliminates the need for separate all-in-one ingress templates.
Users can now use the standard s3.ingress and filer.ingress settings
for both deployment modes.
* helm: fix allInOne.data.size and storageClass to use null defaults
Change size and storageClass from empty strings to null so the template
defaults (10Gi for size, cluster default for storageClass) will apply
correctly. Empty strings prevent the Helm | default function from working.
* helm: fix S3 ingress to include standalone S3 gateway case
Add s3.enabled check to the $s3Enabled logic so the ingress works for:
1. Standalone S3 gateway (s3.enabled)
2. S3 on Filer (filer.s3.enabled) when not in all-in-one mode
3. S3 in all-in-one mode (allInOne.s3.enabled)