* feat: add S3 bucket size and object count metrics
Adds periodic collection of bucket size metrics:
- SeaweedFS_s3_bucket_size_bytes: logical size (deduplicated across replicas)
- SeaweedFS_s3_bucket_physical_size_bytes: physical size (including replicas)
- SeaweedFS_s3_bucket_object_count: object count (deduplicated)
Collection runs every 1 minute via background goroutine that queries
filer Statistics RPC for each bucket's collection.
Also adds Grafana dashboard panels for:
- S3 Bucket Size (logical vs physical)
- S3 Bucket Object Count
* address PR comments: fix bucket size metrics collection
1. Fix collectCollectionInfoFromMaster to use master VolumeList API
- Now properly queries master for topology info
- Uses WithMasterClient to get volume list from master
- Correctly calculates logical vs physical size based on replication
2. Return error when filerClient is nil to trigger fallback
- Changed from 'return nil, nil' to 'return nil, error'
- Ensures fallback to filer stats is properly triggered
3. Implement pagination in listBucketNames
- Added listBucketPageSize constant (1000)
- Uses StartFromFileName for pagination
- Continues fetching until fewer entries than limit returned
4. Handle NewReplicaPlacementFromByte error and prevent division by zero
- Check error return from NewReplicaPlacementFromByte
- Default to 1 copy if error occurs
- Add explicit check for copyCount == 0
* simplify bucket size metrics: remove filer fallback, align with quota enforcement
- Remove fallback to filer Statistics RPC
- Use only master topology for collection info (same as s3.bucket.quota.enforce)
- Updated comments to clarify this runs the same collection logic as quota enforcement
- Simplified code by removing collectBucketSizeFromFilerStats
* use s3a.option.Masters directly instead of querying filer
* address PR comments: fix dashboard overlaps and improve metrics collection
Grafana dashboard fixes:
- Fix overlapping panels 55 and 59 in grafana_seaweedfs.json (moved 59 to y=30)
- Fix grid collision in k8s dashboard (moved panel 72 to y=48)
- Aggregate bucket metrics with max() by (bucket) for multi-instance S3 gateways
Go code improvements:
- Add graceful shutdown support via context cancellation
- Use ticker instead of time.Sleep for better shutdown responsiveness
- Distinguish EOF from actual errors in stream handling
* improve bucket size metrics: multi-master failover and proper error handling
- Initial delay now respects context cancellation using select with time.After
- Use WithOneOfGrpcMasterClients for multi-master failover instead of hardcoding Masters[0]
- Properly propagate stream errors instead of just logging them (EOF vs real errors)
* improve bucket size metrics: distributed lock and volume ID deduplication
- Add distributed lock (LiveLock) so only one S3 instance collects metrics at a time
- Add IsLocked() method to LiveLock for checking lock status
- Fix deduplication: use volume ID tracking instead of dividing by copyCount
- Previous approach gave wrong results if replicas were missing
- Now tracks seen volume IDs and counts each volume only once
- Physical size still includes all replicas for accurate disk usage reporting
* rename lock to s3.leader
* simplify: remove StartBucketSizeMetricsCollection wrapper function
* fix data race: use atomic operations for LiveLock.isLocked field
- Change isLocked from bool to int32
- Use atomic.LoadInt32/StoreInt32 for all reads/writes
- Sync shared isLocked field in StartLongLivedLock goroutine
* add nil check for topology info to prevent panic
* fix bucket metrics: use Ticker for consistent intervals, fix pagination logic
- Use time.Ticker instead of time.After for consistent interval execution
- Fix pagination: count all entries (not just directories) for proper termination
- Update lastFileName for all entries to prevent pagination issues
* address PR comments: remove redundant atomic store, propagate context
- Remove redundant atomic.StoreInt32 in StartLongLivedLock (AttemptToLock already sets it)
- Propagate context through metrics collection for proper cancellation on shutdown
- collectAndUpdateBucketSizeMetrics now accepts ctx
- collectCollectionInfoFromMaster uses ctx for VolumeList RPC
- listBucketNames uses ctx for ListEntries RPC
* metrics: add Prometheus metrics for concurrent upload tracking
Add Prometheus metrics to monitor concurrent upload activity for both
filer and S3 servers. This provides visibility into the upload limiting
feature added in the previous PR.
New Metrics:
- SeaweedFS_filer_in_flight_upload_bytes: Current bytes being uploaded to filer
- SeaweedFS_filer_in_flight_upload_count: Current number of uploads to filer
- SeaweedFS_s3_in_flight_upload_bytes: Current bytes being uploaded to S3
- SeaweedFS_s3_in_flight_upload_count: Current number of uploads to S3
The metrics are updated atomically whenever uploads start or complete,
providing real-time visibility into upload concurrency levels.
This helps operators:
- Monitor upload concurrency in real-time
- Set appropriate limits based on actual usage patterns
- Detect potential bottlenecks or capacity issues
- Track the effectiveness of upload limiting configuration
* grafana: add dashboard panels for concurrent upload metrics
Add 4 new panels to the Grafana dashboard to visualize the concurrent
upload metrics added in this PR:
Filer Section:
- Filer Concurrent Uploads: Shows current number of concurrent uploads
- Filer Concurrent Upload Bytes: Shows current bytes being uploaded
S3 Gateway Section:
- S3 Concurrent Uploads: Shows current number of concurrent uploads
- S3 Concurrent Upload Bytes: Shows current bytes being uploaded
These panels help operators monitor upload concurrency in real-time and
tune the upload limiting configuration based on actual usage patterns.
* more efficient
fix deadlock when broadcast to clients
when master thransfer leader, the old master will disconnect with all
filers and volumeServers, if the cluster is a big , the broadcast
messages may be more big than the max of the channel len 100, then if the
KeepConnect was not listen on the channel in disconnect, it will
deadlock. and the whole cluster will not serve!