seaweedfs

History

Chris Lu d95df76bca feat: separate scheduler lanes for iceberg, lifecycle, and volume management (#8787 ) * feat: introduce scheduler lanes for independent per-workload scheduling Split the single plugin scheduler loop into independent per-lane goroutines so that volume management, iceberg compaction, and lifecycle operations never block each other. Each lane has its own: - Goroutine (laneSchedulerLoop) - Wake channel for immediate scheduling - Admin lock scope (e.g. "plugin scheduler:default") - Configurable idle sleep duration - Loop state tracking Three lanes are defined: - default: vacuum, volume_balance, ec_balance, erasure_coding, admin_script - iceberg: iceberg_maintenance - lifecycle: s3_lifecycle (new, handler coming in a later commit) Job types are mapped to lanes via a hardcoded map with LaneDefault as the fallback. The SchedulerJobTypeState and SchedulerStatus types now include a Lane field for API consumers. * feat: per-lane execution reservation pools for resource isolation Each scheduler lane now maintains its own execution reservation map so that a busy volume lane cannot consume execution slots needed by iceberg or lifecycle lanes. The per-lane pool is used by default when dispatching jobs through the lane scheduler; the global pool remains as a fallback for the public DispatchProposals API. * feat: add per-lane scheduler status API and lane worker UI pages - GET /api/plugin/lanes returns all lanes with status and job types - GET /api/plugin/workers?lane=X filters workers by lane - GET /api/plugin/scheduler-states?lane=X filters job types by lane - GET /api/plugin/scheduler-status?lane=X returns lane-scoped status - GET /plugin/lanes/{lane}/workers renders per-lane worker page - SchedulerJobTypeState now includes a "lane" field The lane worker pages show scheduler status, job type configuration, and connected workers scoped to a single lane, with links back to the main plugin overview. * feat: add s3_lifecycle worker handler for object store lifecycle management Implements a full plugin worker handler for S3 lifecycle management, assigned to the new "lifecycle" scheduler lane. Detection phase: - Reads filer.conf to find buckets with TTL lifecycle rules - Creates one job proposal per bucket with active lifecycle rules - Supports bucket_filter wildcard pattern from admin config Execution phase: - Walks the bucket directory tree breadth-first - Identifies expired objects by checking TtlSec + Crtime < now - Deletes expired objects in configurable batches - Reports progress with scanned/expired/error counts - Supports dry_run mode for safe testing Configurable via admin UI: - batch_size: entries per filer listing page (default 1000) - max_deletes_per_bucket: safety cap per run (default 10000) - dry_run: detect without deleting - delete_marker_cleanup: clean expired delete markers - abort_mpu_days: abort stale multipart uploads The handler integrates with the existing PutBucketLifecycle flow which sets TtlSec on entries via filer.conf path rules. * feat: add per-lane submenu items under Workers sidebar menu Replace the single "Workers" sidebar link with a collapsible submenu containing three lane entries: - Default (volume management + admin scripts) -> /plugin - Iceberg (table compaction) -> /plugin/lanes/iceberg/workers - Lifecycle (S3 object expiration) -> /plugin/lanes/lifecycle/workers The submenu auto-expands when on any /plugin page and highlights the active lane. Icons match each lane's job type descriptor (server, snowflake, hourglass). * feat: scope plugin pages to their scheduler lane The plugin overview, configuration, detection, queue, and execution pages now filter workers, job types, scheduler states, and scheduler status to only show data for their lane. - Plugin() templ function accepts a lane parameter (default: "default") - JavaScript appends ?lane= to /api/plugin/workers, /job-types, /scheduler-states, and /scheduler-status API calls - GET /api/plugin/job-types now supports ?lane= filtering - When ?job= is provided (e.g. ?job=iceberg_maintenance), the lane is auto-derived from the job type so the page scopes correctly This ensures /plugin shows only default-lane workers and /plugin/configuration?job=iceberg_maintenance scopes to the iceberg lane. * fix: remove "Lane" from lane worker page titles and capitalize properly "lifecycle Lane Workers" -> "Lifecycle Workers" "iceberg Lane Workers" -> "Iceberg Workers" * refactor: promote lane items to top-level sidebar menu entries Move Default, Iceberg, and Lifecycle from a collapsible submenu to direct top-level items under the WORKERS heading. Removes the intermediate "Workers" parent link and collapse toggle. * admin: unify plugin lane routes and handlers * admin: filter plugin jobs and activities by lane * admin: reuse plugin UI for worker lane pages * fix: use ServerAddress.ToGrpcAddress() for filer connections in lifecycle handler ClusterContext addresses use ServerAddress format (host:port.grpcPort). Convert to the actual gRPC address via ToGrpcAddress() before dialing, and add a Ping verification after connecting. Fixes: "dial tcp: lookup tcp/8888.18888: unknown port" * fix: resolve ServerAddress gRPC port in iceberg and lifecycle filer connections ClusterContext addresses use ServerAddress format (host:httpPort.grpcPort). Both the iceberg and lifecycle handlers now detect the compound format and extract the gRPC port via ToGrpcAddress() before dialing. Plain host:port addresses (e.g. from tests) are passed through unchanged. Fixes: "dial tcp: lookup tcp/8888.18888: unknown port" * align url * Potential fix for code scanning alert no. 335: Incorrect conversion between integer types Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> * fix: address PR review findings across scheduler lanes and lifecycle handler - Fix variable shadowing: rename loop var `w` to `worker` in GetPluginWorkersAPI to avoid shadowing the http.ResponseWriter param - Fix stale GetSchedulerStatus: aggregate loop states across all lanes instead of reading never-updated legacy schedulerLoopState - Scope InProcessJobs to lane in GetLaneSchedulerStatus - Fix AbortMPUDays=0 treated as unset: change <= 0 to < 0 so 0 disables - Propagate listing errors in lifecycle bucket walk instead of swallowing - Implement DeleteMarkerCleanup: scan for S3 delete marker entries and remove them - Implement AbortMPUDays: scan .uploads directory and remove stale multipart uploads older than the configured threshold - Fix success determination: mark job failed when result.errors > 0 even if no fatal error occurred - Add regression test for jobTypeLaneMap to catch drift from handler registrations * fix: guard against nil result in lifecycle completion and trim filer addresses - Guard result dereference in completion summary: use local vars defaulting to 0 when result is nil to prevent panic - Append trimmed filer addresses instead of originals so whitespace is not passed to the gRPC dialer * fix: propagate ctx cancellation from deleteExpiredObjects and add config logging - deleteExpiredObjects now returns a third error value when the context is canceled mid-batch; the caller stops processing further batches and returns the cancellation error to the job completion handler - readBoolConfig and readInt64Config now log unexpected ConfigValue types at V(1) for debugging, consistent with readStringConfig * fix: propagate errors in lifecycle cleanup helpers and use correct delete marker key - cleanupDeleteMarkers: return error on ctx cancellation and SeaweedList failures instead of silently continuing - abortIncompleteMPUs: log SeaweedList errors instead of discarding - isDeleteMarker: use ExtDeleteMarkerKey ("Seaweed-X-Amz-Delete-Marker") instead of ExtLatestVersionIsDeleteMarker which is for the parent entry - batchSize cap: use math.MaxInt instead of math.MaxInt32 * fix: propagate ctx cancellation from abortIncompleteMPUs and log unrecognized bool strings - abortIncompleteMPUs now returns (aborted, errors, ctxErr) matching cleanupDeleteMarkers; caller stops on cancellation or listing failure - readBoolConfig logs unrecognized string values before falling back * fix: shared per-bucket budget across lifecycle phases and allow cleanup without expired objects - Thread a shared remaining counter through TTL deletion, delete marker cleanup, and MPU abort so the total operations per bucket never exceed MaxDeletesPerBucket - Remove early return when no TTL-expired objects found so delete marker cleanup and MPU abort still run - Add NOTE on cleanupDeleteMarkers about version-safety limitation --------- Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>		19 hours ago
..
admin	feat: separate scheduler lanes for iceberg, lifecycle, and volume management (#8787)	19 hours ago
cluster	admin: auto migrating master maintenance scripts to admin_script plugin config (#8509)	3 weeks ago
command	Add insecure_skip_verify option for HTTPS client in security.toml (#8781)	1 day ago
credential	fix(s3): omit NotResource:null from bucket policy JSON response (#8658)	2 weeks ago
filer	fix(weed/filer/store_test): fix dropped errors (#8782)	1 day ago
filer_client	fix: resolve gRPC DNS resolution issues in Kubernetes #8384 (#8387)	1 month ago
glog	glog: add JSON structured logging mode (#8708)	1 week ago
iam	Add data file compaction to iceberg maintenance (Phase 2) (#8503)	2 weeks ago
iamapi	fix(s3): omit NotResource:null from bucket policy JSON response (#8658)	2 weeks ago
images	chore: execute goimports to format the code (#7983)	3 months ago
kms	S3 API: Add integration with KMS providers (#7152)	7 months ago
mount	mount: stream all filer mutations over single ordered gRPC stream (#8770)	2 days ago
mq	fix: resolve Kafka gateway response deadlocks causing Sarama client hangs (#8762)	3 days ago
notification	go fix	1 month ago
operation	Add Prometheus metric to count upload errors (#8788)	22 hours ago
pb	Rust volume server implementation with CI (#8539)	22 hours ago
plugin/worker	feat: separate scheduler lanes for iceberg, lifecycle, and volume management (#8787)	19 hours ago
query	mount: make metadata cache rebuilds snapshot-consistent (#8531)	3 weeks ago
remote_storage	improve: large file sync throughput for remote.cache and filer.sync (#8676)	1 week ago
replication	improve: large file sync throughput for remote.cache and filer.sync (#8676)	1 week ago
s3api	fix: serialize SSE-KMS metadata when bucket default encryption applies KMS (#8780)	1 day ago
security	fix: port in SNI address when using domainName instead of IP for master (#8500)	3 weeks ago
sequence	chore: execute goimports to format the code (#7983)	3 months ago
server	Fix TUS chunked upload and resume failures (#8783) (#8786)	1 day ago
sftpd	Fix SFTP file upload failures with JWT filer tokens (#8448)	4 weeks ago
shell	Give the `ScrubVolume()` RPC an option to flag found broken volumes as read-only. (#8360)	1 day ago
static	Fix Broken Links (#5287)	2 years ago
stats	Add Prometheus metric to count upload errors (#8788)	22 hours ago
storage	Rust volume server implementation with CI (#8539)	22 hours ago
telemetry	Prevent split-brain: Persistent ClusterID and Join Validation (#8022)	2 months ago
topology	Add data file compaction to iceberg maintenance (Phase 2) (#8503)	2 weeks ago
util	Add insecure_skip_verify option for HTTPS client in security.toml (#8781)	1 day ago
wdclient	wdclient/exclusive_locks: replace println with glog in ExclusiveLocker (#8723)	6 days ago
worker	chore:(weed/worker/tasks/erasure_coding): Prune Unused and Untested Functions (#8761)	3 days ago
Makefile	Move SQL engine and PostgreSQL server to their own binaries (#8417)	1 month ago
weed.go	Fix the issue where fuse command on a node cannot specify multiple configuration directory paths (#7874)	3 months ago