Mirror/seaweedfs - seaweedfs - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Chris Lu	d95df76bca	feat: separate scheduler lanes for iceberg, lifecycle, and volume management (#8787 ) * feat: introduce scheduler lanes for independent per-workload scheduling Split the single plugin scheduler loop into independent per-lane goroutines so that volume management, iceberg compaction, and lifecycle operations never block each other. Each lane has its own: - Goroutine (laneSchedulerLoop) - Wake channel for immediate scheduling - Admin lock scope (e.g. "plugin scheduler:default") - Configurable idle sleep duration - Loop state tracking Three lanes are defined: - default: vacuum, volume_balance, ec_balance, erasure_coding, admin_script - iceberg: iceberg_maintenance - lifecycle: s3_lifecycle (new, handler coming in a later commit) Job types are mapped to lanes via a hardcoded map with LaneDefault as the fallback. The SchedulerJobTypeState and SchedulerStatus types now include a Lane field for API consumers. * feat: per-lane execution reservation pools for resource isolation Each scheduler lane now maintains its own execution reservation map so that a busy volume lane cannot consume execution slots needed by iceberg or lifecycle lanes. The per-lane pool is used by default when dispatching jobs through the lane scheduler; the global pool remains as a fallback for the public DispatchProposals API. * feat: add per-lane scheduler status API and lane worker UI pages - GET /api/plugin/lanes returns all lanes with status and job types - GET /api/plugin/workers?lane=X filters workers by lane - GET /api/plugin/scheduler-states?lane=X filters job types by lane - GET /api/plugin/scheduler-status?lane=X returns lane-scoped status - GET /plugin/lanes/{lane}/workers renders per-lane worker page - SchedulerJobTypeState now includes a "lane" field The lane worker pages show scheduler status, job type configuration, and connected workers scoped to a single lane, with links back to the main plugin overview. * feat: add s3_lifecycle worker handler for object store lifecycle management Implements a full plugin worker handler for S3 lifecycle management, assigned to the new "lifecycle" scheduler lane. Detection phase: - Reads filer.conf to find buckets with TTL lifecycle rules - Creates one job proposal per bucket with active lifecycle rules - Supports bucket_filter wildcard pattern from admin config Execution phase: - Walks the bucket directory tree breadth-first - Identifies expired objects by checking TtlSec + Crtime < now - Deletes expired objects in configurable batches - Reports progress with scanned/expired/error counts - Supports dry_run mode for safe testing Configurable via admin UI: - batch_size: entries per filer listing page (default 1000) - max_deletes_per_bucket: safety cap per run (default 10000) - dry_run: detect without deleting - delete_marker_cleanup: clean expired delete markers - abort_mpu_days: abort stale multipart uploads The handler integrates with the existing PutBucketLifecycle flow which sets TtlSec on entries via filer.conf path rules. * feat: add per-lane submenu items under Workers sidebar menu Replace the single "Workers" sidebar link with a collapsible submenu containing three lane entries: - Default (volume management + admin scripts) -> /plugin - Iceberg (table compaction) -> /plugin/lanes/iceberg/workers - Lifecycle (S3 object expiration) -> /plugin/lanes/lifecycle/workers The submenu auto-expands when on any /plugin page and highlights the active lane. Icons match each lane's job type descriptor (server, snowflake, hourglass). * feat: scope plugin pages to their scheduler lane The plugin overview, configuration, detection, queue, and execution pages now filter workers, job types, scheduler states, and scheduler status to only show data for their lane. - Plugin() templ function accepts a lane parameter (default: "default") - JavaScript appends ?lane= to /api/plugin/workers, /job-types, /scheduler-states, and /scheduler-status API calls - GET /api/plugin/job-types now supports ?lane= filtering - When ?job= is provided (e.g. ?job=iceberg_maintenance), the lane is auto-derived from the job type so the page scopes correctly This ensures /plugin shows only default-lane workers and /plugin/configuration?job=iceberg_maintenance scopes to the iceberg lane. * fix: remove "Lane" from lane worker page titles and capitalize properly "lifecycle Lane Workers" -> "Lifecycle Workers" "iceberg Lane Workers" -> "Iceberg Workers" * refactor: promote lane items to top-level sidebar menu entries Move Default, Iceberg, and Lifecycle from a collapsible submenu to direct top-level items under the WORKERS heading. Removes the intermediate "Workers" parent link and collapse toggle. * admin: unify plugin lane routes and handlers * admin: filter plugin jobs and activities by lane * admin: reuse plugin UI for worker lane pages * fix: use ServerAddress.ToGrpcAddress() for filer connections in lifecycle handler ClusterContext addresses use ServerAddress format (host:port.grpcPort). Convert to the actual gRPC address via ToGrpcAddress() before dialing, and add a Ping verification after connecting. Fixes: "dial tcp: lookup tcp/8888.18888: unknown port" * fix: resolve ServerAddress gRPC port in iceberg and lifecycle filer connections ClusterContext addresses use ServerAddress format (host:httpPort.grpcPort). Both the iceberg and lifecycle handlers now detect the compound format and extract the gRPC port via ToGrpcAddress() before dialing. Plain host:port addresses (e.g. from tests) are passed through unchanged. Fixes: "dial tcp: lookup tcp/8888.18888: unknown port" * align url * Potential fix for code scanning alert no. 335: Incorrect conversion between integer types Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> * fix: address PR review findings across scheduler lanes and lifecycle handler - Fix variable shadowing: rename loop var `w` to `worker` in GetPluginWorkersAPI to avoid shadowing the http.ResponseWriter param - Fix stale GetSchedulerStatus: aggregate loop states across all lanes instead of reading never-updated legacy schedulerLoopState - Scope InProcessJobs to lane in GetLaneSchedulerStatus - Fix AbortMPUDays=0 treated as unset: change <= 0 to < 0 so 0 disables - Propagate listing errors in lifecycle bucket walk instead of swallowing - Implement DeleteMarkerCleanup: scan for S3 delete marker entries and remove them - Implement AbortMPUDays: scan .uploads directory and remove stale multipart uploads older than the configured threshold - Fix success determination: mark job failed when result.errors > 0 even if no fatal error occurred - Add regression test for jobTypeLaneMap to catch drift from handler registrations * fix: guard against nil result in lifecycle completion and trim filer addresses - Guard result dereference in completion summary: use local vars defaulting to 0 when result is nil to prevent panic - Append trimmed filer addresses instead of originals so whitespace is not passed to the gRPC dialer * fix: propagate ctx cancellation from deleteExpiredObjects and add config logging - deleteExpiredObjects now returns a third error value when the context is canceled mid-batch; the caller stops processing further batches and returns the cancellation error to the job completion handler - readBoolConfig and readInt64Config now log unexpected ConfigValue types at V(1) for debugging, consistent with readStringConfig * fix: propagate errors in lifecycle cleanup helpers and use correct delete marker key - cleanupDeleteMarkers: return error on ctx cancellation and SeaweedList failures instead of silently continuing - abortIncompleteMPUs: log SeaweedList errors instead of discarding - isDeleteMarker: use ExtDeleteMarkerKey ("Seaweed-X-Amz-Delete-Marker") instead of ExtLatestVersionIsDeleteMarker which is for the parent entry - batchSize cap: use math.MaxInt instead of math.MaxInt32 * fix: propagate ctx cancellation from abortIncompleteMPUs and log unrecognized bool strings - abortIncompleteMPUs now returns (aborted, errors, ctxErr) matching cleanupDeleteMarkers; caller stops on cancellation or listing failure - readBoolConfig logs unrecognized string values before falling back * fix: shared per-bucket budget across lifecycle phases and allow cleanup without expired objects - Thread a shared remaining counter through TTL deletion, delete marker cleanup, and MPU abort so the total operations per bucket never exceed MaxDeletesPerBucket - Remove early return when no TTL-expired objects found so delete marker cleanup and MPU abort still run - Add NOTE on cleanupDeleteMarkers about version-safety limitation --------- Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>	3 days ago
Chris Lu	8cde3d4486	Add data file compaction to iceberg maintenance (Phase 2) (#8503 ) * Add iceberg_maintenance plugin worker handler (Phase 1) Implement automated Iceberg table maintenance as a new plugin worker job type. The handler scans S3 table buckets for tables needing maintenance and executes operations in the correct Iceberg order: expire snapshots, remove orphan files, and rewrite manifests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add data file compaction to iceberg maintenance handler (Phase 2) Implement bin-packing compaction for small Parquet data files: - Enumerate data files from manifests, group by partition - Merge small files using parquet-go (read rows, write merged output) - Create new manifest with ADDED/DELETED/EXISTING entries - Commit new snapshot with compaction metadata Add 'compact' operation to maintenance order (runs before expire_snapshots), configurable via target_file_size_bytes and min_input_files thresholds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix memory exhaustion in mergeParquetFiles by processing files sequentially Previously all source Parquet files were loaded into memory simultaneously, risking OOM when a compaction bin contained many small files. Now each file is loaded, its rows are streamed into the output writer, and its data is released before the next file is loaded — keeping peak memory proportional to one input file plus the output buffer. * Validate bucket/namespace/table names against path traversal Reject names containing '..', '/', or '\' in Execute to prevent directory traversal via crafted job parameters. * Add filer address failover in iceberg maintenance handler Try each filer address from cluster context in order instead of only using the first one. This improves resilience when the primary filer is temporarily unreachable. * Add separate MinManifestsToRewrite config for manifest rewrite threshold The rewrite_manifests operation was reusing MinInputFiles (meant for compaction bin file counts) as its manifest count threshold. Add a dedicated MinManifestsToRewrite field with its own config UI section and default value (5) so the two thresholds can be tuned independently. * Fix risky mtime fallback in orphan removal that could delete new files When entry.Attributes is nil, mtime defaulted to Unix epoch (1970), which would always be older than the safety threshold, causing the file to be treated as eligible for deletion. Skip entries with nil Attributes instead, matching the safer logic in operations.go. * Fix undefined function references in iceberg_maintenance_handler.go Use the exported function names (ShouldSkipDetectionByInterval, BuildDetectorActivity, BuildExecutorActivity) matching their definitions in vacuum_handler.go. * Remove duplicated iceberg maintenance handler in favor of iceberg/ subpackage The IcebergMaintenanceHandler and its compaction code in the parent pluginworker package duplicated the logic already present in the iceberg/ subpackage (which self-registers via init()). The old code lacked stale-plan guards, proper path normalization, CAS-based xattr updates, and error-returning parseOperations. Since the registry pattern (default "all") makes the old handler unreachable, remove it entirely. All functionality is provided by iceberg.Handler with the reviewed improvements. * Fix MinManifestsToRewrite clamping to match UI minimum of 2 The clamp reset values below 2 to the default of 5, contradicting the UI's advertised MinValue of 2. Clamp to 2 instead. * Sort entries by size descending in splitOversizedBin for better packing Entries were processed in insertion order which is non-deterministic from map iteration. Sorting largest-first before the splitting loop improves bin packing efficiency by filling bins more evenly. * Add context cancellation check to drainReader loop The row-streaming loop in drainReader did not check ctx between iterations, making long compaction merges uncancellable. Check ctx.Done() at the top of each iteration. * Fix splitOversizedBin to always respect targetSize limit The minFiles check in the split condition allowed bins to grow past targetSize when they had fewer than minFiles entries, defeating the OOM protection. Now bins always split at targetSize, and a trailing runt with fewer than minFiles entries is merged into the previous bin. * Add integration tests for iceberg table maintenance plugin worker Tests start a real weed mini cluster, create S3 buckets and Iceberg table metadata via filer gRPC, then exercise the iceberg.Handler operations (ExpireSnapshots, RemoveOrphans, RewriteManifests) against the live filer. A full maintenance cycle test runs all operations in sequence and verifies metadata consistency. Also adds exported method wrappers (testing_api.go) so the integration test package can call the unexported handler methods. * Fix splitOversizedBin dropping files and add source path to drainReader errors The runt-merge step could leave leading bins with fewer than minFiles entries (e.g. [80,80,10,10] with targetSize=100, minFiles=2 would drop the first 80-byte file). Replace the filter-based approach with an iterative merge that folds any sub-minFiles bin into its smallest neighbor, preserving all eligible files. Also add the source file path to drainReader error messages so callers can identify which Parquet file caused a read/write failure. * Harden integration test error handling - s3put: fail immediately on HTTP 4xx/5xx instead of logging and continuing - lookupEntry: distinguish NotFound (return nil) from unexpected RPC errors (fail the test) - writeOrphan and orphan creation in FullMaintenanceCycle: check CreateEntryResponse.Error in addition to the RPC error * go fmt --------- Co-authored-by: Copilot <copilot@github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2 weeks ago
Chris Lu	baae672b6f	feat: auto-disable master vacuum when plugin worker is active (#8624 ) * feat: auto-disable master vacuum when plugin vacuum worker is active When a vacuum-capable plugin worker connects to the admin server, the admin server calls DisableVacuum on the master to prevent the automatic scheduled vacuum from conflicting with the plugin worker's vacuum. When the worker disconnects, EnableVacuum is called to restore the default behavior. A safety net in the topology refresh loop re-enables vacuum if the admin server disconnects without cleanup. * rename isAdminServerConnected to isAdminServerConnectedFunc * add 5s timeout to DisableVacuum/EnableVacuum gRPC calls Prevents the monitor goroutine from blocking indefinitely if the master is unresponsive. * track plugin ownership of vacuum disable to avoid overriding operator - Add vacuumDisabledByPlugin flag to Topology, set when DisableVacuum is called while admin server is connected (i.e., by plugin monitor) - Safety net only re-enables vacuum when it was disabled by plugin, not when an operator intentionally disabled it via shell command - EnableVacuum clears the plugin flag * extract syncVacuumState for testability, add fake toggler tests Extract the single sync step into syncVacuumState() with a vacuumToggler interface. Add TestSyncVacuumState with a fake toggler that verifies disable/enable calls on state transitions. * use atomic.Bool for isDisableVacuum and vacuumDisabledByPlugin Both fields are written by gRPC handlers and read by the vacuum goroutine, causing a data race. Use atomic.Bool with Store/Load for thread-safe access. * use explicit by_plugin field instead of connection heuristic Add by_plugin bool to DisableVacuumRequest proto so the caller declares intent explicitly. The admin server monitor sets it to true; shell commands leave it false. This prevents an operator's intentional disable from being auto-reversed by the safety net. * use setter for admin server callback instead of function parameter Move isAdminServerConnected from StartRefreshWritableVolumes parameter to Topology.SetAdminServerConnectedFunc() setter. Keeps the function signature stable and decouples the topology layer from the admin server concept. * suppress repeated log messages on persistent sync failures Add retrying parameter to syncVacuumState so the initial state transition is logged at V(0) but subsequent retries of the same transition are silent until the call succeeds. * clear plugin ownership flag on manual DisableVacuum Prevents stale plugin flag from causing incorrect auto-enable when an operator manually disables vacuum after a plugin had previously disabled it. * add by_plugin to EnableVacuumRequest for symmetric ownership tracking Plugin-driven EnableVacuum now only re-enables if the plugin was the one that disabled it. If an operator manually disabled vacuum after the plugin, the plugin's EnableVacuum is a no-op. This prevents the plugin monitor from overriding operator intent on worker disconnect. * use cancellable context for monitorVacuumWorker goroutine Replace context.Background() with a cancellable context stored as bgCancel on AdminServer. Shutdown() calls bgCancel() so monitorVacuumWorker exits cleanly via ctx.Done(). * track operator and plugin vacuum disables independently Replace single isDisableVacuum flag with two independent flags: vacuumDisabledByOperator and vacuumDisabledByPlugin. Each caller only flips its own flag. The effective disabled state is the OR of both. This prevents a plugin connect/disconnect cycle from overriding an operator's manual disable, and vice versa. * fix safety net to clear plugin flag, not operator flag The safety net should call EnableVacuumByPlugin() to clear only the plugin disable flag when the admin server disconnects. The previous call to EnableVacuum() incorrectly cleared the operator flag instead.	2 weeks ago
Chris Lu	1bd7a98a4a	simplify plugin scheduler: remove configurable IdleSleepSeconds, use constant 61s The SchedulerConfig struct and its persistence/API were unnecessary indirection. Replace with a simple constant (reduced from 613s to 61s) so the scheduler re-checks for detectable job types promptly after going idle, improving the clean-install experience.	3 weeks ago
Chris Lu	b3620c7e14	admin: auto migrating master maintenance scripts to admin_script plugin config (#8509 ) * admin: seed admin_script plugin config from master maintenance scripts When the admin server starts, fetch the maintenance scripts configuration from the master via GetMasterConfiguration. If the admin_script plugin worker does not already have a saved config, use the master's scripts as the default value. This enables seamless migration from master.toml [master.maintenance] to the admin script plugin worker. Changes: - Add maintenance_scripts and maintenance_sleep_minutes fields to GetMasterConfigurationResponse in master.proto - Populate the new fields from viper config in master_grpc_server.go - On admin server startup, fetch the master config and seed the admin_script plugin config if no config exists yet - Strip lock/unlock commands from the master scripts since the admin script worker handles locking automatically Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: address review comments on admin_script seeding - Replace TOCTOU race (separate Load+Save) with atomic SaveJobTypeConfigIfNotExists on ConfigStore and Plugin - Replace ineffective polling loop with single GetMaster call using 30s context timeout, since GetMaster respects context cancellation - Add unit tests for SaveJobTypeConfigIfNotExists (in-memory + on-disk) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: apply maintenance script defaults in gRPC handler The gRPC handler for GetMasterConfiguration read maintenance scripts from viper without calling SetDefault, relying on startAdminScripts having run first. If the admin server calls GetMasterConfiguration before startAdminScripts sets the defaults, viper returns empty strings and the seeding is silently skipped. Apply SetDefault in the gRPC handler itself so it is self-contained. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Revert "fix: apply maintenance script defaults in gRPC handler" This reverts commit `068a506330`. * fix: use atomic save in ensureJobTypeConfigFromDescriptor ensureJobTypeConfigFromDescriptor used a separate Load + Save, racing with seedAdminScriptFromMaster. If the descriptor defaults (empty script) were saved first, SaveJobTypeConfigIfNotExists in the seeding goroutine would see an existing config and skip, losing the master's maintenance scripts. Switch to SaveJobTypeConfigIfNotExists so both paths are atomic. Whichever wins, the other is a safe no-op. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: fetch master scripts inline during config bootstrap, not in goroutine Replace the seedAdminScriptFromMaster goroutine with a ConfigDefaultsProvider callback. When the plugin bootstraps admin_script defaults from the worker descriptor, it calls the provider which fetches maintenance scripts from the master synchronously. This eliminates the race between the seeding goroutine and the descriptor-based config bootstrap. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * skip commented lock unlock Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com> * reduce grpc calls --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	4 weeks ago
Chris Lu	18ccc9b773	Plugin scheduler: sequential iterations with max runtime (#8496 ) * pb: add job type max runtime setting * plugin: default job type max runtime * plugin: redesign scheduler loop * admin ui: update scheduler settings * plugin: fix scheduler loop state name * plugin scheduler: restore backlog skip * plugin scheduler: drop legacy detection helper * admin api: require scheduler config body * admin ui: preserve detection interval on save * plugin scheduler: use job context and drain cancels * plugin scheduler: respect detection intervals * plugin scheduler: gate runs and drain queue * ec test: reuse req/resp vars * ec test: add scheduler debug logs * Adjust scheduler idle sleep and initial run delay * Clear pending job queue before scheduler runs * Log next detection time in EC integration test * Improve plugin scheduler debug logging in EC test * Expose scheduler next detection time * Log scheduler next detection time in EC test * Wake scheduler on config or worker updates * Expose scheduler sleep interval in UI * Fix scheduler sleep save value selection * Set scheduler idle sleep default to 613s * Show scheduler next run time in plugin UI --------- Co-authored-by: Copilot <copilot@github.com>	4 weeks ago
Chris Lu	e1e5b4a8a6	add admin script worker (#8491 ) * admin: add plugin lock coordination * shell: allow bypassing lock checks * plugin worker: add admin script handler * mini: include admin_script in plugin defaults * admin script UI: drop name and enlarge text * admin script: add default script * admin_script: make run interval configurable * plugin: gate other jobs during admin_script runs * plugin: use last completed admin_script run * admin: backfill plugin config defaults * templ Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com> * comparable to default version Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com> * default to run Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com> * format Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com> * shell: respect pre-set noLock for fix.replication * shell: add force no-lock mode for admin scripts * volume balance worker already exists Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com> * admin: expose scheduler status JSON * shell: add sleep command * shell: restrict sleep syntax * Revert "shell: respect pre-set noLock for fix.replication" This reverts commit `2b14e8b826`. * templ Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com> * fix import Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com> * less logs Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com> * Reduce master client logs on canceled contexts * Update mini default job type count --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	4 weeks ago
Chris Lu	a61a2affe3	Expire stuck plugin jobs (#8492 ) * Add stale job expiry and expire API * Add expire job button * Add test hook and coverage for ExpirePluginJobAPI * Document scheduler filtering side effect and reuse helper * Restore job spec proposal test * Regenerate plugin template output --------- Co-authored-by: Copilot <copilot@github.com>	4 weeks ago
Chris Lu	c73e65ad5e	Add customizable plugin display names and weights (#8459 ) * feat: add customizable plugin display names and weights - Add weight field to JobTypeCapability proto message - Modify ListKnownJobTypes() to return JobTypeInfo with display names and weights - Modify ListPluginJobTypes() to return JobTypeInfo instead of string - Sort plugins by weight (descending) then alphabetically - Update admin API to return enriched job type metadata - Update plugin UI template to display names instead of IDs - Consolidate API by reusing existing function names instead of suffixed variants * perf: optimize plugin job type capability lookup and add null-safe parsing - Pre-calculate job type capabilities in a map to reduce O(nm) nested loops to O(n+m) lookup time in ListKnownJobTypes() - Add parseJobTypeItem() helper function for null-safe job type item parsing - Refactor plugin.templ to use parseJobTypeItem() in all job type access points (hasJobType, applyInitialNavigation, ensureActiveNavigation, renderTopTabs) - Deterministic capability resolution by using first worker's capability templ * refactor: use parseJobTypeItem helper consistently in plugin.templ Replace duplicated job type extraction logic at line 1296-1298 with parseJobTypeItem() helper function for consistency and maintainability. * improve: prefer richer capability metadata and add null-safety checks - Improve capability selection in ListKnownJobTypes() to prefer capabilities with non-empty DisplayName and higher Weight across all workers instead of first-wins approach. Handles mixed-version clusters better. - Add defensive null checks in renderJobTypeSummary() to safely access parseJobTypeItem() result before property access - Ensures malformed or missing entries won't break the rendering pipeline * fix: preserve existing DisplayName when merging capabilities Fix capability merge logic to respect existing DisplayName values: - If existing has DisplayName but candidate doesn't, preserve existing - If existing doesn't have DisplayName but candidate does, use candidate - Only use Weight comparison if DisplayName status is equal - Prevents higher-weight capabilities with empty DisplayName from overriding capabilities with non-empty DisplayName	1 month ago
Chris Lu	8ec9ff4a12	Refactor plugin system and migrate worker runtime (#8369 ) * admin: add plugin runtime UI page and route wiring * pb: add plugin gRPC contract and generated bindings * admin/plugin: implement worker registry, runtime, monitoring, and config store * admin/dash: wire plugin runtime and expose plugin workflow APIs * command: add flags to enable plugin runtime * admin: rename remaining plugin v2 wording to plugin * admin/plugin: add detectable job type registry helper * admin/plugin: add scheduled detection and dispatch orchestration * admin/plugin: prefetch job type descriptors when workers connect * admin/plugin: add known job type discovery API and UI * admin/plugin: refresh design doc to match current implementation * admin/plugin: enforce per-worker scheduler concurrency limits * admin/plugin: use descriptor runtime defaults for scheduler policy * admin/ui: auto-load first known plugin job type on page open * admin/plugin: bootstrap persisted config from descriptor defaults * admin/plugin: dedupe scheduled proposals by dedupe key * admin/ui: add job type and state filters for plugin monitoring * admin/ui: add per-job-type plugin activity summary * admin/plugin: split descriptor read API from schema refresh * admin/ui: keep plugin summary metrics global while tables are filtered * admin/plugin: retry executor reservation before timing out * admin/plugin: expose scheduler states for monitoring * admin/ui: show per-job-type scheduler states in plugin monitor * pb/plugin: rename protobuf package to plugin * admin/plugin: rename pluginRuntime wiring to plugin * admin/plugin: remove runtime naming from plugin APIs and UI * admin/plugin: rename runtime files to plugin naming * admin/plugin: persist jobs and activities for monitor recovery * admin/plugin: lease one detector worker per job type * admin/ui: show worker load from plugin heartbeats * admin/plugin: skip stale workers for detector and executor picks * plugin/worker: add plugin worker command and stream runtime scaffold * plugin/worker: implement vacuum detect and execute handlers * admin/plugin: document external vacuum plugin worker starter * command: update plugin.worker help to reflect implemented flow * command/admin: drop legacy Plugin V2 label * plugin/worker: validate vacuum job type and respect min interval * plugin/worker: test no-op detect when min interval not elapsed * command/admin: document plugin.worker external process * plugin/worker: advertise configured concurrency in hello * command/plugin.worker: add jobType handler selection * command/plugin.worker: test handler selection by job type * command/plugin.worker: persist worker id in workingDir * admin/plugin: document plugin.worker jobType and workingDir flags * plugin/worker: support cancel request for in-flight work * plugin/worker: test cancel request acknowledgements * command/plugin.worker: document workingDir and jobType behavior * plugin/worker: emit executor activity events for monitor * plugin/worker: test executor activity builder * admin/plugin: send last successful run in detection request * admin/plugin: send cancel request when detect or execute context ends * admin/plugin: document worker cancel request responsibility * admin/handlers: expose plugin scheduler states API in no-auth mode * admin/handlers: test plugin scheduler states route registration * admin/plugin: keep worker id on worker-generated activity records * admin/plugin: test worker id propagation in monitor activities * admin/dash: always initialize plugin service * command/admin: remove plugin enable flags and default to enabled * admin/dash: drop pluginEnabled constructor parameter * admin/plugin UI: stop checking plugin enabled state * admin/plugin: remove docs for plugin enable flags * admin/dash: remove unused plugin enabled check method * admin/dash: fallback to in-memory plugin init when dataDir fails * admin/plugin API: expose worker gRPC port in status * command/plugin.worker: resolve admin gRPC port via plugin status * split plugin UI into overview/configuration/monitoring pages * Update layout_templ.go * add volume_balance plugin worker handler * wire plugin.worker CLI for volume_balance job type * add erasure_coding plugin worker handler * wire plugin.worker CLI for erasure_coding job type * support multi-job handlers in plugin worker runtime * allow plugin.worker jobType as comma-separated list * admin/plugin UI: rename to Workers and simplify config view * plugin worker: queue detection requests instead of capacity reject * Update plugin_worker.go * plugin volume_balance: remove force_move/timeout from worker config UI * plugin erasure_coding: enforce local working dir and cleanup * admin/plugin UI: rename admin settings to job scheduling * admin/plugin UI: persist and robustly render detection results * admin/plugin: record and return detection trace metadata * admin/plugin UI: show detection process and decision trace * plugin: surface detector decision trace as activities * mini: start a plugin worker by default * admin/plugin UI: split monitoring into detection and execution tabs * plugin worker: emit detection decision trace for EC and balance * admin workers UI: split monitoring into detection and execution pages * plugin scheduler: skip proposals for active assigned/running jobs * admin workers UI: add job queue tab * plugin worker: add dummy stress detector and executor job type * admin workers UI: reorder tabs to detection queue execution * admin workers UI: regenerate plugin template * plugin defaults: include dummy stress and add stress tests * plugin dummy stress: rotate detection selections across runs * plugin scheduler: remove cross-run proposal dedupe * plugin queue: track pending scheduled jobs * plugin scheduler: wait for executor capacity before dispatch * plugin scheduler: skip detection when waiting backlog is high * plugin: add disk-backed job detail API and persistence * admin ui: show plugin job detail modal from job id links * plugin: generate unique job ids instead of reusing proposal ids * plugin worker: emit heartbeats on work state changes * plugin registry: round-robin tied executor and detector picks * add temporary EC overnight stress runner * plugin job details: persist and render EC execution plans * ec volume details: color data and parity shard badges * shard labels: keep parity ids numeric and color-only distinction * admin: remove legacy maintenance UI routes and templates * admin: remove dead maintenance endpoint helpers * Update layout_templ.go * remove dummy_stress worker and command support * refactor plugin UI to job-type top tabs and sub-tabs * migrate weed worker command to plugin runtime * remove plugin.worker command and keep worker runtime with metrics * update helm worker args for jobType and execution flags * set plugin scheduling defaults to global 16 and per-worker 4 * stress: fix RPC context reuse and remove redundant variables in ec_stress_runner * admin/plugin: fix lifecycle races, safe channel operations, and terminal state constants * admin/dash: randomize job IDs and fix priority zero-value overwrite in plugin API * admin/handlers: implement buffered rendering to prevent response corruption * admin/plugin: implement debounced persistence flusher and optimize BuildJobDetail memory lookups * admin/plugin: fix priority overwrite and implement bounded wait in scheduler reserve * admin/plugin: implement atomic file writes and fix run record side effects * admin/plugin: use P prefix for parity shard labels in execution plans * admin/plugin: enable parallel execution for cancellation tests * admin: refactor time.Time fields to pointers for better JSON omitempty support * admin/plugin: implement pointer-safe time assignments and comparisons in plugin core * admin/plugin: fix time assignment and sorting logic in plugin monitor after pointer refactor * admin/plugin: update scheduler activity tracking to use time pointers * admin/plugin: fix time-based run history trimming after pointer refactor * admin/dash: fix JobSpec struct literal in plugin API after pointer refactor * admin/view: add D/P prefixes to EC shard badges for UI consistency * admin/plugin: use lifecycle-aware context for schema prefetching * Update ec_volume_details_templ.go * admin/stress: fix proposal sorting and log volume cleanup errors * stress: refine ec stress runner with math/rand and collection name - Added Collection field to VolumeEcShardsDeleteRequest for correct filename construction. - Replaced crypto/rand with seeded math/rand PRNG for bulk payloads. - Added documentation for EcMinAge zero-value behavior. - Added logging for ignored errors in volume/shard deletion. * admin: return internal server error for plugin store failures Changed error status code from 400 Bad Request to 500 Internal Server Error for failures in GetPluginJobDetail to correctly reflect server-side errors. * admin: implement safe channel sends and graceful shutdown sync - Added sync.WaitGroup to Plugin struct to manage background goroutines. - Implemented safeSendCh helper using recover() to prevent panics on closed channels. - Ensured Shutdown() waits for all background operations to complete. * admin: robustify plugin monitor with nil-safe time and record init - Standardized nil-safe assignment for time.Time pointers (CreatedAt, UpdatedAt, CompletedAt). - Ensured persistJobDetailSnapshot initializes new records correctly if they don't exist on disk. - Fixed debounced persistence to trigger immediate write on job completion. admin: improve scheduler shutdown behavior and logic guards - Replaced brittle error string matching with explicit r.shutdownCh selection for shutdown detection. - Removed redundant nil guard in buildScheduledJobSpec. - Standardized WaitGroup usage for schedulerLoop. * admin: implement deep copy for job parameters and atomic write fixes - Implemented deepCopyGenericValue and used it in cloneTrackedJob to prevent shared state. - Ensured atomicWriteFile creates parent directories before writing. * admin: remove unreachable branch in shard classification Removed an unreachable 'totalShards <= 0' check in classifyShardID as dataShards and parityShards are already guarded. * admin: secure UI links and use canonical shard constants - Added rel="noopener noreferrer" to external links for security. - Replaced magic number 14 with erasure_coding.TotalShardsCount. - Used renderEcShardBadge for missing shard list consistency. * admin: stabilize plugin tests and fix regressions - Composed a robust plugin_monitor_test.go to handle asynchronous persistence. - Updated all time.Time literals to use timeToPtr helper. - Added explicit Shutdown() calls in tests to synchronize with debounced writes. - Fixed syntax errors and orphaned struct literals in tests. * Potential fix for code scanning alert no. 278: Slice memory allocation with excessive size value Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> * Potential fix for code scanning alert no. 283: Uncontrolled data used in path expression Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> * admin: finalize refinements for error handling, scheduler, and race fixes - Standardized HTTP 500 status codes for store failures in plugin_api.go. - Tracked scheduled detection goroutines with sync.WaitGroup for safe shutdown. - Fixed race condition in safeSendDetectionComplete by extracting channel under lock. - Implemented deep copy for JobActivity details. - Used defaultDirPerm constant in atomicWriteFile. * test(ec): migrate admin dockertest to plugin APIs * admin/plugin_api: fix RunPluginJobTypeAPI to return 500 for server-side detection/filter errors * admin/plugin_api: fix ExecutePluginJobAPI to return 500 for job execution failures * admin/plugin_api: limit parseProtoJSONBody request body to 1MB to prevent unbounded memory usage * admin/plugin: consolidate regex to package-level validJobTypePattern; add char validation to sanitizeJobID * admin/plugin: fix racy Shutdown channel close with sync.Once * admin/plugin: track sendLoop and recv goroutines in WorkerStream with r.wg * admin/plugin: document writeProtoFiles atomicity — .pb is source of truth, .json is human-readable only * admin/plugin: extract activityLess helper to deduplicate nil-safe OccurredAt sort comparators * test/ec: check http.NewRequest errors to prevent nil req panics * test/ec: replace deprecated ioutil/math/rand, fix stale step comment 5.1→3.1 * plugin(ec): raise default detection and scheduling throughput limits * topology: include empty disks in volume list and EC capacity fallback * topology: remove hard 10-task cap for detection planning * Update ec_volume_details_templ.go * adjust default * fix tests --------- Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>	1 month ago