seaweedfs

Commit Graph

Author	SHA1	Message	Date
Chris Lu	260d9f44a4	less logs	2 days ago
Chris Lu	3481ad22cb	go vet	2 days ago
Chris Lu	4518dc1056	less logs	2 days ago
Chris Lu	349aaa12ef	Switch empty-folder cleanup to bucket policy	3 days ago
Chris Lu	eded9a29b5	more logs	3 days ago
Chris Lu	428ef2b75f	logging	3 days ago
Chris Lu	2737e20916	Fix Spark temp cleanup with implicit directory markers	3 days ago
Chris Lu	74c80aa441	Fix Spark temp marker cleanup in async folder cleaner	3 days ago
Chris Lu	d4f2355b9a	Add actionable empty-folder cleanup diagnostics	3 days ago
Chris Lu	d40154f36e	Revert synchronous folder pruning and add cleanup diagnostics	3 days ago
Chris Lu	bd77d9f9b3	Generalize empty folder cleanup for Spark temp artifacts	3 days ago
Chris Lu	692b3a6e07	Fix Spark _temporary cleanup and add issue #8285 regression test	3 days ago
Chris Lu	403592bb9f	Add Spark Iceberg catalog integration tests and CI support (#8242 ) * Add Spark Iceberg catalog integration tests and CI support Implement comprehensive integration tests for Spark with SeaweedFS Iceberg REST catalog: - Basic CRUD operations (Create, Read, Update, Delete) on Iceberg tables - Namespace (database) management - Data insertion, querying, and deletion - Time travel capabilities via snapshot versioning - Compatible with SeaweedFS S3 and Iceberg REST endpoints Tests mirror the structure of existing Trino integration tests but use Spark's Python SQL API and PySpark for testing. Add GitHub Actions CI job for spark-iceberg-catalog-tests in s3-tables-tests.yml to automatically run Spark integration tests on pull requests. * fmt * Fix Spark integration tests - code review feedback * go mod tidy * Add go mod tidy step to integration test jobs Add 'go mod tidy' step before test runs for all integration test jobs: - s3-tables-tests - iceberg-catalog-tests - trino-iceberg-catalog-tests - spark-iceberg-catalog-tests This ensures dependencies are clean before running tests. * Fix remaining Spark operations test issues Address final code review comments: Setup & Initialization: - Add waitForSparkReady() helper function that polls Spark readiness with backoff instead of hardcoded 10-second sleep - Extract setupSparkTestEnv() helper to reduce boilerplate duplication between TestSparkCatalogBasicOperations and TestSparkTimeTravel - Both tests now use helpers for consistent, reliable setup Assertions & Validation: - Make setup-critical operations (namespace, table creation, initial insert) use t.Fatalf instead of t.Errorf to fail fast - Validate setupSQL output in TestSparkTimeTravel and fail if not 'Setup complete' - Add validation after second INSERT in TestSparkTimeTravel: verify row count increased to 2 before time travel test - Add context to error messages with namespace and tableName params Code Quality: - Remove code duplication between test functions - All critical paths now properly validated - Consistent error handling throughout * Fix go vet errors in S3 Tables tests Fixes: 1. setup_test.go (Spark): - Add missing import: github.com/testcontainers/testcontainers-go/wait - Use wait.ForLog instead of undefined testcontainers.NewLogStrategy - Remove unused strings import 2. trino_catalog_test.go: - Use net.JoinHostPort instead of fmt.Sprintf for address formatting - Properly handles IPv6 addresses by wrapping them in brackets * Use weed mini for simpler SeaweedFS startup Replace complex multi-process startup (master, volume, filer, s3) with single 'weed mini' command that starts all services together. Benefits: - Simpler, more reliable startup - Single weed mini process vs 4 separate processes - Automatic coordination between components - Better port management with no manual coordination Changes: - Remove separate master, volume, filer process startup - Use weed mini with -master.port, -filer.port, -s3.port flags - Keep Iceberg REST as separate service (still needed) - Increase timeout to 15s for port readiness (weed mini startup) - Remove volumePort and filerProcess fields from TestEnvironment - Simplify cleanup to only handle two processes (mini, iceberg rest) * Clean up dead code and temp directory leaks Fixes: 1. Remove dead s3Process field and cleanup: - weed mini bundles S3 gateway, no separate process needed - Removed s3Process field from TestEnvironment - Removed unnecessary s3Process cleanup code 2. Fix temp config directory leak: - Add sparkConfigDir field to TestEnvironment - Store returned configDir in writeSparkConfig - Clean up sparkConfigDir in Cleanup() with os.RemoveAll - Prevents accumulation of temp directories in test runs 3. Simplify Cleanup: - Now handles only necessary processes (weed mini, iceberg rest) - Removes both seaweedfsDataDir and sparkConfigDir - Cleaner shutdown sequence * Use weed mini's built-in Iceberg REST and fix python binary Changes: - Add -s3.port.iceberg flag to weed mini for built-in Iceberg REST Catalog - Remove separate 'weed server' process for Iceberg REST - Remove icebergRestProcess field from TestEnvironment - Simplify Cleanup() to only manage weed mini + Spark - Add port readiness check for iceberg REST from weed mini - Set Spark container Cmd to '/bin/sh -c sleep 3600' to keep it running - Change python to python3 in container.Exec calls This simplifies to truly one all-in-one weed mini process (master, filer, s3, iceberg-rest) plus just the Spark container. * go fmt * clean up * bind on a non-loopback IP for container access, aligned Iceberg metadata saves/locations with table locations, and reworked Spark time travel to use TIMESTAMP AS OF with safe timestamp extraction. * shared mini start * Fixed internal directory creation under /buckets so .objects paths can auto-create without failing bucket-name validation, which restores table bucket object writes * fix path Updated table bucket objects to write under `/buckets/<bucket>` and saved Iceberg metadata there, adjusting Spark time-travel timestamp to committed_at +1s. Rebuilt the weed binary (`go install ./weed`) and confirmed passing tests for Spark and Trino with focused test commands. * Updated table bucket creation to stop creating /buckets/.objects and switched Trino REST warehouse to s3://<bucket> to match Iceberg layout. * Stabilize S3Tables integration tests * Fix timestamp extraction and remove dead code in bucketDir * Use table bucket as warehouse in s3tables tests * Update trino_blog_operations_test.go * adds the CASCADE option to handle any remaining table metadata/files in the schema directory * skip namespace not empty	5 days ago
Chris Lu	c284e51d20	fix: multipart upload ETag calculation (#8238 ) * fix multipart etag * address comments * clean up * clean up * optimization * address comments * unquoted etag * dedup * upgrade * clean * etag * return quoted tag * quoted etag * debug * s3api: unify ETag retrieval and quoting across handlers Refactor newListEntry to take S3ApiServer and use getObjectETag, and update setResponseHeaders to use the same logic. This ensures consistent ETags are returned for both listing and direct access. s3api: implement ListObjects deduplication for versioned buckets Handle duplicate entries between the main path and the .versions directory by prioritizing the latest version when bucket versioning is enabled. * s3api: cleanup stale main file entries during versioned uploads Add explicit deletion of pre-existing "main" files when creating new versions in versioned buckets. This prevents stale entries from appearing in bucket listings and ensures consistency. * s3api: fix cleanup code placement in versioned uploads Correct the placement of rm calls in completeMultipartUpload and putVersionedObject to ensure stale main files are properly deleted during versioned uploads. * s3api: improve getObjectETag fallback for empty ExtETagKey Ensure that when ExtETagKey exists but contains an empty value, the function falls through to MD5/chunk-based calculation instead of returning an empty string. * s3api: fix test files for new newListEntry signature Update test files to use the new newListEntry signature where the first parameter is S3ApiServer. Created mockS3ApiServer to properly test owner display name lookup functionality. s3api: use filer.ETag for consistent Md5 handling in getEtagFromEntry Change getEtagFromEntry fallback to use filer.ETag(entry) instead of filer.ETagChunks to ensure legacy entries with Attributes.Md5 are handled consistently with the rest of the codebase. * s3api: optimize list logic and fix conditional header logging - Hoist bucket versioning check out of per-entry callback to avoid repeated getVersioningState calls - Extract appendOrDedup helper function to eliminate duplicate dedup/append logic across multiple code paths - Change If-Match mismatch logging from glog.Errorf to glog.V(3).Infof and remove DEBUG prefix for consistency * s3api: fix test mock to properly initialize IAM accounts Fixed nil pointer dereference in TestNewListEntryOwnerDisplayName by directly initializing the IdentityAccessManagement.accounts map in the test setup. This ensures newListEntry can properly look up account display names without panicking. * cleanup * s3api: remove premature main file cleanup in versioned uploads Removed incorrect cleanup logic that was deleting main files during versioned uploads. This was causing test failures because it deleted objects that should have been preserved as null versions when versioning was first enabled. The deduplication logic in listing is sufficient to handle duplicate entries without deleting files during upload. * s3api: add empty-value guard to getEtagFromEntry Added the same empty-value guard used in getObjectETag to prevent returning quoted empty strings. When ExtETagKey exists but is empty, the function now falls through to filer.ETag calculation instead of returning "". * s3api: fix listing of directory key objects with matching prefix Revert prefix handling logic to use strings.TrimPrefix instead of checking HasPrefix with empty string result. This ensures that when a directory key object exactly matches the prefix (e.g. prefix="dir/", object="dir/"), it is correctly handled as a regular entry instead of being skipped or incorrectly processed as a common prefix. Also fixed missing variable definition. * s3api: refactor list inline dedup to use appendOrDedup helper Refactored the inline deduplication logic in listFilerEntries to use the shared appendOrDedup helper function. This ensures consistent behavior and reduces code duplication. * test: fix port allocation race in s3tables integration test Updated startMiniCluster to find all required ports simultaneously using findAvailablePorts instead of sequentially. This prevents race conditions where the OS reallocates a port that was just released, causing multiple services (e.g. Filer and Volume) to be assigned the same port and fail to start.	6 days ago
Chris Lu	066410dbd0	Fix S3 Gateway Read Failover #8076 (#8087 ) * fix s3 read failover #8076 - Implement cache invalidation in vidMapClient - Add retry logic in shared PrepareStreamContentWithThrottler - Update S3 Gateway to use FilerClient directly for invalidation support - Remove obsolete simpleMasterClient struct * improve observability for chunk re-lookup failures Added a warning log when volume location re-lookup fails after cache invalidation in PrepareStreamContentWithThrottler. * address code review feedback - Prevent infinite retry loops by comparing old/new URLs before retry - Update fileId2Url map after successful re-lookup for subsequent references - Add comprehensive test coverage for failover logic - Add tests for InvalidateCache method * Fix: prevent data duplication in stream retry and improve VidMap robustness * Cleanup: remove redundant check in InvalidateCache	3 weeks ago
Chris Lu	8880f9932f	filer: auto clean empty implicit s3 folders (#8051 ) * filer: auto clean empty s3 implicit folders Explicitly tag implicitly created S3 folders (parent directories from object uploads) with 'Seaweed-X-Amz-Implicit-Dir'. Update EmptyFolderCleaner to check for this attribute and cache the result efficiently. * filer: correctly handle nil attributes in empty folder cleaner cache * filer: refine implicit tagging logic Prevent tagging buckets as implicit directories. Reduce code duplication. * filer: safeguard GetEntryAttributes against nil entry and not found error * filer: move ErrNotFound handling to EmptyFolderCleaner * filer: add comment to explain level > 3 check for implicit directories	4 weeks ago
Chris Lu	796a911cb3	Prevent bucket renaming in filer, fuse mount, and S3 (#8048 ) * prevent bucket renaming in filer, fuse mount, s3 * refactor CanRename to support context propagation * harden bucket rename validation to fail closed on find error	4 weeks ago
Walnuts	691aea84c3	feat: add TLS configuration options for Cassandra2 store (#7998 ) * feat: add TLS configuration options for Cassandra2 store Signed-off-by: walnuts1018 <r.juglans.1018@gmail.com> * fix: use 9142 port in tls connection Signed-off-by: walnuts1018 <r.juglans.1018@gmail.com> * Align the setting field names with gocql's SSLOpts. Signed-off-by: walnuts1018 <r.juglans.1018@gmail.com> * Removed: store.cluster.Port = 9142 * chore: update gocql dependency to v2 * refactor: improve Cassandra TLS configuration and port logic * docs: update filer.toml scaffold with ssl_enable_host_verification --------- Signed-off-by: walnuts1018 <r.juglans.1018@gmail.com> Co-authored-by: Chris Lu <chris.lu@gmail.com>	4 weeks ago
Chris Lu	379c032868	Fix chown Input/output error on large file sets (#7996 ) * Fix chown Input/output error on large file sets (Fixes #7911) Implemented retry logic for MySQL/MariaDB backend to handle transient errors like deadlocks and timeouts. * Fix syntax error: missing closing brace * Refactor: Use %w for error wrapping and errors.As for extraction * Fix: Disable retry logic inside transactions	1 month ago
promalert	9012069bd7	chore: execute goimports to format the code (#7983 ) * chore: execute goimports to format the code Signed-off-by: promalert <promalert@outlook.com> * goimports -w . --------- Signed-off-by: promalert <promalert@outlook.com> Co-authored-by: Chris Lu <chris.lu@gmail.com>	1 month ago
Alasdair Macmillan	9778b9589e	Fix unaligned 64-bit atomic operation on ARM32 (#7958 ) (#7959 )	1 month ago
cduk	568f1fe5b1	fix: include DiskType in metadata log volume assignment (#7918 ) When writing metadata logs to /topics/.system/log, the filer was not respecting the disk type configuration from path-specific rules (fs.configure). This caused volume assignment failures when volume servers used a specific disk type (e.g., "ssd") because the assign request defaulted to empty disk type. The fix adds DiskType to the VolumeAssignRequest in the filer's metadata log write path, ensuring that path-specific disk type configurations are properly honored for internal system writes. Fixes errors like: "metadata log write failed /topics/.system/log/...: AssignVolume: failed to find writable volumes for collection" Signed-off-by: Charles Darke <s.cduk@toodevious.com> Co-authored-by: Charles Darke <s.cduk@toodevious.com>	1 month ago
Chris Lu	288ba5fec8	mount: let filer handle chunk deletion decision (#7900 ) * mount: let filer handle chunk deletion decision Remove chunk deletion decision from FUSE mount's Unlink operation. Previously, the mount decided whether to delete chunks based on its locally cached entry's HardLinkCounter, which could be stale. Now always pass isDeleteData=true and let the filer make the authoritative decision based on its own data. This prevents potential inconsistencies when: - The FUSE mount's cached entry is stale - Race conditions occur between multiple mounts - Direct filer operations change hard link counts * filer: check hard link counter before deleting chunks When deleting an entry, only delete the underlying chunks if: 1. It is not a hard link 2. OR it is the last hard link (counter <= 1) This protects against data loss when a client (like FUSE mount) requests chunk deletion for a file that has multiple hard links.	2 months ago
steve.wei	e439e33888	fix(filer): check error from FindEntry (#7878 ) * fix(filer): check error from FindEntry * remove --------- Co-authored-by: Chris Lu <chris.lu@gmail.com>	2 months ago
Chris Lu	1261e93ef2	fix: comprehensive go vet error fixes and add CI enforcement (#7861 ) * fix: use keyed fields in struct literals - Replace unsafe reflect.StringHeader/SliceHeader with safe unsafe.String/Slice (weed/query/sqltypes/unsafe.go) - Add field names to Type_ScalarType struct literals (weed/mq/schema/schema_builder.go) - Add Duration field name to FlexibleDuration struct literals across test files - Add field names to bson.D struct literals (weed/filer/mongodb/mongodb_store_kv.go) Fixes go vet warnings about unkeyed struct literals. * fix: remove unreachable code - Remove unreachable return statements after infinite for loops - Remove unreachable code after if/else blocks where all paths return - Simplify recursive logic by removing unnecessary for loop (inode_to_path.go) - Fix Type_ScalarType literal to use enum value directly (schema_builder.go) - Call onCompletionFn on stream error (subscribe_session.go) Files fixed: - weed/query/sqltypes/unsafe.go - weed/mq/schema/schema_builder.go - weed/mq/client/sub_client/connect_to_sub_coordinator.go - weed/filer/redis3/ItemList.go - weed/mq/client/agent_client/subscribe_session.go - weed/mq/broker/broker_grpc_pub_balancer.go - weed/mount/inode_to_path.go - weed/util/skiplist/name_list.go * fix: avoid copying lock values in protobuf messages - Use proto.Merge() instead of direct assignment to avoid copying sync.Mutex in S3ApiConfiguration (iamapi_server.go) - Add explicit comments noting that channel-received values are already copies before taking addresses (volume_grpc_client_to_master.go) The protobuf messages contain sync.Mutex fields from the message state, which should not be copied. Using proto.Merge() properly merges messages without copying the embedded mutex. * fix: correct byte array size for uint32 bit shift operations The generateAccountId() function only needs 4 bytes to create a uint32 value. Changed from allocating 8 bytes to 4 bytes to match the actual usage. This fixes go vet warning about shifting 8-bit values (bytes) by more than 8 bits. * fix: ensure context cancellation on all error paths In broker_client_subscribe.go, ensure subscriberCancel() is called on all error return paths: - When stream creation fails - When partition assignment fails - When sending initialization message fails This prevents context leaks when an error occurs during subscriber creation. * fix: ensure subscriberCancel called for CreateFreshSubscriber stream.Send error Ensure subscriberCancel() is called when stream.Send fails in CreateFreshSubscriber. * ci: add go vet step to prevent future lint regressions - Add go vet step to GitHub Actions workflow - Filter known protobuf lock warnings (MessageState sync.Mutex) These are expected in generated protobuf code and are safe - Prevents accumulation of go vet errors in future PRs - Step runs before build to catch issues early * fix: resolve remaining syntax and logic errors in vet fixes - Fixed syntax errors in filer_sync.go caused by missing closing braces - Added missing closing brace for if block and function - Synchronized fixes to match previous commits on branch * fix: add missing return statements to daemon functions - Add 'return false' after infinite loops in filer_backup.go and filer_meta_backup.go - Satisfies declared bool return type signatures - Maintains consistency with other daemon functions (runMaster, runFilerSynchronize, runWorker) - While unreachable, explicitly declares the return satisfies function signature contract * fix: add nil check for onCompletionFn in SubscribeMessageRecord - Check if onCompletionFn is not nil before calling it - Prevents potential panic if nil function is passed - Matches pattern used in other callback functions * docs: clarify unreachable return statements in daemon functions - Add comments documenting that return statements satisfy function signature - Explains that these returns follow infinite loops and are unreachable - Improves code clarity for future maintainers	2 months ago
chrislu	4a764dbb37	fmt	2 months ago
G-OD	504b258258	s3: fix remote object not caching (#7790 ) * s3: fix remote object not caching * s3: address review comments for remote object caching - Fix leading slash in object name by using strings.TrimPrefix - Return cached entry from CacheRemoteObjectToLocalCluster to get updated local chunk locations - Reuse existing helper function instead of inline gRPC call * s3/filer: add singleflight deduplication for remote object caching - Add singleflight.Group to FilerServer to deduplicate concurrent cache operations - Wrap CacheRemoteObjectToLocalCluster with singleflight to ensure only one caching operation runs per object when multiple clients request the same file - Add early-return check for already-cached objects - S3 API calls filer gRPC with timeout and graceful fallback on error - Clear negative bucket cache when bucket is created via weed shell - Add integration tests for remote cache with singleflight deduplication This benefits all clients (S3, HTTP, Hadoop) accessing remote-mounted objects by preventing redundant cache operations and improving concurrent access performance. Fixes: https://github.com/seaweedfs/seaweedfs/discussions/7599 * fix: data race in concurrent remote object caching - Add mutex to protect chunks slice from concurrent append - Add mutex to protect fetchAndWriteErr from concurrent read/write - Fix incorrect error check (was checking assignResult.Error instead of parseErr) - Rename inner variable to avoid shadowing fetchAndWriteErr * fix: address code review comments - Remove duplicate remote caching block in GetObjectHandler, keep only singleflight version - Add mutex protection for concurrent chunk slice and error access (data race fix) - Use lazy initialization for S3 client in tests to avoid panic during package load - Fix markdown linting: add language specifier to code fence, blank lines around tables - Add 'all' target to Makefile as alias for test-with-server - Remove unused 'util' import * style: remove emojis from test files * fix: add defensive checks and sort chunks by offset - Add nil check and type assertion check for singleflight result - Sort chunks by offset after concurrent fetching to maintain file order * fix: improve test diagnostics and path normalization - runWeedShell now returns error for better test diagnostics - Add all targets to .PHONY in Makefile (logs-primary, logs-remote, health) - Strip leading slash from normalizedObject to avoid double slashes in path --------- Co-authored-by: chrislu <chris.lu@gmail.com> Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com>	2 months ago
Chris Lu	5a03b5538f	filer: improve FoundationDB performance by disabling batch by default (#7770 ) * filer: improve FoundationDB performance by disabling batch by default This PR addresses a performance issue where FoundationDB filer was achieving only ~757 ops/sec with 12 concurrent S3 clients, despite FDB being capable of 17,000+ ops/sec. Root cause: The write batcher was waiting up to 5ms for each operation to batch, even though S3 semantics require waiting for durability confirmation. This added artificial latency that defeated the purpose of batching. Changes: - Disable write batching by default (batch_enabled = false) - Each write now commits immediately in its own transaction - Reduce batch interval from 5ms to 1ms when batching is enabled - Add batch_enabled config option to toggle behavior - Improve batcher to collect available ops without blocking - Add benchmarks comparing batch vs no-batch performance Benchmark results (16 concurrent goroutines): - With batch: 2,924 ops/sec (342,032 ns/op) - Without batch: 4,625 ops/sec (216,219 ns/op) - Improvement: +58% faster Configuration: - Default: batch_enabled = false (optimal for S3 PUT latency) - For bulk ingestion: set batch_enabled = true Also fixes ARM64 Docker test setup (shell compatibility, fdbserver path). * fix: address review comments - use atomic counter and remove duplicate batcher - Use sync/atomic.Uint64 for unique filenames in concurrent benchmarks - Remove duplicate batcher creation in createBenchmarkStoreWithBatching (initialize() already creates batcher when batchEnabled=true) * fix: add realistic default values to benchmark store helper Set directoryPrefix, timeout, and maxRetryDelay to reasonable defaults for more realistic benchmark conditions.	2 months ago
Chris Lu	59a7c40043	Add keyPrefix support for TiKV store (#7756 ) * Add keyPrefix support for TiKV store Similar to the Redis keyPrefix feature (#7299), this adds keyPrefix support for TiKV stores to enable sharing a single TiKV cluster as metadata store for multitenant SeaweedFS clusters. Changes: - Add keyPrefix field to TikvStore struct - Update Initialize function to read keyPrefix from config - Add getKey method to prepend prefix to all keys - Update generateKey, getNameFromKey, and genDirectoryKeyPrefix methods to be store receiver methods and handle key prefixing - Update filer.toml scaffold with keyPrefix configuration option Fixes #7752 * Fix potential slice corruption in getKey method Use a new slice with proper capacity to avoid modifying the underlying array of store.keyPrefix when appending. * Add keyPrefix validation and defensive bounds check - Add validation in Initialize to reject keyPrefix longer than 256 bytes - Add bounds check in getNameFromKey to prevent panic on malformed keys * Update weed/filer/tikv/tikv_store.go Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update weed/command/scaffold/filer.toml Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2 months ago
Chris Lu	e8b7347031	Reduce memory allocations in hot paths (#7725 ) * filer: reduce allocations in MatchStorageRule Optimize MatchStorageRule to avoid allocations in common cases: - Return singleton emptyPathConf when no rules match (zero allocations) - Return existing rule directly when only one rule matches (zero allocations) - Only allocate and merge when multiple rules match (rare case) Based on heap profile analysis showing 111MB allocated from 1.64M calls to this function during 180 seconds of operation. * filer: add fast path for getActualStore when no path-specific stores Add hasPathSpecificStore flag to FilerStoreWrapper to skip the MatchPrefix() call and []byte(path) conversion when no path-specific stores are configured (the common case). Based on heap profile analysis showing 1.39M calls to this function during 180 seconds of operation, each requiring a string-to-byte slice conversion for the MatchPrefix call. * filer/foundationdb: use sync.Pool for tuple allocation in genKey Use sync.Pool to reuse tuple.Tuple slices in genKey(), reducing allocation overhead for every FoundationDB operation. Based on heap profile analysis showing 102MB allocated from 1.79M calls to genKey() during 180 seconds of operation. The Pack() call still allocates internally, but this reduces the tuple slice allocation overhead by ~50%. * filer: use sync.Pool for protobuf Entry and FuseAttributes Add pooling for filer_pb.Entry and filer_pb.FuseAttributes in EncodeAttributesAndChunks and DecodeAttributesAndChunks to reduce allocations during filer store operations. Changes: - Add pbEntryPool with pre-allocated FuseAttributes - Add EntryAttributeToExistingPb for in-place attribute conversion - Update ToExistingProtoEntry to reuse existing Attributes when available Based on heap profile showing: - EncodeAttributesAndChunks: 69.5MB cumulative - DecodeAttributesAndChunks: 46.5MB cumulative - EntryAttributeToPb: 47.5MB flat allocations * log_buffer: use sync.Pool for LogEntry in readTs Add logEntryPool to reuse filer_pb.LogEntry objects in readTs(), which is called frequently during binary search in ReadFromBuffer. This function only needs the TsNs field from the unmarshaled entry, so pooling the LogEntry avoids repeated allocations. Based on heap profile showing readTs with 188MB cumulative allocations from timestamp lookups during log buffer reads. * pb: reduce gRPC metadata allocations in interceptor Optimize requestIDUnaryInterceptor and WithGrpcClient to reduce metadata allocations on every gRPC request: - Use AppendToOutgoingContext instead of NewOutgoingContext + New() This avoids creating a new map[string]string for single key-value pairs - Check FromIncomingContext return value before using metadata Based on heap profile showing metadata operations contributing 0.45GB (10.5%) of allocations, with requestIDUnaryInterceptor being the main source at 0.44GB cumulative. Expected reduction: ~0.2GB from avoiding map allocations per request. * filer/log_buffer: address code review feedback - Use proto.Reset() instead of manual field clearing in resetLogEntry for more idiomatic and comprehensive state clearing - Add resetPbEntry() call before pool return in error path for consistency with success path in DecodeAttributesAndChunks * log_buffer: reduce PreviousBufferCount from 32 to 4 Reduce the number of retained previous buffers from 32 to 4. Each buffer is 8MB, so this reduces the maximum retained memory from 256MB to 32MB for previous buffers. Most subscribers catch up quickly, so 4 buffers (32MB) should be sufficient while significantly reducing memory footprint. * filer/foundationdb: use defer for tuple pool cleanup in genKey Refactor genKey to use defer for returning the pooled tuple. This ensures the pooled object is always returned even if store.seaweedfsDir.Pack panics, making the code more robust. Also simplifies the code by removing the temporary variable. * filer: early-stop MatchStorageRule prescan after 2 matches Stop the prescan callback after finding 2 matches since we only need to know if there are 0, 1, or multiple matches. This avoids unnecessarily scanning the rest of the trie when many rules exist. * fix: address critical code review issues filer_conf.go: - Remove mutable singleton emptyPathConf that could corrupt shared state - Return fresh copy for no-match case and cloned copy for single-match case - Add clonePathConf helper to create shallow copies safely grpc_client_server.go: - Remove incorrect AppendToOutgoingContext call in server interceptor (that API is for outbound client calls, not server-side handlers) - Rely on request_id.Set and SetTrailer for request ID propagation * fix: treat FilerConf_PathConf as immutable Fix callers that were incorrectly mutating the returned PathConf: - filer_server_handlers_write.go: Use local variable for MaxFileNameLength instead of mutating the shared rule - command_s3_bucket_quota_check.go: Create new PathConf explicitly when modifying config instead of mutating the returned one This allows MatchStorageRule to safely return the singleton or direct references without copying, restoring the memory optimization. Callers must NOT mutate the returned FilerConf_PathConf. filer: add ClonePathConf helper for creating mutable copies Add reusable ClonePathConf function that creates a mutable copy of a PathConf. This is useful when callers need to modify config before calling SetLocationConf. Update command_s3_bucket_quota_check.go to use the new helper. Also fix redundant return statement in DeleteLocationConf. * fmt * filer: fix protobuf pool reset to clear internal fields Address code review feedback: 1. resetPbEntry/resetFuseAttributes: Use struct assignment (e = T{}) instead of field-by-field reset to clear protobuf internal fields (unknownFields, sizeCache) that would otherwise accumulate across pool reuses, causing data corruption or memory bloat. 2. EntryAttributeToExistingPb: Add nil guard for attr parameter to prevent panic if caller passes nil. log_buffer: reset logEntry before pool return in error path For consistency with success path, reset the logEntry before putting it back in the pool in the error path. This prevents the pooled object from holding references to partially unmarshaled data. * filer: optimize MatchStorageRule and document ClonePathConf 1. Avoid double []byte(path) conversion in multi-match case by converting once and reusing pathBytes. 2. Add IMPORTANT comment to ClonePathConf documenting that it must be kept in sync with filer_pb.FilerConf_PathConf fields when the protobuf evolves. * filer/log_buffer: fix data race and use defer for pool cleanup 1. entry_codec.go EncodeAttributesAndChunks: Fix critical data race - proto.Marshal may return a slice sharing memory with the message. Copy the data before returning message to pool to prevent corruption. 2. entry_codec.go DecodeAttributesAndChunks: Use defer for cleaner pool management, ensuring message is always returned to pool. 3. log_buffer.go readTs: Use defer for pool cleanup, removing duplicated resetLogEntry/Put calls in success and error paths. * filer: fix ClonePathConf field order and add comprehensive test 1. Fix field order in ClonePathConf to match protobuf struct definition (WormGracePeriodSeconds before WormRetentionTimeSeconds). 2. Add TestClonePathConf that constructs a fully-populated PathConf, calls ClonePathConf, and asserts equality of all exported fields. This will catch future schema drift when new fields are added. 3. Add TestClonePathConfNil to verify nil handling. * filer: use reflection in ClonePathConf test to detect schema drift Replace hardcoded field comparisons with reflection-based comparison. This automatically catches: 1. New fields added to the protobuf but not copied in ClonePathConf 2. Missing non-zero test values for any exported field The test iterates over all exported fields using reflect and compares src vs clone values, failing if any field differs. * filer: update EntryAttributeToExistingPb comment to reflect nil handling The function safely handles nil attr by returning early, but the comment incorrectly stated 'attr must not be nil'. Update comment to accurately describe the defensive behavior. * Fix review feedback: restore request ID propagation and remove redundant resets 1. grpc_client_server.go: Restore AppendToOutgoingContext for request ID so handlers making downstream gRPC calls will automatically propagate the request ID to downstream services. 2. entry_codec.go: Remove redundant resetPbEntry calls after Get. The defer block ensures reset before Put, so next Get receives clean object. 3. log_buffer.go: Remove redundant resetLogEntry call after Get for same reason - defer already handles reset before Put.	2 months ago
Chris Lu	c153420022	filer: add write batching for FoundationDB store to improve throughput (#7708 ) This addresses issue #7699 where FoundationDB filer store had low throughput (~400-500 obj/s) due to each write operation creating a separate transaction. Changes: - Add writeBatcher that collects multiple writes into batched transactions - New config options: batch_size (default: 100), batch_interval (default: 5ms) - Batching provides ~5.7x throughput improvement (from ~456 to ~2600 obj/s) Benchmark results with different batch sizes: - batch_size=1: ~456 obj/s (baseline, no batching) - batch_size=10: ~2621 obj/s (5.7x improvement) - batch_size=16: ~2514 obj/s (5.5x improvement) - batch_size=100: ~2617 obj/s (5.7x improvement) - batch_size=1000: ~2593 obj/s (5.7x improvement) The batch_interval timer (5ms) ensures writes are flushed promptly even when batch is not full, providing good latency characteristics. Addressed review feedback: - Changed wait=false to wait=true in UpdateEntry/DeleteEntry to properly propagate errors to callers - Fixed timer reset race condition by stopping and draining before reset Fixes #7699	2 months ago
Chris Lu	0cd9f34177	mount: improve EnsureVisited performance with dedup, parallelism, and batching (#7697 ) * mount: add singleflight to deduplicate concurrent EnsureVisited calls When multiple goroutines access the same uncached directory simultaneously, they would all make redundant network requests to the filer. This change uses singleflight.Group to ensure only one goroutine fetches the directory entries while others wait for the result. This fixes a race condition where concurrent lookups or readdir operations on the same uncached directory would: 1. Make duplicate network requests to the filer 2. Insert duplicate entries into LevelDB cache 3. Waste CPU and network bandwidth * mount: fetch parent directories in parallel during EnsureVisited Previously, when accessing a deep path like /a/b/c/d, the parent directories were fetched serially from target to root. This change: 1. Collects all uncached directories from target to root first 2. Fetches them all in parallel using errgroup 3. Relies on singleflight (from previous commit) for deduplication This reduces latency when accessing deep uncached paths, especially in high-latency network environments where parallel requests can significantly improve performance. * mount: add batch inserts for LevelDB meta cache When populating the meta cache from filer, entries were inserted one-by-one into LevelDB. This change: 1. Adds BatchInsertEntries method to LevelDBStore that uses LevelDB's native batch write API 2. Updates MetaCache to keep a direct reference to the LevelDB store for batch operations 3. Modifies doEnsureVisited to collect entries and insert them in batches of 100 entries Batch writes are more efficient because: - Reduces number of individual write operations - Reduces disk syncs - Improves throughput for large directories * mount: fix potential nil dereference in MarkChildrenCached Add missing check for inode existence in inode2path map before accessing the InodeEntry. This prevents a potential nil pointer dereference if the inode exists in path2inode but not in inode2path (which could happen due to race conditions or bugs). This follows the same pattern used in IsChildrenCached which properly checks for existence before accessing the entry. * mount: fix batch flush when last entry is hidden The previous batch insert implementation relied on the isLast flag to flush remaining entries. However, if the last entry is a hidden system entry (like 'topics' or 'etc' in root), the callback returns early and the remaining entries in the batch are never flushed. Fix by: 1. Only flush when batch reaches threshold inside the callback 2. Flush any remaining entries after ReadDirAllEntries completes 3. Use error wrapping instead of logging+returning to avoid duplicate logs 4. Create new slice after flush to allow GC of flushed entries 5. Add documentation for batchInsertSize constant This ensures all entries are properly inserted regardless of whether the last entry is hidden, and prevents memory retention issues. * mount: add context support for cancellation in EnsureVisited Thread context.Context through the batch insert call chain to enable proper cancellation and timeout support: 1. Use errgroup.WithContext() so if one fetch fails, others are cancelled 2. Add context parameter to BatchInsertEntries for consistency with InsertEntry 3. Pass context to ReadDirAllEntries for cancellation during network calls 4. Check context cancellation before starting work in doEnsureVisited 5. Use %w for error wrapping to preserve error types for inspection This prevents unnecessary work when one directory fetch fails and makes the batch operations consistent with the existing context-aware APIs.	2 months ago
Chris Lu	1b13324fb7	fix: skip log files with deleted volumes in filer backup (#7692 ) fix: skip log files with deleted volumes in filer backup (#3720) When filer.backup or filer.meta.backup resumes after being stopped, it may encounter persisted log files stored on volumes that have since been deleted (via volume.deleteEmpty -force). Previously, this caused the backup to get stuck in an infinite retry loop with 'volume X not found' errors. This fix catches 'volume not found' errors when reading log files and skips the problematic file instead of failing. The backup will now: - Log a warning about the missing volume - Skip the problematic log file - Continue with the next log file, allowing progress The VolumeNotFoundPattern regex was already defined but never used - this change puts it to use. Fixes #3720	2 months ago
Chris Lu	5c1de633cb	mount: improve read throughput with parallel chunk fetching (#7627 ) * filer: remove lock contention during chunk download This addresses issue #7504 where a single weed mount FUSE instance does not fully utilize node network bandwidth when reading large files. The SingleChunkCacher was holding a mutex during the entire HTTP download, causing readers to block until the download completed. This serialized chunk reads even when multiple goroutines were downloading in parallel. Changes: - Add sync.Cond to SingleChunkCacher for efficient waiting - Move HTTP download outside the critical section in startCaching() - Use condition variable in readChunkAt() to wait for download completion - Add isComplete flag to track download state Now multiple chunk downloads can proceed truly in parallel, and readers wait efficiently using the condition variable instead of blocking on a mutex held during I/O operations. Ref: #7504 * filer: parallel chunk fetching within doReadAt This addresses issue #7504 by enabling parallel chunk downloads within a single read operation. Previously, doReadAt() processed chunks sequentially in a loop, meaning each chunk had to be fully downloaded before the next one started. This left significant network bandwidth unused when chunks resided on different volume servers. Changes: - Collect all chunk read tasks upfront - Use errgroup to fetch multiple chunks in parallel - Each chunk reads directly into its correct buffer position - Limit concurrency to prefetchCount (min 4) to avoid overwhelming the system - Handle gaps and zero-filling before parallel fetch - Trigger prefetch after parallel reads complete For a read spanning N chunks on different volume servers, this can now utilize up to N times the bandwidth of a single connection. Ref: #7504 * http: direct buffer read to reduce memory copies This addresses issue #7504 by reducing memory copy overhead during chunk downloads. Previously, RetriedFetchChunkData used ReadUrlAsStream which: 1. Allocated a 64KB intermediate buffer 2. Read data in 64KB chunks 3. Called a callback to copy each chunk to the destination For a 16MB chunk, this meant 256 copy operations plus the callback overhead. Profiling showed significant time spent in memmove. Changes: - Add readUrlDirectToBuffer() that reads directly into the destination - Add retriedFetchChunkDataDirect() for unencrypted, non-gzipped chunks - Automatically use direct read path when possible (cipher=nil, gzip=false) - Use http.NewRequestWithContext for proper cancellation For unencrypted chunks (the common case), this eliminates the intermediate buffer entirely, reading HTTP response bytes directly into the final destination buffer. Ref: #7504 * address review comments - Use channel (done) instead of sync.Cond for download completion signaling This integrates better with context cancellation patterns - Remove redundant groupErr check in reader_at.go (errors are already captured in task.err) - Remove buggy URL encoding logic from retriedFetchChunkDataDirect (The existing url.PathEscape on full URL is a pre-existing bug that should be fixed separately) * address review comments (round 2) - Return io.ErrUnexpectedEOF when HTTP response is truncated This prevents silent data corruption from incomplete reads - Simplify errgroup error handling by using g.Wait() error directly Remove redundant task.err field and manual error aggregation loop - Define minReadConcurrency constant instead of magic number 4 Improves code readability and maintainability Note: Context propagation to startCaching() is intentionally NOT changed. The downloaded chunk is a shared resource that may be used by multiple readers. Using context.Background() ensures the download completes even if one reader cancels, preventing data loss for other waiting readers. * http: inject request ID for observability in direct read path Add request_id.InjectToRequest() call to readUrlDirectToBuffer() for consistency with ReadUrlAsStream path. This ensures full-chunk reads carry the same tracing/correlation headers for server logs and metrics. * filer: consistent timestamp handling in sequential read path Use max(ts, task.chunk.ModifiedTsNs) in sequential path to match parallel path behavior. Also update ts before error check so that on failure, the returned timestamp reflects the max of all chunks processed so far. * filer: document why context.Background() is used in startCaching Add comment explaining the intentional design decision: the downloaded chunk is a shared resource that may be used by multiple concurrent readers. Using context.Background() ensures the download completes even if one reader cancels, preventing errors for other waiting readers. * filer: propagate context for reader cancellation Address review comment: pass context through ReadChunkAt call chain so that a reader can cancel its wait for a download. The key distinction is: - Download uses context.Background() - shared resource, always completes - Reader wait uses request context - can be cancelled individually If a reader cancels, it stops waiting and returns ctx.Err(), but the download continues to completion for other readers waiting on the same chunk. This properly handles the shared resource semantics while still allowing individual reader cancellation. * filer: use defer for close(done) to guarantee signal on panic Move close(s.done) to a defer statement at the start of startCaching() to ensure the completion signal is always sent, even if an unexpected panic occurs. This prevents readers from blocking indefinitely. * filer: remove unnecessary code - Remove close(s.cacheStartedCh) in destroy() - the channel is only used for one-time synchronization, closing it provides no benefit - Remove task := task loop variable capture - Go 1.22+ fixed loop variable semantics, this capture is no longer necessary (go.mod specifies Go 1.24.0) * filer: restore fallback to chunkCache when cacher returns no data Fix critical issue where ReadChunkAt would return 0,nil immediately if SingleChunkCacher couldn't provide data for the requested offset, without trying the chunkCache fallback. Now if cacher.readChunkAt returns n=0 and err=nil, we fall through to try chunkCache. * filer: add comprehensive tests for ReaderCache Tests cover: - Context cancellation while waiting for download - Fallback to chunkCache when cacher returns n=0, err=nil - Multiple concurrent readers waiting for same chunk - Partial reads at different offsets - Downloader cleanup when exceeding cache limit - Done channel signaling (no hangs on completion) * filer: prioritize done channel over context cancellation If data is already available (done channel closed), return it even if the reader's context is also cancelled. This avoids unnecessary errors when the download has already completed. * filer: add lookup error test and document test limitations Add TestSingleChunkCacherLookupError to test error handling when lookup fails. Document that full HTTP integration tests for SingleChunkCacher require global HTTP client initialization which is complex in unit tests. The download path is tested via FUSE integration tests. * filer: add tests that exercise SingleChunkCacher concurrency logic Add tests that use blocking lookupFileIdFn to exercise the actual SingleChunkCacher wait/cancellation logic: - TestSingleChunkCacherContextCancellationDuringLookup: tests reader cancellation while lookup is blocked - TestSingleChunkCacherMultipleReadersWaitForDownload: tests multiple readers waiting on the same download - TestSingleChunkCacherOneReaderCancelsOthersContinue: tests that when one reader cancels, other readers continue waiting These tests properly exercise the done channel wait/cancel logic without requiring HTTP calls - the blocking lookup simulates a slow download.	2 months ago
chrislu	8d110b29dd	fmt	2 months ago
Chris Lu	39ba19eea6	filer: async empty folder cleanup via metadata events (#7614 ) * filer: async empty folder cleanup via metadata events Implements asynchronous empty folder cleanup when files are deleted in S3. Key changes: 1. EmptyFolderCleaner - New component that handles folder cleanup: - Uses consistent hashing (LockRing) to determine folder ownership - Each filer owns specific folders, avoiding duplicate cleanup work - Debounces delete events (10s delay) to batch multiple deletes - Caches rough folder counts to skip unnecessary checks - Cancels pending cleanup when new files are created - Handles both file and subdirectory deletions 2. Integration with metadata events: - Listens to both local and remote filer metadata events - Processes create/delete/rename events to track folder state - Only processes folders under /buckets/<bucket>/... 3. Removed synchronous empty folder cleanup from S3 handlers: - DeleteObjectHandler no longer calls DoDeleteEmptyParentDirectories - DeleteMultipleObjectsHandler no longer tracks/cleans directories - Cleanup now happens asynchronously via metadata events Benefits: - Non-blocking: S3 delete requests return immediately - Coordinated: Only one filer (the owner) cleans each folder - Efficient: Batching and caching reduce unnecessary checks - Event-driven: Folder deletion triggers parent folder check automatically * filer: add CleanupQueue data structure for deduplicated folder cleanup CleanupQueue uses a linked list for FIFO ordering and a hashmap for O(1) deduplication. Processing is triggered when: - Queue size reaches maxSize (default 1000), OR - Oldest item exceeds maxAge (default 10 minutes) Key features: - O(1) Add, Remove, Pop, Contains operations - Duplicate folders are ignored (keeps original position/time) - Testable with injectable time function - Thread-safe with mutex protection * filer: use CleanupQueue for empty folder cleanup Replace timer-per-folder approach with queue-based processing: - Use CleanupQueue for deduplication and ordered processing - Process queue when full (1000 items) or oldest item exceeds 10 minutes - Background processor checks queue every 10 seconds - Remove from queue on create events to cancel pending cleanup Benefits: - Bounded memory: queue has max size, not unlimited timers - Efficient: O(1) add/remove/contains operations - Batch processing: handle many folders efficiently - Better for high-volume delete scenarios * filer: CleanupQueue.Add moves duplicate to back with updated time When adding a folder that already exists in the queue: - Remove it from its current position - Add it to the back of the queue - Update the queue time to current time This ensures that folders with recent delete activity are processed later, giving more time for additional deletes to occur. * filer: CleanupQueue uses event time and inserts in sorted order Changes: - Add() now takes eventTime parameter instead of using current time - Insert items in time-sorted order (oldest at front) to handle out-of-order events - When updating duplicate with newer time, reposition to maintain sort order - Ignore updates with older time (keep existing later time) This ensures proper ordering when processing events from distributed filers where event arrival order may not match event occurrence order. * filer: remove unused CleanupQueue functions (SetNowFunc, GetAll) Removed test-only functions: - SetNowFunc: tests now use real time with past event times - GetAll: tests now use Pop() to verify order Kept functions used in production: - Peek: used in filer_notify_read.go - OldestAge: used in empty_folder_cleaner.go logging * filer: initialize cache entry on first delete/create event Previously, roughCount was only updated if the cache entry already existed, but entries were only created during executeCleanup. This meant delete/create events before the first cleanup didn't track the count. Now create the cache entry on first event, so roughCount properly tracks all changes from the start. * filer: skip adding to cleanup queue if roughCount > 0 If the cached roughCount indicates there are still items in the folder, don't bother adding it to the cleanup queue. This avoids unnecessary queue entries and reduces wasted cleanup checks. * filer: don't create cache entry on create event Only update roughCount if the folder is already being tracked. New folders don't need tracking until we see a delete event. * filer: move empty folder cleanup to its own package - Created weed/filer/empty_folder_cleanup package - Defined FilerOperations interface to break circular dependency - Added CountDirectoryEntries method to Filer - Exported IsUnderPath and IsUnderBucketPath helper functions * filer: make isUnderPath and isUnderBucketPath private These helpers are only used within the empty_folder_cleanup package.	2 months ago
Chris Lu	61c0514a1c	filer: add username and keyPrefix support for Redis stores (#7591 ) * filer: add username and keyPrefix support for Redis stores Addresses https://github.com/seaweedfs/seaweedfs/issues/7299 - Add username config option to redis2, redis_cluster2, redis_lua, and redis_lua_cluster stores (sentinel stores already had it) - Add keyPrefix config option to all Redis stores to prefix all keys, useful for Envoy Redis Proxy or multi-tenant Redis setups * refactor: reduce duplication in redis.NewClient creation Address code review feedback by defining redis.Options once and conditionally setting TLSConfig instead of duplicating the entire NewClient call. * filer.toml: add username and keyPrefix to redis2.tmp example	2 months ago
Chris Lu	d48e1e1659	mount: improve read throughput with parallel chunk fetching (#7569 ) * mount: improve read throughput with parallel chunk fetching This addresses issue #7504 where a single weed mount FUSE instance does not fully utilize node network bandwidth when reading large files. Changes: - Add -concurrentReaders mount option (default: 16) to control the maximum number of parallel chunk fetches during read operations - Implement parallel section reading in ChunkGroup.ReadDataAt() using errgroup for better throughput when reading across multiple sections - Enhance ReaderCache with MaybeCacheMany() to prefetch multiple chunks ahead in parallel during sequential reads (now prefetches 4 chunks) - Increase ReaderCache limit dynamically based on concurrentReaders to support higher read parallelism The bottleneck was that chunks were being read sequentially even when they reside on different volume servers. By introducing parallel chunk fetching, a single mount instance can now better saturate available network bandwidth. Fixes: #7504 * fmt * Address review comments: make prefetch configurable, improve error handling Changes: 1. Add DefaultPrefetchCount constant (4) to reader_at.go 2. Add GetPrefetchCount() method to ChunkGroup that derives prefetch count from concurrentReaders (1/4 ratio, min 1, max 8) 3. Pass prefetch count through NewChunkReaderAtFromClient 4. Fix error handling in readDataAtParallel to prioritize errgroup error 5. Update all callers to use DefaultPrefetchCount constant For mount operations, prefetch scales with -concurrentReaders: - concurrentReaders=16 (default) -> prefetch=4 - concurrentReaders=32 -> prefetch=8 (capped) - concurrentReaders=4 -> prefetch=1 For non-mount paths (WebDAV, query engine, MQ), uses DefaultPrefetchCount. * fmt * Refactor: use variadic parameter instead of new function name Use NewChunkGroup with optional concurrentReaders parameter instead of creating a separate NewChunkGroupWithConcurrency function. This maintains backward compatibility - existing callers without the parameter get the default of 16 concurrent readers. * Use explicit concurrentReaders parameter instead of variadic * Refactor: use MaybeCache with count parameter instead of new MaybeCacheMany function * Address nitpick review comments - Add upper bound (128) on concurrentReaders to prevent excessive goroutine fan-out - Cap readerCacheLimit at 256 accordingly - Fix SetChunks: use Lock() instead of RLock() since we are writing to group.sections	3 months ago
Chris Lu	4106fc0436	fix(tikv): improve context propagation and refactor batch delete logic (#7558 ) * fix(tikv): improve context propagation and refactor batch delete logic Address review comments from PR #7557: 1. Replace context.TODO() with ctx in txn.Get calls - Fixes timeout/cancellation propagation in FindEntry - Fixes timeout/cancellation propagation in KvGet 2. Refactor DeleteFolderChildren to use flush helper - Eliminates code duplication - Cleaner and more maintainable These changes ensure proper context propagation throughout all TiKV operations and improve code maintainability. * error formatting	3 months ago
Chris Lu	5287d9f3e3	fix(tikv): replace DeleteRange with transaction-based batch deletes (#7557 ) * fix(tikv): replace DeleteRange with transaction-based batch deletes Fixes #7187 Problem: TiKV's DeleteRange API is a RawKV operation that bypasses transaction isolation. When SeaweedFS filer uses TiKV with txn client and another service uses RawKV client on the same cluster, DeleteFolderChildren can accidentally delete KV pairs from the RawKV client because DeleteRange operates at the raw key level without respecting transaction boundaries. Reproduction: 1. SeaweedFS filer using TiKV txn client for metadata 2. Another service using rawkv client on same TiKV cluster 3. Filer performs batch file deletion via DeleteFolderChildren 4. Result: ~50% of rawkv client's KV pairs get deleted Solution: Replace client.DeleteRange() (RawKV API) with transactional batch deletes using txn.Delete() within transactions. This ensures: - Transaction isolation - operations respect TiKV's MVCC boundaries - Keyspace separation - txn client and RawKV client stay isolated - Proper key handling - keys are copied to avoid iterator reuse issues - Batch processing - deletes batched (10K default) to manage memory Changes: 1. Core data structure: - Removed deleteRangeConcurrency field - Added batchCommitSize field (configurable, default 10000) 2. DeleteFolderChildren rewrite: - Replaced DeleteRange with iterative batch deletes - Added proper transaction lifecycle management - Implemented key copying to avoid iterator buffer reuse - Added batching to prevent memory exhaustion 3. New deleteBatch helper: - Handles transaction creation and lifecycle - Batches deletes within single transaction - Properly commits/rolls back based on context 4. Context propagation: - Updated RunInTxn to accept context parameter - All RunInTxn call sites now pass context - Enables proper timeout/cancellation handling 5. Configuration: - Removed deleterange_concurrency setting - Added batchdelete_count setting (default 10000) All critical review comments from PR #7188 have been addressed: - Proper key copying with append([]byte(nil), key...) - Conditional transaction rollback based on inContext flag - Context propagation for commits - Proper transaction lifecycle management - Configurable batch size Co-authored-by: giftz <giftz@users.noreply.github.com> * fix: remove extra closing brace causing syntax error in tikv_store.go --------- Co-authored-by: giftz <giftz@users.noreply.github.com>	3 months ago
Chris Lu	5075381060	Support multiple filers for S3 and IAM servers with automatic failover (#7550 ) * Support multiple filers for S3 and IAM servers with automatic failover This change adds support for multiple filer addresses in the 'weed s3' and 'weed iam' commands, enabling high availability through automatic failover. Key changes: - Updated S3ApiServerOption.Filer to Filers ([]pb.ServerAddress) - Updated IamServerOption.Filer to Filers ([]pb.ServerAddress) - Modified -filer flag to accept comma-separated addresses - Added getFilerAddress() helper methods for backward compatibility - Updated all filer client calls to support multiple addresses - Uses pb.WithOneOfGrpcFilerClients for automatic failover Usage: weed s3 -filer=localhost:8888,localhost:8889 weed iam -filer=localhost:8888,localhost:8889 The underlying FilerClient already supported multiple filers with health tracking and automatic failover - this change exposes that capability through the command-line interface. * Add filer discovery: treat initial filers as seeds and discover peers from master Enhances FilerClient to automatically discover additional filers in the same filer group by querying the master server. This allows users to specify just a few seed filers, and the client will discover all other filers in the cluster. Key changes to wdclient/FilerClient: - Added MasterClient, FilerGroup, and DiscoveryInterval fields - Added thread-safe filer list management with RWMutex - Implemented discoverFilers() background goroutine - Uses cluster.ListExistingPeerUpdates() to query master for filers - Automatically adds newly discovered filers to the list - Added Close() method to clean up discovery goroutine New FilerClientOption fields: - MasterClient: enables filer discovery from master - FilerGroup: specifies which filer group to discover - DiscoveryInterval: how often to refresh (default 5 minutes) Usage example: masterClient := wdclient.NewMasterClient(...) filerClient := wdclient.NewFilerClient( []pb.ServerAddress{"localhost:8888"}, // seed filers grpcDialOption, dataCenter, &wdclient.FilerClientOption{ MasterClient: masterClient, FilerGroup: "my-group", }, ) defer filerClient.Close() The initial filers act as seeds - the client discovers and adds all other filers in the same group from the master. Discovered filers are added dynamically without removing existing ones (relying on health checks for unavailable filers). * Address PR review comments: implement full failover for IAM operations Critical fixes based on code review feedback: 1. IAM API Failover (Critical): - Replace pb.WithGrpcFilerClient with pb.WithOneOfGrpcFilerClients in: * GetS3ApiConfigurationFromFiler() * PutS3ApiConfigurationToFiler() * GetPolicies() * PutPolicies() - Now all IAM operations support automatic failover across multiple filers 2. Validation Improvements: - Add validation in NewIamApiServerWithStore() to require at least one filer - Add validation in NewS3ApiServerWithStore() to require at least one filer - Add warning log when no filers configured for credential store 3. Error Logging: - Circuit breaker now logs when config load fails instead of silently ignoring - Helps operators understand why circuit breaker limits aren't applied 4. Code Quality: - Use ToGrpcAddress() for filer address in credential store setup - More consistent with rest of codebase and future-proof These changes ensure IAM operations have the same high availability guarantees as S3 operations, completing the multi-filer failover implementation. * Fix IAM manager initialization: remove code duplication, add TODO for HA Addresses review comment on s3api_server.go:145 Changes: - Remove duplicate code for getting first filer address - Extract filerAddr variable once and reuse - Add TODO comment documenting the HA limitation for IAM manager - Document that loadIAMManagerFromConfig and NewS3IAMIntegration need updates to support multiple filers for full HA Note: This is a known limitation when using filer-backed IAM stores. The interfaces need to be updated to accept multiple filer addresses. For now, documenting this limitation clearly. * Document credential store HA limitation with TODO Addresses review comment on auth_credentials.go:149 Changes: - Add TODO comment documenting that SetFilerClient interface needs update for multi-filer support - Add informative log message indicating HA limitation - Document that this is a known limitation for filer-backed credential stores The SetFilerClient interface currently only accepts a single filer address. To properly support HA, the credential store interfaces need to be updated to handle multiple filer addresses. * Track current active filer in FilerClient for better HA Add GetCurrentFiler() method to FilerClient that returns the currently active filer based on the filerIndex which is updated on successful operations. This provides better availability than always using the first filer. Changes: - Add FilerClient.GetCurrentFiler() method that returns current active filer - Update S3ApiServer.getFilerAddress() to use FilerClient's current filer - Add fallback to first filer if FilerClient not yet initialized - Document IAM limitation (doesn't have FilerClient access) Benefits: - Single-filer operations (URLs, ReadFilerConf, etc.) now use the currently active/healthy filer - Better distribution and failover behavior - FilerClient's round-robin and health tracking automatically determines which filer to use * Document ReadFilerConf HA limitation in lifecycle handlers Addresses review comment on s3api_bucket_handlers.go:880 Add comment documenting that ReadFilerConf uses the current active filer from FilerClient (which is better than always using first filer), but doesn't have built-in multi-filer failover. Add TODO to update filer.ReadFilerConf to support multiple filers for complete HA. For now, it uses the currently active/healthy filer tracked by FilerClient which provides reasonable availability. * Document multipart upload URL HA limitation Addresses review comment on s3api_object_handlers_multipart.go:442 Add comment documenting that part upload URLs point to the current active filer (tracked by FilerClient), which is better than always using the first filer but still creates a potential point of failure if that filer becomes unavailable during upload. Suggest TODO solutions: - Use virtual hostname/load balancer for filers - Have S3 server proxy uploads to healthy filers Current behavior provides reasonable availability by using the currently active/healthy filer rather than being pinned to first filer. * Document multipart completion Location URL limitation Addresses review comment on filer_multipart.go:187 Add comment documenting that the Location URL in CompleteMultipartUpload response points to the current active filer (tracked by FilerClient). Note that clients should ideally use the S3 API endpoint rather than this direct URL. If direct access is attempted and the specific filer is unavailable, the request will fail. Current behavior uses the currently active/healthy filer rather than being pinned to the first filer, providing better availability. * Make credential store use current active filer for HA Update FilerEtcStore to use a function that returns the current active filer instead of a fixed address, enabling high availability. Changes: - Add SetFilerAddressFunc() method to FilerEtcStore - Store uses filerAddressFunc instead of fixed filerGrpcAddress - withFilerClient() calls the function to get current active filer - Keep SetFilerClient() for backward compatibility (marked deprecated) - Update S3ApiServer to pass FilerClient.GetCurrentFiler to store Benefits: - Credential store now uses currently active/healthy filer - Automatic failover when filer becomes unavailable - True HA for credential operations - Backward compatible with old SetFilerClient interface This addresses the credential store limitation - no longer pinned to first filer, uses FilerClient's tracked current active filer. * Clarify multipart URL comments: filer address not used for uploads Update comments to reflect that multipart upload URLs are not actually used for upload traffic - uploads go directly to volume servers. Key clarifications: - genPartUploadUrl: Filer address is parsed out, only path is used - CompleteMultipartUpload Location: Informational field per AWS S3 spec - Actual uploads bypass filer proxy and go directly to volume servers The filer address in these URLs is NOT a HA concern because: 1. Part uploads: URL is parsed for path, upload goes to volume servers 2. Location URL: Informational only, clients use S3 endpoint This addresses the observation that S3 uploads don't go through filers, only metadata operations do. * Remove filer address from upload paths - pass path directly Eliminate unnecessary filer address from upload URLs by passing file paths directly instead of full URLs that get immediately parsed. Changes: - Rename genPartUploadUrl() → genPartUploadPath() (returns path only) - Rename toFilerUrl() → toFilerPath() (returns path only) - Update putToFiler() to accept filePath instead of uploadUrl - Remove URL parsing code (no longer needed) - Remove net/url import (no longer used) - Keep old function names as deprecated wrappers for compatibility Benefits: - Cleaner code - no fake URL construction/parsing - No dependency on filer address for internal operations - More accurate naming (these are paths, not URLs) - Eliminates confusion about HA concerns This completely removes the filer address from upload operations - it was never actually used for routing, only parsed for the path. * Remove deprecated functions: use new path-based functions directly Remove deprecated wrapper functions and update all callers to use the new function names directly. Removed: - genPartUploadUrl() → all callers now use genPartUploadPath() - toFilerUrl() → all callers now use toFilerPath() - SetFilerClient() → removed along with fallback code Updated: - s3api_object_handlers_multipart.go: uploadUrl → filePath - s3api_object_handlers_put.go: uploadUrl → filePath, versionUploadUrl → versionFilePath - s3api_object_versioning.go: toFilerUrl → toFilerPath - s3api_object_handlers_test.go: toFilerUrl → toFilerPath - auth_credentials.go: removed SetFilerClient fallback - filer_etc_store.go: removed deprecated SetFilerClient method Benefits: - Cleaner codebase with no deprecated functions - All variable names accurately reflect that they're paths, not URLs - Single interface for credential stores (SetFilerAddressFunc only) All code now consistently uses the new path-based approach. * Fix toFilerPath: remove URL escaping for raw file paths The toFilerPath function should return raw file paths, not URL-escaped paths. URL escaping was needed when the path was embedded in a URL (old toFilerUrl), but now that we pass paths directly to putToFiler, they should be unescaped. This fixes S3 integration test failures: - test_bucket_listv2_encoding_basic - test_bucket_list_encoding_basic - test_bucket_listv2_delimiter_whitespace - test_bucket_list_delimiter_whitespace The tests were failing because paths were double-encoded (escaped when stored, then escaped again when listed), resulting in %252B instead of %2B for '+' characters. Root cause: When we removed URL parsing in putToFiler, we should have also removed URL escaping in toFilerPath since paths are now used directly without URL encoding/decoding. * Add thread safety to FilerEtcStore and clarify credential store comments Address review suggestions for better thread safety and code clarity: 1. Thread Safety: Add RWMutex to FilerEtcStore - Protects filerAddressFunc and grpcDialOption from concurrent access - Initialize() uses write lock when setting function - SetFilerAddressFunc() uses write lock - withFilerClient() uses read lock to get function and dial option - GetPolicies() uses read lock to check if configured 2. Improved Error Messages: - Prefix errors with "filer_etc:" for easier debugging - "filer address not configured" → "filer_etc: filer address function not configured" - "filer address is empty" → "filer_etc: filer address is empty" 3. Clarified Comments: - auth_credentials.go: Clarify that initial setup is temporary - Document that it's updated in s3api_server.go after FilerClient creation - Remove ambiguity about when FilerClient.GetCurrentFiler is used Benefits: - Safe for concurrent credential operations - Clear error messages for debugging - Explicit documentation of initialization order * Enable filer discovery: pass master addresses to FilerClient Fix two critical issues: 1. Filer Discovery Not Working: Master client was not being passed to FilerClient, so peer discovery couldn't work 2. Credential Store Design: Already uses FilerClient via GetCurrentFiler function - this is the correct design for HA Changes: Command (s3.go): - Read master addresses from GetFilerConfiguration response - Pass masterAddresses to S3ApiServerOption - Log master addresses for visibility S3ApiServerOption: - Add Masters []pb.ServerAddress field for discovery S3ApiServer: - Create MasterClient from Masters when available - Pass MasterClient + FilerGroup to FilerClient via options - Enable discovery with 5-minute refresh interval - Log whether discovery is enabled or disabled Credential Store: - Already correctly uses filerClient.GetCurrentFiler via function - This provides HA without tight coupling to FilerClient struct - Function-based design is clean and thread-safe Discovery Flow: 1. S3 command reads filer config → gets masters + filer group 2. S3ApiServer creates MasterClient from masters 3. FilerClient uses MasterClient to query for peer filers 4. Background goroutine refreshes peer list every 5 minutes 5. Credential store uses GetCurrentFiler to get active filer Now filer discovery actually works! �� * Use S3 endpoint in multipart Location instead of filer address * Add multi-filer failover to ReadFilerConf * Address CodeRabbit review: fix buffer reuse and improve lock safety Address two code review suggestions: 1. Fix buffer reuse in ReadFilerConfFromFilers: - Use local []byte data instead of shared buffer - Prevents partial data from failed attempts affecting successful reads - Creates fresh buffer inside callback for masterClient path - More robust to future changes in read helpers 2. Improve lock safety in FilerClient: - Add WithHealth variants that accept health pointer - Get health pointer while holding lock, then release before calling - Eliminates potential for lock confusion (though no actual deadlock existed) - Clearer separation: lock for data access, atomics for health ops Changes: - ReadFilerConfFromFilers: var data []byte, create buf inside callback - shouldSkipUnhealthyFilerWithHealth(health filerHealth) - recordFilerSuccessWithHealth(health filerHealth) - recordFilerFailureWithHealth(health filerHealth) - Keep old functions for backward compatibility (marked deprecated) - Update LookupVolumeIds to use WithHealth variants Benefits: - More robust multi-filer configuration reading - Clearer lock vs atomic operation boundaries - No lock held during health checks (even though atomics don't block) - Better code organization and maintainability * add constant * Fix IAM manager and post policy to use current active filer * Fix critical race condition and goroutine leak * Update weed/s3api/filer_multipart.go Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Fix compilation error and address code review suggestions Address remaining unresolved comments: 1. Fix compilation error: Add missing net/url import - filer_multipart.go used url.PathEscape without import - Added "net/url" to imports 2. Fix Location URL formatting (all 4 occurrences): - Add missing slash between bucket and key - Use url.PathEscape for bucket names - Use urlPathEscape for object keys - Handles special characters in bucket/key names - Before: http://host/bucketkey - After: http://host/bucket/key (properly escaped) 3. Optimize discovery loop (O(NM) → O(N+M)): - Use map for existing filers (O(1) lookup) - Reduces time holding write lock - Better performance with many filers - Before: Nested loop for each discovered filer - After: Build map once, then O(1) lookups Changes: - filer_multipart.go: Import net/url, fix all Location URLs - filer_client.go: Use map for efficient filer discovery Benefits: - Compiles successfully - Proper URL encoding (handles spaces, special chars) - Faster discovery with less lock contention - Production-ready URL formatting Fix race conditions and make Close() idempotent Address CodeRabbit review #3512078995: 1. Critical: Fix unsynchronized read in error message - Line 584 read len(fc.filerAddresses) without lock - Race with refreshFilerList appending to slice - Fixed: Take RLock to read length safely - Prevents race detector warnings 2. Important: Make Close() idempotent - Closing already-closed channel panics - Can happen with layered cleanup in shutdown paths - Fixed: Use sync.Once to ensure single close - Safe to call Close() multiple times now 3. Nitpick: Add warning for empty filer address - getFilerAddress() can return empty string - Helps diagnose unexpected state - Added: Warning log when no filers available 4. Nitpick: Guard deprecated index-based helpers - shouldSkipUnhealthyFiler, recordFilerSuccess/Failure - Accessed filerHealth without lock (races with discovery) - Fixed: Take RLock and check bounds before array access - Prevents index out of bounds and races Changes: - filer_client.go: - Add closeDiscoveryOnce sync.Once field - Use Do() in Close() for idempotent channel close - Add RLock guards to deprecated index-based helpers - Add bounds checking to prevent panics - Synchronized read of filerAddresses length in error - s3api_server.go: - Add warning log when getFilerAddress returns empty Benefits: - No race conditions (passes race detector) - No panic on double-close - Better error diagnostics - Safe with discovery enabled - Production-hardened shutdown logic * Fix hardcoded http scheme and add panic recovery Address CodeRabbit review #3512114811: 1. Major: Fix hardcoded http:// scheme in Location URLs - Location URLs always used http:// regardless of client connection - HTTPS clients got http:// URLs (incorrect) - Fixed: Detect scheme from request - Check X-Forwarded-Proto header (for proxies) first - Check r.TLS != nil for direct HTTPS - Fallback to http for plain connections - Applied to all 4 CompleteMultipartUploadResult locations 2. Major: Add panic recovery to discovery goroutine - Long-running background goroutine could crash entire process - Panic in refreshFilerList would terminate program - Fixed: Add defer recover() with error logging - Goroutine failures now logged, not fatal 3. Note: Close() idempotency already implemented - Review flagged as duplicate issue - Already fixed in commit `3d7a65c7e` - sync.Once (closeDiscoveryOnce) prevents double-close panic - Safe to call Close() multiple times Changes: - filer_multipart.go: - Add getRequestScheme() helper function - Update all 4 Location URLs to use dynamic scheme - Format: scheme://host/bucket/key (was: http://...) - filer_client.go: - Add panic recovery to discoverFilers() - Log panics instead of crashing Benefits: - Correct scheme (https/http) in Location URLs - Works behind proxies (X-Forwarded-Proto) - No process crashes from discovery failures - Production-hardened background goroutine - Proper AWS S3 API compliance * Fix S3 WithFilerClient to use filer failover Critical fix for multi-filer deployments: Problem: - S3ApiServer.WithFilerClient() was creating direct connections to ONE filer - Used pb.WithGrpcClient() with single filer address - No failover - if that filer failed, ALL operations failed - Caused test failures: "bucket directory not found" - IAM Integration Tests failing with 500 Internal Error Root Cause: - WithFilerClient bypassed filerClient connection management - Always connected to getFilerAddress() (current filer only) - Didn't retry other filers on failure - All getEntry(), updateEntry(), etc. operations failed if current filer down Solution: 1. Added FilerClient.GetAllFilers() method - Returns snapshot of all filer addresses - Thread-safe copy to avoid races 2. Implemented withFilerClientFailover() - Try current filer first (fast path) - On failure, try all other filers - Log successful failover - Return error only if ALL filers fail 3. Updated WithFilerClient() - Use filerClient for failover when available - Fallback to direct connection for testing/init Impact: ✅ All S3 operations now support multi-filer failover ✅ Bucket metadata reads work with any available filer ✅ Entry operations (getEntry, updateEntry) failover automatically ✅ IAM tests should pass now ✅ Production-ready HA support Files Changed: - wdclient/filer_client.go: Add GetAllFilers() method - s3api/s3api_handlers.go: Implement failover logic This fixes the test failure where bucket operations failed when the primary filer was temporarily unavailable during cleanup. * Update current filer after successful failover Address code review: https://github.com/seaweedfs/seaweedfs/pull/7550#pullrequestreview-3512223723 Issue: After successful failover, the current filer index was not updated. This meant every subsequent request would still try the (potentially unhealthy) original filer first, then failover again. Solution: 1. Added FilerClient.SetCurrentFiler(addr) method: - Finds the index of specified filer address - Atomically updates filerIndex to point to it - Thread-safe with RLock 2. Call SetCurrentFiler after successful failover: - Update happens immediately after successful connection - Future requests start with the known-healthy filer - Reduces unnecessary failover attempts Benefits: ✅ Subsequent requests use healthy filer directly ✅ No repeated failover to same unhealthy filer ✅ Better performance - fast path hits healthy filer ✅ Comment now matches actual behavior * Integrate health tracking with S3 failover Address code review suggestion to leverage existing health tracking instead of simple iteration through all filers. Changes: 1. Added address-based health tracking API to FilerClient: - ShouldSkipUnhealthyFiler(addr) - check circuit breaker - RecordFilerSuccess(addr) - reset failure count - RecordFilerFailure(addr) - increment failure count These methods find the filer by address and delegate to existing WithHealth methods for actual health management. 2. Updated withFilerClientFailover to use health tracking: - Record success/failure for every filer attempt - Skip unhealthy filers during failover (circuit breaker) - Only try filers that haven't exceeded failure threshold - Automatic re-check after reset timeout Benefits:* ✅ Circuit breaker prevents wasting time on known-bad filers ✅ Health tracking shared across all operations ✅ Automatic recovery when unhealthy filers come back ✅ Reduced latency - skip filers in failure state ✅ Better visibility with health metrics Behavior: - Try current filer first (fast path) - If fails, record failure and try other HEALTHY filers - Skip filers with failureCount >= threshold (default 3) - Re-check unhealthy filers after resetTimeout (default 30s) - Record all successes/failures for health tracking * Update weed/wdclient/filer_client.go Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Enable filer discovery with empty filerGroup Empty filerGroup is a valid value representing the default group. The master client can discover filers even when filerGroup is empty. Change: - Remove the filerGroup != "" check in NewFilerClient - Keep only masterClient != nil check - Empty string will be passed to ListClusterNodes API as-is This enables filer discovery to work with the default group. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	3 months ago
tam-i13	b669607fcd	Add error list each entry func (#7485 ) * added error return in type ListEachEntryFunc * return error if errClose * fix fmt.Errorf * fix return errClose * use %w fmt.Errorf * added entry in messege error * add callbackErr in ListDirectoryEntries * fix error * add log * clear err when the scanner stops on io.EOF, so returning err doesn’t surface EOF as a failure. * more info in error * add ctx to logs, error handling * fix return eachEntryFunc * fix * fix log * fix return * fix foundationdb test s * fix eachEntryFunc * fix return resEachEntryFuncErr * Update weed/filer/filer.go Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update weed/filer/elastic/v7/elastic_store.go Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update weed/filer/hbase/hbase_store.go Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update weed/filer/foundationdb/foundationdb_store.go Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update weed/filer/ydb/ydb_store.go Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * fix * add scanErr --------- Co-authored-by: Roman Tamarov <r.tamarov@kryptonite.ru> Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com> Co-authored-by: chrislu <chris.lu@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	3 months ago
Chris Lu	c1b8d4bf0d	S3: adds FilerClient to use cached volume id (#7518 ) * adds FilerClient to use cached volume id * refactor: MasterClient embeds vidMapClient to eliminate ~150 lines of duplication - Create masterVolumeProvider that implements VolumeLocationProvider - MasterClient now embeds vidMapClient instead of maintaining duplicate cache logic - Removed duplicate methods: LookupVolumeIdsWithFallback, getStableVidMap, etc. - MasterClient still receives real-time updates via KeepConnected streaming - Updates call inherited addLocation/deleteLocation from vidMapClient - Benefits: DRY principle, shared singleflight, cache chain logic reused - Zero behavioral changes - only architectural improvement * refactor: mount uses FilerClient for efficient volume location caching - Add configurable vidMap cache size (default: 5 historical snapshots) - Add FilerClientOption struct for clean configuration * GrpcTimeout: default 5 seconds (prevents hanging requests) * UrlPreference: PreferUrl or PreferPublicUrl * CacheSize: number of historical vidMap snapshots (for volume moves) - NewFilerClient uses option struct for better API extensibility - Improved error handling in filerVolumeProvider.LookupVolumeIds: * Distinguish genuine 'not found' from communication failures * Log volumes missing from filer response * Return proper error context with volume count * Document that filer Locations lacks Error field (unlike master) - FilerClient.GetLookupFileIdFunction() handles URL preference automatically - Mount (WFS) creates FilerClient with appropriate options - Benefits for weed mount: * Singleflight: Deduplicates concurrent volume lookups * Cache history: Old volume locations available briefly when volumes move * Configurable cache depth: Tune for different deployment environments * Battle-tested vidMap cache with cache chain * Better concurrency handling with timeout protection * Improved error visibility and debugging - Old filer.LookupFn() kept for backward compatibility - Performance improvement for mount operations with high concurrency * fix: prevent vidMap swap race condition in LookupFileIdWithFallback - Hold vidMapLock.RLock() during entire vm.LookupFileId() call - Prevents resetVidMap() from swapping vidMap mid-operation - Ensures atomic access to the current vidMap instance - Added documentation warnings to getStableVidMap() about swap risks - Enhanced withCurrentVidMap() documentation for clarity This fixes a subtle race condition where: 1. Thread A: acquires lock, gets vm pointer, releases lock 2. Thread B: calls resetVidMap(), swaps vc.vidMap 3. Thread A: calls vm.LookupFileId() on old/stale vidMap While the old vidMap remains valid (in cache chain), holding the lock ensures we consistently use the current vidMap for the entire operation. * fix: FilerClient supports multiple filer addresses for high availability Critical fix: FilerClient now accepts []ServerAddress instead of single address - Prevents mount failure when first filer is down (regression fix) - Implements automatic failover to remaining filers - Uses round-robin with atomic index tracking (same pattern as WFS.WithFilerClient) - Retries all configured filers before giving up - Updates successful filer index for future requests Changes: - NewFilerClient([]pb.ServerAddress, ...) instead of (pb.ServerAddress, ...) - filerVolumeProvider references FilerClient for failover access - LookupVolumeIds tries all filers with util.Retry pattern - Mount passes all option.FilerAddresses for HA - S3 wraps single filer in slice for API consistency This restores the high availability that existed in the old implementation where mount would automatically failover between configured filers. * fix: restore leader change detection in KeepConnected stream loop Critical fix: Leader change detection was accidentally removed from the streaming loop - Master can announce leader changes during an active KeepConnected stream - Without this check, client continues talking to non-leader until connection breaks - This can lead to stale data or operational errors The check needs to be in TWO places: 1. Initial response (lines 178-187): Detect redirect on first connect 2. Stream loop (lines 203-209): Detect leader changes during active stream Restored the loop check that was accidentally removed during refactoring. This ensures the client immediately reconnects to new leader when announced. * improve: address code review findings on error handling and documentation 1. Master provider now preserves per-volume errors - Surface detailed errors from master (e.g., misconfiguration, deletion) - Return partial results with aggregated errors using errors.Join - Callers can now distinguish specific volume failures from general errors - Addresses issue of losing vidLoc.Error details 2. Document GetMaster initialization contract - Add comprehensive documentation explaining blocking behavior - Clarify that KeepConnectedToMaster must be started first - Provide typical initialization pattern example - Prevent confusing timeouts during warm-up 3. Document partial results API contract - LookupVolumeIdsWithFallback explicitly documents partial results - Clear examples of how to handle result + error combinations - Helps prevent callers from discarding valid partial results 4. Add safeguards to legacy filer.LookupFn - Add deprecation warning with migration guidance - Implement simple 10,000 entry cache limit - Log warning when limit reached - Recommend wdclient.FilerClient for new code - Prevents unbounded memory growth in long-running processes These changes improve API clarity and operational safety while maintaining backward compatibility. * fix: handle partial results correctly in LookupVolumeIdsWithFallback callers Two callers were discarding partial results by checking err before processing the result map. While these are currently single-volume lookups (so partial results aren't possible), the code was fragile and would break if we ever batched multiple volumes together. Changes: - Check result map FIRST, then conditionally check error - If volume is found in result, use it (ignore errors about other volumes) - If volume is NOT found and err != nil, include error context with %w - Add defensive comments explaining the pattern for future maintainers This makes the code: 1. Correct for future batched lookups 2. More informative (preserves underlying error details) 3. Consistent with filer_grpc_server.go which already handles this correctly Example: If looking up ["1", "2", "999"] and only 999 fails, callers looking for volumes 1 or 2 will succeed instead of failing unnecessarily. * improve: address remaining code review findings 1. Lazy initialize FilerClient in mount for proxy-only setups - Only create FilerClient when VolumeServerAccess != "filerProxy" - Avoids wasted work when all reads proxy through filer - filerClient is nil for proxy mode, initialized for direct access 2. Fix inaccurate deprecation comment in filer.LookupFn - Updated comment to reflect current behavior (10k bounded cache) - Removed claim of "unbounded growth" after adding size limit - Still directs new code to wdclient.FilerClient for better features 3. Audit all MasterClient usages for KeepConnectedToMaster - Verified all production callers start KeepConnectedToMaster early - Filer, Shell, Master, Broker, Benchmark, Admin all correct - IAM creates MasterClient but never uses it (harmless) - Test code doesn't need KeepConnectedToMaster (mocks) All callers properly follow the initialization pattern documented in GetMaster(), preventing unexpected blocking or timeouts. * fix: restore observability instrumentation in MasterClient During the refactoring, several important stats counters and logging statements were accidentally removed from tryConnectToMaster. These are critical for monitoring and debugging the health of master client connections. Restored instrumentation: 1. stats.MasterClientConnectCounter("total") - tracks all connection attempts 2. stats.MasterClientConnectCounter(FailedToKeepConnected) - when KeepConnected stream fails 3. stats.MasterClientConnectCounter(FailedToReceive) - when Recv() fails in loop 4. stats.MasterClientConnectCounter(Failed) - when overall gprcErr occurs 5. stats.MasterClientConnectCounter(OnPeerUpdate) - when peer updates detected Additionally restored peer update logging: - "+ filer@host noticed group.type address" for node additions - "- filer@host noticed group.type address" for node removals - Only logs updates matching the client's FilerGroup for noise reduction This information is valuable for: - Monitoring cluster health and connection stability - Debugging cluster membership changes - Tracking master failover and reconnection patterns - Identifying network issues between clients and masters No functional changes - purely observability restoration. * improve: implement gRPC-aware retry for FilerClient volume lookups The previous implementation used util.Retry which only retries errors containing the string "transport". This is insufficient for handling the full range of transient gRPC errors. Changes: 1. Added isRetryableGrpcError() to properly inspect gRPC status codes - Retries: Unavailable, DeadlineExceeded, ResourceExhausted, Aborted - Falls back to string matching for non-gRPC network errors 2. Replaced util.Retry with custom retry loop - 3 attempts with exponential backoff (1s, 1.5s, 2.25s) - Tries all N filers on each attempt (N3 total attempts max) - Fast-fails on non-retryable errors (NotFound, PermissionDenied, etc.) 3. Improved logging - Shows both filer attempt (x/N) and retry attempt (y/3) - Logs retry reason and wait time for debugging Benefits: - Better handling of transient gRPC failures (server restarts, load spikes) - Faster failure for permanent errors (no wasted retries) - More informative logs for troubleshooting - Maintains existing HA failover across multiple filers Example: If all 3 filers return Unavailable (server overload): - Attempt 1: try all 3 filers, wait 1s - Attempt 2: try all 3 filers, wait 1.5s - Attempt 3: try all 3 filers, fail Example: If filer returns NotFound (volume doesn't exist): - Attempt 1: try all 3 filers, fast-fail (no retry) fmt * improve: add circuit breaker to skip known-unhealthy filers The previous implementation tried all filers on every failure, including known-unhealthy ones. This wasted time retrying permanently down filers. Problem scenario (3 filers, filer0 is down): - Last successful: filer1 (saved as filerIndex=1) - Next lookup when filer1 fails: Retry 1: filer1(fail) → filer2(fail) → filer0(fail, wastes 5s timeout) Retry 2: filer1(fail) → filer2(fail) → filer0(fail, wastes 5s timeout) Retry 3: filer1(fail) → filer2(fail) → filer0(fail, wastes 5s timeout) Total wasted: 15 seconds on known-bad filer! Solution: Circuit breaker pattern - Track consecutive failures per filer (atomic int32) - Skip filers with 3+ consecutive failures - Re-check unhealthy filers every 30 seconds - Reset failure count on success New behavior: - filer0 fails 3 times → marked unhealthy - Future lookups skip filer0 for 30 seconds - After 30s, re-check filer0 (allows recovery) - If filer0 succeeds, reset failure count to 0 Benefits: 1. Avoids wasting time on known-down filers 2. Still sticks to last healthy filer (via filerIndex) 3. Allows recovery (30s re-check window) 4. No configuration needed (automatic) Implementation details: - filerHealth struct tracks failureCount (atomic) + lastFailureTime - shouldSkipUnhealthyFiler(): checks if we should skip this filer - recordFilerSuccess(): resets failure count to 0 - recordFilerFailure(): increments count, updates timestamp - Logs when skipping unhealthy filers (V(2) level) Example with circuit breaker: - filer0 down, saved filerIndex=1 (filer1 healthy) - Lookup 1: filer1(ok) → Done (0.01s) - Lookup 2: filer1(fail) → filer2(ok) → Done, save filerIndex=2 (0.01s) - Lookup 3: filer2(fail) → skip filer0 (unhealthy) → filer1(ok) → Done (0.01s) Much better than wasting 15s trying filer0 repeatedly! * fix: OnPeerUpdate should only process updates for matching FilerGroup Critical bug: The OnPeerUpdate callback was incorrectly moved outside the FilerGroup check when restoring observability instrumentation. This caused clients to process peer updates for ALL filer groups, not just their own. Problem: Before: mc.OnPeerUpdate only called for update.FilerGroup == mc.FilerGroup Bug: mc.OnPeerUpdate called for ALL updates regardless of FilerGroup Impact: - Multi-tenant deployments with separate filer groups would see cross-group updates (e.g., group A clients processing group B updates) - Could cause incorrect cluster membership tracking - OnPeerUpdate handlers (like Filer's DLM ring updates) would receive irrelevant updates from other groups Example scenario: Cluster has two filer groups: "production" and "staging" Production filer connects with FilerGroup="production" Incorrect behavior (bug): - Receives "staging" group updates - Incorrectly adds staging filers to production DLM ring - Cross-tenant data access issues Correct behavior (fixed): - Only receives "production" group updates - Only adds production filers to production DLM ring - Proper isolation between groups Fix: Moved mc.OnPeerUpdate(update, time.Now()) back INSIDE the FilerGroup check where it belongs, matching the original implementation. The logging and stats counter were already correctly scoped to matching FilerGroup, so they remain inside the if block as intended. * improve: clarify Aborted error handling in volume lookups Added documentation and logging to address the concern that codes.Aborted might not always be retryable in all contexts. Context-specific justification for treating Aborted as retryable: Volume location lookups (LookupVolume RPC) are simple, read-only operations: - No transactions - No write conflicts - No application-level state changes - Idempotent (safe to retry) In this context, Aborted is most likely caused by: - Filer restarting/recovering (transient) - Connection interrupted mid-request (transient) - Server-side resource cleanup (transient) NOT caused by: - Application-level conflicts (no writes) - Transaction failures (no transactions) - Logical errors (read-only lookup) Changes: 1. Added detailed comment explaining the context-specific reasoning 2. Added V(1) logging when treating Aborted as retryable - Helps detect misclassification if it occurs - Visible in verbose logs for troubleshooting 3. Split switch statement for clarity (one case per line) If future analysis shows Aborted should not be retried, operators will now have visibility via logs to make that determination. The logging provides evidence for future tuning decisions. Alternative approaches considered but not implemented: - Removing Aborted entirely (too conservative for read-only ops) - Message content inspection (adds complexity, no known patterns yet) - Different handling per RPC type (premature optimization) * fix: IAM server must start KeepConnectedToMaster for masterClient usage The IAM server creates and uses a MasterClient but never started KeepConnectedToMaster, which could cause blocking if IAM config files have chunks requiring volume lookups. Problem flow: NewIamApiServerWithStore() → creates masterClient → ❌ NEVER starts KeepConnectedToMaster GetS3ApiConfigurationFromFiler() → filer.ReadEntry(iama.masterClient, ...) → StreamContent(masterClient, ...) if file has chunks → masterClient.GetLookupFileIdFunction() → GetMaster(ctx) ← BLOCKS indefinitely waiting for connection! While IAM config files (identity & policies) are typically small and stored inline without chunks, the code path exists and would block if the files ever had chunks. Fix: Start KeepConnectedToMaster in background goroutine right after creating masterClient, following the documented pattern: mc := wdclient.NewMasterClient(...) go mc.KeepConnectedToMaster(ctx) This ensures masterClient is usable if ReadEntry ever needs to stream chunked content from volume servers. Note: This bug was dormant because IAM config files are small (<256 bytes) and SeaweedFS stores small files inline in Entry.Content, not as chunks. The bug would only manifest if: - IAM config grew > 256 bytes (inline threshold) - Config was stored as chunks on volume servers - ReadEntry called StreamContent - GetMaster blocked indefinitely Now all 9 production MasterClient instances correctly follow the pattern. * fix: data race on filerHealth.lastFailureTime in circuit breaker The circuit breaker tracked lastFailureTime as time.Time, which was written in recordFilerFailure and read in shouldSkipUnhealthyFiler without synchronization, causing a data race. Data race scenario: Goroutine 1: recordFilerFailure(0) health.lastFailureTime = time.Now() // ❌ unsynchronized write Goroutine 2: shouldSkipUnhealthyFiler(0) time.Since(health.lastFailureTime) // ❌ unsynchronized read → RACE DETECTED by -race detector Fix: Changed lastFailureTime from time.Time to int64 (lastFailureTimeNs) storing Unix nanoseconds for atomic access: Write side (recordFilerFailure): atomic.StoreInt64(&health.lastFailureTimeNs, time.Now().UnixNano()) Read side (shouldSkipUnhealthyFiler): lastFailureNs := atomic.LoadInt64(&health.lastFailureTimeNs) if lastFailureNs == 0 { return false } // Never failed lastFailureTime := time.Unix(0, lastFailureNs) time.Since(lastFailureTime) > 30time.Second Benefits: - Atomic reads/writes (no data race) - Efficient (int64 is 8 bytes, always atomic on 64-bit systems) - Zero value (0) naturally means "never failed" - No mutex needed (lock-free circuit breaker) Note: sync/atomic was already imported for failureCount, so no new import needed. fix: create fresh timeout context for each filer retry attempt The timeout context was created once at function start and reused across all retry attempts, causing subsequent retries to run with progressively shorter (or expired) deadlines. Problem flow: Line 244: timeoutCtx, cancel := context.WithTimeout(ctx, 5s) defer cancel() Retry 1, filer 0: client.LookupVolume(timeoutCtx, ...) ← 5s available ✅ Retry 1, filer 1: client.LookupVolume(timeoutCtx, ...) ← 3s left Retry 1, filer 2: client.LookupVolume(timeoutCtx, ...) ← 0.5s left Retry 2, filer 0: client.LookupVolume(timeoutCtx, ...) ← EXPIRED! ❌ Result: Retries always fail with DeadlineExceeded, defeating the purpose of retries. Fix: Moved context.WithTimeout inside the per-filer loop, creating a fresh timeout context for each attempt: for x := 0; x < n; x++ { timeoutCtx, cancel := context.WithTimeout(ctx, fc.grpcTimeout) err := pb.WithGrpcFilerClient(..., func(client) { resp, err := client.LookupVolume(timeoutCtx, ...) ... }) cancel() // Clean up immediately after call } Benefits: - Each filer attempt gets full fc.grpcTimeout (default 5s) - Retries actually have time to complete - No context leaks (cancel called after each attempt) - More predictable timeout behavior Example with fix: Retry 1, filer 0: fresh 5s timeout ✅ Retry 1, filer 1: fresh 5s timeout ✅ Retry 2, filer 0: fresh 5s timeout ✅ Total max time: 3 retries × 3 filers × 5s = 45s (plus backoff) Note: The outer ctx (from caller) still provides overall cancellation if the caller cancels or times out the entire operation. * fix: always reset vidMap cache on master reconnection The previous refactoring removed the else block that resets vidMap when the first message from a newly connected master is not a VolumeLocation. Problem scenario: 1. Client connects to master-1 and builds vidMap cache 2. Master-1 fails, client connects to master-2 3. First message from master-2 is a ClusterNodeUpdate (not VolumeLocation) 4. Old code: vidMap is reset and updated ✅ 5. New code: vidMap is NOT reset ❌ 6. Result: Client uses stale cache from master-1 → data access errors Example flow with bug: Connect to master-2 First message: ClusterNodeUpdate {filer.x added} → No resetVidMap() call → vidMap still has master-1's stale volume locations → Client reads from wrong volume servers → 404 errors Fix: Restored the else block that resets vidMap when first message is not a VolumeLocation: if resp.VolumeLocation != nil { // ... check leader, reset, and update ... } else { // First message is ClusterNodeUpdate or other type // Must still reset to avoid stale data mc.resetVidMap() } This ensures the cache is always cleared when establishing a new master connection, regardless of what the first message type is. Root cause: During the vidMapClient refactoring, this else block was accidentally dropped, making failover behavior fragile and non-deterministic (depends on which message type arrives first from the new master). Impact: - High severity for master failover scenarios - Could cause read failures, 404s, or wrong data access - Only manifests when first message is not VolumeLocation * fix: goroutine and connection leak in IAM server shutdown The IAM server's KeepConnectedToMaster goroutine used context.Background(), which is non-cancellable, causing the goroutine and its gRPC connections to leak on server shutdown. Problem: go masterClient.KeepConnectedToMaster(context.Background()) - context.Background() never cancels - KeepConnectedToMaster goroutine runs forever - gRPC connection to master stays open - No way to stop cleanly on server shutdown Result: Resource leaks when IAM server is stopped Fix: 1. Added shutdownContext and shutdownCancel to IamApiServer struct 2. Created cancellable context in NewIamApiServerWithStore: shutdownCtx, shutdownCancel := context.WithCancel(context.Background()) 3. Pass shutdownCtx to KeepConnectedToMaster: go masterClient.KeepConnectedToMaster(shutdownCtx) 4. Added Shutdown() method to invoke cancel: func (iama IamApiServer) Shutdown() { if iama.shutdownCancel != nil { iama.shutdownCancel() } } 5. Stored masterClient reference on IamApiServer for future use Benefits: - Goroutine stops cleanly when Shutdown() is called - gRPC connections are closed properly - No resource leaks on server restart/stop - Shutdown() is idempotent (safe to call multiple times) Usage (for future graceful shutdown): iamServer, _ := iamapi.NewIamApiServer(...) defer iamServer.Shutdown() // or in signal handler: sigChan := make(chan os.Signal, 1) signal.Notify(sigChan, syscall.SIGTERM, syscall.SIGINT) go func() { <-sigChan iamServer.Shutdown() os.Exit(0) }() Note: Current command implementations (weed/command/iam.go) don't have shutdown paths yet, but this makes IAM server ready for proper lifecycle management when that infrastructure is added. refactor: remove unnecessary KeepMasterClientConnected wrapper in filer The Filer.KeepMasterClientConnected() method was an unnecessary wrapper that just forwarded to MasterClient.KeepConnectedToMaster(). This wrapper added no value and created inconsistency with other components that call KeepConnectedToMaster directly. Removed: filer.go:178-180 func (fs Filer) KeepMasterClientConnected(ctx context.Context) { fs.MasterClient.KeepConnectedToMaster(ctx) } Updated caller: filer_server.go:181 - go fs.filer.KeepMasterClientConnected(context.Background()) + go fs.filer.MasterClient.KeepConnectedToMaster(context.Background()) Benefits: - Consistent with other components (S3, IAM, Shell, Mount) - Removes unnecessary indirection - Clearer that KeepConnectedToMaster runs in background goroutine - Follows the documented pattern from MasterClient.GetMaster() Note: shell/commands.go was verified and already correctly starts KeepConnectedToMaster in a background goroutine (shell_liner.go:51): go commandEnv.MasterClient.KeepConnectedToMaster(ctx) fix: use client ID instead of timeout for gRPC signature parameter The pb.WithGrpcFilerClient signature parameter is meant to be a client identifier for logging and tracking (added as 'sw-client-id' gRPC metadata in streaming mode), not a timeout value. Problem: timeoutMs := int32(fc.grpcTimeout.Milliseconds()) // 5000 (5 seconds) err := pb.WithGrpcFilerClient(false, timeoutMs, filerAddress, ...) - Passing timeout (5000ms) as signature/client ID - Misuse of API: signature should be a unique client identifier - Timeout is already handled by timeoutCtx passed to gRPC call - Inconsistent with other callers (all use 0 or proper client ID) How WithGrpcFilerClient uses signature parameter: func WithGrpcClient(..., signature int32, ...) { if streamingMode && signature != 0 { md := metadata.New(map[string]string{"sw-client-id": fmt.Sprintf("%d", signature)}) ctx = metadata.NewOutgoingContext(ctx, md) } ... } It's for client identification, not timeout control! Fix: 1. Added clientId int32 field to FilerClient struct 2. Initialize with rand.Int31() in NewFilerClient for unique ID 3. Removed timeoutMs variable (and misleading comment) 4. Use fc.clientId in pb.WithGrpcFilerClient call Before: err := pb.WithGrpcFilerClient(false, timeoutMs, ...) ^^^^^^^^^ Wrong! (5000) After: err := pb.WithGrpcFilerClient(false, fc.clientId, ...) ^^^^^^^^^^^^ Correct! (random int31) Benefits: - Correct API usage (signature = client ID, not timeout) - Timeout still works via timeoutCtx (unchanged) - Consistent with other pb.WithGrpcFilerClient callers - Enables proper client tracking on filer side via gRPC metadata - Each FilerClient instance has unique ID for debugging Examples of correct usage elsewhere: weed/iamapi/iamapi_server.go:145 pb.WithGrpcFilerClient(false, 0, ...) weed/command/s3.go:215 pb.WithGrpcFilerClient(false, 0, ...) weed/shell/commands.go:110 pb.WithGrpcFilerClient(streamingMode, 0, ...) All use 0 (or a proper signature), not a timeout value. * fix: add timeout to master volume lookup to prevent indefinite blocking The masterVolumeProvider.LookupVolumeIds method was using the context directly without a timeout, which could cause it to block indefinitely if the master is slow to respond or unreachable. Problem: err := pb.WithMasterClient(false, p.masterClient.GetMaster(ctx), ...) resp, err := client.LookupVolume(ctx, &master_pb.LookupVolumeRequest{...}) - No timeout on gRPC call to master - Could block indefinitely if master is unresponsive - Inconsistent with FilerClient which uses 5s timeout - This is a fallback path (cache miss) but still needs protection Scenarios where this could hang: 1. Master server under heavy load (slow response) 2. Network issues between client and master 3. Master server hung or deadlocked 4. Master in process of shutting down Fix: timeoutCtx, cancel := context.WithTimeout(ctx, 5time.Second) defer cancel() err := pb.WithMasterClient(false, p.masterClient.GetMaster(timeoutCtx), ...) resp, err := client.LookupVolume(timeoutCtx, &master_pb.LookupVolumeRequest{...}) Benefits: - Prevents indefinite blocking on master lookup - Consistent with FilerClient timeout pattern (5 seconds) - Faster failure detection when master is unresponsive - Caller's context still honored (timeout is in addition, not replacement) - Improves overall system resilience Note: 5 seconds is a reasonable default for volume lookups: - Long enough for normal master response (~10-50ms) - Short enough to fail fast on issues - Matches FilerClient's grpcTimeout default purge * refactor: address code review feedback on comments and style Fixed several code quality issues identified during review: 1. Corrected backoff algorithm description in filer_client.go: - Changed "Exponential backoff" to "Multiplicative backoff with 1.5x factor" - The formula waitTime * 3/2 produces 1s, 1.5s, 2.25s, not exponential 2^n - More accurate terminology prevents confusion 2. Removed redundant nil check in vidmap_client.go: - After the for loop, node is guaranteed to be non-nil - Loop either returns early or assigns non-nil value to node - Simplified: if node != nil { node.cache.Store(nil) } → node.cache.Store(nil) 3. Added startup logging to IAM server for consistency: - Log when master client connection starts - Matches pattern in S3ApiServer (line 100 in s3api_server.go) - Improves operational visibility during startup - Added missing glog import 4. Fixed indentation in filer/reader_at.go: - Lines 76-91 had incorrect indentation (extra tab level) - Line 93 also misaligned - Now properly aligned with surrounding code 5. Updated deprecation comment to follow Go convention: - Changed "DEPRECATED:" to "Deprecated:" (standard Go format) - Tools like staticcheck and IDEs recognize the standard format - Enables automated deprecation warnings in tooling - Better developer experience All changes are cosmetic and do not affect functionality. * fmt * refactor: make circuit breaker parameters configurable in FilerClient The circuit breaker failure threshold (3) and reset timeout (30s) were hardcoded, making it difficult to tune the client's behavior in different deployment environments without modifying the code. Problem: func shouldSkipUnhealthyFiler(index int32) bool { if failureCount < 3 { // Hardcoded threshold return false } if time.Since(lastFailureTime) > 30time.Second { // Hardcoded timeout return false } } Different environments have different needs: - High-traffic production: may want lower threshold (2) for faster failover - Development/testing: may want higher threshold (5) to tolerate flaky networks - Low-latency services: may want shorter reset timeout (10s) - Batch processing: may want longer reset timeout (60s) Solution: 1. Added fields to FilerClientOption: - FailureThreshold int32 (default: 3) - ResetTimeout time.Duration (default: 30s) 2. Added fields to FilerClient: - failureThreshold int32 - resetTimeout time.Duration 3. Applied defaults in NewFilerClient with option override: failureThreshold := int32(3) resetTimeout := 30 time.Second if opt.FailureThreshold > 0 { failureThreshold = opt.FailureThreshold } if opt.ResetTimeout > 0 { resetTimeout = opt.ResetTimeout } 4. Updated shouldSkipUnhealthyFiler to use configurable values: if failureCount < fc.failureThreshold { ... } if time.Since(lastFailureTime) > fc.resetTimeout { ... } Benefits: ✓ Tunable for different deployment environments ✓ Backward compatible (defaults match previous hardcoded values) ✓ No breaking changes to existing code ✓ Better maintainability and flexibility Example usage: // Aggressive failover for low-latency production fc := wdclient.NewFilerClient(filers, dialOpt, dc, &wdclient.FilerClientOption{ FailureThreshold: 2, ResetTimeout: 10 * time.Second, }) // Tolerant of flaky networks in development fc := wdclient.NewFilerClient(filers, dialOpt, dc, &wdclient.FilerClientOption{ FailureThreshold: 5, ResetTimeout: 60 * time.Second, }) * retry parameters * refactor: make retry and timeout parameters configurable Made retry logic and gRPC timeouts configurable across FilerClient and MasterClient to support different deployment environments and network conditions. Problem 1: Hardcoded retry parameters in FilerClient waitTime := time.Second // Fixed at 1s maxRetries := 3 // Fixed at 3 attempts waitTime = waitTime * 3 / 2 // Fixed 1.5x multiplier Different environments have different needs: - Unstable networks: may want more retries (5) with longer waits (2s) - Low-latency production: may want fewer retries (2) with shorter waits (500ms) - Batch processing: may want exponential backoff (2x) instead of 1.5x Problem 2: Hardcoded gRPC timeout in MasterClient timeoutCtx, cancel := context.WithTimeout(ctx, 5time.Second) Master lookups may need different timeouts: - High-latency cross-region: may need 10s timeout - Local network: may use 2s timeout for faster failure detection Solution for FilerClient: 1. Added fields to FilerClientOption: - MaxRetries int (default: 3) - InitialRetryWait time.Duration (default: 1s) - RetryBackoffFactor float64 (default: 1.5) 2. Added fields to FilerClient: - maxRetries int - initialRetryWait time.Duration - retryBackoffFactor float64 3. Updated LookupVolumeIds to use configurable values: waitTime := fc.initialRetryWait maxRetries := fc.maxRetries for retry := 0; retry < maxRetries; retry++ { ... waitTime = time.Duration(float64(waitTime) fc.retryBackoffFactor) } Solution for MasterClient: 1. Added grpcTimeout field to MasterClient (default: 5s) 2. Initialize in NewMasterClient with 5 * time.Second default 3. Updated masterVolumeProvider to use p.masterClient.grpcTimeout Benefits: ✓ Tunable for different network conditions and deployment scenarios ✓ Backward compatible (defaults match previous hardcoded values) ✓ No breaking changes to existing code ✓ Consistent configuration pattern across FilerClient and MasterClient Example usage: // Fast-fail for low-latency production with stable network fc := wdclient.NewFilerClient(filers, dialOpt, dc, &wdclient.FilerClientOption{ MaxRetries: 2, InitialRetryWait: 500 * time.Millisecond, RetryBackoffFactor: 2.0, // Exponential backoff GrpcTimeout: 2 * time.Second, }) // Patient retries for unstable network or batch processing fc := wdclient.NewFilerClient(filers, dialOpt, dc, &wdclient.FilerClientOption{ MaxRetries: 5, InitialRetryWait: 2 * time.Second, RetryBackoffFactor: 1.5, GrpcTimeout: 10 * time.Second, }) Note: MasterClient timeout is currently set at construction time and not user-configurable via NewMasterClient parameters. Future enhancement could add a MasterClientOption struct similar to FilerClientOption. * fix: rename vicCacheLock to vidCacheLock for consistency Fixed typo in variable name for better code consistency and readability. Problem: vidCache := make(map[string]filer_pb.Locations) var vicCacheLock sync.RWMutex // Typo: vic instead of vid vicCacheLock.RLock() locations, found := vidCache[vid] vicCacheLock.RUnlock() The variable name 'vicCacheLock' is inconsistent with 'vidCache'. Both should use 'vid' prefix (volume ID) not 'vic'. Fix: Renamed all 5 occurrences: - var vicCacheLock → var vidCacheLock (line 56) - vicCacheLock.RLock() → vidCacheLock.RLock() (line 62) - vicCacheLock.RUnlock() → vidCacheLock.RUnlock() (line 64) - vicCacheLock.Lock() → vidCacheLock.Lock() (line 81) - vicCacheLock.Unlock() → vidCacheLock.Unlock() (line 91) Benefits: ✓ Consistent variable naming convention ✓ Clearer intent (volume ID cache lock) ✓ Better code readability ✓ Easier code navigation fix: use defer cancel() with anonymous function for proper context cleanup Fixed context cancellation to use defer pattern correctly in loop iteration. Problem: for x := 0; x < n; x++ { timeoutCtx, cancel := context.WithTimeout(ctx, fc.grpcTimeout) err := pb.WithGrpcFilerClient(...) cancel() // Only called on normal return, not on panic } Issues with original approach: 1. If pb.WithGrpcFilerClient panics, cancel() is never called → context leak 2. If callback returns early (though unlikely here), cleanup might be missed 3. Not following Go best practices for context.WithTimeout usage Problem with naive defer in loop: for x := 0; x < n; x++ { timeoutCtx, cancel := context.WithTimeout(ctx, fc.grpcTimeout) defer cancel() // ❌ WRONG: All defers accumulate until function returns } In Go, defer executes when the surrounding function returns, not when the loop iteration ends. This would accumulate n deferred cancel() calls and leak contexts until LookupVolumeIds returns. Solution: Wrap in anonymous function for x := 0; x < n; x++ { err := func() error { timeoutCtx, cancel := context.WithTimeout(ctx, fc.grpcTimeout) defer cancel() // ✅ Executes when anonymous function returns (per iteration) return pb.WithGrpcFilerClient(...) }() } Benefits: ✓ Context always cancelled, even on panic ✓ defer executes after each iteration (not accumulated) ✓ Follows Go best practices for context.WithTimeout ✓ No resource leaks during retry loop execution ✓ Cleaner error handling Reference: Go documentation for context.WithTimeout explicitly shows: ctx, cancel := context.WithTimeout(...) defer cancel() This is the idiomatic pattern that should always be followed. * Can't use defer directly in loop * improve: add data center preference and URL shuffling for consistent performance Added missing data center preference and load distribution (URL shuffling) to ensure consistent performance and behavior across all code paths. Problem 1: PreferPublicUrl path missing DC preference and shuffling Location: weed/wdclient/filer_client.go lines 184-192 The custom PreferPublicUrl implementation was simply iterating through locations and building URLs without considering: 1. Data center proximity (latency optimization) 2. Load distribution across volume servers Before: for _, loc := range locations { url := loc.PublicUrl if url == "" { url = loc.Url } fullUrls = append(fullUrls, "http://"+url+"/"+fileId) } return fullUrls, nil After: var sameDcUrls, otherDcUrls []string dataCenter := fc.GetDataCenter() for _, loc := range locations { url := loc.PublicUrl if url == "" { url = loc.Url } httpUrl := "http://" + url + "/" + fileId if dataCenter != "" && dataCenter == loc.DataCenter { sameDcUrls = append(sameDcUrls, httpUrl) } else { otherDcUrls = append(otherDcUrls, httpUrl) } } rand.Shuffle(len(sameDcUrls), ...) rand.Shuffle(len(otherDcUrls), ...) fullUrls = append(sameDcUrls, otherDcUrls...) Problem 2: Cache miss path missing URL shuffling Location: weed/wdclient/vidmap_client.go lines 95-108 The cache miss path (fallback lookup) was missing URL shuffling, while the cache hit path (vm.LookupFileId) already shuffles URLs. This inconsistency meant: - Cache hit: URLs shuffled → load distributed - Cache miss: URLs not shuffled → first server always hit Before: var sameDcUrls, otherDcUrls []string // ... build URLs ... fullUrls = append(sameDcUrls, otherDcUrls...) return fullUrls, nil After: var sameDcUrls, otherDcUrls []string // ... build URLs ... rand.Shuffle(len(sameDcUrls), ...) rand.Shuffle(len(otherDcUrls), ...) fullUrls = append(sameDcUrls, otherDcUrls...) return fullUrls, nil Benefits: ✓ Reduced latency by preferring same-DC volume servers ✓ Even load distribution across all volume servers ✓ Consistent behavior between cache hit/miss paths ✓ Consistent behavior between PreferUrl and PreferPublicUrl ✓ Matches behavior of existing vidMap.LookupFileId implementation Impact on performance: - Lower read latency (same-DC preference) - Better volume server utilization (load spreading) - No single volume server becomes a hotspot Note: Added math/rand import to vidmap_client.go for shuffle support. * Update weed/wdclient/masterclient.go Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * improve: call IAM server Shutdown() for best-effort cleanup Added call to iamApiServer.Shutdown() to ensure cleanup happens when possible, and documented the limitations of the current approach. Problem: The Shutdown() method was defined in IamApiServer but never called anywhere, meaning the KeepConnectedToMaster goroutine would continue running even when the IAM server stopped, causing resource leaks. Changes: 1. Store iamApiServer instance in weed/command/iam.go - Changed: _, iamApiServer_err := iamapi.NewIamApiServer(...) - To: iamApiServer, iamApiServer_err := iamapi.NewIamApiServer(...) 2. Added defer call for best-effort cleanup - defer iamApiServer.Shutdown() - This will execute if startIamServer() returns normally 3. Added logging in Shutdown() method - Log when shutdown is triggered for visibility 4. Documented limitations and future improvements - Added note that defer only works for normal function returns - SeaweedFS commands don't currently have signal handling - Suggested future enhancement: add SIGTERM/SIGINT handling Current behavior: - ✓ Cleanup happens if HTTP server fails to start (glog.Fatalf path) - ✓ Cleanup happens if Serve() returns with error (unlikely) - ✗ Cleanup does NOT happen on SIGTERM/SIGINT (process killed) The last case is a limitation of the current command architecture - all SeaweedFS commands (s3, filer, volume, master, iam) lack signal handling for graceful shutdown. This is a systemic issue that affects all services. Future enhancement: To properly handle SIGTERM/SIGINT, the command layer would need: sigChan := make(chan os.Signal, 1) signal.Notify(sigChan, syscall.SIGTERM, syscall.SIGINT) go func() { httpServer.Serve(listener) // Non-blocking }() <-sigChan glog.V(0).Infof("Received shutdown signal") iamApiServer.Shutdown() httpServer.Shutdown(context.Background()) This would require refactoring the command structure for all services, which is out of scope for this change. Benefits of current approach: ✓ Best-effort cleanup (better than nothing) ✓ Proper cleanup in error paths ✓ Documented for future improvement ✓ Consistent with how other SeaweedFS services handle lifecycle * data racing in test --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	3 months ago
Chris Lu	c6b6ea40e6	filer store: add foundationdb (#7178 ) * add foundationdb * Update foundationdb_store.go * fix * apply the patch * avoid panic on error * address comments * remove extra data * address comments * adds more debug messages * fix range listing * delete with prefix range; list with right start key * fix docker files * use the more idiomatic FoundationDB KeySelectors * address comments * proper errors * fix API versions * more efficient * recursive deletion * clean up * clean up * pagination, one transaction for deletion * error checking * Use fdb.Strinc() to compute the lexicographically next string and create a proper range * fix docker * Update README.md * delete in batches * delete in batches * fix build * add foundationdb build * Updated FoundationDB Version * Fixed glibc/musl Incompatibility (Alpine → Debian) * Update container_foundationdb_version.yml * build SeaweedFS * build tag * address comments * separate transaction * address comments * fix build * empty vs no data * fixes * add go test * Install FoundationDB client libraries * nil compare	3 months ago
Chris Lu	ca84a8a713	S3: Directly read write volume servers (#7481 ) * Lazy Versioning Check, Conditional SSE Entry Fetch, HEAD Request Optimization * revert Reverted the conditional versioning check to always check versioning status Reverted the conditional SSE entry fetch to always fetch entry metadata Reverted the conditional versioning check to always check versioning status Reverted the conditional SSE entry fetch to always fetch entry metadata * Lazy Entry Fetch for SSE, Skip Conditional Header Check * SSE-KMS headers are present, this is not an SSE-C request (mutually exclusive) * SSE-C is mutually exclusive with SSE-S3 and SSE-KMS * refactor * Removed Premature Mutual Exclusivity Check * check for the presence of the X-Amz-Server-Side-Encryption header * not used * fmt * directly read write volume servers * HTTP Range Request Support * set header * md5 * copy object * fix sse * fmt * implement sse * sse continue * fixed the suffix range bug (bytes=-N for "last N bytes") * debug logs * Missing PartsCount Header * profiling * url encoding * test_multipart_get_part * headers * debug * adjust log level * handle part number * Update s3api_object_handlers.go * nil safety * set ModifiedTsNs * remove * nil check * fix sse header * same logic as filer * decode values * decode ivBase64 * s3: Fix SSE decryption JWT authentication and streaming errors Critical fix for SSE (Server-Side Encryption) test failures: 1. JWT Authentication Bug (Root Cause): - Changed from GenJwtForFilerServer to GenJwtForVolumeServer - S3 API now uses correct JWT when directly reading from volume servers - Matches filer's authentication pattern for direct volume access - Fixes 'unexpected EOF' and 500 errors in SSE tests 2. Streaming Error Handling: - Added error propagation in getEncryptedStreamFromVolumes goroutine - Use CloseWithError() to properly communicate stream failures - Added debug logging for streaming errors 3. Response Header Timing: - Removed premature WriteHeader(http.StatusOK) call - Let Go's http package write status automatically on first write - Prevents header lock when errors occur during streaming 4. Enhanced SSE Decryption Debugging: - Added IV/Key validation and logging for SSE-C, SSE-KMS, SSE-S3 - Better error messages for missing or invalid encryption metadata - Added glog.V(2) debugging for decryption setup This fixes SSE integration test failures where encrypted objects could not be retrieved due to volume server authentication failures. The JWT bug was causing volume servers to reject requests, resulting in truncated/empty streams (EOF) or internal errors. * s3: Fix SSE multipart upload metadata preservation Critical fix for SSE multipart upload test failures (SSE-C and SSE-KMS): Root Cause - Incomplete SSE Metadata Copying: The old code only tried to copy 'SeaweedFSSSEKMSKey' from the first part to the completed object. This had TWO bugs: 1. Wrong Constant Name (Key Mismatch Bug): - Storage uses: SeaweedFSSSEKMSKeyHeader = 'X-SeaweedFS-SSE-KMS-Key' - Old code read: SeaweedFSSSEKMSKey = 'x-seaweedfs-sse-kms-key' - Result: SSE-KMS metadata was NEVER copied → 500 errors 2. Missing SSE-C and SSE-S3 Headers: - SSE-C requires: IV, Algorithm, KeyMD5 - SSE-S3 requires: encrypted key data + standard headers - Old code: copied nothing for SSE-C/SSE-S3 → decryption failures Fix - Complete SSE Header Preservation: Now copies ALL SSE headers from first part to completed object: - SSE-C: SeaweedFSSSEIV, CustomerAlgorithm, CustomerKeyMD5 - SSE-KMS: SeaweedFSSSEKMSKeyHeader, AwsKmsKeyId, ServerSideEncryption - SSE-S3: SeaweedFSSSES3Key, ServerSideEncryption Applied consistently to all 3 code paths: 1. Versioned buckets (creates version file) 2. Suspended versioning (creates main object with null versionId) 3. Non-versioned buckets (creates main object) Why This Is Correct: The headers copied EXACTLY match what putToFiler stores during part upload (lines 496-521 in s3api_object_handlers_put.go). This ensures detectPrimarySSEType() can correctly identify encrypted multipart objects and trigger inline decryption with proper metadata. Fixes: TestSSEMultipartUploadIntegration (SSE-C and SSE-KMS subtests) * s3: Add debug logging for versioning state diagnosis Temporary debug logging to diagnose test_versioning_obj_plain_null_version_overwrite_suspended failure. Added glog.V(0) logging to show: 1. setBucketVersioningStatus: when versioning status is changed 2. PutObjectHandler: what versioning state is detected (Enabled/Suspended/none) 3. PutObjectHandler: which code path is taken (putVersionedObject vs putSuspendedVersioningObject) This will help identify if: - The versioning status is being set correctly in bucket config - The cache is returning stale/incorrect versioning state - The switch statement is correctly routing to suspended vs enabled handlers * s3: Enhanced versioning state tracing for suspended versioning diagnosis Added comprehensive logging across the entire versioning state flow: PutBucketVersioningHandler: - Log requested status (Enabled/Suspended) - Log when calling setBucketVersioningStatus - Log success/failure of status change setBucketVersioningStatus: - Log bucket and status being set - Log when config is updated - Log completion with error code updateBucketConfig: - Log versioning state being written to cache - Immediate cache verification after Set - Log if cache verification fails getVersioningState: - Log bucket name and state being returned - Log if object lock forces VersioningEnabled - Log errors This will reveal: 1. If PutBucketVersioning(Suspended) is reaching the handler 2. If the cache update succeeds 3. What state getVersioningState returns during PUT 4. Any cache consistency issues Expected to show why bucket still reports 'Enabled' after 'Suspended' call. * s3: Add SSE chunk detection debugging for multipart uploads Added comprehensive logging to diagnose why TestSSEMultipartUploadIntegration fails: detectPrimarySSEType now logs: 1. Total chunk count and extended header count 2. All extended headers with 'sse'/'SSE'/'encryption' in the name 3. For each chunk: index, SseType, and whether it has metadata 4. Final SSE type counts (SSE-C, SSE-KMS, SSE-S3) This will reveal if: - Chunks are missing SSE metadata after multipart completion - Extended headers are copied correctly from first part - The SSE detection logic is working correctly Expected to show if chunks have SseType=0 (none) or proper SSE types set. * s3: Trace SSE chunk metadata through multipart completion and retrieval Added end-to-end logging to track SSE chunk metadata lifecycle: During Multipart Completion (filer_multipart.go): 1. Log finalParts chunks BEFORE mkFile - shows SseType and metadata 2. Log versionEntry.Chunks INSIDE mkFile callback - shows if mkFile preserves SSE info 3. Log success after mkFile completes During GET Retrieval (s3api_object_handlers.go): 1. Log retrieved entry chunks - shows SseType and metadata after retrieval 2. Log detected SSE type result This will reveal at which point SSE chunk metadata is lost: - If finalParts have SSE metadata but versionEntry.Chunks don't → mkFile bug - If versionEntry.Chunks have SSE metadata but retrieved chunks don't → storage/retrieval bug - If chunks never have SSE metadata → multipart completion SSE processing bug Expected to show chunks with SseType=NONE during retrieval even though they were created with proper SseType during multipart completion. * s3: Fix SSE-C multipart IV base64 decoding bug Critical Bug Found: SSE-C multipart uploads were failing because: Root Cause: - entry.Extended[SeaweedFSSSEIV] stores base64-encoded IV (24 bytes for 16-byte IV) - SerializeSSECMetadata expects raw IV bytes (16 bytes) - During multipart completion, we were passing base64 IV directly → serialization error Error Message: "Failed to serialize SSE-C metadata for chunk in part X: invalid IV length: expected 16 bytes, got 24" Fix: - Base64-decode IV before passing to SerializeSSECMetadata - Added error handling for decode failures Impact: - SSE-C multipart uploads will now correctly serialize chunk metadata - Chunks will have proper SSE metadata for decryption during GET This fixes the SSE-C subtest of TestSSEMultipartUploadIntegration. SSE-KMS still has a separate issue (error code 23) being investigated. * fixes * kms sse * handle retry if not found in .versions folder and should read the normal object * quick check (no retries) to see if the .versions/ directory exists * skip retry if object is not found * explicit update to avoid sync delay * fix map update lock * Remove fmt.Printf debug statements * Fix SSE-KMS multipart base IV fallback to fail instead of regenerating * fmt * Fix ACL grants storage logic * header handling * nil handling * range read for sse content * test range requests for sse objects * fmt * unused code * upload in chunks * header case * fix url * bucket policy error vs bucket not found * jwt handling * fmt * jwt in request header * Optimize Case-Insensitive Prefix Check * dead code * Eliminated Unnecessary Stream Prefetch for Multipart SSE * range sse * sse * refactor * context * fmt * fix type * fix SSE-C IV Mismatch * Fix Headers Being Set After WriteHeader * fix url parsing * propergate sse headers * multipart sse-s3 * aws sig v4 authen * sse kms * set content range * better errors * Update s3api_object_handlers_copy.go * Update s3api_object_handlers.go * Update s3api_object_handlers.go * avoid magic number * clean up * Update s3api_bucket_policy_handlers.go * fix url parsing * context * data and metadata both use background context * adjust the offset * SSE Range Request IV Calculation * adjust logs * IV relative to offset in each part, not the whole file * collect logs * offset * fix offset * fix url * logs * variable * jwt * Multipart ETag semantics: conditionally set object-level Md5 for single-chunk uploads only. * sse * adjust IV and offset * multipart boundaries * ensures PUT and GET operations return consistent ETags * Metadata Header Case * CommonPrefixes Sorting with URL Encoding * always sort * remove the extra PathUnescape call * fix the multipart get part ETag * the FileChunk is created without setting ModifiedTsNs * Sort CommonPrefixes lexicographically to match AWS S3 behavior * set md5 for multipart uploads * prevents any potential data loss or corruption in the small-file inline storage path * compiles correctly * decryptedReader will now be properly closed after use * Fixed URL encoding and sort order for CommonPrefixes * Update s3api_object_handlers_list.go * SSE-x Chunk View Decryption * Different IV offset calculations for single-part vs multipart objects * still too verbose in logs * less logs * ensure correct conversion * fix listing * nil check * minor fixes * nil check * single character delimiter * optimize * range on empty object or zero-length * correct IV based on its position within that part, not its position in the entire object * adjust offset * offset Fetch FULL encrypted chunk (not just the range) Adjust IV by PartOffset/ChunkOffset only Decrypt full chunk Skip in the DECRYPTED stream to reach OffsetInChunk * look breaking * refactor * error on no content * handle intra-block byte skipping * Incomplete HTTP Response Error Handling * multipart SSE * Update s3api_object_handlers.go * address comments * less logs * handling directory * Optimized rejectDirectoryObjectWithoutSlash() to avoid unnecessary lookups * Revert "handling directory" This reverts commit `3a335f0ac3`. * constant * Consolidate nil entry checks in GetObjectHandler * add range tests * Consolidate redundant nil entry checks in HeadObjectHandler * adjust logs * SSE type * large files * large files Reverted the plain-object range test * ErrNoEncryptionConfig * Fixed SSERangeReader Infinite Loop Vulnerability * Fixed SSE-KMS Multipart ChunkReader HTTP Body Leak * handle empty directory in S3, added PyArrow tests * purge unused code * Update s3_parquet_test.py * Update requirements.txt * According to S3 specifications, when both partNumber and Range are present, the Range should apply within the selected part's boundaries, not to the full object. * handle errors * errors after writing header * https * fix: Wait for volume assignment readiness before running Parquet tests The test-implicit-dir-with-server test was failing with an Internal Error because volume assignment was not ready when tests started. This fix adds a check that attempts a volume assignment and waits for it to succeed before proceeding with tests. This ensures that: 1. Volume servers are registered with the master 2. Volume growth is triggered if needed 3. The system can successfully assign volumes for writes Fixes the timeout issue where boto3 would retry 4 times and fail with 'We encountered an internal error, please try again.' * sse tests * store derived IV * fix: Clean up gRPC ports between tests to prevent port conflicts The second test (test-implicit-dir-with-server) was failing because the volume server's gRPC port (18080 = VOLUME_PORT + 10000) was still in use from the first test. The cleanup code only killed HTTP port processes, not gRPC port processes. Added cleanup for gRPC ports in all stop targets: - Master gRPC: MASTER_PORT + 10000 (19333) - Volume gRPC: VOLUME_PORT + 10000 (18080) - Filer gRPC: FILER_PORT + 10000 (18888) This ensures clean state between test runs in CI. * add import * address comments * docs: Add placeholder documentation files for Parquet test suite Added three missing documentation files referenced in test/s3/parquet/README.md: 1. TEST_COVERAGE.md - Documents 43 total test cases (17 Go unit tests, 6 Python integration tests, 20 Python end-to-end tests) 2. FINAL_ROOT_CAUSE_ANALYSIS.md - Explains the s3fs compatibility issue with PyArrow, the implicit directory problem, and how the fix works 3. MINIO_DIRECTORY_HANDLING.md - Compares MinIO's directory handling approach with SeaweedFS's implementation Each file contains: - Title and overview - Key technical details relevant to the topic - TODO sections for future expansion These placeholder files resolve the broken README links and provide structure for future detailed documentation. * clean up if metadata operation failed * Update s3_parquet_test.py * clean up * Update Makefile * Update s3_parquet_test.py * Update Makefile * Handle ivSkip for non-block-aligned offsets * Update README.md * stop volume server faster * stop volume server in 1 second * different IV for each chunk in SSE-S3 and SSE-KMS * clean up if fails * testing upload * error propagation * fmt * simplify * fix copying * less logs * endian * Added marshaling error handling * handling invalid ranges * error handling for adding to log buffer * fix logging * avoid returning too quickly and ensure proper cleaning up * Activity Tracking for Disk Reads * Cleanup Unused Parameters * Activity Tracking for Kafka Publishers * Proper Test Error Reporting * refactoring * less logs * less logs * go fmt * guard it with if entry.Attributes.TtlSec > 0 to match the pattern used elsewhere. * Handle bucket-default encryption config errors explicitly for multipart * consistent activity tracking * obsolete code for s3 on filer read/write handlers * Update weed/s3api/s3api_object_handlers_list.go Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	3 months ago
Konstantin Lebedev	084b377f87	do delete expired entries on s3 list request (#7426 ) * do delete expired entries on s3 list request https://github.com/seaweedfs/seaweedfs/issues/6837 * disable delete expires s3 entry in filer * pass opt allowDeleteObjectsByTTL to all servers * delete on get and head * add lifecycle expiration s3 tests * fix opt allowDeleteObjectsByTTL for server * fix test lifecycle expiration * fix IsExpired * fix locationPrefix for updateEntriesTTL * fix s3tests * resolv coderabbitai * GetS3ExpireTime on filer * go mod * clear TtlSeconds for volume * move s3 delete expired entry to filer * filer delete meta and data * del unusing func removeExpiredObject * test s3 put * test s3 put multipart * allowDeleteObjectsByTTL by default * fix pipline tests * rm dublicate SeaweedFSExpiresS3 * revert expiration tests * fix updateTTL * rm log * resolv comment * fix delete version object * fix S3Versioning * fix delete on FindEntry * fix delete chunks * fix sqlite not support concurrent writes/reads * move deletion out of listing transaction; delete entries and empty folders * Revert "fix sqlite not support concurrent writes/reads" This reverts commit `5d5da14e0e`. * clearer handling on recursive empty directory deletion * handle listing errors * strut copying * reuse code to delete empty folders * use iterative approach with a queue to avoid recursive WithFilerClient calls * stop a gRPC stream from the client-side callback is to return a specific error, e.g., io.EOF * still issue UpdateEntry when the flag must be added * errors join * join path * cleaner * add context, sort directories by depth (deepest first) to avoid redundant checks * batched operation, refactoring * prevent deleting bucket * constant * reuse code * more logging * refactoring * s3 TTL time * Safety check --------- Co-authored-by: chrislu <chris.lu@gmail.com>	3 months ago
Dmitriy Pavlov	9b6b564235	Filer: Add retry mechanism for failed file deletions (#7402 ) * Filer: Add retry mechanism for failed file deletions Implement a retry queue with exponential backoff for handling transient deletion failures, particularly when volumes are temporarily read-only. Key features: - Automatic retry for retryable errors (read-only volumes, network issues) - Exponential backoff: 5min → 10min → 20min → ... (max 6 hours) - Maximum 10 retry attempts per file before giving up - Separate goroutine processing retry queue every minute - Enhanced logging with retry/permanent error classification This addresses the issue where file deletions fail when volumes are temporarily read-only (tiered volumes, maintenance, etc.) and these deletions were previously lost. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Update weed/filer/filer_deletion.go Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Filer: Add retry mechanism for failed file deletions Implement a retry queue with exponential backoff for handling transient deletion failures, particularly when volumes are temporarily read-only. Key features: - Automatic retry for retryable errors (read-only volumes, network issues) - Exponential backoff: 5min → 10min → 20min → ... (max 6 hours) - Maximum 10 retry attempts per file before giving up - Separate goroutine processing retry queue every minute - Map-based retry queue for O(1) lookups and deletions - Enhanced logging with retry/permanent error classification - Consistent error detail limiting (max 10 total errors logged) - Graceful shutdown support with quit channel for both processors This addresses the issue where file deletions fail when volumes are temporarily read-only (tiered volumes, maintenance, etc.) and these deletions were previously lost. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Filer: Replace magic numbers with named constants in retry processor Replace hardcoded values with package-level constants for better maintainability: - DeletionRetryPollInterval (1 minute): interval for checking retry queue - DeletionRetryBatchSize (1000): max items to process per iteration This improves code readability and makes configuration changes easier. * Filer: Optimize retry queue with min-heap data structure Replace map-based retry queue with a min-heap for better scalability and deterministic ordering. Performance improvements: - GetReadyItems: O(N) → O(K log N) where K is items retrieved - AddOrUpdate: O(1) → O(log N) (acceptable trade-off) - Early exit when checking ready items (heap top is earliest) - No full iteration over all items while holding lock Benefits: - Deterministic processing order (earliest NextRetryAt first) - Better scalability for large retry queues (thousands of items) - Reduced lock contention duration - Memory efficient (no separate slice reconstruction) Implementation: - Min-heap ordered by NextRetryAt using container/heap - Dual index: heap for ordering + map for O(1) FileId lookups - heap.Fix() used when updating existing items - Comprehensive complexity documentation in comments This addresses the performance bottleneck identified in GetReadyItems where iterating over the entire map with a write lock could block other goroutines in high-failure scenarios. * Filer: Modernize heap interface and improve error handling docs 1. Replace interface{} with any in heap methods - Addresses modern Go style (Go 1.18+) - Improves code readability 2. Enhance isRetryableError documentation - Acknowledge string matching brittleness - Add comprehensive TODO for future improvements: * Use HTTP status codes (503, 429, etc.) * Implement structured error types with errors.Is/As * Extract gRPC status codes * Add error wrapping for better context - Document each error pattern with context - Add defensive check for empty error strings Current implementation remains pragmatic for initial release while documenting a clear path for future robustness improvements. String matching is acceptable for now but should be replaced with structured error checking when refactoring the deletion pipeline. * Filer: Refactor deletion processors for better readability Extract large callback functions into dedicated private methods to improve code organization and maintainability. Changes: 1. Extract processDeletionBatch method - Handles deletion of a batch of file IDs - Classifies errors (success, not found, retryable, permanent) - Manages retry queue additions - Consolidates logging logic 2. Extract processRetryBatch method - Handles retry attempts for previously failed deletions - Processes retry results and updates queue - Symmetric to processDeletionBatch for consistency Benefits: - Main loop functions (loopProcessingDeletion, loopProcessingDeletionRetry) are now concise and focused on orchestration - Business logic is separated into testable methods - Reduced nesting depth improves readability - Easier to understand control flow at a glance - Better separation of concerns The refactored methods follow the single responsibility principle, making the codebase more maintainable and easier to extend. * Update weed/filer/filer_deletion.go Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Filer: Fix critical retry count bug and add comprehensive error patterns Critical bug fixes from PR review: 1. Fix RetryCount reset bug (CRITICAL) - Problem: When items are re-queued via AddOrUpdate, RetryCount resets to 1, breaking exponential backoff - Solution: Add RequeueForRetry() method that preserves retry state - Impact: Ensures proper exponential backoff progression 2. Add overflow protection in backoff calculation - Check shift amount > 63 to prevent bit-shift overflow - Additional safety: check if delay <= 0 or > MaxRetryDelay - Protects against arithmetic overflow in extreme cases 3. Expand retryable error patterns - Added: timeout, deadline exceeded, context canceled - Added: lookup error/failed (volume discovery issues) - Added: connection refused, broken pipe (network errors) - Added: too many requests, service unavailable (backpressure) - Added: temporarily unavailable, try again (transient errors) - Added: i/o timeout (network timeouts) Benefits: - Retry mechanism now works correctly across restarts - More robust against edge cases and overflow - Better coverage of transient failure scenarios - Improved resilience in high-failure environments Addresses feedback from CodeRabbit and Gemini Code Assist in PR #7402. * Filer: Add persistence docs and comprehensive unit tests Documentation improvements: 1. Document in-memory queue limitation - Acknowledge that retry queue is volatile (lost on restart) - Document trade-offs and future persistence options - Provide clear path for production hardening - Note eventual consistency through main deletion queue Unit test coverage: 1. TestDeletionRetryQueue_AddAndRetrieve - Basic add/retrieve operations - Verify items not ready before delay elapsed 2. TestDeletionRetryQueue_ExponentialBackoff - Verify exponential backoff progression (5m→10m→20m→40m→80m) - Validate delay calculations with timing tolerance 3. TestDeletionRetryQueue_OverflowProtection - Test high retry counts (60+) that could cause overflow - Verify capping at MaxRetryDelay 4. TestDeletionRetryQueue_MaxAttemptsReached - Verify items discarded after MaxRetryAttempts - Confirm proper queue cleanup 5. TestIsRetryableError - Comprehensive error pattern coverage - Test all retryable error types (timeout, connection, lookup, etc.) - Verify non-retryable errors correctly identified 6. TestDeletionRetryQueue_HeapOrdering - Verify min-heap property maintained - Test items processed in NextRetryAt order - Validate heap.Init() integration All tests passing. Addresses PR feedback on testing requirements. * Filer: Add code quality improvements for deletion retry Address PR feedback with minor optimizations: - Add MaxLoggedErrorDetails constant (replaces magic number 10) - Pre-allocate slices and maps in processRetryBatch for efficiency - Improve log message formatting to use constant These changes improve code maintainability and runtime performance without altering functionality. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * refactoring retrying * use constant * assert * address comment * refactor * address comments * dedup * process retried deletions * address comment * check in-flight items also; dedup code * refactoring * refactoring * simplify * reset heap * more efficient * add DeletionBatchSize as a constant;Permanent > Retryable > Success > Not Found --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: chrislu <chris.lu@gmail.com> Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com>	4 months ago
Chris Lu	6a8c53bc44	Filer: batch deletion operations to return individual error results (#7382 ) * batch deletion operations to return individual error results Modify batch deletion operations to return individual error results instead of one aggregated error, enabling better tracking of which specific files failed to delete (helping reduce orphan file issues). * Simplified logging logic * Optimized nested loop * handles the edge case where the RPC succeeds but connection cleanup fails * simplify * simplify * ignore 'not found' errors here	4 months ago
Chris Lu	263e891da0	Clients to volume server requires JWT tokens for all read operations (#7376 ) * [Admin UI] Login not possible due to securecookie error * avoid 404 favicon * Update weed/admin/dash/auth_middleware.go Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * address comments * avoid variable over shadowing * log session save error * When jwt.signing.read.key is enabled in security.toml, the volume server requires JWT tokens for all read operations. * reuse fileId * refactor --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	4 months ago
Chris Lu	7d147f238c	avoid repeated reading disk (#7369 ) * avoid repeated reading disk * checks both flush time AND read position advancement * wait on cond * fix reading Gap detection and skipping to earliest memory time Time-based reads that include events at boundary times for first reads (offset ≤ 0) Aggregated subscriber wake-up via ListenersWaits signaling * address comments	4 months ago

1 2 3 4 5 ...

899 Commits (260d9f44a44969c8477c615ecc86f4d237ddfa22)