Sparse delta-file snapshots with copy-on-write in the flusher.
Zero write-path overhead when no snapshot is active.
New: snapshot.go (SnapshotBitmap, SnapshotHeader, delta file I/O)
Modified: flusher.go (flushMu, CoW phase in FlushOnce, PauseAndFlush)
Modified: blockvol.go (Create/Read/Delete/Restore/ListSnapshots, recovery)
Modified: wal_writer.go (Reset for snapshot restore)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add ALUA (Asymmetric Logical Unit Access) support to the iSCSI target,
enabling dm-multipath on Linux to automatically detect path state changes
and reroute I/O during HA failover without initiator-side intervention.
- ALUAProvider interface with implicit ALUA (TPGS=0x01)
- INQUIRY byte 5 TPGS bits, VPD 0x83 with NAA+TPG+RTP descriptors
- REPORT TARGET PORT GROUPS handler (MAINTENANCE IN SA=0x0A)
- MAINTENANCE OUT rejection (implicit-only, no SET TPG)
- Standby write rejection (NOT_READY ASC=04h ASCQ=0Bh)
- RoleNone maps to Active/Optimized (standalone single-node compatibility)
- NAA-6 device identifier derived from volume UUID
- -tpg-id flag with [1,65535] validation
- dm-multipath config + setup script (group_by_tpg, ALUA prio)
- 12 unit tests + 16 QA adversarial tests + 4 integration tests
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Test harness for running blockvol iSCSI tests on WSL2 and remote nodes
(m01/M02). Includes Node (SSH/local exec), ISCSIClient (discover/login/
logout), WeedTarget (weed volume server lifecycle), and test suites for
smoke, stress, crash recovery, chaos, perf benchmarks, and apps (fio/dd).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add ProcessBlockVolumeAssignments to BlockVolumeStore and wire
AssignmentSource/AssignmentCallback into the heartbeat collector's
Run() loop. Assignments are fetched and applied each tick after
status collection.
Bug fixes:
- BUG-CP4B3-1: TOCTOU between GetBlockVolume and HandleAssignment.
Added withVolume() helper that holds RLock across lookup+operation,
preventing RemoveBlockVolume from closing the volume mid-assignment.
- BUG-CP4B3-2: Data race on callback fields read by Run() goroutine.
Made StatusCallback/AssignmentSource/AssignmentCallback private,
added cbMu mutex and SetXxx() setter methods. Lock held only for
load/store, not during callback execution.
7 dev tests + 13 QA adversarial tests = 20 new tests.
972 total unit tests, all passing.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Boundary tests for RoleFromWire, LeaseTTLToWire overflow/clamp/negative,
ToBlockVolumeInfoMessage with primary/stale/closed/concurrent volumes,
BlockVolumeAssignment roundtrip, and heartbeat collection edge cases.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add SimulatedMaster test helper + 20 assignment sequence tests (8 sequence,
5 failover, 5 adversarial, 2 status). Add BlockVolumeStatus struct and
Status() method. Includes QA test files for CP1-CP4a. 940 total unit tests.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add master-driven lifecycle operations: promotion, demotion, rebuild,
and split-brain prevention. All testable on Windows with mock TCP.
New files:
- promotion.go: HandleAssignment (single entry point for role changes),
promote (Replica/None -> Primary with durable epoch), demote
(Primary -> Draining -> Stale with drain timeout)
- rebuild.go: RebuildServer (WAL catch-up + full extent streaming),
StartRebuild client (WAL catch-up with full extent fallback,
two-phase rebuild with second catch-up for concurrent writes)
Modified:
- wal_writer.go: ScanFrom() method, ErrWALRecycled sentinel
- repl_proto.go: rebuild message types + RebuildRequest encode/decode
- blockvol.go: assignMu, drainTimeout, rebuildServer fields;
HandleAssignment/StartRebuildServer/StopRebuildServer methods;
rebuild server stop in Close()
- dirty_map.go: Clear() method for full extent rebuild
32 new tests covering WAL scan, promotion/demotion, rebuild server,
rebuild client, split-brain prevention, and full lifecycle scenarios.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Primary ships WAL entries to replica over TCP (data channel), confirms
durability via barrier RPC (control channel). SyncCache runs local fsync
and replica barrier in parallel via MakeDistributedSync. When replica is
unreachable, shipper enters permanent degraded mode and falls back to
local-only sync (Phase 3 behavior).
Key design: two separate TCP ports (data+control), contiguous LSN
enforcement, epoch equality check, WAL-full retry on replica,
cond.Wait-based barrier with configurable timeout, BarrierFsyncFailed
status code. Close lifecycle: shipper → receiver → drain → committer →
flusher → fd.
New files: repl_proto.go, wal_shipper.go, replica_apply.go,
replica_barrier.go, dist_group_commit.go
Modified: blockvol.go, blockvol_test.go
27 dev tests + 21 QA tests = 48 new tests; 889 total (609 engine + 280
iSCSI), all passing.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
9 categories: PDU, Params, Login, Discovery, SCSI, DataIO, Session,
Target, Integration. 2,183 lines. All 229 tests pass (164 dev + 55 QA).
No new production bugs found.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Discovery session nil handler crash: reject SCSI commands with
Reject PDU when s.scsi is nil (discovery sessions have no target).
2. CmdSN window enforcement: validate incoming CmdSN against
[ExpCmdSN, MaxCmdSN] using serial arithmetic. Drop out-of-window
commands per RFC 7143 section 4.2.2.1.
3. Data-Out buffer offset validation: enforce BufferOffset == received
for ordered data (DataPDUInOrder=Yes). Prevents silent corruption
from out-of-order or overlapping data.
4. ImmediateData enforcement: reject immediate data in SCSI command
PDU when negotiated ImmediateData=No.
5. UNMAP descriptor length alignment: reject blockDescLen not a
multiple of 16 bytes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Linux kernel iSCSI initiator pipelines multiple SCSI commands on
the same TCP connection (command queuing). When a write needs R2T for
data beyond the immediate portion, collectDataOut may read a pipelined
SCSI command instead of the expected Data-Out PDU.
Fix: queue non-Data-Out PDUs received during collectDataOut into a
pending buffer. The main dispatch loop drains pending PDUs before
reading from the connection. This correctly handles interleaved
commands during multi-PDU write transfers.
Bug found during WSL2 smoke test: mkfs.ext4 hangs at "Writing
superblocks" because inode table zeroing sends large writes that
exceed FirstBurstLength, triggering R2T while the kernel has already
queued the next command.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Skip InitiatorAlias in negotiation (was returning NotUnderstood)
- Capture TargetName in StageLoginOp direct-jump path (iscsiadm skips
security stage, sends CSG=LoginOp directly -- nil SCSIHandler crash)
- Add portalAddr to TargetServer for discovery responses (listener on
[::] is not routable from WSL2 clients)
- Add -portal flag to iscsi-target binary
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
If local EC scrubbing hits needles whose chunk location reside entirely
in local shards, we can fully reconstruct them, and check CRCs for
data integrity.
* fix LevelDB panic on lazy reload
Implemented a thread-safe reload mechanism using double-checked
locking and a retry loop in Get, Put, and Delete. Added a concurrency
test to verify the fix and prevent regressions.
Fixes#8269
* refactor: use helper for leveldb fix and remove deprecated ioutil
* fix: prevent deadlock by using getFromDb helper
Extracted DB lookup to internal helper to avoid recursive RLock in Put/Delete methods.
Updated Get to use the helper as well.
* fix: resolve syntax error and commit deadlock prevention
Fixed a duplicate function declaration syntax error.
Verified that getFromDb helper correctly prevents recursive RLock scenarios.
* refactor: remove redundant timeout checks
Removed nested `if m.ldbTimeout > 0` checks in Get, Put, and Delete
methods as suggested in PR review.
* Fix disk errors handling in vacuum compaction
When a disk reports IO errors during vacuum compaction (e.g., 'read /mnt/d1/weed/oc_xyz.dat: input/output error'), the vacuum task should signal the error to the master so it can:
1. Drop the faulty volume replica
2. Rebuild the replica from healthy copies
Changes:
- Add checkReadWriteError() calls in vacuum read paths (ReadNeedleBlob, ReadData, ScanVolumeFile) to flag EIO errors in volume.lastIoError
- Preserve error wrapping using %w format instead of %v so EIO propagates correctly
- The existing heartbeat logic will detect lastIoError and remove the bad volume
Fixes issue #8237
* error
* s3: fix health check endpoints returning 404 for HEAD requests #8243
When a disk reports IO errors during vacuum compaction (e.g., 'read /mnt/d1/weed/oc_xyz.dat: input/output error'), the vacuum task should signal the error to the master so it can:
1. Drop the faulty volume replica
2. Rebuild the replica from healthy copies
Changes:
- Add checkReadWriteError() calls in vacuum read paths (ReadNeedleBlob, ReadData, ScanVolumeFile) to flag EIO errors in volume.lastIoError
- Preserve error wrapping using %w format instead of %v so EIO propagates correctly
- The existing heartbeat logic will detect lastIoError and remove the bad volume
Fixes issue #8237
* fix multipart etag
* address comments
* clean up
* clean up
* optimization
* address comments
* unquoted etag
* dedup
* upgrade
* clean
* etag
* return quoted tag
* quoted etag
* debug
* s3api: unify ETag retrieval and quoting across handlers
Refactor newListEntry to take *S3ApiServer and use getObjectETag,
and update setResponseHeaders to use the same logic. This ensures
consistent ETags are returned for both listing and direct access.
* s3api: implement ListObjects deduplication for versioned buckets
Handle duplicate entries between the main path and the .versions
directory by prioritizing the latest version when bucket versioning
is enabled.
* s3api: cleanup stale main file entries during versioned uploads
Add explicit deletion of pre-existing "main" files when creating new
versions in versioned buckets. This prevents stale entries from
appearing in bucket listings and ensures consistency.
* s3api: fix cleanup code placement in versioned uploads
Correct the placement of rm calls in completeMultipartUpload and
putVersionedObject to ensure stale main files are properly deleted
during versioned uploads.
* s3api: improve getObjectETag fallback for empty ExtETagKey
Ensure that when ExtETagKey exists but contains an empty value,
the function falls through to MD5/chunk-based calculation instead
of returning an empty string.
* s3api: fix test files for new newListEntry signature
Update test files to use the new newListEntry signature where the
first parameter is *S3ApiServer. Created mockS3ApiServer to properly
test owner display name lookup functionality.
* s3api: use filer.ETag for consistent Md5 handling in getEtagFromEntry
Change getEtagFromEntry fallback to use filer.ETag(entry) instead of
filer.ETagChunks to ensure legacy entries with Attributes.Md5 are
handled consistently with the rest of the codebase.
* s3api: optimize list logic and fix conditional header logging
- Hoist bucket versioning check out of per-entry callback to avoid
repeated getVersioningState calls
- Extract appendOrDedup helper function to eliminate duplicate
dedup/append logic across multiple code paths
- Change If-Match mismatch logging from glog.Errorf to glog.V(3).Infof
and remove DEBUG prefix for consistency
* s3api: fix test mock to properly initialize IAM accounts
Fixed nil pointer dereference in TestNewListEntryOwnerDisplayName by
directly initializing the IdentityAccessManagement.accounts map in the
test setup. This ensures newListEntry can properly look up account
display names without panicking.
* cleanup
* s3api: remove premature main file cleanup in versioned uploads
Removed incorrect cleanup logic that was deleting main files during
versioned uploads. This was causing test failures because it deleted
objects that should have been preserved as null versions when
versioning was first enabled. The deduplication logic in listing is
sufficient to handle duplicate entries without deleting files during
upload.
* s3api: add empty-value guard to getEtagFromEntry
Added the same empty-value guard used in getObjectETag to prevent
returning quoted empty strings. When ExtETagKey exists but is empty,
the function now falls through to filer.ETag calculation instead of
returning "".
* s3api: fix listing of directory key objects with matching prefix
Revert prefix handling logic to use strings.TrimPrefix instead of
checking HasPrefix with empty string result. This ensures that when a
directory key object exactly matches the prefix (e.g. prefix="dir/",
object="dir/"), it is correctly handled as a regular entry instead of
being skipped or incorrectly processed as a common prefix. Also fixed
missing variable definition.
* s3api: refactor list inline dedup to use appendOrDedup helper
Refactored the inline deduplication logic in listFilerEntries to use the
shared appendOrDedup helper function. This ensures consistent behavior
and reduces code duplication.
* test: fix port allocation race in s3tables integration test
Updated startMiniCluster to find all required ports simultaneously using
findAvailablePorts instead of sequentially. This prevents race conditions
where the OS reallocates a port that was just released, causing multiple
services (e.g. Filer and Volume) to be assigned the same port and fail
to start.
* Add a version token on `GetState()`/`SetState()` RPCs for volume server states.
* Make state version a property ov `VolumeServerState` instead of an in-memory counter.
Also extend state atomicity to reads, instead of just writes.
Implement index (fast) scrubbing for regular/EC volumes via `ScrubVolume()`/`ScrubEcVolume()`.
Also rearranges existing index test files for reuse across unit tests for different modules.
* fix concurrent map access in EC shards info #8219
* refactor: simplify Disk.ToDiskInfo to use ecShards snapshot and avoid redundant locking
* refactor: improve GetEcShards with pre-allocation and defer
* feat: Add Iceberg REST Catalog server
Implement Iceberg REST Catalog API on a separate port (default 8181)
that exposes S3 Tables metadata through the Apache Iceberg REST protocol.
- Add new weed/s3api/iceberg package with REST handlers
- Implement /v1/config endpoint returning catalog configuration
- Implement namespace endpoints (list/create/get/head/delete)
- Implement table endpoints (list/create/load/head/delete/update)
- Add -port.iceberg flag to S3 standalone server (s3.go)
- Add -s3.port.iceberg flag to combined server mode (server.go)
- Add -s3.port.iceberg flag to mini cluster mode (mini.go)
- Support prefix-based routing for multiple catalogs
The Iceberg REST server reuses S3 Tables metadata storage under
/table-buckets and enables DuckDB, Spark, and other Iceberg clients
to connect to SeaweedFS as a catalog.
* feat: Add Iceberg Catalog pages to admin UI
Add admin UI pages to browse Iceberg catalogs, namespaces, and tables.
- Add Iceberg Catalog menu item under Object Store navigation
- Create iceberg_catalog.templ showing catalog overview with REST info
- Create iceberg_namespaces.templ listing namespaces in a catalog
- Create iceberg_tables.templ listing tables in a namespace
- Add handlers and routes in admin_handlers.go
- Add Iceberg data provider methods in s3tables_management.go
- Add Iceberg data types in types.go
The Iceberg Catalog pages provide visibility into the same S3 Tables
data through an Iceberg-centric lens, including REST endpoint examples
for DuckDB and PyIceberg.
* test: Add Iceberg catalog integration tests and reorg s3tables tests
- Reorganize existing s3tables tests to test/s3tables/table-buckets/
- Add new test/s3tables/catalog/ for Iceberg REST catalog tests
- Add TestIcebergConfig to verify /v1/config endpoint
- Add TestIcebergNamespaces to verify namespace listing
- Add TestDuckDBIntegration for DuckDB connectivity (requires Docker)
- Update CI workflow to use new test paths
* fix: Generate proper random UUIDs for Iceberg tables
Address code review feedback:
- Replace placeholder UUID with crypto/rand-based UUID v4 generation
- Add detailed TODO comments for handleUpdateTable stub explaining
the required atomic metadata swap implementation
* fix: Serve Iceberg on localhost listener when binding to different interface
Address code review feedback: properly serve the localhost listener
when the Iceberg server is bound to a non-localhost interface.
* ci: Add Iceberg catalog integration tests to CI
Add new job to run Iceberg catalog tests in CI, along with:
- Iceberg package build verification
- Iceberg unit tests
- Iceberg go vet checks
- Iceberg format checks
* fix: Address code review feedback for Iceberg implementation
- fix: Replace hardcoded account ID with s3_constants.AccountAdminId in buildTableBucketARN()
- fix: Improve UUID generation error handling with deterministic fallback (timestamp + PID + counter)
- fix: Update handleUpdateTable to return HTTP 501 Not Implemented instead of fake success
- fix: Better error handling in handleNamespaceExists to distinguish 404 from 500 errors
- fix: Use relative URL in template instead of hardcoded localhost:8181
- fix: Add HTTP timeout to test's waitForService function to avoid hangs
- fix: Use dynamic ephemeral ports in integration tests to avoid flaky parallel failures
- fix: Add Iceberg port to final port configuration logging in mini.go
* fix: Address critical issues in Iceberg implementation
- fix: Cache table UUIDs to ensure persistence across LoadTable calls
The UUID now remains stable for the lifetime of the server session.
TODO: For production, UUIDs should be persisted in S3 Tables metadata.
- fix: Remove redundant URL-encoded namespace parsing
mux router already decodes %1F to \x1F before passing to handlers.
Redundant ReplaceAll call could cause bugs with literal %1F in namespace.
* fix: Improve test robustness and reduce code duplication
- fix: Make DuckDB test more robust by failing on unexpected errors
Instead of silently logging errors, now explicitly check for expected
conditions (extension not available) and skip the test appropriately.
- fix: Extract username helper method to reduce duplication
Created getUsername() helper in AdminHandlers to avoid duplicating
the username retrieval logic across Iceberg page handlers.
* fix: Add mutex protection to table UUID cache
Protects concurrent access to the tableUUIDs map with sync.RWMutex.
Uses read-lock for fast path when UUID already cached, and write-lock
for generating new UUIDs. Includes double-check pattern to handle race
condition between read-unlock and write-lock.
* style: fix go fmt errors
* feat(iceberg): persist table UUID in S3 Tables metadata
* feat(admin): configure Iceberg port in Admin UI and commands
* refactor: address review comments (flags, tests, handlers)
- command/mini: fix tracking of explicit s3.port.iceberg flag
- command/admin: add explicit -iceberg.port flag
- admin/handlers: reuse getUsername helper
- tests: use 127.0.0.1 for ephemeral ports and os.Stat for file size check
* test: check error from FileStat in verify_gc_empty_test
* Implement RPC skeleton for regular/EC volumes scrubbing.
See https://github.com/seaweedfs/seaweedfs/issues/8018 for details.
* Minor proto improvements for `ScrubVolume()`, `ScrubEcVolume()`:
- Add fields for scrubbing details in `ScrubVolumeResponse` and `ScrubEcVolumeResponse`,
instead of reporting these through RPC errors.
- Return a list of broken shards when scrubbing EC volumes, via `EcShardInfo'.
* Boostrap persistent state for volume servers.
This PR implements logic load/save persistent state information for storages
associated with volume servers, and reporting state changes back to masters
via heartbeat messages.
More work ensues!
See https://github.com/seaweedfs/seaweedfs/issues/7977 for details.
* Add volume server RPCs to read and update state flags.
* fix: skip exhausted blocks before creating an interval
* refactor: optimize interval creation and fix logic duplication
* docs: add docstring for LocateData
* refactor: extract moveToNextBlock helper to deduplicate logic
* fix: use int64 for block index comparison to prevent overflow
* test: add unit test for LocateData boundary crossing (issue #8179)
* fix: skip exhausted blocks to prevent negative interval size and panics (issue #8179)
* refactor: apply review suggestions for test maintainability and code style