The iceber/iouring-go SubmitRequests returns a RequestSet interface
which cannot be ranged over directly. Use resultSet.Done() to wait
for all completions, then iterate resultSet.Requests().
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace UseIOUring bool with IOBackend IOBackendMode (tri-state):
- "standard" (default): sequential pread/pwrite/fdatasync
- "auto": try io_uring, fall back to standard with warning log
- "io_uring": require io_uring, fail startup if unavailable
NewIOUring now returns ErrIOUringUnavailable instead of silently
falling back — callers decide whether to fail or fall back based
on the requested mode. All mode transitions are logged:
io backend: requested=auto selected=standard reason=...
io backend: requested=io_uring selected=io_uring
CLI: --io-backend=standard|auto|io_uring added to iscsi-target.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. HIGH: LinkedWriteFsync now uses SubmitLinkRequests (IOSQE_IO_LINK)
instead of SubmitRequests, ensuring write+fdatasync execute as a
linked chain in the kernel. Falls back to sequential on error.
2. HIGH: PreadBatch/PwriteBatch chunk ops by ring capacity to prevent
"too many requests" rejection when dirty map exceeds ring size (256).
3. MED: CloseBatchIO() added to Flusher, called in BlockVol.Close()
after final flush to release io_uring ring / kernel resources.
4. MED: Sync parity — both standard and io_uring paths now use
fdatasync (via platform-specific fdatasync_linux.go / fdatasync_other.go).
Standard path previously used fsync; now matches io_uring semantics.
On non-Linux, fdatasync falls back to fsync (only option available).
10 batchio tests, all blockvol tests pass.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add iouring_linux.go (build-tagged linux && !no_iouring) using
iceber/iouring-go for batched pread/pwrite/fdatasync. Includes
linked write+fsync chain for group commit optimization.
iouring_other.go provides silent fallback to standard on non-Linux.
blockvol.go wires UseIOUring config flag through to flusher BatchIO.
NewIOUring gracefully falls back if kernel lacks io_uring support.
10 batchio tests, all blockvol tests pass unchanged.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New package batchio/ with BatchIO interface (PreadBatch, PwriteBatch,
Fsync, LinkedWriteFsync) and standard sequential implementation.
Flusher refactored to use BatchIO: WAL header reads, WAL entry reads,
and extent writes are now batched through the interface. With the
default NewStandard() backend, behavior is identical to before.
UseIOUring config field added for future io_uring opt-in (Linux 5.6+).
9 interface tests, all existing blockvol tests pass unchanged.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sparse delta-file snapshots with copy-on-write in the flusher.
Zero write-path overhead when no snapshot is active.
New: snapshot.go (SnapshotBitmap, SnapshotHeader, delta file I/O)
Modified: flusher.go (flushMu, CoW phase in FlushOnce, PauseAndFlush)
Modified: blockvol.go (Create/Read/Delete/Restore/ListSnapshots, recovery)
Modified: wal_writer.go (Reset for snapshot restore)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add ALUA (Asymmetric Logical Unit Access) support to the iSCSI target,
enabling dm-multipath on Linux to automatically detect path state changes
and reroute I/O during HA failover without initiator-side intervention.
- ALUAProvider interface with implicit ALUA (TPGS=0x01)
- INQUIRY byte 5 TPGS bits, VPD 0x83 with NAA+TPG+RTP descriptors
- REPORT TARGET PORT GROUPS handler (MAINTENANCE IN SA=0x0A)
- MAINTENANCE OUT rejection (implicit-only, no SET TPG)
- Standby write rejection (NOT_READY ASC=04h ASCQ=0Bh)
- RoleNone maps to Active/Optimized (standalone single-node compatibility)
- NAA-6 device identifier derived from volume UUID
- -tpg-id flag with [1,65535] validation
- dm-multipath config + setup script (group_by_tpg, ALUA prio)
- 12 unit tests + 16 QA adversarial tests + 4 integration tests
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Test harness for running blockvol iSCSI tests on WSL2 and remote nodes
(m01/M02). Includes Node (SSH/local exec), ISCSIClient (discover/login/
logout), WeedTarget (weed volume server lifecycle), and test suites for
smoke, stress, crash recovery, chaos, perf benchmarks, and apps (fio/dd).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add ProcessBlockVolumeAssignments to BlockVolumeStore and wire
AssignmentSource/AssignmentCallback into the heartbeat collector's
Run() loop. Assignments are fetched and applied each tick after
status collection.
Bug fixes:
- BUG-CP4B3-1: TOCTOU between GetBlockVolume and HandleAssignment.
Added withVolume() helper that holds RLock across lookup+operation,
preventing RemoveBlockVolume from closing the volume mid-assignment.
- BUG-CP4B3-2: Data race on callback fields read by Run() goroutine.
Made StatusCallback/AssignmentSource/AssignmentCallback private,
added cbMu mutex and SetXxx() setter methods. Lock held only for
load/store, not during callback execution.
7 dev tests + 13 QA adversarial tests = 20 new tests.
972 total unit tests, all passing.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
BlockVolumeHeartbeatCollector periodically collects block volume status
via callback (standalone, no gRPC wiring yet). Store() accessor on
BlockService. Three bugs found by QA and fixed: Stop-before-Run deadlock
(BUG-CP4B2-1), zero interval panic (BUG-CP4B2-2), callback panic crashes
goroutine (BUG-CP4B2-3). 12 new tests (3 dev + 9 QA adversarial).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Boundary tests for RoleFromWire, LeaseTTLToWire overflow/clamp/negative,
ToBlockVolumeInfoMessage with primary/stale/closed/concurrent volumes,
BlockVolumeAssignment roundtrip, and heartbeat collection edge cases.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add SimulatedMaster test helper + 20 assignment sequence tests (8 sequence,
5 failover, 5 adversarial, 2 status). Add BlockVolumeStatus struct and
Status() method. Includes QA test files for CP1-CP4a. 940 total unit tests.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add master-driven lifecycle operations: promotion, demotion, rebuild,
and split-brain prevention. All testable on Windows with mock TCP.
New files:
- promotion.go: HandleAssignment (single entry point for role changes),
promote (Replica/None -> Primary with durable epoch), demote
(Primary -> Draining -> Stale with drain timeout)
- rebuild.go: RebuildServer (WAL catch-up + full extent streaming),
StartRebuild client (WAL catch-up with full extent fallback,
two-phase rebuild with second catch-up for concurrent writes)
Modified:
- wal_writer.go: ScanFrom() method, ErrWALRecycled sentinel
- repl_proto.go: rebuild message types + RebuildRequest encode/decode
- blockvol.go: assignMu, drainTimeout, rebuildServer fields;
HandleAssignment/StartRebuildServer/StopRebuildServer methods;
rebuild server stop in Close()
- dirty_map.go: Clear() method for full extent rebuild
32 new tests covering WAL scan, promotion/demotion, rebuild server,
rebuild client, split-brain prevention, and full lifecycle scenarios.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Primary ships WAL entries to replica over TCP (data channel), confirms
durability via barrier RPC (control channel). SyncCache runs local fsync
and replica barrier in parallel via MakeDistributedSync. When replica is
unreachable, shipper enters permanent degraded mode and falls back to
local-only sync (Phase 3 behavior).
Key design: two separate TCP ports (data+control), contiguous LSN
enforcement, epoch equality check, WAL-full retry on replica,
cond.Wait-based barrier with configurable timeout, BarrierFsyncFailed
status code. Close lifecycle: shipper → receiver → drain → committer →
flusher → fd.
New files: repl_proto.go, wal_shipper.go, replica_apply.go,
replica_barrier.go, dist_group_commit.go
Modified: blockvol.go, blockvol_test.go
27 dev tests + 21 QA tests = 48 new tests; 889 total (609 engine + 280
iSCSI), all passing.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
9 categories: PDU, Params, Login, Discovery, SCSI, DataIO, Session,
Target, Integration. 2,183 lines. All 229 tests pass (164 dev + 55 QA).
No new production bugs found.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Discovery session nil handler crash: reject SCSI commands with
Reject PDU when s.scsi is nil (discovery sessions have no target).
2. CmdSN window enforcement: validate incoming CmdSN against
[ExpCmdSN, MaxCmdSN] using serial arithmetic. Drop out-of-window
commands per RFC 7143 section 4.2.2.1.
3. Data-Out buffer offset validation: enforce BufferOffset == received
for ordered data (DataPDUInOrder=Yes). Prevents silent corruption
from out-of-order or overlapping data.
4. ImmediateData enforcement: reject immediate data in SCSI command
PDU when negotiated ImmediateData=No.
5. UNMAP descriptor length alignment: reject blockDescLen not a
multiple of 16 bytes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Linux kernel iSCSI initiator pipelines multiple SCSI commands on
the same TCP connection (command queuing). When a write needs R2T for
data beyond the immediate portion, collectDataOut may read a pipelined
SCSI command instead of the expected Data-Out PDU.
Fix: queue non-Data-Out PDUs received during collectDataOut into a
pending buffer. The main dispatch loop drains pending PDUs before
reading from the connection. This correctly handles interleaved
commands during multi-PDU write transfers.
Bug found during WSL2 smoke test: mkfs.ext4 hangs at "Writing
superblocks" because inode table zeroing sends large writes that
exceed FirstBurstLength, triggering R2T while the kernel has already
queued the next command.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Skip InitiatorAlias in negotiation (was returning NotUnderstood)
- Capture TargetName in StageLoginOp direct-jump path (iscsiadm skips
security stage, sends CSG=LoginOp directly -- nil SCSIHandler crash)
- Add portalAddr to TargetServer for discovery responses (listener on
[::] is not routable from WSL2 clients)
- Add -portal flag to iscsi-target binary
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* feat: add customizable plugin display names and weights
- Add weight field to JobTypeCapability proto message
- Modify ListKnownJobTypes() to return JobTypeInfo with display names and weights
- Modify ListPluginJobTypes() to return JobTypeInfo instead of string
- Sort plugins by weight (descending) then alphabetically
- Update admin API to return enriched job type metadata
- Update plugin UI template to display names instead of IDs
- Consolidate API by reusing existing function names instead of suffixed variants
* perf: optimize plugin job type capability lookup and add null-safe parsing
- Pre-calculate job type capabilities in a map to reduce O(n*m) nested loops
to O(n+m) lookup time in ListKnownJobTypes()
- Add parseJobTypeItem() helper function for null-safe job type item parsing
- Refactor plugin.templ to use parseJobTypeItem() in all job type access points
(hasJobType, applyInitialNavigation, ensureActiveNavigation, renderTopTabs)
- Deterministic capability resolution by using first worker's capability
* templ
* refactor: use parseJobTypeItem helper consistently in plugin.templ
Replace duplicated job type extraction logic at line 1296-1298 with
parseJobTypeItem() helper function for consistency and maintainability.
* improve: prefer richer capability metadata and add null-safety checks
- Improve capability selection in ListKnownJobTypes() to prefer capabilities
with non-empty DisplayName and higher Weight across all workers instead of
first-wins approach. Handles mixed-version clusters better.
- Add defensive null checks in renderJobTypeSummary() to safely access
parseJobTypeItem() result before property access
- Ensures malformed or missing entries won't break the rendering pipeline
* fix: preserve existing DisplayName when merging capabilities
Fix capability merge logic to respect existing DisplayName values:
- If existing has DisplayName but candidate doesn't, preserve existing
- If existing doesn't have DisplayName but candidate does, use candidate
- Only use Weight comparison if DisplayName status is equal
- Prevents higher-weight capabilities with empty DisplayName from
overriding capabilities with non-empty DisplayName
* feat: drop table location mapping support
Disable external metadata locations for S3 Tables and remove the table location
mapping index entirely. Table metadata must live under the table bucket paths,
so lookups no longer use mapping directories.
Changes:
- Remove mapping lookup and cache from bucket path resolution
- Reject metadataLocation in CreateTable and UpdateTable
- Remove mapping helpers and tests
* compile
* refactor
* fix: accept metadataLocation in S3 Tables API requests
We removed the external table location mapping feature, but still need to
accept and store metadataLocation values from clients like Trino. The mapping
feature was an internal implementation detail that mapped external buckets to
internal table paths. The metadataLocation field itself is part of the S3 Tables
API and should be preserved.
* fmt
* fix: handle MetadataLocation in UpdateTable requests
Mirror handleCreateTable behavior by updating metadata.MetadataLocation
when req.MetadataLocation is provided in UpdateTable requests. This ensures
table metadata location can be updated, not just set during creation.
* fix: move table location mappings to /etc/s3tables to avoid bucket name validation
Fixes#8362 - table location mappings were stored under /buckets/.table-location-mappings
which fails bucket name validation because it starts with a dot. Moving them to
/etc/s3tables resolves the migration error for upgrades.
Changes:
- Table location mappings now stored under /etc/s3tables
- Ensure parent /etc directory exists before creating /etc/s3tables
- Normal writes go to new location only (no legacy compatibility)
- Removed bucket name validation exception for old location
* refactor: simplify lookupTableLocationMapping by removing redundant mappingPath parameter
The mappingPath function parameter was redundant as the path can be derived
from mappingDir and bucket using path.Join. This simplifies the code and
reduces the risk of path mismatches between parameters.
* Fix S3 signature verification behind reverse proxies
When SeaweedFS is deployed behind a reverse proxy (e.g. nginx, Kong,
Traefik), AWS S3 Signature V4 verification fails because the Host header
the client signed with (e.g. "localhost:9000") differs from the Host
header SeaweedFS receives on the backend (e.g. "seaweedfs:8333").
This commit adds a new -s3.externalUrl parameter (and S3_EXTERNAL_URL
environment variable) that tells SeaweedFS what public-facing URL clients
use to connect. When set, SeaweedFS uses this host value for signature
verification instead of the Host header from the incoming request.
New parameter:
-s3.externalUrl (flag) or S3_EXTERNAL_URL (environment variable)
Example: -s3.externalUrl=http://localhost:9000
Example: S3_EXTERNAL_URL=https://s3.example.com
The environment variable is particularly useful in Docker/Kubernetes
deployments where the external URL is injected via container config.
The flag takes precedence over the environment variable when both are set.
At startup, the URL is parsed and default ports are stripped to match
AWS SDK behavior (port 80 for HTTP, port 443 for HTTPS), so
"http://s3.example.com:80" and "http://s3.example.com" are equivalent.
Bugs fixed:
- Default port stripping was removed by a prior PR, causing signature
mismatches when clients connect on standard ports (80/443)
- X-Forwarded-Port was ignored when X-Forwarded-Host was not present
- Scheme detection now uses proper precedence: X-Forwarded-Proto >
TLS connection > URL scheme > "http"
- Test expectations for standard port stripping were incorrect
- expectedHost field in TestSignatureV4WithForwardedPort was declared
but never actually checked (self-referential test)
* Add Docker integration test for S3 proxy signature verification
Docker Compose setup with nginx reverse proxy to validate that the
-s3.externalUrl parameter (or S3_EXTERNAL_URL env var) correctly
resolves S3 signature verification when SeaweedFS runs behind a proxy.
The test uses nginx proxying port 9000 to SeaweedFS on port 8333,
with X-Forwarded-Host/Port/Proto headers set. SeaweedFS is configured
with -s3.externalUrl=http://localhost:9000 so it uses "localhost:9000"
for signature verification, matching what the AWS CLI signs with.
The test can be run with aws CLI on the host or without it by using
the amazon/aws-cli Docker image with --network host.
Test covers: create-bucket, list-buckets, put-object, head-object,
list-objects-v2, get-object, content round-trip integrity,
delete-object, and delete-bucket — all through the reverse proxy.
* Create s3-proxy-signature-tests.yml
* fix CLI
* fix CI
* Update s3-proxy-signature-tests.yml
* address comments
* Update Dockerfile
* add user
* no need for fuse
* Update s3-proxy-signature-tests.yml
* debug
* weed mini
* fix health check
* health check
* fix health checking
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
Add cosi.bucketClassParameters to allow passing arbitrary parameters
to the default BucketClass resource. This enables use cases like
tiered storage where a diskType parameter needs to be set on the
BucketClass to route objects to specific volume servers.
When bucketClassParameters is empty (default), the BucketClass is
rendered without a parameters block, preserving backward compatibility.
Signed-off-by: Kirill Ilin <stitch14@yandex.ru>
Co-authored-by: Claude <noreply@anthropic.com>
Some filesystems, such as XFS, may over-allocate disk spaces when using
volume preallocation. Remove this option from the default docker entrypoint
scripts to allow volumes to use only the necessary disk space.
Fixes: https://github.com/seaweedfs/seaweedfs/issues/6465#issuecomment-3964174718
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Make EC detection context aware
* Update register.go
* Speed up EC detection planning
* Add tests for EC detection planner
* optimizations
detection.go: extracted ParseCollectionFilter (exported) and feed it into the detection loop so both detection and tracing share the same parsing/whitelisting logic; the detection loop now iterates on a sorted list of volume IDs, checks the context at every iteration, and only sets hasMore when there are still unprocessed groups after hitting maxResults, keeping runtime bounded while still scheduling planned tasks before returning the results.
erasure_coding_handler.go: dropped the duplicated inline filter parsing in emitErasureCodingDetectionDecisionTrace and now reuse erasurecodingtask.ParseCollectionFilter, and the summary suffix logic now only accounts for the hasMore case that can actually happen.
detection_test.go: updated the helper topology builder to use master_pb.VolumeInformationMessage (matching the current protobuf types) and tightened the cancellation/max-results tests so they reliably exercise the detection logic (cancel before calling Detection, and provide enough disks so one result is produced before the limit).
* use working directory
* fix compilation
* fix compilation
* rename
* go vet
* fix getenv
* address comments, fix error
* Fix SFTP file upload failures with JWT filer tokens (issue #8425)
When JWT authentication is enabled for filer operations via jwt.filer_signing.*
configuration, SFTP server file upload requests were rejected because they lacked
JWT authorization headers.
Changes:
- Added JWT signing key and expiration fields to SftpServer struct
- Modified putFile() to generate and include JWT tokens in upload requests
- Enhanced SFTPServiceOptions with JWT configuration fields
- Updated SFTP command startup to load and pass JWT config to service
This allows SFTP uploads to authenticate with JWT-enabled filers, consistent
with how other SeaweedFS components (S3 API, file browser) handle filer auth.
Fixes#8425
* Apply suggestion from @gemini-code-assist[bot]
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
---------
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
TestS3MultipartOperationsInheritPutObjectPermissions verifies that multipart
upload operations (CreateMultipartUpload, UploadPart, ListParts,
CompleteMultipartUpload, AbortMultipartUpload, ListMultipartUploads) work
correctly when a user has only s3:PutObject permission granted.
This test validates the behavior where multipart operations are implicitly
granted when s3:PutObject is authorized, as multipart upload is an
implementation detail of putting objects in S3.