Add counters (total, soft, hard, timeout) and wait-time histogram to
WALAdmission, wired through EngineMetrics and exported as Prometheus
metrics. Six new tests verify all code paths. Nil-safe for backwards
compatibility.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
All three io_uring backends (iceber, giouring, raw) now require explicit
build tags — no tag means standard-only. Each backend registers its name
via IOUringImpl so startup logs show compiled implementation alongside
requested/selected backend mode.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Split iouring_linux.go into three build-tagged implementations:
1. iouring_iceber_linux.go (-tags iouring_iceber)
iceber/iouring-go library. Goroutine-based completion model.
Known -72% write regression due to per-op channel overhead.
2. iouring_giouring_linux.go (-tags iouring_giouring)
pawelgaczynski/giouring — direct liburing port. No goroutines,
no channels. Direct SQE/CQE ring manipulation. Kernel 6.0+.
3. iouring_raw_linux.go (default on Linux, no tags needed)
Raw syscall wrappers — io_uring_setup/io_uring_enter + mmap.
Zero dependencies. ~300 LOC. Kernel 5.6+.
Build commands for benchmarking:
go build -tags iouring_iceber ./... # option A
go build -tags iouring_giouring ./... # option B
go build ./... # option C (raw, default)
go build -tags no_iouring ./... # disable all io_uring
All variants implement the same BatchIO interface. Cross-compile
verified for all four tag combinations.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The iceber/iouring-go SubmitRequests returns a RequestSet interface
which cannot be ranged over directly. Use resultSet.Done() to wait
for all completions, then iterate resultSet.Requests().
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace UseIOUring bool with IOBackend IOBackendMode (tri-state):
- "standard" (default): sequential pread/pwrite/fdatasync
- "auto": try io_uring, fall back to standard with warning log
- "io_uring": require io_uring, fail startup if unavailable
NewIOUring now returns ErrIOUringUnavailable instead of silently
falling back — callers decide whether to fail or fall back based
on the requested mode. All mode transitions are logged:
io backend: requested=auto selected=standard reason=...
io backend: requested=io_uring selected=io_uring
CLI: --io-backend=standard|auto|io_uring added to iscsi-target.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. HIGH: LinkedWriteFsync now uses SubmitLinkRequests (IOSQE_IO_LINK)
instead of SubmitRequests, ensuring write+fdatasync execute as a
linked chain in the kernel. Falls back to sequential on error.
2. HIGH: PreadBatch/PwriteBatch chunk ops by ring capacity to prevent
"too many requests" rejection when dirty map exceeds ring size (256).
3. MED: CloseBatchIO() added to Flusher, called in BlockVol.Close()
after final flush to release io_uring ring / kernel resources.
4. MED: Sync parity — both standard and io_uring paths now use
fdatasync (via platform-specific fdatasync_linux.go / fdatasync_other.go).
Standard path previously used fsync; now matches io_uring semantics.
On non-Linux, fdatasync falls back to fsync (only option available).
10 batchio tests, all blockvol tests pass.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add iouring_linux.go (build-tagged linux && !no_iouring) using
iceber/iouring-go for batched pread/pwrite/fdatasync. Includes
linked write+fsync chain for group commit optimization.
iouring_other.go provides silent fallback to standard on non-Linux.
blockvol.go wires UseIOUring config flag through to flusher BatchIO.
NewIOUring gracefully falls back if kernel lacks io_uring support.
10 batchio tests, all blockvol tests pass unchanged.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New package batchio/ with BatchIO interface (PreadBatch, PwriteBatch,
Fsync, LinkedWriteFsync) and standard sequential implementation.
Flusher refactored to use BatchIO: WAL header reads, WAL entry reads,
and extent writes are now batched through the interface. With the
default NewStandard() backend, behavior is identical to before.
UseIOUring config field added for future io_uring opt-in (Linux 5.6+).
9 interface tests, all existing blockvol tests pass unchanged.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sparse delta-file snapshots with copy-on-write in the flusher.
Zero write-path overhead when no snapshot is active.
New: snapshot.go (SnapshotBitmap, SnapshotHeader, delta file I/O)
Modified: flusher.go (flushMu, CoW phase in FlushOnce, PauseAndFlush)
Modified: blockvol.go (Create/Read/Delete/Restore/ListSnapshots, recovery)
Modified: wal_writer.go (Reset for snapshot restore)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add ALUA (Asymmetric Logical Unit Access) support to the iSCSI target,
enabling dm-multipath on Linux to automatically detect path state changes
and reroute I/O during HA failover without initiator-side intervention.
- ALUAProvider interface with implicit ALUA (TPGS=0x01)
- INQUIRY byte 5 TPGS bits, VPD 0x83 with NAA+TPG+RTP descriptors
- REPORT TARGET PORT GROUPS handler (MAINTENANCE IN SA=0x0A)
- MAINTENANCE OUT rejection (implicit-only, no SET TPG)
- Standby write rejection (NOT_READY ASC=04h ASCQ=0Bh)
- RoleNone maps to Active/Optimized (standalone single-node compatibility)
- NAA-6 device identifier derived from volume UUID
- -tpg-id flag with [1,65535] validation
- dm-multipath config + setup script (group_by_tpg, ALUA prio)
- 12 unit tests + 16 QA adversarial tests + 4 integration tests
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Test harness for running blockvol iSCSI tests on WSL2 and remote nodes
(m01/M02). Includes Node (SSH/local exec), ISCSIClient (discover/login/
logout), WeedTarget (weed volume server lifecycle), and test suites for
smoke, stress, crash recovery, chaos, perf benchmarks, and apps (fio/dd).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add ProcessBlockVolumeAssignments to BlockVolumeStore and wire
AssignmentSource/AssignmentCallback into the heartbeat collector's
Run() loop. Assignments are fetched and applied each tick after
status collection.
Bug fixes:
- BUG-CP4B3-1: TOCTOU between GetBlockVolume and HandleAssignment.
Added withVolume() helper that holds RLock across lookup+operation,
preventing RemoveBlockVolume from closing the volume mid-assignment.
- BUG-CP4B3-2: Data race on callback fields read by Run() goroutine.
Made StatusCallback/AssignmentSource/AssignmentCallback private,
added cbMu mutex and SetXxx() setter methods. Lock held only for
load/store, not during callback execution.
7 dev tests + 13 QA adversarial tests = 20 new tests.
972 total unit tests, all passing.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Boundary tests for RoleFromWire, LeaseTTLToWire overflow/clamp/negative,
ToBlockVolumeInfoMessage with primary/stale/closed/concurrent volumes,
BlockVolumeAssignment roundtrip, and heartbeat collection edge cases.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add SimulatedMaster test helper + 20 assignment sequence tests (8 sequence,
5 failover, 5 adversarial, 2 status). Add BlockVolumeStatus struct and
Status() method. Includes QA test files for CP1-CP4a. 940 total unit tests.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add master-driven lifecycle operations: promotion, demotion, rebuild,
and split-brain prevention. All testable on Windows with mock TCP.
New files:
- promotion.go: HandleAssignment (single entry point for role changes),
promote (Replica/None -> Primary with durable epoch), demote
(Primary -> Draining -> Stale with drain timeout)
- rebuild.go: RebuildServer (WAL catch-up + full extent streaming),
StartRebuild client (WAL catch-up with full extent fallback,
two-phase rebuild with second catch-up for concurrent writes)
Modified:
- wal_writer.go: ScanFrom() method, ErrWALRecycled sentinel
- repl_proto.go: rebuild message types + RebuildRequest encode/decode
- blockvol.go: assignMu, drainTimeout, rebuildServer fields;
HandleAssignment/StartRebuildServer/StopRebuildServer methods;
rebuild server stop in Close()
- dirty_map.go: Clear() method for full extent rebuild
32 new tests covering WAL scan, promotion/demotion, rebuild server,
rebuild client, split-brain prevention, and full lifecycle scenarios.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Primary ships WAL entries to replica over TCP (data channel), confirms
durability via barrier RPC (control channel). SyncCache runs local fsync
and replica barrier in parallel via MakeDistributedSync. When replica is
unreachable, shipper enters permanent degraded mode and falls back to
local-only sync (Phase 3 behavior).
Key design: two separate TCP ports (data+control), contiguous LSN
enforcement, epoch equality check, WAL-full retry on replica,
cond.Wait-based barrier with configurable timeout, BarrierFsyncFailed
status code. Close lifecycle: shipper → receiver → drain → committer →
flusher → fd.
New files: repl_proto.go, wal_shipper.go, replica_apply.go,
replica_barrier.go, dist_group_commit.go
Modified: blockvol.go, blockvol_test.go
27 dev tests + 21 QA tests = 48 new tests; 889 total (609 engine + 280
iSCSI), all passing.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
9 categories: PDU, Params, Login, Discovery, SCSI, DataIO, Session,
Target, Integration. 2,183 lines. All 229 tests pass (164 dev + 55 QA).
No new production bugs found.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. Discovery session nil handler crash: reject SCSI commands with
Reject PDU when s.scsi is nil (discovery sessions have no target).
2. CmdSN window enforcement: validate incoming CmdSN against
[ExpCmdSN, MaxCmdSN] using serial arithmetic. Drop out-of-window
commands per RFC 7143 section 4.2.2.1.
3. Data-Out buffer offset validation: enforce BufferOffset == received
for ordered data (DataPDUInOrder=Yes). Prevents silent corruption
from out-of-order or overlapping data.
4. ImmediateData enforcement: reject immediate data in SCSI command
PDU when negotiated ImmediateData=No.
5. UNMAP descriptor length alignment: reject blockDescLen not a
multiple of 16 bytes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Linux kernel iSCSI initiator pipelines multiple SCSI commands on
the same TCP connection (command queuing). When a write needs R2T for
data beyond the immediate portion, collectDataOut may read a pipelined
SCSI command instead of the expected Data-Out PDU.
Fix: queue non-Data-Out PDUs received during collectDataOut into a
pending buffer. The main dispatch loop drains pending PDUs before
reading from the connection. This correctly handles interleaved
commands during multi-PDU write transfers.
Bug found during WSL2 smoke test: mkfs.ext4 hangs at "Writing
superblocks" because inode table zeroing sends large writes that
exceed FirstBurstLength, triggering R2T while the kernel has already
queued the next command.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Skip InitiatorAlias in negotiation (was returning NotUnderstood)
- Capture TargetName in StageLoginOp direct-jump path (iscsiadm skips
security stage, sends CSG=LoginOp directly -- nil SCSIHandler crash)
- Add portalAddr to TargetServer for discovery responses (listener on
[::] is not routable from WSL2 clients)
- Add -portal flag to iscsi-target binary
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
If local EC scrubbing hits needles whose chunk location reside entirely
in local shards, we can fully reconstruct them, and check CRCs for
data integrity.
* fix LevelDB panic on lazy reload
Implemented a thread-safe reload mechanism using double-checked
locking and a retry loop in Get, Put, and Delete. Added a concurrency
test to verify the fix and prevent regressions.
Fixes#8269
* refactor: use helper for leveldb fix and remove deprecated ioutil
* fix: prevent deadlock by using getFromDb helper
Extracted DB lookup to internal helper to avoid recursive RLock in Put/Delete methods.
Updated Get to use the helper as well.
* fix: resolve syntax error and commit deadlock prevention
Fixed a duplicate function declaration syntax error.
Verified that getFromDb helper correctly prevents recursive RLock scenarios.
* refactor: remove redundant timeout checks
Removed nested `if m.ldbTimeout > 0` checks in Get, Put, and Delete
methods as suggested in PR review.
* Fix disk errors handling in vacuum compaction
When a disk reports IO errors during vacuum compaction (e.g., 'read /mnt/d1/weed/oc_xyz.dat: input/output error'), the vacuum task should signal the error to the master so it can:
1. Drop the faulty volume replica
2. Rebuild the replica from healthy copies
Changes:
- Add checkReadWriteError() calls in vacuum read paths (ReadNeedleBlob, ReadData, ScanVolumeFile) to flag EIO errors in volume.lastIoError
- Preserve error wrapping using %w format instead of %v so EIO propagates correctly
- The existing heartbeat logic will detect lastIoError and remove the bad volume
Fixes issue #8237
* error
* s3: fix health check endpoints returning 404 for HEAD requests #8243