You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
5.8 KiB
5.8 KiB
Debugging Breakthrough: EOF Exception Analysis
Summary
After extensive debugging, we've identified and partially fixed the root cause of the EOFException: Still have: 78 bytes left error in Parquet file reads.
Root Cause Analysis
Initial Hypothesis ❌ (Incorrect)
- Thought: File size calculation was wrong (
contentLengthoff by 78 bytes) - Reality:
contentLengthwas always correct at 1275 bytes
Second Hypothesis ❌ (Partially Correct)
- Thought:
FSDataOutputStream.getPos()wasn't delegating toSeaweedOutputStream.getPos() - Reality: The override was working, but there was a deeper issue
Third Hypothesis ✅ (ROOT CAUSE)
- Problem:
SeaweedInputStream.read(ByteBuffer buf)was returning 0 bytes for inline content - Location: Line 127-129 in
SeaweedInputStream.java - Bug: When copying inline content from protobuf entry,
bytesReadwas never updated
// BEFORE (BUGGY):
if (this.position < Integer.MAX_VALUE && (this.position + len) <= entry.getContent().size()) {
entry.getContent().substring((int) this.position, (int) (this.position + len)).copyTo(buf);
// bytesRead stays 0! <-- BUG
} else {
bytesRead = SeaweedRead.read(...);
}
return (int) bytesRead; // Returns 0 when inline content was copied!
// AFTER (FIXED):
if (this.position < Integer.MAX_VALUE && (this.position + len) <= entry.getContent().size()) {
entry.getContent().substring((int) this.position, (int) (this.position + len)).copyTo(buf);
bytesRead = len; // FIX: Update bytesRead after inline copy
} else {
bytesRead = SeaweedRead.read(...);
}
return (int) bytesRead; // Now returns correct value!
Why This Caused EOF Errors
-
Parquet's readFully() loop:
while (remaining > 0) { int read = inputStream.read(buffer, offset, remaining); if (read == -1 || read == 0) { throw new EOFException("Still have: " + remaining + " bytes left"); } remaining -= read; } -
Our bug: When
read()returned 0 instead of the actual bytes copied, Parquet thought the stream was done -
Result: EOF exception with exactly the number of bytes that weren't reported
Fixes Implemented
1. SeaweedInputStream.java (PRIMARY FIX)
- File:
other/java/client/src/main/java/seaweedfs/client/SeaweedInputStream.java - Change: Set
bytesRead = lenafter inline content copy - Impact: Ensures
read()always returns the correct number of bytes read
2. SeaweedOutputStream.java (DIAGNOSTIC)
- File:
other/java/client/src/main/java/seaweedfs/client/SeaweedOutputStream.java - Change: Added comprehensive logging to
getPos()with stack traces - Purpose: Track who calls
getPos()and what positions are returned - Finding: All positions appeared correct in tests
3. SeaweedFileSystem.java (ALREADY FIXED)
- File:
other/java/hdfs3/src/main/java/seaweed/hdfs/SeaweedFileSystem.java - Change: Override
FSDataOutputStream.getPos()to delegate toSeaweedOutputStream - Verification: Confirmed working with WARN logs
4. Unit Test Added
- File:
other/java/client/src/test/java/seaweedfs/client/SeaweedStreamIntegrationTest.java - Test:
testRangeReads() - Coverage:
- Range reads at specific offsets (like Parquet footer reads)
- Sequential
readFully()pattern that was failing - Multiple small reads vs. large reads
- The exact 78-byte read at offset 1197 that was failing
Test Results
Before Fix
EOFException: Reached the end of stream. Still have: 78 bytes left
- contentLength: 1275 (correct!)
- reads: position=1197 len=78 bytesRead=0 ❌
After Fix
No EOF exceptions observed
- contentLength: 1275 (correct)
- reads: position=1197 len=78 bytesRead=78 ✅
Why The 78-Byte Offset Was Consistent
The "78 bytes" wasn't random - it was systematically the last read() call that returned 0 instead of the actual bytes:
- File size: 1275 bytes
- Last read: position=1197, len=78
- Expected: bytesRead=78
- Actual (before fix): bytesRead=0
- Parquet: "I need 78 more bytes but got EOF!" → EOFException
Commits
e95f7061a: Fix inline content read bug + add unit testc10ae054b: Add SeaweedInputStream constructor logging5c30bc8e7: Add detailed getPos() tracking with stack traces
Next Steps
- Push changes to your branch
- Run CI tests to verify fix works in GitHub Actions
- Monitor for any remaining edge cases
- Remove debug logging once confirmed stable (or reduce to DEBUG level)
- Backport to other SeaweedFS client versions if needed
Key Learnings
- Read the return value: Always ensure functions return the correct value, not just perform side effects
- Buffer operations need tracking: When copying data to buffers, track how much was copied
- Stack traces help: Knowing WHO calls a function helps understand WHEN bugs occur
- Consistent offsets = systematic bug: The 78-byte offset being consistent pointed to a logic error, not data corruption
- Downloaded file was perfect: The fact that
parquet-toolscould read the downloaded file proved the bug was in the read path, not write path
Files Modified
other/java/client/src/main/java/seaweedfs/client/SeaweedInputStream.java
other/java/client/src/main/java/seaweedfs/client/SeaweedOutputStream.java
other/java/client/src/main/java/seaweedfs/client/SeaweedRead.java
other/java/client/src/test/java/seaweedfs/client/SeaweedStreamIntegrationTest.java
other/java/hdfs3/src/main/java/seaweed/hdfs/SeaweedFileSystem.java
other/java/hdfs3/src/main/java/seaweed/hdfs/SeaweedFileSystemStore.java
other/java/hdfs3/src/main/java/seaweed/hdfs/SeaweedHadoopOutputStream.java
References
- Issue: Spark integration tests failing with EOF exception
- Parquet version: 1.16.0
- Spark version: 3.5.0
- SeaweedFS client version: 3.80.1-SNAPSHOT