diff --git a/test/java/spark/DEBUGGING_BREAKTHROUGH.md b/test/java/spark/DEBUGGING_BREAKTHROUGH.md new file mode 100644 index 000000000..909842ad8 --- /dev/null +++ b/test/java/spark/DEBUGGING_BREAKTHROUGH.md @@ -0,0 +1,151 @@ +# Debugging Breakthrough: EOF Exception Analysis + +## Summary +After extensive debugging, we've identified and partially fixed the root cause of the `EOFException: Still have: 78 bytes left` error in Parquet file reads. + +## Root Cause Analysis + +### Initial Hypothesis ❌ (Incorrect) +- **Thought**: File size calculation was wrong (`contentLength` off by 78 bytes) +- **Reality**: `contentLength` was **always correct** at 1275 bytes + +### Second Hypothesis ❌ (Partially Correct) +- **Thought**: `FSDataOutputStream.getPos()` wasn't delegating to `SeaweedOutputStream.getPos()` +- **Reality**: The override **was working**, but there was a deeper issue + +### Third Hypothesis ✅ (ROOT CAUSE) +- **Problem**: `SeaweedInputStream.read(ByteBuffer buf)` was returning 0 bytes for inline content +- **Location**: Line 127-129 in `SeaweedInputStream.java` +- **Bug**: When copying inline content from protobuf entry, `bytesRead` was never updated + +```java +// BEFORE (BUGGY): +if (this.position < Integer.MAX_VALUE && (this.position + len) <= entry.getContent().size()) { + entry.getContent().substring((int) this.position, (int) (this.position + len)).copyTo(buf); + // bytesRead stays 0! <-- BUG +} else { + bytesRead = SeaweedRead.read(...); +} +return (int) bytesRead; // Returns 0 when inline content was copied! +``` + +```java +// AFTER (FIXED): +if (this.position < Integer.MAX_VALUE && (this.position + len) <= entry.getContent().size()) { + entry.getContent().substring((int) this.position, (int) (this.position + len)).copyTo(buf); + bytesRead = len; // FIX: Update bytesRead after inline copy +} else { + bytesRead = SeaweedRead.read(...); +} +return (int) bytesRead; // Now returns correct value! +``` + +## Why This Caused EOF Errors + +1. **Parquet's readFully() loop**: + ```java + while (remaining > 0) { + int read = inputStream.read(buffer, offset, remaining); + if (read == -1 || read == 0) { + throw new EOFException("Still have: " + remaining + " bytes left"); + } + remaining -= read; + } + ``` + +2. **Our bug**: When `read()` returned 0 instead of the actual bytes copied, Parquet thought the stream was done +3. **Result**: EOF exception with exactly the number of bytes that weren't reported + +## Fixes Implemented + +### 1. SeaweedInputStream.java (PRIMARY FIX) +- **File**: `other/java/client/src/main/java/seaweedfs/client/SeaweedInputStream.java` +- **Change**: Set `bytesRead = len` after inline content copy +- **Impact**: Ensures `read()` always returns the correct number of bytes read + +### 2. SeaweedOutputStream.java (DIAGNOSTIC) +- **File**: `other/java/client/src/main/java/seaweedfs/client/SeaweedOutputStream.java` +- **Change**: Added comprehensive logging to `getPos()` with stack traces +- **Purpose**: Track who calls `getPos()` and what positions are returned +- **Finding**: All positions appeared correct in tests + +### 3. SeaweedFileSystem.java (ALREADY FIXED) +- **File**: `other/java/hdfs3/src/main/java/seaweed/hdfs/SeaweedFileSystem.java` +- **Change**: Override `FSDataOutputStream.getPos()` to delegate to `SeaweedOutputStream` +- **Verification**: Confirmed working with WARN logs + +### 4. Unit Test Added +- **File**: `other/java/client/src/test/java/seaweedfs/client/SeaweedStreamIntegrationTest.java` +- **Test**: `testRangeReads()` +- **Coverage**: + - Range reads at specific offsets (like Parquet footer reads) + - Sequential `readFully()` pattern that was failing + - Multiple small reads vs. large reads + - The exact 78-byte read at offset 1197 that was failing + +## Test Results + +### Before Fix +``` +EOFException: Reached the end of stream. Still have: 78 bytes left +- contentLength: 1275 (correct!) +- reads: position=1197 len=78 bytesRead=0 ❌ +``` + +### After Fix +``` +No EOF exceptions observed +- contentLength: 1275 (correct) +- reads: position=1197 len=78 bytesRead=78 ✅ +``` + +## Why The 78-Byte Offset Was Consistent + +The "78 bytes" wasn't random - it was **systematically the last `read()` call** that returned 0 instead of the actual bytes: +- File size: 1275 bytes +- Last read: position=1197, len=78 +- Expected: bytesRead=78 +- Actual (before fix): bytesRead=0 +- Parquet: "I need 78 more bytes but got EOF!" → EOFException + +## Commits + +1. **e95f7061a**: Fix inline content read bug + add unit test +2. **c10ae054b**: Add SeaweedInputStream constructor logging +3. **5c30bc8e7**: Add detailed getPos() tracking with stack traces + +## Next Steps + +1. **Push changes** to your branch +2. **Run CI tests** to verify fix works in GitHub Actions +3. **Monitor** for any remaining edge cases +4. **Remove debug logging** once confirmed stable (or reduce to DEBUG level) +5. **Backport** to other SeaweedFS client versions if needed + +## Key Learnings + +1. **Read the return value**: Always ensure functions return the correct value, not just perform side effects +2. **Buffer operations need tracking**: When copying data to buffers, track how much was copied +3. **Stack traces help**: Knowing WHO calls a function helps understand WHEN bugs occur +4. **Consistent offsets = systematic bug**: The 78-byte offset being consistent pointed to a logic error, not data corruption +5. **Downloaded file was perfect**: The fact that `parquet-tools` could read the downloaded file proved the bug was in the read path, not write path + +## Files Modified + +``` +other/java/client/src/main/java/seaweedfs/client/SeaweedInputStream.java +other/java/client/src/main/java/seaweedfs/client/SeaweedOutputStream.java +other/java/client/src/main/java/seaweedfs/client/SeaweedRead.java +other/java/client/src/test/java/seaweedfs/client/SeaweedStreamIntegrationTest.java +other/java/hdfs3/src/main/java/seaweed/hdfs/SeaweedFileSystem.java +other/java/hdfs3/src/main/java/seaweed/hdfs/SeaweedFileSystemStore.java +other/java/hdfs3/src/main/java/seaweed/hdfs/SeaweedHadoopOutputStream.java +``` + +## References + +- Issue: Spark integration tests failing with EOF exception +- Parquet version: 1.16.0 +- Spark version: 3.5.0 +- SeaweedFS client version: 3.80.1-SNAPSHOT +