docs: add detailed analysis of Parquet EOF fix

3 months ago · 90aa83dbe4
1 changed files with 126 additions and 0 deletions
--- a/test/java/spark/PARQUET_EOF_FIX.md
+++ b/test/java/spark/PARQUET_EOF_FIX.md
@ -0,0 +1,126 @@
+# Parquet EOFException Fix: 78-Byte Discrepancy
+
+## Problem Statement
+
+Spark integration tests were consistently failing with:
+```
+java.io.EOFException: Reached the end of stream. Still have: 78 bytes left
+at org.apache.parquet.hadoop.util.H2SeekableInputStream.readFully(H2SeekableInputStream.java:112)
+```
+
+The error was consistent across all Parquet writes:
+- File sizes varied: 684, 693, 696, 707, 1275 bytes
+- Missing bytes: **ALWAYS exactly 78 bytes**
+- This suggested a systematic offset error, not random data loss
+
+## Root Cause Analysis
+
+### Investigation Steps
+
+1. **Examined Parquet-Java source code** (`~/dev/parquet-java/`):
+   - Found the error originates in `H2SeekableInputStream.readFully()` line 112
+   - Comment indicates: *"this is probably a bug in the ParquetReader"*
+   - Parquet is trying to read data based on footer metadata offsets
+
+2. **Traced Parquet writer logic**:
+   - In `ParquetFileWriter.java` line 1027-1029 and 1546:
+     ```java
+     long beforeHeader = out.getPos();
+     if (currentChunkFirstDataPage < 0) {
+         currentChunkFirstDataPage = beforeHeader;
+     }
+     ```
+   - Parquet calls `out.getPos()` to record where column chunks start
+   - These positions are stored in the file's footer metadata
+
+3. **Identified the disconnect**:
+   - `out` is Hadoop's `FSDataOutputStream` wrapping `SeaweedHadoopOutputStream`
+   - `FSDataOutputStream` uses an **internal position counter**
+   - It does **NOT** call `SeaweedOutputStream.getPos()` automatically
+   - Evidence: No `"[DEBUG-2024] getPos() called"` log messages appeared in tests
+
+4. **Confirmed with file download**:
+   - Successfully downloaded actual Parquet file (1275 bytes)
+   - Parquet's footer claims data extends to byte 1353 (1275 + 78)
+   - The footer metadata has incorrect offsets!
+
+### The Mismatch
+
+```
+When writing:
+┌─────────────────────────────────────────────────────────────┐
+│ Parquet Writer                                              │
+│   ↓ write(data)                                             │
+│ FSDataOutputStream (Hadoop)                                 │
+│   - Counts bytes: position = 1353                           │
+│   - getPos() returns: 1353 ← Parquet records this!          │
+│   ↓ write(data)                                             │
+│ SeaweedOutputStream                                         │
+│   - Buffers data internally                                 │
+│   - getPos() returns: position + buffer.position()          │
+│   - But FSDataOutputStream NEVER calls this!                │
+│   ↓ flush on close()                                        │
+│ SeaweedFS Server                                            │
+│   - Actually stores: 1275 bytes                             │
+└─────────────────────────────────────────────────────────────┘
+
+Result: Footer says "read from offset 1353" but file only has 1275 bytes!
+```
+
+## The Fix
+
+**File**: `other/java/hdfs3/src/main/java/seaweed/hdfs/SeaweedFileSystem.java`
+
+Override `FSDataOutputStream.getPos()` to delegate to our stream:
+
+```java
+SeaweedHadoopOutputStream outputStream = (SeaweedHadoopOutputStream) 
+    seaweedFileSystemStore.createFile(path, overwrite, permission,
+        seaweedBufferSize, replicaPlacement);
+
+// Use custom FSDataOutputStream that delegates getPos() to our stream
+return new FSDataOutputStream(outputStream, statistics) {
+    @Override
+    public long getPos() {
+        // Delegate to SeaweedOutputStream's position tracking
+        return outputStream.getPos();
+    }
+};
+```
+
+### Why This Works
+
+1. **Before**: Parquet calls `FSDataOutputStream.getPos()` → Gets Hadoop's internal counter (wrong!)
+2. **After**: Parquet calls `FSDataOutputStream.getPos()` → Delegates to `SeaweedOutputStream.getPos()` → Returns `position + buffer.position()` (correct!)
+
+3. `SeaweedOutputStream.getPos()` correctly accounts for:
+   - `position`: bytes already flushed to server
+   - `buffer.position()`: bytes in buffer not yet flushed
+   - Total: accurate position for metadata
+
+## Testing
+
+The fix will be validated by:
+1. The existing `getPos()` logging will now show calls (previously silent)
+2. Parquet files should be readable without EOFException
+3. The 78-byte discrepancy should disappear
+
+## Related Code
+
+- **Parquet Writer**: `parquet-java/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java:1027,1546`
+- **Parquet Reader**: `parquet-java/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:1174,1180`
+- **Error Location**: `parquet-java/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/H2SeekableInputStream.java:112`
+- **SeaweedFS Position Tracking**: `other/java/client/src/main/java/seaweedfs/client/SeaweedOutputStream.java:100-108`
+
+## Lessons Learned
+
+1. **Double buffering is dangerous**: When multiple layers track position independently, they can diverge
+2. **Read the source**: Examining Parquet-Java and Spark source code was essential to understanding the issue
+3. **Systematic errors need systematic analysis**: The consistent 78-byte offset was a clue it wasn't random data loss
+4. **Framework integration matters**: Hadoop's `FSDataOutputStream` wrapper behavior must be understood and explicitly handled
+
+## Commit
+
+**SHA**: 9e7ed4868
+**Message**: "fix: Override FSDataOutputStream.getPos() to use SeaweedOutputStream position"
+