2.7 KiB
Debug Breakthrough: Root Cause Identified
Complete Event Sequence
1. Write Pattern
- writeCalls 1-465: Writing Parquet data
- Last getPos() call: writeCalls=465, returns 1252
→ flushedPosition=0 + bufferPosition=1252 = 1252
- writeCalls 466-470: 5 more writes (8 bytes total)
→ These are footer metadata bytes
→ Parquet does NOT call getPos() after these writes
- close() called:
→ buffer.position()=1260 (1252 + 8)
→ All 1260 bytes flushed to disk
→ File size set to 1260 bytes
2. The Problem
Parquet's write sequence:
- Write column chunk data, calling
getPos()after each write → records offsets - Last
getPos()returns 1252 - Write footer metadata (8 bytes) → NO getPos() call!
- Close file → flushes all 1260 bytes
Result: Parquet footer says data ends at 1252, but file actually has 1260 bytes.
3. The Discrepancy
Last getPos(): 1252 bytes (what Parquet recorded in footer)
Actual file: 1260 bytes (what was flushed)
Missing: 8 bytes (footer metadata written without getPos())
4. Why It Fails on Read
When Parquet tries to read the file:
- Footer says column chunks end at offset 1252
- Parquet tries to read from 1252, expecting more data
- But the actual data structure is offset by 8 bytes
- Results in:
EOFException: Still have: 78 bytes left
5. Key Insight: The "78 bytes"
The 78 bytes is NOT missing data — it's a metadata mismatch:
- Parquet footer contains incorrect offsets
- These offsets are off by 8 bytes (the final footer writes)
- When reading, Parquet calculates it needs 78 more bytes based on wrong offsets
Root Cause
Parquet assumes getPos() reflects ALL bytes written, even buffered ones.
Our implementation is correct:
public long getPos() {
return position + buffer.position(); // ✅ Includes buffered data
}
BUT: Parquet writes footer metadata AFTER the last getPos() call, so those 8 bytes
are not accounted for in the footer's offset calculations.
Why Unit Tests Pass but Spark Fails
Unit tests: Direct writes → immediate getPos() → correct offsets Spark/Parquet: Complex write sequence → footer written AFTER last getPos() → stale offsets
The Fix
We need to ensure that when Parquet writes its footer, ALL bytes (including those 8 footer bytes) are accounted for in the file position. Options:
- Force flush on getPos() - ensures position is up-to-date
- Override FSDataOutputStream more deeply - intercept all write operations
- Investigate Parquet's footer writing logic - understand why it doesn't call getPos()
Next: Examine how HDFS/S3 FileSystem implementations handle this.