# Debug Breakthrough: Root Cause Identified ## Complete Event Sequence ### 1. Write Pattern ``` - writeCalls 1-465: Writing Parquet data - Last getPos() call: writeCalls=465, returns 1252 → flushedPosition=0 + bufferPosition=1252 = 1252 - writeCalls 466-470: 5 more writes (8 bytes total) → These are footer metadata bytes → Parquet does NOT call getPos() after these writes - close() called: → buffer.position()=1260 (1252 + 8) → All 1260 bytes flushed to disk → File size set to 1260 bytes ``` ### 2. The Problem **Parquet's write sequence:** 1. Write column chunk data, calling `getPos()` after each write → records offsets 2. **Last `getPos()` returns 1252** 3. Write footer metadata (8 bytes) → **NO getPos() call!** 4. Close file → flushes all 1260 bytes **Result**: Parquet footer says data ends at **1252**, but file actually has **1260** bytes. ### 3. The Discrepancy ``` Last getPos(): 1252 bytes (what Parquet recorded in footer) Actual file: 1260 bytes (what was flushed) Missing: 8 bytes (footer metadata written without getPos()) ``` ### 4. Why It Fails on Read When Parquet tries to read the file: - Footer says column chunks end at offset 1252 - Parquet tries to read from 1252, expecting more data - But the actual data structure is offset by 8 bytes - Results in: `EOFException: Still have: 78 bytes left` ### 5. Key Insight: The "78 bytes" The **78 bytes** is NOT missing data — it's a **metadata mismatch**: - Parquet footer contains incorrect offsets - These offsets are off by 8 bytes (the final footer writes) - When reading, Parquet calculates it needs 78 more bytes based on wrong offsets ## Root Cause **Parquet assumes `getPos()` reflects ALL bytes written, even buffered ones.** Our implementation is correct: ```java public long getPos() { return position + buffer.position(); // ✅ Includes buffered data } ``` BUT: Parquet writes footer metadata AFTER the last `getPos()` call, so those 8 bytes are not accounted for in the footer's offset calculations. ## Why Unit Tests Pass but Spark Fails **Unit tests**: Direct writes → immediate getPos() → correct offsets **Spark/Parquet**: Complex write sequence → footer written AFTER last getPos() → stale offsets ## The Fix We need to ensure that when Parquet writes its footer, ALL bytes (including those 8 footer bytes) are accounted for in the file position. Options: 1. **Force flush on getPos()** - ensures position is up-to-date 2. **Override FSDataOutputStream more deeply** - intercept all write operations 3. **Investigate Parquet's footer writing logic** - understand why it doesn't call getPos() Next: Examine how HDFS/S3 FileSystem implementations handle this.