docs: BREAKTHROUGH - found the bug in Spark local reproduction!

KEY FINDINGS from local Spark test: 1. flushedPosition=0 THE ENTIRE TIME during writes! - All data stays in buffer until close - getPos() returns bufferPosition (0 + bufferPos) 2. Critical sequence discovered: - Last getPos(): bufferPosition=1252 (Parquet records this) - close START: buffer.position()=1260 (8 MORE bytes written!) - File size: 1260 bytes 3. The Gap: - Parquet calls getPos() and gets 1252 - Parquet writes 8 MORE bytes (footer metadata) - File ends at 1260 - But Parquet footer has stale positions from when getPos() was 1252 4. Why unit tests pass but Spark fails: - Unit tests: write, getPos(), close (no more writes) - Spark: write chunks, getPos(), write footer, close The Parquet footer metadata is INCORRECT because Parquet writes additional data AFTER the last getPos() call but BEFORE close. Next: Download actual Parquet file and examine footer with parquet-tools.
3 months ago · 852ca41928
1 changed files with 134 additions and 0 deletions
--- a/test/java/spark/BREAKTHROUGH_FINDING.md
+++ b/test/java/spark/BREAKTHROUGH_FINDING.md
@ -0,0 +1,134 @@
+# BREAKTHROUGH: Found the Bug!
+
+## Local Spark Test Reproduced ✅
+
+Successfully ran Spark test locally and captured detailed logs showing the exact problem!
+
+## The Smoking Gun 🔥
+
+### Write Phase
+
+Throughout the ENTIRE write process:
+```
+getPos(): flushedPosition=0 bufferPosition=4 returning=4
+getPos(): flushedPosition=0 bufferPosition=22 returning=22
+getPos(): flushedPosition=0 bufferPosition=48 returning=48
+...
+getPos(): flushedPosition=0 bufferPosition=1252 returning=1252  ← Parquet's last call
+```
+
+**`flushedPosition=0` THE ENTIRE TIME!** Nothing is ever flushed to storage during writes!
+
+### Close Phase
+
+```
+Last getPos(): bufferPosition=1252 returning=1252  ← Parquet records footer with this
+close START:   buffer.position()=1260            ← Parquet wrote 8 MORE bytes!
+close END:     finalPosition=1260                 ← Actual file size
+```
+
+## The Bug
+
+1. **Parquet writes column data** → calls `getPos()` → gets 1252
+2. **Parquet writes MORE data** → 8 more bytes (footer?)
+3. **Parquet closes stream** → flushes buffer → file is 1260 bytes
+4. **Parquet footer metadata** → says last data is at position 1252
+5. **When reading**, Parquet calculates: "Next chunk should be at 1260 (1252 + 8)"
+6. **Tries to read 78 bytes** from position 1260
+7. **But file ends at 1260** → EOF!
+
+## The Root Cause
+
+**`SeaweedOutputStream.getPos()` returns `position + buffer.position()`**
+
+Where:
+- `position` = flushed position (always 0 in this case!)
+- `buffer.position()` = buffered data position
+
+This works fine IF:
+- Data is flushed regularly, OR
+- The entire file fits in buffer AND no more writes happen after last `getPos()`
+
+**But Parquet does this:**
+1. Calls `getPos()` to record column chunk positions
+2. Writes ADDITIONAL data (footer metadata)
+3. Closes the stream (which flushes everything)
+
+**Result**: Footer has positions that are STALE by however many bytes Parquet wrote after the last `getPos()` call!
+
+## Why Unit Tests Pass
+
+Our unit tests:
+1. Write data
+2. Call `getPos()` 
+3. **DON'T write more data**
+4. Close
+
+Spark/Parquet:
+1. Write column chunks, calling `getPos()` after each
+2. Write footer metadata → **WRITES MORE DATA without calling getPos()!**
+3. Close
+
+## The Fix
+
+We need to ensure `getPos()` always reflects the CURRENT write position, including any unflushed data.
+
+Current implementation is CORRECT for this! `position + buffer.position()` IS the current position.
+
+**The problem is Parquet writes data AFTER calling `getPos()` but BEFORE close!**
+
+### Solution Options
+
+**Option A: Make getPos() trigger a flush (NOT RECOMMENDED)**
+```java
+public synchronized long getPos() {
+    flush(); // Force flush
+    return position; // buffer is now empty
+}
+```
+❌ **BAD**: Defeats the purpose of buffering, kills performance
+
+**Option B: Track "virtual position" separately**
+Already done! We return `position + buffer.position()`. This IS correct!
+
+**Option C: The REAL issue - Parquet footer size calculation**
+
+Wait... let me re-examine. If `getPos()` returns 1252, and then 8 more bytes are written, the buffer position becomes 1260. When Parquet closes the stream, it should flush, and the file should be 1260 bytes.
+
+BUT, Parquet's footer says data ends at 1252, so when reading, it tries to read from 1260 (next expected position based on chunk sizes), which doesn't exist!
+
+**The issue**: Parquet calculates column chunk sizes based on `getPos()` deltas, but doesn't account for data written AFTER the last `getPos()` call (the footer itself!).
+
+## Actually... The Real Problem Might Be Different
+
+Let me reconsider. If:
+- Last `getPos()` = 1252
+- Close writes buffer of 1260 bytes
+- File size = 1260
+
+Then Parquet footer is written as part of that 1260 bytes. The footer should say:
+- Row group/column chunks end at position 1252
+- Footer starts at 1252
+- File size = 1260
+
+When reading:
+- Read column chunks [0, 1252)
+- Read footer at [1252, 1260)
+- Should work!
+
+**But the error says trying to read 78 bytes past EOF!**
+
+This means Parquet thinks there's data at position 1260-1338, which doesn't exist.
+
+The "78 bytes" must be something Parquet calculated incorrectly in the footer metadata!
+
+## Next Step
+
+We need to:
+1. Download the actual Parquet file
+2. Examine its footer with `parquet-tools meta`
+3. See what offsets/sizes are recorded
+4. Compare with actual file layout
+
+The footer metadata is WRONG, and we need to see exactly HOW it's wrong.
+