docs: comprehensive analysis of Parquet EOF root cause and fix strategies

Documented complete technical analysis including: ROOT CAUSE: - Parquet writes footer metadata AFTER last getPos() call - 8 bytes written without getPos() being called - Footer records stale offsets (1252 instead of 1260) - Results in metadata mismatch → EOF exception on read FIX OPTIONS (4 approaches analyzed): 1. Flush on getPos() - simple but slow 2. Track virtual position - RECOMMENDED 3. Defer footer metadata - complex 4. Force flush before close - workaround RECOMMENDED: Option 2 (Virtual Position) - Add virtualPosition field - getPos() returns virtualPosition (not position) - Aligns with Hadoop FSDataOutputStream semantics - No performance impact Ready to implement the fix.
3 months ago · 2d6b571120
1 changed files with 204 additions and 0 deletions
--- a/test/java/spark/PARQUET_ROOT_CAUSE_AND_FIX.md
+++ b/test/java/spark/PARQUET_ROOT_CAUSE_AND_FIX.md
@ -0,0 +1,204 @@
+# Parquet EOF Exception: Root Cause and Fix Strategy
+
+## Executive Summary
+
+**Problem**: `EOFException: Still have: 78 bytes left` when reading Parquet files written to SeaweedFS via Spark.
+
+**Root Cause**: Parquet footer metadata contains stale offsets due to writes occurring AFTER the last `getPos()` call.
+
+**Impact**: All Parquet files written via Spark are unreadable.
+
+---
+
+## Technical Details
+
+### The Write Sequence (from debug logs)
+
+```
+Write Phase:
+- writeCalls 1-465: Parquet data (column chunks, dictionaries, etc.)
+- Last getPos(): returns 1252 (flushedPosition=0 + bufferPosition=1252)
+  ↓
+Footer Phase:
+- writeCalls 466-470: Footer metadata (8 bytes)
+- NO getPos() called during this phase!
+  ↓
+Close Phase:
+- buffer.position() = 1260 bytes
+- All 1260 bytes flushed to disk
+- File size set to 1260 bytes
+```
+
+###The Mismatch
+
+| What                      | Value | Notes |
+|--------------------------|-------|-------|
+| Last `getPos()` returned | 1252  | Parquet records this in footer |
+| Actual bytes written     | 1260  | What's flushed to disk |
+| **Gap**                  | **8** | **Unaccounted footer bytes** |
+
+### Why Reads Fail
+
+1. Parquet footer says: "Column chunk data ends at offset 1252"
+2. Actual file structure: Column chunk data ends at offset 1260
+3. When reading, Parquet seeks to offset 1252
+4. Parquet expects to find data there, but it's 8 bytes off
+5. Result: `EOFException: Still have: 78 bytes left`
+
+> The "78 bytes" is Parquet's calculation of how much data it expected vs. what it got, based on incorrect offsets.
+
+---
+
+## Why This Happens
+
+Parquet's footer writing is **asynchronous** with respect to `getPos()`:
+
+```java
+// Parquet's internal logic (simplified):
+1. Write column chunk → call getPos() → record offset
+2. Write more chunks → call getPos() → record offset
+3. Write footer metadata (magic bytes, etc.) → NO getPos()!
+4. Close stream
+```
+
+The footer metadata bytes (step 3) are written AFTER Parquet has recorded all offsets.
+
+---
+
+## Why Unit Tests Pass but Spark Fails
+
+**Unit tests**:
+- Simple write patterns
+- Direct, synchronous writes
+- `getPos()` called immediately after relevant writes
+
+**Spark/Parquet**:
+- Complex write patterns with buffering
+- Asynchronous footer writing
+- `getPos()` NOT called after final footer writes
+
+---
+
+## Fix Options
+
+### Option 1: Flush on getPos() (Simple, but has performance impact)
+
+```java
+public synchronized long getPos() {
+    if (buffer.position() > 0) {
+        writeCurrentBufferToService();  // Force flush
+    }
+    return position;
+}
+```
+
+**Pros**:
+- Ensures `position` is always accurate
+- Simple to implement
+
+**Cons**:
+- Performance hit (many small flushes)
+- Changes buffering semantics
+
+### Option 2: Track Virtual Position Separately (Recommended)
+
+Keep `position` (flushed) separate from `getPos()` (virtual):
+
+```java
+private long position = 0;  // Flushed bytes
+private long virtualPosition = 0;  // Total bytes written
+
+@Override
+public synchronized void write(byte[] data, int off, int length) {
+    // ... existing write logic ...
+    virtualPosition += length;
+}
+
+public synchronized long getPos() {
+    return virtualPosition;  // Always accurate, no flush needed
+}
+```
+
+**Pros**:
+- No performance impact
+- Clean separation of concerns
+- `getPos()` always reflects total bytes written
+
+**Cons**:
+- Need to track `virtualPosition` across all write methods
+
+### Option 3: Defer Footer Metadata Update (Complex)
+
+Modify `flushWrittenBytesToServiceInternal()` to account for buffered data:
+
+```java
+protected void flushWrittenBytesToServiceInternal(final long offset) {
+    long actualOffset = offset + buffer.position();  // Include buffered data
+    entry.getAttributes().setFileSize(actualOffset);
+    // ...
+}
+```
+
+**Pros**:
+- Minimal code changes
+
+**Cons**:
+- Doesn't solve the root cause
+- May break other use cases
+
+### Option 4: Force Flush Before Close (Workaround)
+
+Override `close()` to flush before calling super:
+
+```java
+@Override
+public synchronized void close() throws IOException {
+    if (buffer.position() > 0) {
+        writeCurrentBufferToService();  // Ensure everything flushed
+    }
+    super.close();
+}
+```
+
+**Pros**:
+- Simple
+- Ensures file size is correct
+
+**Cons**:
+- Doesn't fix the `getPos()` staleness issue
+- Still has metadata timing problems
+
+---
+
+## Recommended Solution
+
+**Option 2: Track Virtual Position Separately**
+
+This aligns with Hadoop's semantics where `getPos()` should return the total number of bytes written to the stream, regardless of buffering.
+
+### Implementation Plan
+
+1. Add `virtualPosition` field to `SeaweedOutputStream`
+2. Update all `write()` methods to increment `virtualPosition`
+3. Change `getPos()` to return `virtualPosition`
+4. Keep `position` for internal flush tracking
+5. Add unit tests to verify `getPos()` accuracy with buffering
+
+---
+
+## Next Steps
+
+1. Implement Option 2 (Virtual Position)
+2. Test with local Spark reproduction
+3. Verify unit tests still pass
+4. Run full Spark integration tests in CI
+5. Compare behavior with HDFS/S3 implementations
+
+---
+
+## References
+
+- Parquet specification: https://parquet.apache.org/docs/file-format/
+- Hadoop `FSDataOutputStream` contract: `getPos()` should return total bytes written
+- Related issues: SeaweedFS Spark integration tests failing with EOF exceptions
+