5.6 KiB

Raw Blame History

Parquet EOF Exception: Complete Debug Session Summary

Timeline

Initial Problem: EOFException: Still have: 78 bytes left when reading Parquet files via Spark
Hypothesis 1: Virtual position tracking issue
Hypothesis 2: Buffering causes offset mismatch
Final Discovery: Parquet's write sequence is fundamentally incompatible with buffered streams

What We Did

Phase 1: Comprehensive Debug Logging

Added WARN-level logging to track every write, flush, and getPos() call
Logged caller stack traces for getPos()
Tracked virtual position, flushed position, and buffer position

Key Finding: Last getPos() returns 1252, but file has 1260 bytes (8-byte gap)

Phase 2: Virtual Position Tracking

Added virtualPosition field to track total bytes written
Updated getPos() to return virtualPosition

Result: ✅ getPos() now returns correct total, but ❌ EOF exception persists

Phase 3: Flush-on-getPos()

Modified getPos() to flush buffer before returning position
Ensures returned position reflects all committed data

Result: ✅ Flushing works, ❌ EOF exception STILL persists

Root Cause: The Fundamental Problem

Parquet's Assumption

Write data → call getPos() → USE returned value immediately
Write more data
Write footer with previously obtained offsets

What Actually Happens

Time 0: Write 1252 bytes
Time 1: getPos() called → flushes → returns 1252
Time 2: Parquet STORES "offset = 1252" in memory
Time 3: Parquet writes footer metadata (8 bytes)
Time 4: Parquet writes footer containing "offset = 1252"
Time 5: close() → flushes all 1260 bytes

Result: Footer says "data at offset 1252"
        But actual file: [data: 0-1252] [footer_meta: 1252-1260]
        When reading: Parquet seeks to 1252, expects data, gets footer → EOF!

The 78-Byte Mystery

The "78 bytes" is NOT missing data. It's Parquet's calculation:

Parquet footer says column chunks are at certain offsets
Those offsets are off by 8 bytes (the footer metadata)
When reading, Parquet calculates it needs 78 more bytes based on wrong offsets
Results in: "Still have: 78 bytes left"

Why Flush-on-getPos() Doesn't Fix It

Even with flushing:

getPos() is called → flushes → returns accurate position (1252)
Parquet uses this value → records "1252" in its internal state
Parquet writes more bytes (footer metadata)
Parquet writes footer with the recorded "1252"
Problem: Those bytes written in step 3 shifted everything!

The issue: Parquet uses the getPos() RETURN VALUE later, not the position at footer-write time.

Why This Works in HDFS

HDFS likely uses one of these strategies:

Unbuffered writes for Parquet - Every byte goes directly to disk
Syncable.hflush() contract - Parquet calls hflush() at critical points
Different internal implementation - HDFS LocalFileSystem might handle this differently

Solutions (Ordered by Viability)

1. Disable Buffering for Parquet (Quick Fix)

if (path.endsWith(".parquet")) {
    this.bufferSize = 1; // Effectively unbuffered
}

Pros: Guaranteed to work
Cons: Poor write performance for Parquet

2. Implement Syncable.hflush() (Proper Fix)

public class SeaweedHadoopOutputStream implements Syncable {
    @Override
    public void hflush() throws IOException {
        writeCurrentBufferToService();
        flushWrittenBytesToService();
    }
}

Requirement: Parquet must call hflush() instead of flush()
Investigation needed: Check Parquet source if it uses Syncable

3. Special getPos() for Parquet (Targeted)

public synchronized long getPos() throws IOException {
    if (path.endsWith(".parquet") && buffer.position() > 0) {
        writeCurrentBufferToService();
    }
    return position;
}

Pros: Only affects Parquet
Cons: Still has the same fundamental issue

4. Post-Write Footer Fix (Complex)

After writing, re-open and fix Parquet footer offsets.
Not recommended: Too fragile

Commits Made

3e754792a - feat: add comprehensive debug logging
2d6b57112 - docs: comprehensive analysis and fix strategies
c1b0aa661 - feat: implement virtual position tracking
9eb71466d - feat: implement flush-on-getPos()

Debug Messages: Key Learnings

Before Any Fix

Last getPos(): flushedPosition=0 bufferPosition=1252 returning=1252
close(): buffer.position()=1260, totalBytesWritten=1260
File size: 1260 bytes ✓
EOF Exception: "Still have: 78 bytes left" ❌

After Virtual Position

getPos(): returning VIRTUAL position=1260
close(): virtualPos=1260, flushedPos=0
File size: 1260 bytes ✓
EOF Exception: "Still have: 78 bytes left" ❌ (unchanged!)

After Flush-on-getPos()

getPos() FLUSHING buffer (1252 bytes)
getPos(): returning position=1252 (all data flushed)
close(): virtualPos=1260, flushedPos=1260
File size: 1260 bytes ✓  
EOF Exception: "Still have: 78 bytes left" ❌ (STILL persists!)

Conclusion

The problem is NOT a bug in SeaweedOutputStream. It's a fundamental incompatibility between:

Parquet's assumption: getPos() returns the exact file offset where next byte will be written
Buffered streams: Data written to buffer, offsets recorded, THEN flushed

Recommended Next Steps:

Check Parquet source: Does it use Syncable.hflush()?
If yes: Implement hflush() properly
If no: Disable buffering for .parquet files

The debugging was successful in identifying the root cause, but the fix requires either:

Changing how Parquet writes (unlikely)
Changing how SeaweedFS buffers Parquet files (feasible)

5.6 KiB Raw Blame History