5.2 KiB
Parquet EOF Exception: Root Cause and Fix Strategy
Executive Summary
Problem: EOFException: Still have: 78 bytes left when reading Parquet files written to SeaweedFS via Spark.
Root Cause: Parquet footer metadata contains stale offsets due to writes occurring AFTER the last getPos() call.
Impact: All Parquet files written via Spark are unreadable.
Technical Details
The Write Sequence (from debug logs)
Write Phase:
- writeCalls 1-465: Parquet data (column chunks, dictionaries, etc.)
- Last getPos(): returns 1252 (flushedPosition=0 + bufferPosition=1252)
↓
Footer Phase:
- writeCalls 466-470: Footer metadata (8 bytes)
- NO getPos() called during this phase!
↓
Close Phase:
- buffer.position() = 1260 bytes
- All 1260 bytes flushed to disk
- File size set to 1260 bytes
###The Mismatch
| What | Value | Notes |
|---|---|---|
Last getPos() returned |
1252 | Parquet records this in footer |
| Actual bytes written | 1260 | What's flushed to disk |
| Gap | 8 | Unaccounted footer bytes |
Why Reads Fail
- Parquet footer says: "Column chunk data ends at offset 1252"
- Actual file structure: Column chunk data ends at offset 1260
- When reading, Parquet seeks to offset 1252
- Parquet expects to find data there, but it's 8 bytes off
- Result:
EOFException: Still have: 78 bytes left
The "78 bytes" is Parquet's calculation of how much data it expected vs. what it got, based on incorrect offsets.
Why This Happens
Parquet's footer writing is asynchronous with respect to getPos():
// Parquet's internal logic (simplified):
1. Write column chunk → call getPos() → record offset
2. Write more chunks → call getPos() → record offset
3. Write footer metadata (magic bytes, etc.) → NO getPos()!
4. Close stream
The footer metadata bytes (step 3) are written AFTER Parquet has recorded all offsets.
Why Unit Tests Pass but Spark Fails
Unit tests:
- Simple write patterns
- Direct, synchronous writes
getPos()called immediately after relevant writes
Spark/Parquet:
- Complex write patterns with buffering
- Asynchronous footer writing
getPos()NOT called after final footer writes
Fix Options
Option 1: Flush on getPos() (Simple, but has performance impact)
public synchronized long getPos() {
if (buffer.position() > 0) {
writeCurrentBufferToService(); // Force flush
}
return position;
}
Pros:
- Ensures
positionis always accurate - Simple to implement
Cons:
- Performance hit (many small flushes)
- Changes buffering semantics
Option 2: Track Virtual Position Separately (Recommended)
Keep position (flushed) separate from getPos() (virtual):
private long position = 0; // Flushed bytes
private long virtualPosition = 0; // Total bytes written
@Override
public synchronized void write(byte[] data, int off, int length) {
// ... existing write logic ...
virtualPosition += length;
}
public synchronized long getPos() {
return virtualPosition; // Always accurate, no flush needed
}
Pros:
- No performance impact
- Clean separation of concerns
getPos()always reflects total bytes written
Cons:
- Need to track
virtualPositionacross all write methods
Option 3: Defer Footer Metadata Update (Complex)
Modify flushWrittenBytesToServiceInternal() to account for buffered data:
protected void flushWrittenBytesToServiceInternal(final long offset) {
long actualOffset = offset + buffer.position(); // Include buffered data
entry.getAttributes().setFileSize(actualOffset);
// ...
}
Pros:
- Minimal code changes
Cons:
- Doesn't solve the root cause
- May break other use cases
Option 4: Force Flush Before Close (Workaround)
Override close() to flush before calling super:
@Override
public synchronized void close() throws IOException {
if (buffer.position() > 0) {
writeCurrentBufferToService(); // Ensure everything flushed
}
super.close();
}
Pros:
- Simple
- Ensures file size is correct
Cons:
- Doesn't fix the
getPos()staleness issue - Still has metadata timing problems
Recommended Solution
Option 2: Track Virtual Position Separately
This aligns with Hadoop's semantics where getPos() should return the total number of bytes written to the stream, regardless of buffering.
Implementation Plan
- Add
virtualPositionfield toSeaweedOutputStream - Update all
write()methods to incrementvirtualPosition - Change
getPos()to returnvirtualPosition - Keep
positionfor internal flush tracking - Add unit tests to verify
getPos()accuracy with buffering
Next Steps
- Implement Option 2 (Virtual Position)
- Test with local Spark reproduction
- Verify unit tests still pass
- Run full Spark integration tests in CI
- Compare behavior with HDFS/S3 implementations
References
- Parquet specification: https://parquet.apache.org/docs/file-format/
- Hadoop
FSDataOutputStreamcontract:getPos()should return total bytes written - Related issues: SeaweedFS Spark integration tests failing with EOF exceptions