5.2 KiB

Raw Blame History

Parquet EOF Exception: Root Cause and Fix Strategy

Executive Summary

Problem: EOFException: Still have: 78 bytes left when reading Parquet files written to SeaweedFS via Spark.

Root Cause: Parquet footer metadata contains stale offsets due to writes occurring AFTER the last getPos() call.

Impact: All Parquet files written via Spark are unreadable.

Technical Details

The Write Sequence (from debug logs)

Write Phase:
- writeCalls 1-465: Parquet data (column chunks, dictionaries, etc.)
- Last getPos(): returns 1252 (flushedPosition=0 + bufferPosition=1252)
  ↓
Footer Phase:
- writeCalls 466-470: Footer metadata (8 bytes)
- NO getPos() called during this phase!
  ↓
Close Phase:
- buffer.position() = 1260 bytes
- All 1260 bytes flushed to disk
- File size set to 1260 bytes

###The Mismatch

What	Value	Notes
Last `getPos()` returned	1252	Parquet records this in footer
Actual bytes written	1260	What's flushed to disk
Gap	8	Unaccounted footer bytes

Why Reads Fail

Parquet footer says: "Column chunk data ends at offset 1252"
Actual file structure: Column chunk data ends at offset 1260
When reading, Parquet seeks to offset 1252
Parquet expects to find data there, but it's 8 bytes off
Result: EOFException: Still have: 78 bytes left

The "78 bytes" is Parquet's calculation of how much data it expected vs. what it got, based on incorrect offsets.

Why This Happens

Parquet's footer writing is asynchronous with respect to getPos():

// Parquet's internal logic (simplified):
1. Write column chunk → call getPos() → record offset
2. Write more chunks → call getPos() → record offset
3. Write footer metadata (magic bytes, etc.) → NO getPos()!
4. Close stream

The footer metadata bytes (step 3) are written AFTER Parquet has recorded all offsets.

Why Unit Tests Pass but Spark Fails

Unit tests:

Simple write patterns
Direct, synchronous writes
getPos() called immediately after relevant writes

Spark/Parquet:

Complex write patterns with buffering
Asynchronous footer writing
getPos() NOT called after final footer writes

Fix Options

Option 1: Flush on getPos() (Simple, but has performance impact)

public synchronized long getPos() {
    if (buffer.position() > 0) {
        writeCurrentBufferToService();  // Force flush
    }
    return position;
}

Pros:

Ensures position is always accurate
Simple to implement

Cons:

Performance hit (many small flushes)
Changes buffering semantics

Option 2: Track Virtual Position Separately (Recommended)

Keep position (flushed) separate from getPos() (virtual):

private long position = 0;  // Flushed bytes
private long virtualPosition = 0;  // Total bytes written

@Override
public synchronized void write(byte[] data, int off, int length) {
    // ... existing write logic ...
    virtualPosition += length;
}

public synchronized long getPos() {
    return virtualPosition;  // Always accurate, no flush needed
}

Pros:

No performance impact
Clean separation of concerns
getPos() always reflects total bytes written

Cons:

Need to track virtualPosition across all write methods

Option 3: Defer Footer Metadata Update (Complex)

Modify flushWrittenBytesToServiceInternal() to account for buffered data:

protected void flushWrittenBytesToServiceInternal(final long offset) {
    long actualOffset = offset + buffer.position();  // Include buffered data
    entry.getAttributes().setFileSize(actualOffset);
    // ...
}

Pros:

Minimal code changes

Cons:

Doesn't solve the root cause
May break other use cases

Option 4: Force Flush Before Close (Workaround)

Override close() to flush before calling super:

@Override
public synchronized void close() throws IOException {
    if (buffer.position() > 0) {
        writeCurrentBufferToService();  // Ensure everything flushed
    }
    super.close();
}

Pros:

Simple
Ensures file size is correct

Cons:

Doesn't fix the getPos() staleness issue
Still has metadata timing problems

Next Steps

Implement Option 2 (Virtual Position)
Test with local Spark reproduction
Verify unit tests still pass
Run full Spark integration tests in CI
Compare behavior with HDFS/S3 implementations

References

Parquet specification: https://parquet.apache.org/docs/file-format/
Hadoop FSDataOutputStream contract: getPos() should return total bytes written
Related issues: SeaweedFS Spark integration tests failing with EOF exceptions

5.2 KiB

Raw Blame History

Parquet EOF Exception: Root Cause and Fix Strategy

Executive Summary

Technical Details

The Write Sequence (from debug logs)

Why Reads Fail

Why This Happens

Why Unit Tests Pass but Spark Fails

Fix Options

Option 1: Flush on getPos() (Simple, but has performance impact)

Option 2: Track Virtual Position Separately (Recommended)

Option 3: Defer Footer Metadata Update (Complex)

Option 4: Force Flush Before Close (Workaround)

Recommended Solution

Implementation Plan

Next Steps

References

5.2 KiB Raw Blame History

Parquet EOF Exception: Root Cause and Fix Strategy

Executive Summary

Technical Details

The Write Sequence (from debug logs)

Why Reads Fail

Why This Happens

Why Unit Tests Pass but Spark Fails

Fix Options

Option 1: Flush on getPos() (Simple, but has performance impact)

Option 2: Track Virtual Position Separately (Recommended)

Option 3: Defer Footer Metadata Update (Complex)

Option 4: Force Flush Before Close (Workaround)

Recommended Solution

Implementation Plan

Next Steps

References

5.2 KiB

Raw Blame History