You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 

5.2 KiB

Parquet EOF Exception: Root Cause and Fix Strategy

Executive Summary

Problem: EOFException: Still have: 78 bytes left when reading Parquet files written to SeaweedFS via Spark.

Root Cause: Parquet footer metadata contains stale offsets due to writes occurring AFTER the last getPos() call.

Impact: All Parquet files written via Spark are unreadable.


Technical Details

The Write Sequence (from debug logs)

Write Phase:
- writeCalls 1-465: Parquet data (column chunks, dictionaries, etc.)
- Last getPos(): returns 1252 (flushedPosition=0 + bufferPosition=1252)
  ↓
Footer Phase:
- writeCalls 466-470: Footer metadata (8 bytes)
- NO getPos() called during this phase!
  ↓
Close Phase:
- buffer.position() = 1260 bytes
- All 1260 bytes flushed to disk
- File size set to 1260 bytes

###The Mismatch

What Value Notes
Last getPos() returned 1252 Parquet records this in footer
Actual bytes written 1260 What's flushed to disk
Gap 8 Unaccounted footer bytes

Why Reads Fail

  1. Parquet footer says: "Column chunk data ends at offset 1252"
  2. Actual file structure: Column chunk data ends at offset 1260
  3. When reading, Parquet seeks to offset 1252
  4. Parquet expects to find data there, but it's 8 bytes off
  5. Result: EOFException: Still have: 78 bytes left

The "78 bytes" is Parquet's calculation of how much data it expected vs. what it got, based on incorrect offsets.


Why This Happens

Parquet's footer writing is asynchronous with respect to getPos():

// Parquet's internal logic (simplified):
1. Write column chunk  call getPos()  record offset
2. Write more chunks  call getPos()  record offset
3. Write footer metadata (magic bytes, etc.)  NO getPos()!
4. Close stream

The footer metadata bytes (step 3) are written AFTER Parquet has recorded all offsets.


Why Unit Tests Pass but Spark Fails

Unit tests:

  • Simple write patterns
  • Direct, synchronous writes
  • getPos() called immediately after relevant writes

Spark/Parquet:

  • Complex write patterns with buffering
  • Asynchronous footer writing
  • getPos() NOT called after final footer writes

Fix Options

Option 1: Flush on getPos() (Simple, but has performance impact)

public synchronized long getPos() {
    if (buffer.position() > 0) {
        writeCurrentBufferToService();  // Force flush
    }
    return position;
}

Pros:

  • Ensures position is always accurate
  • Simple to implement

Cons:

  • Performance hit (many small flushes)
  • Changes buffering semantics

Keep position (flushed) separate from getPos() (virtual):

private long position = 0;  // Flushed bytes
private long virtualPosition = 0;  // Total bytes written

@Override
public synchronized void write(byte[] data, int off, int length) {
    // ... existing write logic ...
    virtualPosition += length;
}

public synchronized long getPos() {
    return virtualPosition;  // Always accurate, no flush needed
}

Pros:

  • No performance impact
  • Clean separation of concerns
  • getPos() always reflects total bytes written

Cons:

  • Need to track virtualPosition across all write methods

Modify flushWrittenBytesToServiceInternal() to account for buffered data:

protected void flushWrittenBytesToServiceInternal(final long offset) {
    long actualOffset = offset + buffer.position();  // Include buffered data
    entry.getAttributes().setFileSize(actualOffset);
    // ...
}

Pros:

  • Minimal code changes

Cons:

  • Doesn't solve the root cause
  • May break other use cases

Option 4: Force Flush Before Close (Workaround)

Override close() to flush before calling super:

@Override
public synchronized void close() throws IOException {
    if (buffer.position() > 0) {
        writeCurrentBufferToService();  // Ensure everything flushed
    }
    super.close();
}

Pros:

  • Simple
  • Ensures file size is correct

Cons:

  • Doesn't fix the getPos() staleness issue
  • Still has metadata timing problems

Option 2: Track Virtual Position Separately

This aligns with Hadoop's semantics where getPos() should return the total number of bytes written to the stream, regardless of buffering.

Implementation Plan

  1. Add virtualPosition field to SeaweedOutputStream
  2. Update all write() methods to increment virtualPosition
  3. Change getPos() to return virtualPosition
  4. Keep position for internal flush tracking
  5. Add unit tests to verify getPos() accuracy with buffering

Next Steps

  1. Implement Option 2 (Virtual Position)
  2. Test with local Spark reproduction
  3. Verify unit tests still pass
  4. Run full Spark integration tests in CI
  5. Compare behavior with HDFS/S3 implementations

References

  • Parquet specification: https://parquet.apache.org/docs/file-format/
  • Hadoop FSDataOutputStream contract: getPos() should return total bytes written
  • Related issues: SeaweedFS Spark integration tests failing with EOF exceptions