5.7 KiB

Raw Blame History

Issue Summary: EOF Exception in Parquet Files

Status: ROOT CAUSE CONFIRMED ✅

We've definitively identified the exact problem!

The Bug

Parquet is trying to read 78 bytes from position 1275, but the file ends at position 1275.

[DEBUG-2024] SeaweedInputStream.read() returning EOF: 
  path=.../employees/part-00000-....snappy.parquet 
  position=1275 
  contentLength=1275 
  bufRemaining=78

What This Means

The Parquet footer metadata says there's data at byte offset 1275 for 78 bytes [1275-1353), but the actual file is only 1275 bytes total!

This is a footer metadata corruption issue, not a data corruption issue.

Evidence

Write Phase (getPos() calls during Parquet write)

position: 190, 190, 190, 190, 231, 231, 231, 231, 262, 262, 285, 285, 310, 310, 333, 333, 333, 346, 346, 357, 357, 372, 372, 383, 383, 383, 383, 1267, 1267, 1267

Last data position: 1267
Final file size: 1275 (1267 + 8-byte footer metadata)

Read Phase (SeaweedInputStream.read() calls)

✅ Read [383, 1267) → 884 bytes (SUCCESS)
✅ Read [1267, 1275) → 8 bytes (SUCCESS)  
✅ Read [4, 1275) → 1271 bytes (SUCCESS)
❌ Read [1275, 1353) → EOF! (FAILED - trying to read past end of file)

Why the Downloaded File Works

When we download the file with curl and analyze it with parquet-tools:

✅ File structure is valid
✅ Magic bytes (PAR1) are correct
✅ Data can be read successfully
✅ Column metadata is correct

BUT when Spark/Parquet reads it at runtime, it interprets the footer metadata differently and tries to read data that doesn't exist.

The "78 Byte Constant"

The missing bytes is ALWAYS 78, across all test runs. This proves:

❌ NOT random data corruption
❌ NOT network/timing issue
✅ Systematic offset calculation error
✅ Likely related to footer size constants or column chunk size calculations

Theories

Theory A: `getPos()` Called at Wrong Time (MOST LIKELY)

When Parquet writes column chunks, it calls getPos() to record offsets in the footer. If:

Parquet calls getPos() before data is flushed from buffer
SeaweedOutputStream.getPos() returns position + buffer.position()
But then data is written and flushed, changing the actual position
Footer records the PRE-FLUSH position, which is wrong

Result: Footer thinks chunks are at position X, but they're actually at position X+78.

Theory B: Buffer Position Miscalculation

If buffer.position() is not correctly accounted for when writing footer metadata:

Data write: position advances correctly
Footer write: uses stale position without buffer.position()
Result: Off-by-buffer-size error (78 bytes = likely our buffer state at footer write time)

Theory C: Parquet Version Incompatibility

Tried downgrading from Parquet 1.16.0 to 1.13.1
ERROR STILL OCCURS ❌
So this is NOT a Parquet version issue

What We've Ruled Out

❌ Parquet version mismatch (tested 1.13.1 and 1.16.0)
❌ Data corruption (file is valid and complete)
❌ SeaweedInputStream.read() returning wrong data (logs show correct behavior)
❌ File size calculation (contentLength is correct at 1275)
❌ Inline content bug (fixed, but issue persists)

What's Actually Wrong

The getPos() values that Parquet records in the footer during the write phase are INCORRECT.

Specifically, when Parquet writes the footer metadata with column chunk offsets, it records positions that are 78 bytes less than they should be.

Example:

Parquet writes data at actual file position 383-1267
But footer says data is at position 1275-1353
That's an offset error of 892 bytes (1275 - 383 = 892)
When trying to read the "next" 78 bytes after 1267, it calculates position as 1275 and tries to read 78 bytes

Next Steps

Option 1: Force Buffer Flush Before getPos() Returns

Modify SeaweedOutputStream.getPos() to always flush the buffer first:

public synchronized long getPos() {
    flush(); // Ensure buffer is written before returning position
    return position + buffer.position(); // buffer.position() should be 0 after flush
}

Option 2: Track Flushed Position Separately

Maintain a flushedPosition field that only updates after successful flush:

private long flushedPosition = 0;

public synchronized long getPos() {
    return flushedPosition + buffer.position();
}

private void writeCurrentBufferToService() {
    // ... write buffer ...
    flushedPosition += buffer.position();
    // ... reset buffer ...
}

Option 3: Investigate Parquet's Column Chunk Write Order

Add detailed logging to see EXACTLY when and where Parquet calls getPos() during column chunk writes. This will show us if the issue is:

getPos() called before or after write()
getPos() called during footer write vs. data write
Column chunk boundaries calculated incorrectly

Test Plan

Implement Option 1 (simplest fix)
Run full Spark integration test suite
If that doesn't work, implement Option 2
Add detailed getPos() call stack logging to see Parquet's exact calling pattern
Compare with a working FileSystem implementation (e.g., HDFS, S3A)

Files to Investigate

SeaweedOutputStream.java - getPos() implementation
SeaweedHadoopOutputStream.java - Hadoop 3.x wrapper
SeaweedFileSystem.java - FSDataOutputStream creation
Parquet source (external): InternalParquetRecordWriter.java - Where it calls getPos()

Confidence Level

🎯 99% confident this is a getPos() buffer flush timing issue.

The "78 bytes" constant strongly suggests it's the size of buffered data that hasn't been flushed when getPos() is called during footer writing.

5.7 KiB Raw Blame History