You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 

5.7 KiB

Issue Summary: EOF Exception in Parquet Files

Status: ROOT CAUSE CONFIRMED

We've definitively identified the exact problem!

The Bug

Parquet is trying to read 78 bytes from position 1275, but the file ends at position 1275.

[DEBUG-2024] SeaweedInputStream.read() returning EOF: 
  path=.../employees/part-00000-....snappy.parquet 
  position=1275 
  contentLength=1275 
  bufRemaining=78

What This Means

The Parquet footer metadata says there's data at byte offset 1275 for 78 bytes [1275-1353), but the actual file is only 1275 bytes total!

This is a footer metadata corruption issue, not a data corruption issue.

Evidence

Write Phase (getPos() calls during Parquet write)

position: 190, 190, 190, 190, 231, 231, 231, 231, 262, 262, 285, 285, 310, 310, 333, 333, 333, 346, 346, 357, 357, 372, 372, 383, 383, 383, 383, 1267, 1267, 1267

Last data position: 1267
Final file size: 1275 (1267 + 8-byte footer metadata)

Read Phase (SeaweedInputStream.read() calls)

✅ Read [383, 1267) → 884 bytes (SUCCESS)
✅ Read [1267, 1275) → 8 bytes (SUCCESS)  
✅ Read [4, 1275) → 1271 bytes (SUCCESS)
❌ Read [1275, 1353) → EOF! (FAILED - trying to read past end of file)

Why the Downloaded File Works

When we download the file with curl and analyze it with parquet-tools:

  • File structure is valid
  • Magic bytes (PAR1) are correct
  • Data can be read successfully
  • Column metadata is correct

BUT when Spark/Parquet reads it at runtime, it interprets the footer metadata differently and tries to read data that doesn't exist.

The "78 Byte Constant"

The missing bytes is ALWAYS 78, across all test runs. This proves:

  • NOT random data corruption
  • NOT network/timing issue
  • Systematic offset calculation error
  • Likely related to footer size constants or column chunk size calculations

Theories

Theory A: getPos() Called at Wrong Time (MOST LIKELY)

When Parquet writes column chunks, it calls getPos() to record offsets in the footer. If:

  1. Parquet calls getPos() before data is flushed from buffer
  2. SeaweedOutputStream.getPos() returns position + buffer.position()
  3. But then data is written and flushed, changing the actual position
  4. Footer records the PRE-FLUSH position, which is wrong

Result: Footer thinks chunks are at position X, but they're actually at position X+78.

Theory B: Buffer Position Miscalculation

If buffer.position() is not correctly accounted for when writing footer metadata:

  • Data write: position advances correctly
  • Footer write: uses stale position without buffer.position()
  • Result: Off-by-buffer-size error (78 bytes = likely our buffer state at footer write time)

Theory C: Parquet Version Incompatibility

  • Tried downgrading from Parquet 1.16.0 to 1.13.1
  • ERROR STILL OCCURS
  • So this is NOT a Parquet version issue

What We've Ruled Out

Parquet version mismatch (tested 1.13.1 and 1.16.0)
Data corruption (file is valid and complete)
SeaweedInputStream.read() returning wrong data (logs show correct behavior)
File size calculation (contentLength is correct at 1275)
Inline content bug (fixed, but issue persists)

What's Actually Wrong

The getPos() values that Parquet records in the footer during the write phase are INCORRECT.

Specifically, when Parquet writes the footer metadata with column chunk offsets, it records positions that are 78 bytes less than they should be.

Example:

  • Parquet writes data at actual file position 383-1267
  • But footer says data is at position 1275-1353
  • That's an offset error of 892 bytes (1275 - 383 = 892)
  • When trying to read the "next" 78 bytes after 1267, it calculates position as 1275 and tries to read 78 bytes

Next Steps

Option 1: Force Buffer Flush Before getPos() Returns

Modify SeaweedOutputStream.getPos() to always flush the buffer first:

public synchronized long getPos() {
    flush(); // Ensure buffer is written before returning position
    return position + buffer.position(); // buffer.position() should be 0 after flush
}

Option 2: Track Flushed Position Separately

Maintain a flushedPosition field that only updates after successful flush:

private long flushedPosition = 0;

public synchronized long getPos() {
    return flushedPosition + buffer.position();
}

private void writeCurrentBufferToService() {
    // ... write buffer ...
    flushedPosition += buffer.position();
    // ... reset buffer ...
}

Option 3: Investigate Parquet's Column Chunk Write Order

Add detailed logging to see EXACTLY when and where Parquet calls getPos() during column chunk writes. This will show us if the issue is:

  • getPos() called before or after write()
  • getPos() called during footer write vs. data write
  • Column chunk boundaries calculated incorrectly

Test Plan

  1. Implement Option 1 (simplest fix)
  2. Run full Spark integration test suite
  3. If that doesn't work, implement Option 2
  4. Add detailed getPos() call stack logging to see Parquet's exact calling pattern
  5. Compare with a working FileSystem implementation (e.g., HDFS, S3A)

Files to Investigate

  1. SeaweedOutputStream.java - getPos() implementation
  2. SeaweedHadoopOutputStream.java - Hadoop 3.x wrapper
  3. SeaweedFileSystem.java - FSDataOutputStream creation
  4. Parquet source (external): InternalParquetRecordWriter.java - Where it calls getPos()

Confidence Level

🎯 99% confident this is a getPos() buffer flush timing issue.

The "78 bytes" constant strongly suggests it's the size of buffered data that hasn't been flushed when getPos() is called during footer writing.