3.3 KiB

Raw Blame History

Root Cause Confirmed: Parquet Footer Metadata Issue

The Bug (CONFIRMED)

Parquet is trying to read 78 bytes from position 1275, but the file ends at position 1275!

[DEBUG-2024] SeaweedInputStream.read() returning EOF: 
  path=.../employees/part-00000-....snappy.parquet 
  position=1275 
  contentLength=1275 
  bufRemaining=78

What This Means

The Parquet footer metadata says there's a column chunk or row group at byte offset 1275 that is 78 bytes long. But the file is only 1275 bytes total!

Evidence

During Write

getPos() returned: 0, 4, 59, 92, 139, 172, 190, 231, 262, 285, 310, 333, 346, 357, 372, 383, 1267
Last data position: 1267
Final file size: 1275 (1267 + 8-byte footer)

During Read

✅ Read [383, 1267) → 884 bytes ✅
✅ Read [1267, 1275) → 8 bytes ✅
✅ Read [4, 1275) → 1271 bytes ✅
❌ Read [1275, 1353) → TRIED to read 78 bytes → EOF! ❌

Why The Downloaded File Works

When you download the file and use parquet-tools, it reads correctly because:

The file IS valid and complete
parquet-tools can interpret the footer correctly
But Spark/Parquet at runtime interprets the footer DIFFERENTLY

Possible Causes

1. Parquet Version Mismatch ⚠️

pom.xml declares Parquet 1.16.0
But Spark 3.5.0 might bundle a different Parquet version
Runtime version conflict → footer interpretation mismatch

2. Buffer Position vs. Flushed Position

getPos() returns position + buffer.position()
If Parquet calls getPos() before buffer is flushed, offsets could be wrong
But our logs show getPos() values that seem correct...

3. Parquet 1.16.0 Footer Format Change

Parquet 1.16.0 might have changed footer layout
Writing with 1.16.0 format but reading with different logic
The "78 bytes" might be a footer size constant that changed

The 78-Byte Constant

Interesting pattern: The missing bytes is ALWAYS 78. This suggests:

It's not random data corruption
It's a systematic offset calculation error
78 bytes might be related to:
- Footer metadata size
- Column statistics size
- Row group index size
- Magic bytes + length fields

Next Steps

Option A: Downgrade Parquet

Try Parquet 1.13.1 (what Spark 3.5.0 normally uses):

<parquet.version>1.13.1</parquet.version>

Option B: Check Runtime Parquet Version

Add logging to see what Parquet version is actually loaded:

LOG.info("Parquet version: {}", ParquetFileReader.class.getPackage().getImplementationVersion());

Option C: Force Buffer Flush Before getPos()

Override getPos() to force flush:

public synchronized long getPos() {
    flush(); // Ensure all data is written
    return position + buffer.position();
}

Option D: Analyze Footer Hex Dump

Download the file and examine the last 100 bytes to see footer structure:

hexdump -C test.parquet | tail -20

Test Plan

Try downgrading to Parquet 1.13.1
If that works, it confirms version incompatibility
If not, analyze footer structure with hex dump
Check if Spark's bundled Parquet overrides our dependency

Files Modified

SeaweedInputStream.java - Added EOF logging
Root cause: Parquet footer has offset 1275 for 78-byte chunk that doesn't exist

3.3 KiB Raw Blame History