You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
3.3 KiB
3.3 KiB
Root Cause Confirmed: Parquet Footer Metadata Issue
The Bug (CONFIRMED)
Parquet is trying to read 78 bytes from position 1275, but the file ends at position 1275!
[DEBUG-2024] SeaweedInputStream.read() returning EOF:
path=.../employees/part-00000-....snappy.parquet
position=1275
contentLength=1275
bufRemaining=78
What This Means
The Parquet footer metadata says there's a column chunk or row group at byte offset 1275 that is 78 bytes long. But the file is only 1275 bytes total!
Evidence
During Write
getPos()returned: 0, 4, 59, 92, 139, 172, 190, 231, 262, 285, 310, 333, 346, 357, 372, 383, 1267- Last data position: 1267
- Final file size: 1275 (1267 + 8-byte footer)
During Read
- ✅ Read [383, 1267) → 884 bytes ✅
- ✅ Read [1267, 1275) → 8 bytes ✅
- ✅ Read [4, 1275) → 1271 bytes ✅
- ❌ Read [1275, 1353) → TRIED to read 78 bytes → EOF! ❌
Why The Downloaded File Works
When you download the file and use parquet-tools, it reads correctly because:
- The file IS valid and complete
- parquet-tools can interpret the footer correctly
- But Spark/Parquet at runtime interprets the footer DIFFERENTLY
Possible Causes
1. Parquet Version Mismatch ⚠️
- pom.xml declares Parquet 1.16.0
- But Spark 3.5.0 might bundle a different Parquet version
- Runtime version conflict → footer interpretation mismatch
2. Buffer Position vs. Flushed Position
getPos()returnsposition + buffer.position()- If Parquet calls
getPos()before buffer is flushed, offsets could be wrong - But our logs show getPos() values that seem correct...
3. Parquet 1.16.0 Footer Format Change
- Parquet 1.16.0 might have changed footer layout
- Writing with 1.16.0 format but reading with different logic
- The "78 bytes" might be a footer size constant that changed
The 78-Byte Constant
Interesting pattern: The missing bytes is ALWAYS 78. This suggests:
- It's not random data corruption
- It's a systematic offset calculation error
- 78 bytes might be related to:
- Footer metadata size
- Column statistics size
- Row group index size
- Magic bytes + length fields
Next Steps
Option A: Downgrade Parquet
Try Parquet 1.13.1 (what Spark 3.5.0 normally uses):
<parquet.version>1.13.1</parquet.version>
Option B: Check Runtime Parquet Version
Add logging to see what Parquet version is actually loaded:
LOG.info("Parquet version: {}", ParquetFileReader.class.getPackage().getImplementationVersion());
Option C: Force Buffer Flush Before getPos()
Override getPos() to force flush:
public synchronized long getPos() {
flush(); // Ensure all data is written
return position + buffer.position();
}
Option D: Analyze Footer Hex Dump
Download the file and examine the last 100 bytes to see footer structure:
hexdump -C test.parquet | tail -20
Test Plan
- Try downgrading to Parquet 1.13.1
- If that works, it confirms version incompatibility
- If not, analyze footer structure with hex dump
- Check if Spark's bundled Parquet overrides our dependency
Files Modified
SeaweedInputStream.java- Added EOF logging- Root cause: Parquet footer has offset 1275 for 78-byte chunk that doesn't exist