You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 

3.3 KiB

Root Cause Confirmed: Parquet Footer Metadata Issue

The Bug (CONFIRMED)

Parquet is trying to read 78 bytes from position 1275, but the file ends at position 1275!

[DEBUG-2024] SeaweedInputStream.read() returning EOF: 
  path=.../employees/part-00000-....snappy.parquet 
  position=1275 
  contentLength=1275 
  bufRemaining=78

What This Means

The Parquet footer metadata says there's a column chunk or row group at byte offset 1275 that is 78 bytes long. But the file is only 1275 bytes total!

Evidence

During Write

  • getPos() returned: 0, 4, 59, 92, 139, 172, 190, 231, 262, 285, 310, 333, 346, 357, 372, 383, 1267
  • Last data position: 1267
  • Final file size: 1275 (1267 + 8-byte footer)

During Read

  • Read [383, 1267) → 884 bytes
  • Read [1267, 1275) → 8 bytes
  • Read [4, 1275) → 1271 bytes
  • Read [1275, 1353) → TRIED to read 78 bytes → EOF!

Why The Downloaded File Works

When you download the file and use parquet-tools, it reads correctly because:

  • The file IS valid and complete
  • parquet-tools can interpret the footer correctly
  • But Spark/Parquet at runtime interprets the footer DIFFERENTLY

Possible Causes

1. Parquet Version Mismatch ⚠️

  • pom.xml declares Parquet 1.16.0
  • But Spark 3.5.0 might bundle a different Parquet version
  • Runtime version conflict → footer interpretation mismatch

2. Buffer Position vs. Flushed Position

  • getPos() returns position + buffer.position()
  • If Parquet calls getPos() before buffer is flushed, offsets could be wrong
  • But our logs show getPos() values that seem correct...
  • Parquet 1.16.0 might have changed footer layout
  • Writing with 1.16.0 format but reading with different logic
  • The "78 bytes" might be a footer size constant that changed

The 78-Byte Constant

Interesting pattern: The missing bytes is ALWAYS 78. This suggests:

  • It's not random data corruption
  • It's a systematic offset calculation error
  • 78 bytes might be related to:
    • Footer metadata size
    • Column statistics size
    • Row group index size
    • Magic bytes + length fields

Next Steps

Option A: Downgrade Parquet

Try Parquet 1.13.1 (what Spark 3.5.0 normally uses):

<parquet.version>1.13.1</parquet.version>

Option B: Check Runtime Parquet Version

Add logging to see what Parquet version is actually loaded:

LOG.info("Parquet version: {}", ParquetFileReader.class.getPackage().getImplementationVersion());

Option C: Force Buffer Flush Before getPos()

Override getPos() to force flush:

public synchronized long getPos() {
    flush(); // Ensure all data is written
    return position + buffer.position();
}

Download the file and examine the last 100 bytes to see footer structure:

hexdump -C test.parquet | tail -20

Test Plan

  1. Try downgrading to Parquet 1.13.1
  2. If that works, it confirms version incompatibility
  3. If not, analyze footer structure with hex dump
  4. Check if Spark's bundled Parquet overrides our dependency

Files Modified

  • SeaweedInputStream.java - Added EOF logging
  • Root cause: Parquet footer has offset 1275 for 78-byte chunk that doesn't exist