# Root Cause Confirmed: Parquet Footer Metadata Issue ## The Bug (CONFIRMED) Parquet is trying to **read 78 bytes from position 1275**, but the file ends at position 1275! ``` [DEBUG-2024] SeaweedInputStream.read() returning EOF: path=.../employees/part-00000-....snappy.parquet position=1275 contentLength=1275 bufRemaining=78 ``` ## What This Means The Parquet footer metadata says there's a column chunk or row group at byte offset **1275** that is **78 bytes long**. But the file is only 1275 bytes total! ## Evidence ### During Write - `getPos()` returned: 0, 4, 59, 92, 139, 172, 190, 231, 262, 285, 310, 333, 346, 357, 372, 383, 1267 - Last data position: **1267** - Final file size: **1275** (1267 + 8-byte footer) ### During Read - ✅ Read [383, 1267) → 884 bytes ✅ - ✅ Read [1267, 1275) → 8 bytes ✅ - ✅ Read [4, 1275) → 1271 bytes ✅ - ❌ **Read [1275, 1353) → TRIED to read 78 bytes → EOF!** ❌ ## Why The Downloaded File Works When you download the file and use `parquet-tools`, it reads correctly because: - The file IS valid and complete - parquet-tools can interpret the footer correctly - **But Spark/Parquet at runtime interprets the footer DIFFERENTLY** ## Possible Causes ### 1. Parquet Version Mismatch ⚠️ - pom.xml declares Parquet 1.16.0 - But Spark 3.5.0 might bundle a different Parquet version - Runtime version conflict → footer interpretation mismatch ### 2. Buffer Position vs. Flushed Position - `getPos()` returns `position + buffer.position()` - If Parquet calls `getPos()` before buffer is flushed, offsets could be wrong - But our logs show getPos() values that seem correct... ### 3. Parquet 1.16.0 Footer Format Change - Parquet 1.16.0 might have changed footer layout - Writing with 1.16.0 format but reading with different logic - The "78 bytes" might be a footer size constant that changed ## The 78-Byte Constant **Interesting pattern**: The missing bytes is ALWAYS 78. This suggests: - It's not random data corruption - It's a systematic offset calculation error - 78 bytes might be related to: - Footer metadata size - Column statistics size - Row group index size - Magic bytes + length fields ## Next Steps ### Option A: Downgrade Parquet Try Parquet 1.13.1 (what Spark 3.5.0 normally uses): ```xml 1.13.1 ``` ### Option B: Check Runtime Parquet Version Add logging to see what Parquet version is actually loaded: ```java LOG.info("Parquet version: {}", ParquetFileReader.class.getPackage().getImplementationVersion()); ``` ### Option C: Force Buffer Flush Before getPos() Override `getPos()` to force flush: ```java public synchronized long getPos() { flush(); // Ensure all data is written return position + buffer.position(); } ``` ### Option D: Analyze Footer Hex Dump Download the file and examine the last 100 bytes to see footer structure: ```bash hexdump -C test.parquet | tail -20 ``` ## Test Plan 1. Try downgrading to Parquet 1.13.1 2. If that works, it confirms version incompatibility 3. If not, analyze footer structure with hex dump 4. Check if Spark's bundled Parquet overrides our dependency ## Files Modified - `SeaweedInputStream.java` - Added EOF logging - Root cause: Parquet footer has offset 1275 for 78-byte chunk that doesn't exist