# Parquet 78-Byte EOFException - Final Analysis ## Problem Status: **UNRESOLVED after Parquet 1.16.0 upgrade** ### Symptoms - `EOFException: Reached the end of stream. Still have: 78 bytes left` - Occurs on **every** Parquet read operation - **78 bytes is ALWAYS the exact discrepancy** across all files/runs ### What We've Confirmed ✅ 1. **SeaweedFS works perfectly** - All bytes written through `write()` are stored correctly - `totalBytesWritten == position == chunks == attr` (perfect consistency) - File integrity verified at every step 2. **Parquet footer IS being written** - Logs show complete write sequence including footer - Small writes at end (1, 1, 4 bytes) = Parquet trailer structure - All bytes flushed successfully before close 3. **Parquet 1.16.0 upgrade changes behavior but not error** - **Before (1.13.1):** 684 bytes written → expects 762 (missing 78) - **After (1.16.0):** 693 bytes written → expects 771 (missing 78) - +9 bytes written, but SAME 78-byte error ### Key Evidence from Logs ``` Parquet 1.13.1: write(74 bytes): totalSoFar=679 🔒 close: totalBytesWritten=684 writeCalls=250 ✅ Stored: 684 bytes ❌ Read expects: 762 bytes (684 + 78) Parquet 1.16.0: write(1 byte): totalSoFar=688 [FOOTER?] write(1 byte): totalSoFar=689 [FOOTER?] write(4 bytes): totalSoFar=693 [FOOTER?] 🔒 close: totalBytesWritten=693 writeCalls=259 ✅ Stored: 693 bytes ❌ Read expects: 771 bytes (693 + 78) ``` ### What This Means **The 78-byte discrepancy is NOT:** - ❌ Missing footer (footer is written) - ❌ Lost bytes (all bytes stored) - ❌ SeaweedFS bug (perfect byte accounting) - ❌ Version-specific (persists across Parquet versions) **The 78-byte discrepancy IS:** - ✅ A **systematic offset calculation error** in Parquet - ✅ Related to how Parquet calculates **expected file size** or **column chunk offsets** - ✅ Consistent across all test cases (not random corruption) ## Hypotheses ### H1: Page Header Size Mismatch Parquet might be calculating expected data size including page headers that are actually compressed/elided in Snappy compression. **Evidence:** - 78 bytes could be multiple page headers (typically 8-16 bytes each) - Compression might eliminate or reduce these headers - Parquet calculates size pre-compression, reads post-compression **Test:** Try **uncompressed Parquet** (no Snappy) ### H2: Column Chunk Metadata Offset Error Parquet footer contains byte offsets to column chunks. These offsets might be calculated incorrectly. **Evidence:** - Footer is written correctly (we see the writes) - But offsets IN the footer point 78 bytes beyond actual data - Reader tries to read from these wrong offsets **Test:** Examine actual file bytes to see footer content ### H3: FSDataOutputStream Position Tracking Hadoop's `FSDataOutputStream` wrapper might track position differently than our underlying stream. **Evidence:** - `getPos()` was NEVER called (suspicious!) - Parquet must get position somehow - likely from FSDataOutputStream directly - FSDataOutputStream might return position before flush **Test:** Implement `Seekable` interface or check FSDataOutputStream behavior ### H4: Dictionary Page Size Accounting Parquet uses dictionary encoding. Dictionary pages might be: - Calculated at full size - Written compressed - Not accounted for properly **Evidence:** - Small files (good candidates for dictionary encoding) - 78 bytes reasonable for dictionary overhead - Consistent across files (dictionaries similar size) **Test:** Disable dictionary encoding in writer ## Recommended Next Steps ### Option 1: File Format Analysis (Definitive) ```bash # Download a failing Parquet file and examine it hexdump -C part-00000-xxx.parquet | tail -n 50 # Check footer structure parquet-tools meta part-00000-xxx.parquet parquet-tools dump part-00000-xxx.parquet ``` This will show: - Actual footer content - Column chunk offsets - What byte 693 vs 771 contains ### Option 2: Test Configuration Changes **A) Disable Snappy compression:** ```scala df.write .option("compression", "none") .parquet(path) ``` **B) Disable dictionary encoding:** ```scala df.write .option("parquet.enable.dictionary", "false") .parquet(path) ``` **C) Larger page sizes:** ```scala spark.conf.set("parquet.page.size", "2097152") // 2MB instead of 1MB ``` ### Option 3: Implement Seekable Interface Hadoop file systems that work with Parquet often implement `Seekable`: ```java public class SeaweedOutputStream extends OutputStream implements Seekable { public long getPos() { return position + buffer.position(); } public void seek(long pos) { // Not supported for output streams throw new UnsupportedOperationException(); } } ``` ### Option 4: Compare with Working Implementation Test the SAME Spark job against: - Local HDFS - S3A - Azure ABFS See if they produce identical Parquet files or different sizes. ### Option 5: Parquet Community This might be a known issue: - Check [Parquet JIRA](https://issues.apache.org/jira/browse/PARQUET) - Search for "column chunk offset" bugs - Ask on Parquet mailing list ## Why This is Hard to Debug 1. **Black box problem:** We see writes going in, but don't see what Parquet's internal calculations are 2. **Wrapper layers:** FSDataOutputStream sits between us and Parquet 3. **Binary format:** Can't inspect footer without tools 4. **Consistent failure:** No edge cases to compare against ## Files for Investigation Priority files to examine: 1. `part-00000-09a699c4-2299-45f9-8bee-8a8b1e241905.c000.snappy.parquet` (693 bytes, year=2021) 2. Any file from year=2020 (705 bytes) ## Success Criteria for Fix - [ ] No EOFException on any Parquet file - [ ] File size matches between write and read - [ ] All 10 tests pass consistently ## Workaround Options (If No Fix Found) 1. **Use different format:** Write as ORC or Avro instead of Parquet 2. **Pad files:** Add 78 bytes of padding to match expected size (hacky) 3. **Fix on read:** Modify SeaweedInputStream to lie about file size 4. **Different Spark version:** Try Spark 4.0.1 (different Parquet integration)