docs: comprehensive analysis of persistent 78-byte Parquet issue

After Parquet 1.16.0 upgrade: - Error persists (EOFException: 78 bytes left) - File sizes changed (684→693, 696→705) but SAME 78-byte gap - Footer IS being written (logs show complete write sequence) - All bytes ARE stored correctly (perfect consistency) Conclusion: This is a systematic offset calculation error in how Parquet calculates expected file size, not a missing data problem. Possible causes: 1. Page header size mismatch with Snappy compression 2. Column chunk metadata offset error in footer 3. FSDataOutputStream position tracking issue 4. Dictionary page size accounting problem Recommended next steps: 1. Try uncompressed Parquet (remove Snappy) 2. Examine actual file bytes with parquet-tools 3. Test with different Spark version (4.0.1) 4. Compare with known-working FS (HDFS, S3A) The 78-byte constant suggests a fixed structure size that Parquet accounts for but isn't actually written or is written differently.
1 month ago · 8e5f1d60ee
1 changed files with 199 additions and 0 deletions
--- a/test/java/spark/PARQUET_ISSUE_SUMMARY.md
+++ b/test/java/spark/PARQUET_ISSUE_SUMMARY.md
@ -0,0 +1,199 @@
+# Parquet 78-Byte EOFException - Final Analysis
+
+## Problem Status: **UNRESOLVED after Parquet 1.16.0 upgrade**
+
+### Symptoms
+- `EOFException: Reached the end of stream. Still have: 78 bytes left`
+- Occurs on **every** Parquet read operation
+- **78 bytes is ALWAYS the exact discrepancy** across all files/runs
+
+### What We've Confirmed ✅
+
+1. **SeaweedFS works perfectly**
+   - All bytes written through `write()` are stored correctly
+   - `totalBytesWritten == position == chunks == attr` (perfect consistency)
+   - File integrity verified at every step
+
+2. **Parquet footer IS being written**
+   - Logs show complete write sequence including footer
+   - Small writes at end (1, 1, 4 bytes) = Parquet trailer structure
+   - All bytes flushed successfully before close
+
+3. **Parquet 1.16.0 upgrade changes behavior but not error**
+   - **Before (1.13.1):** 684 bytes written → expects 762 (missing 78)
+   - **After (1.16.0):** 693 bytes written → expects 771 (missing 78)
+   - +9 bytes written, but SAME 78-byte error
+
+### Key Evidence from Logs
+
+```
+Parquet 1.13.1:
+write(74 bytes): totalSoFar=679
+🔒 close: totalBytesWritten=684 writeCalls=250
+✅ Stored: 684 bytes
+❌ Read expects: 762 bytes (684 + 78)
+
+Parquet 1.16.0:
+write(1 byte): totalSoFar=688 [FOOTER?]
+write(1 byte): totalSoFar=689 [FOOTER?]
+write(4 bytes): totalSoFar=693 [FOOTER?]
+🔒 close: totalBytesWritten=693 writeCalls=259
+✅ Stored: 693 bytes  
+❌ Read expects: 771 bytes (693 + 78)
+```
+
+### What This Means
+
+**The 78-byte discrepancy is NOT:**
+- ❌ Missing footer (footer is written)
+- ❌ Lost bytes (all bytes stored)
+- ❌ SeaweedFS bug (perfect byte accounting)
+- ❌ Version-specific (persists across Parquet versions)
+
+**The 78-byte discrepancy IS:**
+- ✅ A **systematic offset calculation error** in Parquet
+- ✅ Related to how Parquet calculates **expected file size** or **column chunk offsets**
+- ✅ Consistent across all test cases (not random corruption)
+
+## Hypotheses
+
+### H1: Page Header Size Mismatch
+Parquet might be calculating expected data size including page headers that are actually compressed/elided in Snappy compression.
+
+**Evidence:**
+- 78 bytes could be multiple page headers (typically 8-16 bytes each)
+- Compression might eliminate or reduce these headers
+- Parquet calculates size pre-compression, reads post-compression
+
+**Test:** Try **uncompressed Parquet** (no Snappy)
+
+### H2: Column Chunk Metadata Offset Error  
+Parquet footer contains byte offsets to column chunks. These offsets might be calculated incorrectly.
+
+**Evidence:**
+- Footer is written correctly (we see the writes)
+- But offsets IN the footer point 78 bytes beyond actual data
+- Reader tries to read from these wrong offsets
+
+**Test:** Examine actual file bytes to see footer content
+
+### H3: FSDataOutputStream Position Tracking
+Hadoop's `FSDataOutputStream` wrapper might track position differently than our underlying stream.
+
+**Evidence:**
+- `getPos()` was NEVER called (suspicious!)
+- Parquet must get position somehow - likely from FSDataOutputStream directly
+- FSDataOutputStream might return position before flush
+
+**Test:** Implement `Seekable` interface or check FSDataOutputStream behavior
+
+### H4: Dictionary Page Size Accounting
+Parquet uses dictionary encoding. Dictionary pages might be:
+- Calculated at full size
+- Written compressed
+- Not accounted for properly
+
+**Evidence:**
+- Small files (good candidates for dictionary encoding)
+- 78 bytes reasonable for dictionary overhead
+- Consistent across files (dictionaries similar size)
+
+**Test:** Disable dictionary encoding in writer
+
+## Recommended Next Steps
+
+### Option 1: File Format Analysis (Definitive)
+```bash
+# Download a failing Parquet file and examine it
+hexdump -C part-00000-xxx.parquet | tail -n 50
+
+# Check footer structure
+parquet-tools meta part-00000-xxx.parquet
+parquet-tools dump part-00000-xxx.parquet
+```
+
+This will show:
+- Actual footer content
+- Column chunk offsets
+- What byte 693 vs 771 contains
+
+### Option 2: Test Configuration Changes
+
+**A) Disable Snappy compression:**
+```scala
+df.write
+  .option("compression", "none")
+  .parquet(path)
+```
+
+**B) Disable dictionary encoding:**
+```scala
+df.write
+  .option("parquet.enable.dictionary", "false")
+  .parquet(path)
+```
+
+**C) Larger page sizes:**
+```scala
+spark.conf.set("parquet.page.size", "2097152") // 2MB instead of 1MB
+```
+
+### Option 3: Implement Seekable Interface
+
+Hadoop file systems that work with Parquet often implement `Seekable`:
+
+```java
+public class SeaweedOutputStream extends OutputStream implements Seekable {
+    public long getPos() {
+        return position + buffer.position();
+    }
+    
+    public void seek(long pos) {
+        // Not supported for output streams
+        throw new UnsupportedOperationException();
+    }
+}
+```
+
+### Option 4: Compare with Working Implementation
+
+Test the SAME Spark job against:
+- Local HDFS
+- S3A
+- Azure ABFS
+
+See if they produce identical Parquet files or different sizes.
+
+### Option 5: Parquet Community
+
+This might be a known issue:
+- Check [Parquet JIRA](https://issues.apache.org/jira/browse/PARQUET)
+- Search for "column chunk offset" bugs
+- Ask on Parquet mailing list
+
+## Why This is Hard to Debug
+
+1. **Black box problem:** We see writes going in, but don't see what Parquet's internal calculations are
+2. **Wrapper layers:** FSDataOutputStream sits between us and Parquet
+3. **Binary format:** Can't inspect footer without tools
+4. **Consistent failure:** No edge cases to compare against
+
+## Files for Investigation
+
+Priority files to examine:
+1. `part-00000-09a699c4-2299-45f9-8bee-8a8b1e241905.c000.snappy.parquet` (693 bytes, year=2021)
+2. Any file from year=2020 (705 bytes)
+
+## Success Criteria for Fix
+
+- [ ] No EOFException on any Parquet file
+- [ ] File size matches between write and read
+- [ ] All 10 tests pass consistently
+
+## Workaround Options (If No Fix Found)
+
+1. **Use different format:** Write as ORC or Avro instead of Parquet
+2. **Pad files:** Add 78 bytes of padding to match expected size (hacky)
+3. **Fix on read:** Modify SeaweedInputStream to lie about file size
+4. **Different Spark version:** Try Spark 4.0.1 (different Parquet integration)
+