6.0 KiB
Parquet 78-Byte EOFException - Final Analysis
Problem Status: UNRESOLVED after Parquet 1.16.0 upgrade
Symptoms
EOFException: Reached the end of stream. Still have: 78 bytes left- Occurs on every Parquet read operation
- 78 bytes is ALWAYS the exact discrepancy across all files/runs
What We've Confirmed ✅
-
SeaweedFS works perfectly
- All bytes written through
write()are stored correctly totalBytesWritten == position == chunks == attr(perfect consistency)- File integrity verified at every step
- All bytes written through
-
Parquet footer IS being written
- Logs show complete write sequence including footer
- Small writes at end (1, 1, 4 bytes) = Parquet trailer structure
- All bytes flushed successfully before close
-
Parquet 1.16.0 upgrade changes behavior but not error
- Before (1.13.1): 684 bytes written → expects 762 (missing 78)
- After (1.16.0): 693 bytes written → expects 771 (missing 78)
- +9 bytes written, but SAME 78-byte error
Key Evidence from Logs
Parquet 1.13.1:
write(74 bytes): totalSoFar=679
🔒 close: totalBytesWritten=684 writeCalls=250
✅ Stored: 684 bytes
❌ Read expects: 762 bytes (684 + 78)
Parquet 1.16.0:
write(1 byte): totalSoFar=688 [FOOTER?]
write(1 byte): totalSoFar=689 [FOOTER?]
write(4 bytes): totalSoFar=693 [FOOTER?]
🔒 close: totalBytesWritten=693 writeCalls=259
✅ Stored: 693 bytes
❌ Read expects: 771 bytes (693 + 78)
What This Means
The 78-byte discrepancy is NOT:
- ❌ Missing footer (footer is written)
- ❌ Lost bytes (all bytes stored)
- ❌ SeaweedFS bug (perfect byte accounting)
- ❌ Version-specific (persists across Parquet versions)
The 78-byte discrepancy IS:
- ✅ A systematic offset calculation error in Parquet
- ✅ Related to how Parquet calculates expected file size or column chunk offsets
- ✅ Consistent across all test cases (not random corruption)
Hypotheses
H1: Page Header Size Mismatch
Parquet might be calculating expected data size including page headers that are actually compressed/elided in Snappy compression.
Evidence:
- 78 bytes could be multiple page headers (typically 8-16 bytes each)
- Compression might eliminate or reduce these headers
- Parquet calculates size pre-compression, reads post-compression
Test: Try uncompressed Parquet (no Snappy)
H2: Column Chunk Metadata Offset Error
Parquet footer contains byte offsets to column chunks. These offsets might be calculated incorrectly.
Evidence:
- Footer is written correctly (we see the writes)
- But offsets IN the footer point 78 bytes beyond actual data
- Reader tries to read from these wrong offsets
Test: Examine actual file bytes to see footer content
H3: FSDataOutputStream Position Tracking
Hadoop's FSDataOutputStream wrapper might track position differently than our underlying stream.
Evidence:
getPos()was NEVER called (suspicious!)- Parquet must get position somehow - likely from FSDataOutputStream directly
- FSDataOutputStream might return position before flush
Test: Implement Seekable interface or check FSDataOutputStream behavior
H4: Dictionary Page Size Accounting
Parquet uses dictionary encoding. Dictionary pages might be:
- Calculated at full size
- Written compressed
- Not accounted for properly
Evidence:
- Small files (good candidates for dictionary encoding)
- 78 bytes reasonable for dictionary overhead
- Consistent across files (dictionaries similar size)
Test: Disable dictionary encoding in writer
Recommended Next Steps
Option 1: File Format Analysis (Definitive)
# Download a failing Parquet file and examine it
hexdump -C part-00000-xxx.parquet | tail -n 50
# Check footer structure
parquet-tools meta part-00000-xxx.parquet
parquet-tools dump part-00000-xxx.parquet
This will show:
- Actual footer content
- Column chunk offsets
- What byte 693 vs 771 contains
Option 2: Test Configuration Changes
A) Disable Snappy compression:
df.write
.option("compression", "none")
.parquet(path)
B) Disable dictionary encoding:
df.write
.option("parquet.enable.dictionary", "false")
.parquet(path)
C) Larger page sizes:
spark.conf.set("parquet.page.size", "2097152") // 2MB instead of 1MB
Option 3: Implement Seekable Interface
Hadoop file systems that work with Parquet often implement Seekable:
public class SeaweedOutputStream extends OutputStream implements Seekable {
public long getPos() {
return position + buffer.position();
}
public void seek(long pos) {
// Not supported for output streams
throw new UnsupportedOperationException();
}
}
Option 4: Compare with Working Implementation
Test the SAME Spark job against:
- Local HDFS
- S3A
- Azure ABFS
See if they produce identical Parquet files or different sizes.
Option 5: Parquet Community
This might be a known issue:
- Check Parquet JIRA
- Search for "column chunk offset" bugs
- Ask on Parquet mailing list
Why This is Hard to Debug
- Black box problem: We see writes going in, but don't see what Parquet's internal calculations are
- Wrapper layers: FSDataOutputStream sits between us and Parquet
- Binary format: Can't inspect footer without tools
- Consistent failure: No edge cases to compare against
Files for Investigation
Priority files to examine:
part-00000-09a699c4-2299-45f9-8bee-8a8b1e241905.c000.snappy.parquet(693 bytes, year=2021)- Any file from year=2020 (705 bytes)
Success Criteria for Fix
- No EOFException on any Parquet file
- File size matches between write and read
- All 10 tests pass consistently
Workaround Options (If No Fix Found)
- Use different format: Write as ORC or Avro instead of Parquet
- Pad files: Add 78 bytes of padding to match expected size (hacky)
- Fix on read: Modify SeaweedInputStream to lie about file size
- Different Spark version: Try Spark 4.0.1 (different Parquet integration)