6.0 KiB

Raw Blame History

Parquet 78-Byte EOFException - Final Analysis

Problem Status: UNRESOLVED after Parquet 1.16.0 upgrade

Symptoms

EOFException: Reached the end of stream. Still have: 78 bytes left
Occurs on every Parquet read operation
78 bytes is ALWAYS the exact discrepancy across all files/runs

What We've Confirmed ✅

SeaweedFS works perfectly
- All bytes written through write() are stored correctly
- totalBytesWritten == position == chunks == attr (perfect consistency)
- File integrity verified at every step
Parquet footer IS being written
- Logs show complete write sequence including footer
- Small writes at end (1, 1, 4 bytes) = Parquet trailer structure
- All bytes flushed successfully before close
Parquet 1.16.0 upgrade changes behavior but not error
- Before (1.13.1): 684 bytes written → expects 762 (missing 78)
- After (1.16.0): 693 bytes written → expects 771 (missing 78)
- +9 bytes written, but SAME 78-byte error

Key Evidence from Logs

Parquet 1.13.1:
write(74 bytes): totalSoFar=679
🔒 close: totalBytesWritten=684 writeCalls=250
✅ Stored: 684 bytes
❌ Read expects: 762 bytes (684 + 78)

Parquet 1.16.0:
write(1 byte): totalSoFar=688 [FOOTER?]
write(1 byte): totalSoFar=689 [FOOTER?]
write(4 bytes): totalSoFar=693 [FOOTER?]
🔒 close: totalBytesWritten=693 writeCalls=259
✅ Stored: 693 bytes  
❌ Read expects: 771 bytes (693 + 78)

What This Means

The 78-byte discrepancy is NOT:

❌ Missing footer (footer is written)
❌ Lost bytes (all bytes stored)
❌ SeaweedFS bug (perfect byte accounting)
❌ Version-specific (persists across Parquet versions)

The 78-byte discrepancy IS:

✅ A systematic offset calculation error in Parquet
✅ Related to how Parquet calculates expected file size or column chunk offsets
✅ Consistent across all test cases (not random corruption)

Hypotheses

H1: Page Header Size Mismatch

Parquet might be calculating expected data size including page headers that are actually compressed/elided in Snappy compression.

Evidence:

78 bytes could be multiple page headers (typically 8-16 bytes each)
Compression might eliminate or reduce these headers
Parquet calculates size pre-compression, reads post-compression

Test: Try uncompressed Parquet (no Snappy)

H2: Column Chunk Metadata Offset Error

Parquet footer contains byte offsets to column chunks. These offsets might be calculated incorrectly.

Evidence:

Footer is written correctly (we see the writes)
But offsets IN the footer point 78 bytes beyond actual data
Reader tries to read from these wrong offsets

Test: Examine actual file bytes to see footer content

H3: FSDataOutputStream Position Tracking

Hadoop's FSDataOutputStream wrapper might track position differently than our underlying stream.

Evidence:

getPos() was NEVER called (suspicious!)
Parquet must get position somehow - likely from FSDataOutputStream directly
FSDataOutputStream might return position before flush

Test: Implement Seekable interface or check FSDataOutputStream behavior

H4: Dictionary Page Size Accounting

Parquet uses dictionary encoding. Dictionary pages might be:

Calculated at full size
Written compressed
Not accounted for properly

Evidence:

Small files (good candidates for dictionary encoding)
78 bytes reasonable for dictionary overhead
Consistent across files (dictionaries similar size)

Test: Disable dictionary encoding in writer

Recommended Next Steps

Option 1: File Format Analysis (Definitive)

# Download a failing Parquet file and examine it
hexdump -C part-00000-xxx.parquet | tail -n 50

# Check footer structure
parquet-tools meta part-00000-xxx.parquet
parquet-tools dump part-00000-xxx.parquet

This will show:

Actual footer content
Column chunk offsets
What byte 693 vs 771 contains

Option 2: Test Configuration Changes

A) Disable Snappy compression:

df.write
  .option("compression", "none")
  .parquet(path)

B) Disable dictionary encoding:

df.write
  .option("parquet.enable.dictionary", "false")
  .parquet(path)

C) Larger page sizes:

spark.conf.set("parquet.page.size", "2097152") // 2MB instead of 1MB

Option 3: Implement Seekable Interface

Hadoop file systems that work with Parquet often implement Seekable:

public class SeaweedOutputStream extends OutputStream implements Seekable {
    public long getPos() {
        return position + buffer.position();
    }
    
    public void seek(long pos) {
        // Not supported for output streams
        throw new UnsupportedOperationException();
    }
}

Option 4: Compare with Working Implementation

Test the SAME Spark job against:

Local HDFS
S3A
Azure ABFS

See if they produce identical Parquet files or different sizes.

Option 5: Parquet Community

This might be a known issue:

Check Parquet JIRA
Search for "column chunk offset" bugs
Ask on Parquet mailing list

Why This is Hard to Debug

Black box problem: We see writes going in, but don't see what Parquet's internal calculations are
Wrapper layers: FSDataOutputStream sits between us and Parquet
Binary format: Can't inspect footer without tools
Consistent failure: No edge cases to compare against

Files for Investigation

Priority files to examine:

part-00000-09a699c4-2299-45f9-8bee-8a8b1e241905.c000.snappy.parquet (693 bytes, year=2021)
Any file from year=2020 (705 bytes)

Success Criteria for Fix

No EOFException on any Parquet file
File size matches between write and read
All 10 tests pass consistently

Workaround Options (If No Fix Found)

Use different format: Write as ORC or Avro instead of Parquet
Pad files: Add 78 bytes of padding to match expected size (hacky)
Fix on read: Modify SeaweedInputStream to lie about file size
Different Spark version: Try Spark 4.0.1 (different Parquet integration)

6.0 KiB Raw Blame History

Parquet 78-Byte EOFException - Final Analysis

Problem Status: UNRESOLVED after Parquet 1.16.0 upgrade

Symptoms

What We've Confirmed ✅

Key Evidence from Logs

What This Means

Hypotheses

H1: Page Header Size Mismatch

H2: Column Chunk Metadata Offset Error

H3: FSDataOutputStream Position Tracking

H4: Dictionary Page Size Accounting

Recommended Next Steps

Option 1: File Format Analysis (Definitive)

Option 2: Test Configuration Changes

Option 3: Implement Seekable Interface

Option 4: Compare with Working Implementation

Option 5: Parquet Community

Why This is Hard to Debug

Files for Investigation

Success Criteria for Fix

Workaround Options (If No Fix Found)

6.0 KiB

Raw Blame History