You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 

6.0 KiB

Parquet 78-Byte EOFException - Final Analysis

Problem Status: UNRESOLVED after Parquet 1.16.0 upgrade

Symptoms

  • EOFException: Reached the end of stream. Still have: 78 bytes left
  • Occurs on every Parquet read operation
  • 78 bytes is ALWAYS the exact discrepancy across all files/runs

What We've Confirmed

  1. SeaweedFS works perfectly

    • All bytes written through write() are stored correctly
    • totalBytesWritten == position == chunks == attr (perfect consistency)
    • File integrity verified at every step
  2. Parquet footer IS being written

    • Logs show complete write sequence including footer
    • Small writes at end (1, 1, 4 bytes) = Parquet trailer structure
    • All bytes flushed successfully before close
  3. Parquet 1.16.0 upgrade changes behavior but not error

    • Before (1.13.1): 684 bytes written → expects 762 (missing 78)
    • After (1.16.0): 693 bytes written → expects 771 (missing 78)
    • +9 bytes written, but SAME 78-byte error

Key Evidence from Logs

Parquet 1.13.1:
write(74 bytes): totalSoFar=679
🔒 close: totalBytesWritten=684 writeCalls=250
✅ Stored: 684 bytes
❌ Read expects: 762 bytes (684 + 78)

Parquet 1.16.0:
write(1 byte): totalSoFar=688 [FOOTER?]
write(1 byte): totalSoFar=689 [FOOTER?]
write(4 bytes): totalSoFar=693 [FOOTER?]
🔒 close: totalBytesWritten=693 writeCalls=259
✅ Stored: 693 bytes  
❌ Read expects: 771 bytes (693 + 78)

What This Means

The 78-byte discrepancy is NOT:

  • Missing footer (footer is written)
  • Lost bytes (all bytes stored)
  • SeaweedFS bug (perfect byte accounting)
  • Version-specific (persists across Parquet versions)

The 78-byte discrepancy IS:

  • A systematic offset calculation error in Parquet
  • Related to how Parquet calculates expected file size or column chunk offsets
  • Consistent across all test cases (not random corruption)

Hypotheses

H1: Page Header Size Mismatch

Parquet might be calculating expected data size including page headers that are actually compressed/elided in Snappy compression.

Evidence:

  • 78 bytes could be multiple page headers (typically 8-16 bytes each)
  • Compression might eliminate or reduce these headers
  • Parquet calculates size pre-compression, reads post-compression

Test: Try uncompressed Parquet (no Snappy)

H2: Column Chunk Metadata Offset Error

Parquet footer contains byte offsets to column chunks. These offsets might be calculated incorrectly.

Evidence:

  • Footer is written correctly (we see the writes)
  • But offsets IN the footer point 78 bytes beyond actual data
  • Reader tries to read from these wrong offsets

Test: Examine actual file bytes to see footer content

H3: FSDataOutputStream Position Tracking

Hadoop's FSDataOutputStream wrapper might track position differently than our underlying stream.

Evidence:

  • getPos() was NEVER called (suspicious!)
  • Parquet must get position somehow - likely from FSDataOutputStream directly
  • FSDataOutputStream might return position before flush

Test: Implement Seekable interface or check FSDataOutputStream behavior

H4: Dictionary Page Size Accounting

Parquet uses dictionary encoding. Dictionary pages might be:

  • Calculated at full size
  • Written compressed
  • Not accounted for properly

Evidence:

  • Small files (good candidates for dictionary encoding)
  • 78 bytes reasonable for dictionary overhead
  • Consistent across files (dictionaries similar size)

Test: Disable dictionary encoding in writer

Option 1: File Format Analysis (Definitive)

# Download a failing Parquet file and examine it
hexdump -C part-00000-xxx.parquet | tail -n 50

# Check footer structure
parquet-tools meta part-00000-xxx.parquet
parquet-tools dump part-00000-xxx.parquet

This will show:

  • Actual footer content
  • Column chunk offsets
  • What byte 693 vs 771 contains

Option 2: Test Configuration Changes

A) Disable Snappy compression:

df.write
  .option("compression", "none")
  .parquet(path)

B) Disable dictionary encoding:

df.write
  .option("parquet.enable.dictionary", "false")
  .parquet(path)

C) Larger page sizes:

spark.conf.set("parquet.page.size", "2097152") // 2MB instead of 1MB

Option 3: Implement Seekable Interface

Hadoop file systems that work with Parquet often implement Seekable:

public class SeaweedOutputStream extends OutputStream implements Seekable {
    public long getPos() {
        return position + buffer.position();
    }
    
    public void seek(long pos) {
        // Not supported for output streams
        throw new UnsupportedOperationException();
    }
}

Option 4: Compare with Working Implementation

Test the SAME Spark job against:

  • Local HDFS
  • S3A
  • Azure ABFS

See if they produce identical Parquet files or different sizes.

Option 5: Parquet Community

This might be a known issue:

  • Check Parquet JIRA
  • Search for "column chunk offset" bugs
  • Ask on Parquet mailing list

Why This is Hard to Debug

  1. Black box problem: We see writes going in, but don't see what Parquet's internal calculations are
  2. Wrapper layers: FSDataOutputStream sits between us and Parquet
  3. Binary format: Can't inspect footer without tools
  4. Consistent failure: No edge cases to compare against

Files for Investigation

Priority files to examine:

  1. part-00000-09a699c4-2299-45f9-8bee-8a8b1e241905.c000.snappy.parquet (693 bytes, year=2021)
  2. Any file from year=2020 (705 bytes)

Success Criteria for Fix

  • No EOFException on any Parquet file
  • File size matches between write and read
  • All 10 tests pass consistently

Workaround Options (If No Fix Found)

  1. Use different format: Write as ORC or Avro instead of Parquet
  2. Pad files: Add 78 bytes of padding to match expected size (hacky)
  3. Fix on read: Modify SeaweedInputStream to lie about file size
  4. Different Spark version: Try Spark 4.0.1 (different Parquet integration)