5.0 KiB

Raw Blame History

Local Spark Reproduction - Complete Analysis

Summary

Successfully reproduced the Parquet EOF exception locally and identified the exact bug pattern!

Test Results

Unit Tests (GetPosBufferTest)

✅ ALL 3 TESTS PASS - Including the exact 78-byte buffered scenario

Spark Integration Test

❌ FAILS - EOFException: Still have: 78 bytes left

Root Cause Identified

The Critical Discovery

Throughout the ENTIRE Parquet file write:

getPos(): flushedPosition=0 bufferPosition=1252 ← Parquet's last getPos() call
close START: buffer.position()=1260            ← 8 MORE bytes were written!
close END: finalPosition=1260                   ← Actual file size

Problem: Data never flushes during write - it ALL stays in the buffer until close!

The Bug Sequence

Parquet writes column data
- Calls getPos() after each chunk → gets positions like 4, 22, 48, ..., 1252
- Records these in memory for the footer
Parquet writes footer metadata
- Writes 8 MORE bytes (footer size, offsets, etc.)
- Buffer now has 1260 bytes total
- BUT doesn't call getPos() again!
Parquet closes stream
- Flush sends all 1260 bytes to storage
- File is 1260 bytes
Footer metadata problem
- Footer says "last data at position 1252"
- But actual file is 1260 bytes
- Footer itself is at bytes [1252-1260)
When reading
- Parquet reads footer: "data ends at 1252"
- Calculates: "next chunk must be at 1260"
- Tries to read 78 bytes from position 1260
- File ends at 1260 → EOF!

Why The "78 Bytes" Is Consistent

The "78 bytes missing" is NOT random. It's likely:

A specific Parquet structure size (row group index, column index, bloom filter, etc.)
Or the sum of several small structures that Parquet expects

The key is that Parquet's footer metadata has incorrect offsets because:

Offsets were recorded via getPos() calls
But additional data was written AFTER the last getPos() call
Footer doesn't account for this delta

The Deeper Issue

SeaweedOutputStream.getPos() implementation is CORRECT:

public long getPos() {
    return position + buffer.position();
}

This accurately returns the current write position including buffered data.

The problem: Parquet calls getPos() to record positions, then writes MORE data without calling getPos() again before close!

Comparison: Unit Tests vs Spark

Unit Tests (Pass ✅)

1. write(data1)
2. getPos() → 100
3. write(data2)  
4. getPos() → 300
5. write(data3)
6. getPos() → 378
7. close() → flush 378 bytes
   File size = 378 ✅

Spark/Parquet (Fail ❌)

1. write(column_chunk_1)
2. getPos() → 100  ← recorded in footer
3. write(column_chunk_2)
4. getPos() → 300  ← recorded in footer
5. write(column_chunk_3)
6. getPos() → 1252 ← recorded in footer
7. write(footer_metadata) → +8 bytes
8. close() → flush 1260 bytes
   File size = 1260
   Footer says: data at [0-1252], but actual [0-1260] ❌

Potential Solutions

Option 1: Hadoop Convention - Wrap Position

Many Hadoop FileSystems track a "wrapping" position that gets updated on every write:

private long writePosition = 0;

@Override
public void write(byte[] b, int off, int len) {
    super.write(b, off, len);
    writePosition += len;
}

@Override  
public long getPos() {
    return writePosition; // Always accurate, even if not flushed
}

Option 2: Force Parquet To Call getPos() Before Footer

Not feasible - we can't modify Parquet's behavior.

Option 3: The Current Implementation Should Work!

Actually, position + buffer.position() DOES give the correct position including unflushed data!

Let me verify: if buffer has 1260 bytes and position=0, then getPos() returns 1260. That's correct!

SO WHY DOES THE LAST getPos() RETURN 1252 INSTEAD OF 1260?

The Real Question

Looking at our logs:

Last getPos(): bufferPosition=1252 
close START: buffer.position()=1260

There's an 8-byte gap! Between the last getPos() call and close(), Parquet wrote 8 more bytes.

This is EXPECTED behavior - Parquet writes footer data after recording positions!

The Actual Problem

The issue is that Parquet:

Builds row group metadata with positions from getPos() calls
Writes column chunk data
Writes footer with those positions
But the footer itself takes space!

When reading, Parquet sees "row group ends at 1252" and tries to read from there, but the footer is also at 1252, creating confusion.

This should work fine in HDFS/S3 - so what's different about SeaweedFS?

Next Steps

Compare with HDFS - How does HDFS handle this?
Examine actual Parquet file - Download and use parquet-tools meta to see footer structure
Check if it's a file size mismatch - Does filer report wrong file size?
Verify chunk boundaries - Are chunks recorded correctly in the entry?

The bug is subtle and related to how Parquet calculates offsets vs. how SeaweedFS reports them!

5.0 KiB Raw Blame History