3.9 KiB

Raw Blame History

Test Results Summary

Unit Tests: ✅ ALL PASS

Created GetPosBufferTest with 3 comprehensive tests that specifically target the Parquet EOF issue:

Test 1: testGetPosWithBufferedData()

✅ PASSED - Tests basic getPos() behavior with multiple writes and buffer management.

Test 2: testGetPosWithSmallWrites()

✅ PASSED - Simulates Parquet's pattern of many small writes with frequent getPos() calls.

Test 3: testGetPosWithExactly78BytesBuffered()

✅ PASSED - The critical test that reproduces the EXACT bug scenario!

Results:

Position after 1000 bytes + flush: 1000
Position with 78 bytes BUFFERED (not flushed): 1078  ✅
Actual file size: 1078  ✅
Bytes read at position 1000: 78  ✅
SUCCESS: getPos() correctly includes buffered data!

Key Finding

getPos() works correctly in unit tests but Spark tests still fail!

This proves:

✅ SeaweedOutputStream.getPos() returns position + buffer.position() correctly
✅ Files are written with correct sizes
✅ Data can be read back at correct positions
✅ The 78-byte buffered scenario works perfectly

Spark Integration Tests: ❌ STILL FAIL

BUT the FSDataOutputStream.getPos() override IS being called in Spark:

25/11/24 08:18:56 WARN SeaweedFileSystem: [DEBUG-2024] FSDataOutputStream.getPos() override called! Returning: 0
25/11/24 08:18:56 WARN SeaweedFileSystem: [DEBUG-2024] FSDataOutputStream.getPos() override called! Returning: 4
25/11/24 08:18:56 WARN SeaweedFileSystem: [DEBUG-2024] FSDataOutputStream.getPos() override called! Returning: 22
...
25/11/24 08:18:56 WARN SeaweedFileSystem: [DEBUG-2024] FSDataOutputStream.getPos() override called! Returning: 190

And the EOF error still occurs:

position=1275 contentLength=1275 bufRemaining=78

The Mystery

If getPos() is:

✅ Implemented correctly (unit tests pass)
✅ Being called by Spark (logs show it)
✅ Returning correct values (logs show reasonable positions)

Then why does Parquet still think there are 78 bytes to read at position 1275?

Possible Explanations

Theory 1: Parquet footer writing happens AFTER stream close

When the stream closes, it flushes the buffer. If Parquet writes the footer metadata BEFORE the final flush but AFTER getting getPos(), the footer could have stale positions.

Theory 2: Buffer position mismatch at close time

The unit tests show position 1078 with 78 bytes buffered. But when the stream closes and flushes, those 78 bytes get written. If the footer is written based on pre-flush positions, it would be off by 78 bytes.

Theory 3: Parquet caches getPos() values

Parquet might call getPos() once per column chunk and cache the value. If it caches the value BEFORE the buffer is flushed, but uses it AFTER, the offset would be wrong.

Theory 4: Multiple streams or file copies

Spark might be writing to a temporary file, then copying/moving it. If the metadata from the first write is used but the second file is what's read, sizes would mismatch.

Next Steps

Add logging to close() - See exact sequence of operations when stream closes
Add logging to flush() - See when buffer is actually flushed vs. when getPos() is called
Check Parquet source - Understand EXACTLY when it calls getPos() vs. when it writes footer
Compare with HDFS - How does HDFS handle this? Does it have special logic?

Hypothesis

The most likely scenario is that Parquet's InternalParquetRecordWriter:

Calls getPos() to record column chunk end positions → Gets 1197 (1275 - 78)
Continues writing more data (78 bytes) to buffer
Closes the stream, which flushes buffer (adds 78 bytes)
Final file size: 1275 bytes
But footer says last chunk ends at 1197
So when reading, it tries to read chunk from [1197, 1275) which is correct
BUT it ALSO tries to read [1275, 1353) because it thinks there's MORE data!

The "78 bytes missing" might actually be "78 bytes DOUBLE-COUNTED" in the footer metadata!

3.9 KiB Raw Blame History