Browse Source

docs: comprehensive test results showing unit tests PASS but Spark fails

KEY FINDINGS:
- Unit tests: ALL 3 tests PASS  including exact 78-byte scenario
- getPos() works correctly: returns position + buffer.position()
- FSDataOutputStream override IS being called in Spark
- But EOF exception still occurs at position=1275 trying to read 78 bytes

This proves the bug is NOT in getPos() itself, but in HOW/WHEN Parquet
uses the returned positions.

Hypothesis: Parquet footer has positions recorded BEFORE final flush,
causing a 78-byte offset error in column chunk metadata.
pull/7526/head
chrislu 1 week ago
parent
commit
50a8a3eb11
  1. 93
      test/java/spark/TEST_RESULTS_SUMMARY.md

93
test/java/spark/TEST_RESULTS_SUMMARY.md

@ -0,0 +1,93 @@
# Test Results Summary
## Unit Tests: ✅ ALL PASS
Created `GetPosBufferTest` with 3 comprehensive tests that specifically target the Parquet EOF issue:
### Test 1: testGetPosWithBufferedData()
**PASSED** - Tests basic `getPos()` behavior with multiple writes and buffer management.
### Test 2: testGetPosWithSmallWrites()
**PASSED** - Simulates Parquet's pattern of many small writes with frequent `getPos()` calls.
### Test 3: testGetPosWithExactly78BytesBuffered()
**PASSED** - The critical test that reproduces the EXACT bug scenario!
**Results**:
```
Position after 1000 bytes + flush: 1000
Position with 78 bytes BUFFERED (not flushed): 1078 ✅
Actual file size: 1078 ✅
Bytes read at position 1000: 78 ✅
SUCCESS: getPos() correctly includes buffered data!
```
## Key Finding
**`getPos()` works correctly in unit tests but Spark tests still fail!**
This proves:
- ✅ `SeaweedOutputStream.getPos()` returns `position + buffer.position()` correctly
- ✅ Files are written with correct sizes
- ✅ Data can be read back at correct positions
- ✅ The 78-byte buffered scenario works perfectly
## Spark Integration Tests: ❌ STILL FAIL
**BUT** the `FSDataOutputStream.getPos()` override **IS** being called in Spark:
```
25/11/24 08:18:56 WARN SeaweedFileSystem: [DEBUG-2024] FSDataOutputStream.getPos() override called! Returning: 0
25/11/24 08:18:56 WARN SeaweedFileSystem: [DEBUG-2024] FSDataOutputStream.getPos() override called! Returning: 4
25/11/24 08:18:56 WARN SeaweedFileSystem: [DEBUG-2024] FSDataOutputStream.getPos() override called! Returning: 22
...
25/11/24 08:18:56 WARN SeaweedFileSystem: [DEBUG-2024] FSDataOutputStream.getPos() override called! Returning: 190
```
And the EOF error still occurs:
```
position=1275 contentLength=1275 bufRemaining=78
```
## The Mystery
If `getPos()` is:
1. ✅ Implemented correctly (unit tests pass)
2. ✅ Being called by Spark (logs show it)
3. ✅ Returning correct values (logs show reasonable positions)
**Then why does Parquet still think there are 78 bytes to read at position 1275?**
## Possible Explanations
### Theory 1: Parquet footer writing happens AFTER stream close
When the stream closes, it flushes the buffer. If Parquet writes the footer metadata BEFORE the final flush but AFTER getting `getPos()`, the footer could have stale positions.
### Theory 2: Buffer position mismatch at close time
The unit tests show position 1078 with 78 bytes buffered. But when the stream closes and flushes, those 78 bytes get written. If the footer is written based on pre-flush positions, it would be off by 78 bytes.
### Theory 3: Parquet caches getPos() values
Parquet might call `getPos()` once per column chunk and cache the value. If it caches the value BEFORE the buffer is flushed, but uses it AFTER, the offset would be wrong.
### Theory 4: Multiple streams or file copies
Spark might be writing to a temporary file, then copying/moving it. If the metadata from the first write is used but the second file is what's read, sizes would mismatch.
## Next Steps
1. **Add logging to close()** - See exact sequence of operations when stream closes
2. **Add logging to flush()** - See when buffer is actually flushed vs. when getPos() is called
3. **Check Parquet source** - Understand EXACTLY when it calls getPos() vs. when it writes footer
4. **Compare with HDFS** - How does HDFS handle this? Does it have special logic?
## Hypothesis
The most likely scenario is that Parquet's `InternalParquetRecordWriter`:
1. Calls `getPos()` to record column chunk end positions → Gets 1197 (1275 - 78)
2. Continues writing more data (78 bytes) to buffer
3. Closes the stream, which flushes buffer (adds 78 bytes)
4. Final file size: 1275 bytes
5. But footer says last chunk ends at 1197
6. So when reading, it tries to read chunk from [1197, 1275) which is correct
7. BUT it ALSO tries to read [1275, 1353) because it thinks there's MORE data!
**The "78 bytes missing" might actually be "78 bytes DOUBLE-COUNTED"** in the footer metadata!
Loading…
Cancel
Save