Browse Source
docs: comprehensive test results showing unit tests PASS but Spark fails
docs: comprehensive test results showing unit tests PASS but Spark fails
KEY FINDINGS:
- Unit tests: ALL 3 tests PASS ✅ including exact 78-byte scenario
- getPos() works correctly: returns position + buffer.position()
- FSDataOutputStream override IS being called in Spark
- But EOF exception still occurs at position=1275 trying to read 78 bytes
This proves the bug is NOT in getPos() itself, but in HOW/WHEN Parquet
uses the returned positions.
Hypothesis: Parquet footer has positions recorded BEFORE final flush,
causing a 78-byte offset error in column chunk metadata.
pull/7526/head
1 changed files with 93 additions and 0 deletions
@ -0,0 +1,93 @@ |
|||
# Test Results Summary |
|||
|
|||
## Unit Tests: ✅ ALL PASS |
|||
|
|||
Created `GetPosBufferTest` with 3 comprehensive tests that specifically target the Parquet EOF issue: |
|||
|
|||
### Test 1: testGetPosWithBufferedData() |
|||
✅ **PASSED** - Tests basic `getPos()` behavior with multiple writes and buffer management. |
|||
|
|||
### Test 2: testGetPosWithSmallWrites() |
|||
✅ **PASSED** - Simulates Parquet's pattern of many small writes with frequent `getPos()` calls. |
|||
|
|||
### Test 3: testGetPosWithExactly78BytesBuffered() |
|||
✅ **PASSED** - The critical test that reproduces the EXACT bug scenario! |
|||
|
|||
**Results**: |
|||
``` |
|||
Position after 1000 bytes + flush: 1000 |
|||
Position with 78 bytes BUFFERED (not flushed): 1078 ✅ |
|||
Actual file size: 1078 ✅ |
|||
Bytes read at position 1000: 78 ✅ |
|||
SUCCESS: getPos() correctly includes buffered data! |
|||
``` |
|||
|
|||
## Key Finding |
|||
|
|||
**`getPos()` works correctly in unit tests but Spark tests still fail!** |
|||
|
|||
This proves: |
|||
- ✅ `SeaweedOutputStream.getPos()` returns `position + buffer.position()` correctly |
|||
- ✅ Files are written with correct sizes |
|||
- ✅ Data can be read back at correct positions |
|||
- ✅ The 78-byte buffered scenario works perfectly |
|||
|
|||
## Spark Integration Tests: ❌ STILL FAIL |
|||
|
|||
**BUT** the `FSDataOutputStream.getPos()` override **IS** being called in Spark: |
|||
``` |
|||
25/11/24 08:18:56 WARN SeaweedFileSystem: [DEBUG-2024] FSDataOutputStream.getPos() override called! Returning: 0 |
|||
25/11/24 08:18:56 WARN SeaweedFileSystem: [DEBUG-2024] FSDataOutputStream.getPos() override called! Returning: 4 |
|||
25/11/24 08:18:56 WARN SeaweedFileSystem: [DEBUG-2024] FSDataOutputStream.getPos() override called! Returning: 22 |
|||
... |
|||
25/11/24 08:18:56 WARN SeaweedFileSystem: [DEBUG-2024] FSDataOutputStream.getPos() override called! Returning: 190 |
|||
``` |
|||
|
|||
And the EOF error still occurs: |
|||
``` |
|||
position=1275 contentLength=1275 bufRemaining=78 |
|||
``` |
|||
|
|||
## The Mystery |
|||
|
|||
If `getPos()` is: |
|||
1. ✅ Implemented correctly (unit tests pass) |
|||
2. ✅ Being called by Spark (logs show it) |
|||
3. ✅ Returning correct values (logs show reasonable positions) |
|||
|
|||
**Then why does Parquet still think there are 78 bytes to read at position 1275?** |
|||
|
|||
## Possible Explanations |
|||
|
|||
### Theory 1: Parquet footer writing happens AFTER stream close |
|||
When the stream closes, it flushes the buffer. If Parquet writes the footer metadata BEFORE the final flush but AFTER getting `getPos()`, the footer could have stale positions. |
|||
|
|||
### Theory 2: Buffer position mismatch at close time |
|||
The unit tests show position 1078 with 78 bytes buffered. But when the stream closes and flushes, those 78 bytes get written. If the footer is written based on pre-flush positions, it would be off by 78 bytes. |
|||
|
|||
### Theory 3: Parquet caches getPos() values |
|||
Parquet might call `getPos()` once per column chunk and cache the value. If it caches the value BEFORE the buffer is flushed, but uses it AFTER, the offset would be wrong. |
|||
|
|||
### Theory 4: Multiple streams or file copies |
|||
Spark might be writing to a temporary file, then copying/moving it. If the metadata from the first write is used but the second file is what's read, sizes would mismatch. |
|||
|
|||
## Next Steps |
|||
|
|||
1. **Add logging to close()** - See exact sequence of operations when stream closes |
|||
2. **Add logging to flush()** - See when buffer is actually flushed vs. when getPos() is called |
|||
3. **Check Parquet source** - Understand EXACTLY when it calls getPos() vs. when it writes footer |
|||
4. **Compare with HDFS** - How does HDFS handle this? Does it have special logic? |
|||
|
|||
## Hypothesis |
|||
|
|||
The most likely scenario is that Parquet's `InternalParquetRecordWriter`: |
|||
1. Calls `getPos()` to record column chunk end positions → Gets 1197 (1275 - 78) |
|||
2. Continues writing more data (78 bytes) to buffer |
|||
3. Closes the stream, which flushes buffer (adds 78 bytes) |
|||
4. Final file size: 1275 bytes |
|||
5. But footer says last chunk ends at 1197 |
|||
6. So when reading, it tries to read chunk from [1197, 1275) which is correct |
|||
7. BUT it ALSO tries to read [1275, 1353) because it thinks there's MORE data! |
|||
|
|||
**The "78 bytes missing" might actually be "78 bytes DOUBLE-COUNTED"** in the footer metadata! |
|||
|
|||
Write
Preview
Loading…
Cancel
Save
Reference in new issue