|
|
@ -0,0 +1,93 @@ |
|
|
|
|
|
# Test Results Summary |
|
|
|
|
|
|
|
|
|
|
|
## Unit Tests: ✅ ALL PASS |
|
|
|
|
|
|
|
|
|
|
|
Created `GetPosBufferTest` with 3 comprehensive tests that specifically target the Parquet EOF issue: |
|
|
|
|
|
|
|
|
|
|
|
### Test 1: testGetPosWithBufferedData() |
|
|
|
|
|
✅ **PASSED** - Tests basic `getPos()` behavior with multiple writes and buffer management. |
|
|
|
|
|
|
|
|
|
|
|
### Test 2: testGetPosWithSmallWrites() |
|
|
|
|
|
✅ **PASSED** - Simulates Parquet's pattern of many small writes with frequent `getPos()` calls. |
|
|
|
|
|
|
|
|
|
|
|
### Test 3: testGetPosWithExactly78BytesBuffered() |
|
|
|
|
|
✅ **PASSED** - The critical test that reproduces the EXACT bug scenario! |
|
|
|
|
|
|
|
|
|
|
|
**Results**: |
|
|
|
|
|
``` |
|
|
|
|
|
Position after 1000 bytes + flush: 1000 |
|
|
|
|
|
Position with 78 bytes BUFFERED (not flushed): 1078 ✅ |
|
|
|
|
|
Actual file size: 1078 ✅ |
|
|
|
|
|
Bytes read at position 1000: 78 ✅ |
|
|
|
|
|
SUCCESS: getPos() correctly includes buffered data! |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
## Key Finding |
|
|
|
|
|
|
|
|
|
|
|
**`getPos()` works correctly in unit tests but Spark tests still fail!** |
|
|
|
|
|
|
|
|
|
|
|
This proves: |
|
|
|
|
|
- ✅ `SeaweedOutputStream.getPos()` returns `position + buffer.position()` correctly |
|
|
|
|
|
- ✅ Files are written with correct sizes |
|
|
|
|
|
- ✅ Data can be read back at correct positions |
|
|
|
|
|
- ✅ The 78-byte buffered scenario works perfectly |
|
|
|
|
|
|
|
|
|
|
|
## Spark Integration Tests: ❌ STILL FAIL |
|
|
|
|
|
|
|
|
|
|
|
**BUT** the `FSDataOutputStream.getPos()` override **IS** being called in Spark: |
|
|
|
|
|
``` |
|
|
|
|
|
25/11/24 08:18:56 WARN SeaweedFileSystem: [DEBUG-2024] FSDataOutputStream.getPos() override called! Returning: 0 |
|
|
|
|
|
25/11/24 08:18:56 WARN SeaweedFileSystem: [DEBUG-2024] FSDataOutputStream.getPos() override called! Returning: 4 |
|
|
|
|
|
25/11/24 08:18:56 WARN SeaweedFileSystem: [DEBUG-2024] FSDataOutputStream.getPos() override called! Returning: 22 |
|
|
|
|
|
... |
|
|
|
|
|
25/11/24 08:18:56 WARN SeaweedFileSystem: [DEBUG-2024] FSDataOutputStream.getPos() override called! Returning: 190 |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
And the EOF error still occurs: |
|
|
|
|
|
``` |
|
|
|
|
|
position=1275 contentLength=1275 bufRemaining=78 |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
## The Mystery |
|
|
|
|
|
|
|
|
|
|
|
If `getPos()` is: |
|
|
|
|
|
1. ✅ Implemented correctly (unit tests pass) |
|
|
|
|
|
2. ✅ Being called by Spark (logs show it) |
|
|
|
|
|
3. ✅ Returning correct values (logs show reasonable positions) |
|
|
|
|
|
|
|
|
|
|
|
**Then why does Parquet still think there are 78 bytes to read at position 1275?** |
|
|
|
|
|
|
|
|
|
|
|
## Possible Explanations |
|
|
|
|
|
|
|
|
|
|
|
### Theory 1: Parquet footer writing happens AFTER stream close |
|
|
|
|
|
When the stream closes, it flushes the buffer. If Parquet writes the footer metadata BEFORE the final flush but AFTER getting `getPos()`, the footer could have stale positions. |
|
|
|
|
|
|
|
|
|
|
|
### Theory 2: Buffer position mismatch at close time |
|
|
|
|
|
The unit tests show position 1078 with 78 bytes buffered. But when the stream closes and flushes, those 78 bytes get written. If the footer is written based on pre-flush positions, it would be off by 78 bytes. |
|
|
|
|
|
|
|
|
|
|
|
### Theory 3: Parquet caches getPos() values |
|
|
|
|
|
Parquet might call `getPos()` once per column chunk and cache the value. If it caches the value BEFORE the buffer is flushed, but uses it AFTER, the offset would be wrong. |
|
|
|
|
|
|
|
|
|
|
|
### Theory 4: Multiple streams or file copies |
|
|
|
|
|
Spark might be writing to a temporary file, then copying/moving it. If the metadata from the first write is used but the second file is what's read, sizes would mismatch. |
|
|
|
|
|
|
|
|
|
|
|
## Next Steps |
|
|
|
|
|
|
|
|
|
|
|
1. **Add logging to close()** - See exact sequence of operations when stream closes |
|
|
|
|
|
2. **Add logging to flush()** - See when buffer is actually flushed vs. when getPos() is called |
|
|
|
|
|
3. **Check Parquet source** - Understand EXACTLY when it calls getPos() vs. when it writes footer |
|
|
|
|
|
4. **Compare with HDFS** - How does HDFS handle this? Does it have special logic? |
|
|
|
|
|
|
|
|
|
|
|
## Hypothesis |
|
|
|
|
|
|
|
|
|
|
|
The most likely scenario is that Parquet's `InternalParquetRecordWriter`: |
|
|
|
|
|
1. Calls `getPos()` to record column chunk end positions → Gets 1197 (1275 - 78) |
|
|
|
|
|
2. Continues writing more data (78 bytes) to buffer |
|
|
|
|
|
3. Closes the stream, which flushes buffer (adds 78 bytes) |
|
|
|
|
|
4. Final file size: 1275 bytes |
|
|
|
|
|
5. But footer says last chunk ends at 1197 |
|
|
|
|
|
6. So when reading, it tries to read chunk from [1197, 1275) which is correct |
|
|
|
|
|
7. BUT it ALSO tries to read [1275, 1353) because it thinks there's MORE data! |
|
|
|
|
|
|
|
|
|
|
|
**The "78 bytes missing" might actually be "78 bytes DOUBLE-COUNTED"** in the footer metadata! |
|
|
|
|
|
|