From 50a8a3eb11af29ab97f6d4f9424c19ac2e14d293 Mon Sep 17 00:00:00 2001 From: chrislu Date: Mon, 24 Nov 2025 00:35:11 -0800 Subject: [PATCH] docs: comprehensive test results showing unit tests PASS but Spark fails MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit KEY FINDINGS: - Unit tests: ALL 3 tests PASS ✅ including exact 78-byte scenario - getPos() works correctly: returns position + buffer.position() - FSDataOutputStream override IS being called in Spark - But EOF exception still occurs at position=1275 trying to read 78 bytes This proves the bug is NOT in getPos() itself, but in HOW/WHEN Parquet uses the returned positions. Hypothesis: Parquet footer has positions recorded BEFORE final flush, causing a 78-byte offset error in column chunk metadata. --- test/java/spark/TEST_RESULTS_SUMMARY.md | 93 +++++++++++++++++++++++++ 1 file changed, 93 insertions(+) create mode 100644 test/java/spark/TEST_RESULTS_SUMMARY.md diff --git a/test/java/spark/TEST_RESULTS_SUMMARY.md b/test/java/spark/TEST_RESULTS_SUMMARY.md new file mode 100644 index 000000000..a2373b421 --- /dev/null +++ b/test/java/spark/TEST_RESULTS_SUMMARY.md @@ -0,0 +1,93 @@ +# Test Results Summary + +## Unit Tests: ✅ ALL PASS + +Created `GetPosBufferTest` with 3 comprehensive tests that specifically target the Parquet EOF issue: + +### Test 1: testGetPosWithBufferedData() +✅ **PASSED** - Tests basic `getPos()` behavior with multiple writes and buffer management. + +### Test 2: testGetPosWithSmallWrites() +✅ **PASSED** - Simulates Parquet's pattern of many small writes with frequent `getPos()` calls. + +### Test 3: testGetPosWithExactly78BytesBuffered() +✅ **PASSED** - The critical test that reproduces the EXACT bug scenario! + +**Results**: +``` +Position after 1000 bytes + flush: 1000 +Position with 78 bytes BUFFERED (not flushed): 1078 ✅ +Actual file size: 1078 ✅ +Bytes read at position 1000: 78 ✅ +SUCCESS: getPos() correctly includes buffered data! +``` + +## Key Finding + +**`getPos()` works correctly in unit tests but Spark tests still fail!** + +This proves: +- ✅ `SeaweedOutputStream.getPos()` returns `position + buffer.position()` correctly +- ✅ Files are written with correct sizes +- ✅ Data can be read back at correct positions +- ✅ The 78-byte buffered scenario works perfectly + +## Spark Integration Tests: ❌ STILL FAIL + +**BUT** the `FSDataOutputStream.getPos()` override **IS** being called in Spark: +``` +25/11/24 08:18:56 WARN SeaweedFileSystem: [DEBUG-2024] FSDataOutputStream.getPos() override called! Returning: 0 +25/11/24 08:18:56 WARN SeaweedFileSystem: [DEBUG-2024] FSDataOutputStream.getPos() override called! Returning: 4 +25/11/24 08:18:56 WARN SeaweedFileSystem: [DEBUG-2024] FSDataOutputStream.getPos() override called! Returning: 22 +... +25/11/24 08:18:56 WARN SeaweedFileSystem: [DEBUG-2024] FSDataOutputStream.getPos() override called! Returning: 190 +``` + +And the EOF error still occurs: +``` +position=1275 contentLength=1275 bufRemaining=78 +``` + +## The Mystery + +If `getPos()` is: +1. ✅ Implemented correctly (unit tests pass) +2. ✅ Being called by Spark (logs show it) +3. ✅ Returning correct values (logs show reasonable positions) + +**Then why does Parquet still think there are 78 bytes to read at position 1275?** + +## Possible Explanations + +### Theory 1: Parquet footer writing happens AFTER stream close +When the stream closes, it flushes the buffer. If Parquet writes the footer metadata BEFORE the final flush but AFTER getting `getPos()`, the footer could have stale positions. + +### Theory 2: Buffer position mismatch at close time +The unit tests show position 1078 with 78 bytes buffered. But when the stream closes and flushes, those 78 bytes get written. If the footer is written based on pre-flush positions, it would be off by 78 bytes. + +### Theory 3: Parquet caches getPos() values +Parquet might call `getPos()` once per column chunk and cache the value. If it caches the value BEFORE the buffer is flushed, but uses it AFTER, the offset would be wrong. + +### Theory 4: Multiple streams or file copies +Spark might be writing to a temporary file, then copying/moving it. If the metadata from the first write is used but the second file is what's read, sizes would mismatch. + +## Next Steps + +1. **Add logging to close()** - See exact sequence of operations when stream closes +2. **Add logging to flush()** - See when buffer is actually flushed vs. when getPos() is called +3. **Check Parquet source** - Understand EXACTLY when it calls getPos() vs. when it writes footer +4. **Compare with HDFS** - How does HDFS handle this? Does it have special logic? + +## Hypothesis + +The most likely scenario is that Parquet's `InternalParquetRecordWriter`: +1. Calls `getPos()` to record column chunk end positions → Gets 1197 (1275 - 78) +2. Continues writing more data (78 bytes) to buffer +3. Closes the stream, which flushes buffer (adds 78 bytes) +4. Final file size: 1275 bytes +5. But footer says last chunk ends at 1197 +6. So when reading, it tries to read chunk from [1197, 1275) which is correct +7. BUT it ALSO tries to read [1275, 1353) because it thinks there's MORE data! + +**The "78 bytes missing" might actually be "78 bytes DOUBLE-COUNTED"** in the footer metadata! +