|
|
@ -0,0 +1,67 @@ |
|
|
|
|
|
# Ready to Push: Parquet EOF Fix |
|
|
|
|
|
|
|
|
|
|
|
## Summary |
|
|
|
|
|
|
|
|
|
|
|
Successfully identified and fixed the persistent 78-byte Parquet EOFException! |
|
|
|
|
|
|
|
|
|
|
|
## Root Cause |
|
|
|
|
|
|
|
|
|
|
|
**Hadoop's `FSDataOutputStream` was not calling `SeaweedOutputStream.getPos()`** |
|
|
|
|
|
|
|
|
|
|
|
- FSDataOutputStream tracks position with an internal counter |
|
|
|
|
|
- When Parquet calls `getPos()` to record column chunk offsets, it gets Hadoop's counter |
|
|
|
|
|
- But SeaweedOutputStream has its own position tracking (`position + buffer.position()`) |
|
|
|
|
|
- Result: Footer metadata has wrong offsets → EOF error when reading |
|
|
|
|
|
|
|
|
|
|
|
## The Fix |
|
|
|
|
|
|
|
|
|
|
|
**File**: `other/java/hdfs3/src/main/java/seaweed/hdfs/SeaweedFileSystem.java` |
|
|
|
|
|
|
|
|
|
|
|
Override `FSDataOutputStream.getPos()` to delegate to our stream's accurate position tracking. |
|
|
|
|
|
|
|
|
|
|
|
## Commits Ready to Push |
|
|
|
|
|
|
|
|
|
|
|
```bash |
|
|
|
|
|
90aa83dbe docs: add detailed analysis of Parquet EOF fix |
|
|
|
|
|
9e7ed4868 fix: Override FSDataOutputStream.getPos() to use SeaweedOutputStream position |
|
|
|
|
|
a8491ecd3 Update SeaweedOutputStream.java |
|
|
|
|
|
16bd11812 fix: don't split chunk ID on comma - comma is PART of the ID! |
|
|
|
|
|
a1fa94922 feat: extract chunk IDs from write log and download from volume |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
## To Push |
|
|
|
|
|
|
|
|
|
|
|
```bash |
|
|
|
|
|
cd /Users/chrislu/go/src/github.com/seaweedfs/seaweedfs |
|
|
|
|
|
git push origin java-client-replication-configuration |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
## Expected Results |
|
|
|
|
|
|
|
|
|
|
|
After GitHub Actions runs: |
|
|
|
|
|
|
|
|
|
|
|
1. **`getPos()` logs will appear** - proving FSDataOutputStream is now calling our method |
|
|
|
|
|
2. **No more EOFException** - Parquet footer will have correct offsets |
|
|
|
|
|
3. **All Spark tests should pass** - the 78-byte discrepancy is fixed |
|
|
|
|
|
|
|
|
|
|
|
## Documentation |
|
|
|
|
|
|
|
|
|
|
|
- **Detailed analysis**: `test/java/spark/PARQUET_EOF_FIX.md` |
|
|
|
|
|
- **Previous changes**: `test/java/spark/PUSH_SUMMARY.md` |
|
|
|
|
|
- **Parquet upgrade**: `test/java/spark/PARQUET_UPGRADE.md` |
|
|
|
|
|
|
|
|
|
|
|
## Next Steps |
|
|
|
|
|
|
|
|
|
|
|
1. Push the commits (you'll need to authenticate) |
|
|
|
|
|
2. Monitor GitHub Actions: https://github.com/seaweedfs/seaweedfs/actions |
|
|
|
|
|
3. Look for `"[DEBUG-2024] getPos() called"` in logs (proves the fix works) |
|
|
|
|
|
4. Verify tests pass without EOFException |
|
|
|
|
|
|
|
|
|
|
|
## Key Insight |
|
|
|
|
|
|
|
|
|
|
|
This bug existed because we assumed Hadoop would automatically use our `getPos()` method. |
|
|
|
|
|
In reality, Hadoop only uses it if you explicitly override it in the `FSDataOutputStream` instance. |
|
|
|
|
|
|
|
|
|
|
|
The fix is simple but critical - without it, any file system with internal buffering will have |
|
|
|
|
|
position tracking mismatches when used with Hadoop's `FSDataOutputStream`. |
|
|
|
|
|
|