Browse Source

docs: push instructions for Parquet EOF fix

pull/7526/head
chrislu 1 week ago
parent
commit
58d4d61f89
  1. 67
      test/java/spark/READY_TO_PUSH.md

67
test/java/spark/READY_TO_PUSH.md

@ -0,0 +1,67 @@
# Ready to Push: Parquet EOF Fix
## Summary
Successfully identified and fixed the persistent 78-byte Parquet EOFException!
## Root Cause
**Hadoop's `FSDataOutputStream` was not calling `SeaweedOutputStream.getPos()`**
- FSDataOutputStream tracks position with an internal counter
- When Parquet calls `getPos()` to record column chunk offsets, it gets Hadoop's counter
- But SeaweedOutputStream has its own position tracking (`position + buffer.position()`)
- Result: Footer metadata has wrong offsets → EOF error when reading
## The Fix
**File**: `other/java/hdfs3/src/main/java/seaweed/hdfs/SeaweedFileSystem.java`
Override `FSDataOutputStream.getPos()` to delegate to our stream's accurate position tracking.
## Commits Ready to Push
```bash
90aa83dbe docs: add detailed analysis of Parquet EOF fix
9e7ed4868 fix: Override FSDataOutputStream.getPos() to use SeaweedOutputStream position
a8491ecd3 Update SeaweedOutputStream.java
16bd11812 fix: don't split chunk ID on comma - comma is PART of the ID!
a1fa94922 feat: extract chunk IDs from write log and download from volume
```
## To Push
```bash
cd /Users/chrislu/go/src/github.com/seaweedfs/seaweedfs
git push origin java-client-replication-configuration
```
## Expected Results
After GitHub Actions runs:
1. **`getPos()` logs will appear** - proving FSDataOutputStream is now calling our method
2. **No more EOFException** - Parquet footer will have correct offsets
3. **All Spark tests should pass** - the 78-byte discrepancy is fixed
## Documentation
- **Detailed analysis**: `test/java/spark/PARQUET_EOF_FIX.md`
- **Previous changes**: `test/java/spark/PUSH_SUMMARY.md`
- **Parquet upgrade**: `test/java/spark/PARQUET_UPGRADE.md`
## Next Steps
1. Push the commits (you'll need to authenticate)
2. Monitor GitHub Actions: https://github.com/seaweedfs/seaweedfs/actions
3. Look for `"[DEBUG-2024] getPos() called"` in logs (proves the fix works)
4. Verify tests pass without EOFException
## Key Insight
This bug existed because we assumed Hadoop would automatically use our `getPos()` method.
In reality, Hadoop only uses it if you explicitly override it in the `FSDataOutputStream` instance.
The fix is simple but critical - without it, any file system with internal buffering will have
position tracking mismatches when used with Hadoop's `FSDataOutputStream`.
Loading…
Cancel
Save