4.6 KiB
Flush-on-getPos() Implementation: Status
Implementation
Added flush-on-getPos() logic to SeaweedOutputStream:
public synchronized long getPos() throws IOException {
// Flush buffer before returning position
if (buffer.position() > 0) {
writeCurrentBufferToService();
}
return position; // Now accurate after flush
}
Test Results
✅ What Works
- Flushing is happening: Logs show "FLUSHING buffer (X bytes)" before each getPos() call
- Many small flushes: Each getPos() call flushes whatever is in the buffer
- File size is correct: FileStatus shows length=1260 bytes ✓
- File is written successfully: The parquet file exists and has the correct size
❌ What Still Fails
EOF Exception PERSISTS: EOFException: Reached the end of stream. Still have: 78 bytes left
Root Cause: Deeper Than Expected
The problem is NOT just about getPos() returning stale values. Even with flush-on-getPos():
- Parquet writes column chunks → calls getPos() → gets flushed position
- Parquet internally records these offsets in memory
- Parquet writes more data (dictionary, headers, etc.)
- Parquet writes footer containing the RECORDED offsets (from step 2)
- Problem: The recorded offsets are relative to when they were captured, but subsequent writes shift everything
The Real Issue: Relative vs. Absolute Offsets
Parquet's write pattern:
Write A (100 bytes) → getPos() returns 100 → Parquet records "A is at offset 100"
Write B (50 bytes) → getPos() returns 150 → Parquet records "B is at offset 150"
Write dictionary → No getPos()!
Write footer → Contains: "A at 100, B at 150"
But the actual file structure is:
[A: 0-100] [B: 100-150] [dict: 150-160] [footer: 160-end]
When reading:
Parquet seeks to offset 100 (expecting A) → But that's where B is!
Result: EOF exception
Why Flush-on-getPos() Doesn't Help
Even though we flush on getPos(), Parquet:
- Records the offset VALUE (e.g., "100")
- Writes more data AFTER recording but BEFORE writing footer
- Footer contains the recorded values (which are now stale)
The Fundamental Problem
Parquet assumes an unbuffered stream where:
getPos()returns the EXACT byte offset in the final file- No data will be written between when
getPos()is called and when the footer is written
SeaweedFS uses a buffered stream where:
- Data is written to buffer first, then flushed
- Multiple operations can happen between getPos() calls
- Footer metadata itself gets written AFTER Parquet records all offsets
Why This Works in HDFS/S3
They likely use one of these approaches:
- Completely unbuffered for Parquet - Every write goes directly to disk
- Syncable.hflush() contract - Parquet calls hflush() at key points
- Different file format handling - Special case for Parquet writes
Next Steps: Possible Solutions
Option A: Disable Buffering for Parquet
if (path.endsWith(".parquet")) {
this.bufferSize = 1; // Effectively unbuffered
}
Pros: Guaranteed correct offsets
Cons: Terrible performance
Option B: Implement Syncable.hflush()
Make Parquet call hflush() instead of just flush():
@Override
public void hflush() throws IOException {
writeCurrentBufferToService();
flushWrittenBytesToService();
}
Pros: Clean, follows Hadoop contract
Cons: Requires Parquet/Spark to use hflush() (they might not)
Option C: Post-Process Parquet Files
After writing, re-read and fix the footer offsets:
// After close, update footer with correct offsets
Pros: No performance impact during write
Cons: Complex, fragile
Option D: Investigate Parquet Footer Writing
Look at Parquet source code to understand WHEN it writes the footer relative to getPos() calls. Maybe we can intercept at the right moment.
Recommendation
Check if Parquet/Spark uses Syncable.hflush():
- Look at Parquet writer source code
- Check if it calls
hflush()or justflush() - If it uses
hflush(), implement it properly - If not, we may need Option A (disable buffering)
Files Modified
-
other/java/client/src/main/java/seaweedfs/client/SeaweedOutputStream.java- Added flush in
getPos() - Changed return to
position(after flush)
- Added flush in
-
other/java/hdfs3/src/main/java/seaweed/hdfs/SeaweedFileSystem.java- Updated FSDataOutputStream wrappers to handle IOException
Status
- ✅ Flush-on-getPos() implemented
- ✅ Flushing is working (logs confirm)
- ❌ EOF exception persists
- ⏭️ Need to investigate Parquet's footer writing mechanism
The fix is not complete. The problem is more fundamental than we initially thought.