You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 

4.6 KiB

Flush-on-getPos() Implementation: Status

Implementation

Added flush-on-getPos() logic to SeaweedOutputStream:

public synchronized long getPos() throws IOException {
    // Flush buffer before returning position
    if (buffer.position() > 0) {
        writeCurrentBufferToService();
    }
    return position; // Now accurate after flush
}

Test Results

What Works

  1. Flushing is happening: Logs show "FLUSHING buffer (X bytes)" before each getPos() call
  2. Many small flushes: Each getPos() call flushes whatever is in the buffer
  3. File size is correct: FileStatus shows length=1260 bytes ✓
  4. File is written successfully: The parquet file exists and has the correct size

What Still Fails

EOF Exception PERSISTS: EOFException: Reached the end of stream. Still have: 78 bytes left

Root Cause: Deeper Than Expected

The problem is NOT just about getPos() returning stale values. Even with flush-on-getPos():

  1. Parquet writes column chunks → calls getPos() → gets flushed position
  2. Parquet internally records these offsets in memory
  3. Parquet writes more data (dictionary, headers, etc.)
  4. Parquet writes footer containing the RECORDED offsets (from step 2)
  5. Problem: The recorded offsets are relative to when they were captured, but subsequent writes shift everything

The Real Issue: Relative vs. Absolute Offsets

Parquet's write pattern:

Write A (100 bytes) → getPos() returns 100 → Parquet records "A is at offset 100"
Write B (50 bytes)  → getPos() returns 150 → Parquet records "B is at offset 150"
Write dictionary    → No getPos()!
Write footer        → Contains: "A at 100, B at 150"

But the actual file structure is:
[A: 0-100] [B: 100-150] [dict: 150-160] [footer: 160-end]

When reading:
Parquet seeks to offset 100 (expecting A) → But that's where B is!
Result: EOF exception

Why Flush-on-getPos() Doesn't Help

Even though we flush on getPos(), Parquet:

  1. Records the offset VALUE (e.g., "100")
  2. Writes more data AFTER recording but BEFORE writing footer
  3. Footer contains the recorded values (which are now stale)

The Fundamental Problem

Parquet assumes an unbuffered stream where:

  • getPos() returns the EXACT byte offset in the final file
  • No data will be written between when getPos() is called and when the footer is written

SeaweedFS uses a buffered stream where:

  • Data is written to buffer first, then flushed
  • Multiple operations can happen between getPos() calls
  • Footer metadata itself gets written AFTER Parquet records all offsets

Why This Works in HDFS/S3

They likely use one of these approaches:

  1. Completely unbuffered for Parquet - Every write goes directly to disk
  2. Syncable.hflush() contract - Parquet calls hflush() at key points
  3. Different file format handling - Special case for Parquet writes

Next Steps: Possible Solutions

Option A: Disable Buffering for Parquet

if (path.endsWith(".parquet")) {
    this.bufferSize = 1; // Effectively unbuffered
}

Pros: Guaranteed correct offsets
Cons: Terrible performance

Option B: Implement Syncable.hflush()

Make Parquet call hflush() instead of just flush():

@Override
public void hflush() throws IOException {
    writeCurrentBufferToService();
    flushWrittenBytesToService();
}

Pros: Clean, follows Hadoop contract
Cons: Requires Parquet/Spark to use hflush() (they might not)

Option C: Post-Process Parquet Files

After writing, re-read and fix the footer offsets:

// After close, update footer with correct offsets

Pros: No performance impact during write
Cons: Complex, fragile

Look at Parquet source code to understand WHEN it writes the footer relative to getPos() calls. Maybe we can intercept at the right moment.

Recommendation

Check if Parquet/Spark uses Syncable.hflush():

  1. Look at Parquet writer source code
  2. Check if it calls hflush() or just flush()
  3. If it uses hflush(), implement it properly
  4. If not, we may need Option A (disable buffering)

Files Modified

  • other/java/client/src/main/java/seaweedfs/client/SeaweedOutputStream.java

    • Added flush in getPos()
    • Changed return to position (after flush)
  • other/java/hdfs3/src/main/java/seaweed/hdfs/SeaweedFileSystem.java

    • Updated FSDataOutputStream wrappers to handle IOException

Status

  • Flush-on-getPos() implemented
  • Flushing is working (logs confirm)
  • EOF exception persists
  • ⏭️ Need to investigate Parquet's footer writing mechanism

The fix is not complete. The problem is more fundamental than we initially thought.