You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 

4.3 KiB

Fix Parquet EOF Error by Removing ByteBufferReadable Interface

Summary

Fixed EOFException: Reached the end of stream. Still have: 78 bytes left error when reading Parquet files with complex schemas in Spark.

Root Cause

SeaweedHadoopInputStream declared it implemented ByteBufferReadable interface but didn't properly implement it, causing incorrect buffering strategy and position tracking issues during positioned reads (critical for Parquet).

Solution

Removed ByteBufferReadable interface from SeaweedHadoopInputStream to match Hadoop's RawLocalFileSystem pattern, which uses BufferedFSInputStream for proper position tracking.

Changes

Core Fix

  1. SeaweedHadoopInputStream.java:

    • Removed ByteBufferReadable interface
    • Removed read(ByteBuffer) method
    • Cleaned up debug logging
    • Added documentation explaining the design choice
  2. SeaweedFileSystem.java:

    • Changed from BufferedByteBufferReadableInputStream to BufferedFSInputStream
    • Applies to all streams uniformly
    • Cleaned up debug logging
  3. SeaweedInputStream.java:

    • Cleaned up debug logging

Cleanup

  1. Deleted debug-only files:

    • DebugDualInputStream.java
    • DebugDualInputStreamWrapper.java
    • DebugDualOutputStream.java
    • DebugMode.java
    • LocalOnlyInputStream.java
    • ShadowComparisonStream.java
  2. Reverted:

    • SeaweedFileSystemStore.java (removed all debug mode logic)
  3. Cleaned:

    • docker-compose.yml (removed debug environment variables)
    • All .md documentation files in test/java/spark/

Testing

All Spark integration tests pass:

  • SparkSQLTest.testCreateTableAndQuery (complex 4-column schema)
  • SimpleOneColumnTest (basic operations)
  • All other Spark integration tests

Technical Details

Why This Works

Hadoop's RawLocalFileSystem uses the exact same pattern:

  • Does NOT implement ByteBufferReadable
  • Uses BufferedFSInputStream for buffering
  • Properly handles positioned reads with automatic position restoration

Position Tracking

BufferedFSInputStream implements positioned reads correctly:

public int read(long position, byte[] buffer, int offset, int length) {
    long oldPos = getPos();
    try {
        seek(position);
        return read(buffer, offset, length);
    } finally {
        seek(oldPos);  // Restores position!
    }
}

This ensures buffered reads don't permanently change the stream position, which is critical for Parquet's random access pattern.

Performance Impact

Minimal to none:

  • Network latency dominates for remote storage
  • Buffering is still active (4x buffer size)
  • Extra byte[] copy is negligible compared to network I/O

Commit Message

Fix Parquet EOF error by removing ByteBufferReadable interface

SeaweedHadoopInputStream incorrectly declared ByteBufferReadable interface
without proper implementation, causing position tracking issues during
positioned reads. This resulted in "78 bytes left" EOF errors when reading
Parquet files with complex schemas in Spark.

Solution: Remove ByteBufferReadable and use BufferedFSInputStream (matching
Hadoop's RawLocalFileSystem pattern) which properly handles position
restoration for positioned reads.

Changes:
- Remove ByteBufferReadable interface from SeaweedHadoopInputStream
- Change SeaweedFileSystem to use BufferedFSInputStream for all streams
- Clean up debug logging
- Delete debug-only classes and files

Tested: All Spark integration tests pass

Files Changed

Modified

  • other/java/hdfs3/src/main/java/seaweed/hdfs/SeaweedHadoopInputStream.java
  • other/java/hdfs3/src/main/java/seaweed/hdfs/SeaweedFileSystem.java
  • other/java/client/src/main/java/seaweedfs/client/SeaweedInputStream.java
  • test/java/spark/docker-compose.yml

Reverted

  • other/java/hdfs3/src/main/java/seaweed/hdfs/SeaweedFileSystemStore.java

Deleted

  • other/java/hdfs3/src/main/java/seaweed/hdfs/DebugDualInputStream.java
  • other/java/hdfs3/src/main/java/seaweed/hdfs/DebugDualInputStreamWrapper.java
  • other/java/hdfs3/src/main/java/seaweed/hdfs/DebugDualOutputStream.java
  • other/java/hdfs3/src/main/java/seaweed/hdfs/DebugMode.java
  • other/java/hdfs3/src/main/java/seaweed/hdfs/LocalOnlyInputStream.java
  • other/java/hdfs3/src/main/java/seaweed/hdfs/ShadowComparisonStream.java
  • All .md files in test/java/spark/ (debug documentation)