4.3 KiB

Raw Blame History

Fix Parquet EOF Error by Removing ByteBufferReadable Interface

Summary

Fixed EOFException: Reached the end of stream. Still have: 78 bytes left error when reading Parquet files with complex schemas in Spark.

Root Cause

SeaweedHadoopInputStream declared it implemented ByteBufferReadable interface but didn't properly implement it, causing incorrect buffering strategy and position tracking issues during positioned reads (critical for Parquet).

Solution

Removed ByteBufferReadable interface from SeaweedHadoopInputStream to match Hadoop's RawLocalFileSystem pattern, which uses BufferedFSInputStream for proper position tracking.

Changes

Core Fix

SeaweedHadoopInputStream.java:
- Removed ByteBufferReadable interface
- Removed read(ByteBuffer) method
- Cleaned up debug logging
- Added documentation explaining the design choice
SeaweedFileSystem.java:
- Changed from BufferedByteBufferReadableInputStream to BufferedFSInputStream
- Applies to all streams uniformly
- Cleaned up debug logging
SeaweedInputStream.java:
- Cleaned up debug logging

Cleanup

Deleted debug-only files:
- DebugDualInputStream.java
- DebugDualInputStreamWrapper.java
- DebugDualOutputStream.java
- DebugMode.java
- LocalOnlyInputStream.java
- ShadowComparisonStream.java
Reverted:
- SeaweedFileSystemStore.java (removed all debug mode logic)
Cleaned:
- docker-compose.yml (removed debug environment variables)
- All .md documentation files in test/java/spark/

Testing

All Spark integration tests pass:

✅ SparkSQLTest.testCreateTableAndQuery (complex 4-column schema)
✅ SimpleOneColumnTest (basic operations)
✅ All other Spark integration tests

Technical Details

Why This Works

Hadoop's RawLocalFileSystem uses the exact same pattern:

Does NOT implement ByteBufferReadable
Uses BufferedFSInputStream for buffering
Properly handles positioned reads with automatic position restoration

Position Tracking

BufferedFSInputStream implements positioned reads correctly:

public int read(long position, byte[] buffer, int offset, int length) {
    long oldPos = getPos();
    try {
        seek(position);
        return read(buffer, offset, length);
    } finally {
        seek(oldPos);  // Restores position!
    }
}

This ensures buffered reads don't permanently change the stream position, which is critical for Parquet's random access pattern.

Performance Impact

Minimal to none:

Network latency dominates for remote storage
Buffering is still active (4x buffer size)
Extra byte[] copy is negligible compared to network I/O

Commit Message

Fix Parquet EOF error by removing ByteBufferReadable interface

SeaweedHadoopInputStream incorrectly declared ByteBufferReadable interface
without proper implementation, causing position tracking issues during
positioned reads. This resulted in "78 bytes left" EOF errors when reading
Parquet files with complex schemas in Spark.

Solution: Remove ByteBufferReadable and use BufferedFSInputStream (matching
Hadoop's RawLocalFileSystem pattern) which properly handles position
restoration for positioned reads.

Changes:
- Remove ByteBufferReadable interface from SeaweedHadoopInputStream
- Change SeaweedFileSystem to use BufferedFSInputStream for all streams
- Clean up debug logging
- Delete debug-only classes and files

Tested: All Spark integration tests pass

Files Changed

Modified

other/java/hdfs3/src/main/java/seaweed/hdfs/SeaweedHadoopInputStream.java
other/java/hdfs3/src/main/java/seaweed/hdfs/SeaweedFileSystem.java
other/java/client/src/main/java/seaweedfs/client/SeaweedInputStream.java
test/java/spark/docker-compose.yml

Reverted

other/java/hdfs3/src/main/java/seaweed/hdfs/SeaweedFileSystemStore.java

Deleted

other/java/hdfs3/src/main/java/seaweed/hdfs/DebugDualInputStream.java
other/java/hdfs3/src/main/java/seaweed/hdfs/DebugDualInputStreamWrapper.java
other/java/hdfs3/src/main/java/seaweed/hdfs/DebugDualOutputStream.java
other/java/hdfs3/src/main/java/seaweed/hdfs/DebugMode.java
other/java/hdfs3/src/main/java/seaweed/hdfs/LocalOnlyInputStream.java
other/java/hdfs3/src/main/java/seaweed/hdfs/ShadowComparisonStream.java
All .md files in test/java/spark/ (debug documentation)

4.3 KiB Raw Blame History