4.3 KiB
Fix Parquet EOF Error by Removing ByteBufferReadable Interface
Summary
Fixed EOFException: Reached the end of stream. Still have: 78 bytes left error when reading Parquet files with complex schemas in Spark.
Root Cause
SeaweedHadoopInputStream declared it implemented ByteBufferReadable interface but didn't properly implement it, causing incorrect buffering strategy and position tracking issues during positioned reads (critical for Parquet).
Solution
Removed ByteBufferReadable interface from SeaweedHadoopInputStream to match Hadoop's RawLocalFileSystem pattern, which uses BufferedFSInputStream for proper position tracking.
Changes
Core Fix
-
SeaweedHadoopInputStream.java:- Removed
ByteBufferReadableinterface - Removed
read(ByteBuffer)method - Cleaned up debug logging
- Added documentation explaining the design choice
- Removed
-
SeaweedFileSystem.java:- Changed from
BufferedByteBufferReadableInputStreamtoBufferedFSInputStream - Applies to all streams uniformly
- Cleaned up debug logging
- Changed from
-
SeaweedInputStream.java:- Cleaned up debug logging
Cleanup
-
Deleted debug-only files:
DebugDualInputStream.javaDebugDualInputStreamWrapper.javaDebugDualOutputStream.javaDebugMode.javaLocalOnlyInputStream.javaShadowComparisonStream.java
-
Reverted:
SeaweedFileSystemStore.java(removed all debug mode logic)
-
Cleaned:
docker-compose.yml(removed debug environment variables)- All
.mddocumentation files intest/java/spark/
Testing
All Spark integration tests pass:
- ✅
SparkSQLTest.testCreateTableAndQuery(complex 4-column schema) - ✅
SimpleOneColumnTest(basic operations) - ✅ All other Spark integration tests
Technical Details
Why This Works
Hadoop's RawLocalFileSystem uses the exact same pattern:
- Does NOT implement
ByteBufferReadable - Uses
BufferedFSInputStreamfor buffering - Properly handles positioned reads with automatic position restoration
Position Tracking
BufferedFSInputStream implements positioned reads correctly:
public int read(long position, byte[] buffer, int offset, int length) {
long oldPos = getPos();
try {
seek(position);
return read(buffer, offset, length);
} finally {
seek(oldPos); // Restores position!
}
}
This ensures buffered reads don't permanently change the stream position, which is critical for Parquet's random access pattern.
Performance Impact
Minimal to none:
- Network latency dominates for remote storage
- Buffering is still active (4x buffer size)
- Extra byte[] copy is negligible compared to network I/O
Commit Message
Fix Parquet EOF error by removing ByteBufferReadable interface
SeaweedHadoopInputStream incorrectly declared ByteBufferReadable interface
without proper implementation, causing position tracking issues during
positioned reads. This resulted in "78 bytes left" EOF errors when reading
Parquet files with complex schemas in Spark.
Solution: Remove ByteBufferReadable and use BufferedFSInputStream (matching
Hadoop's RawLocalFileSystem pattern) which properly handles position
restoration for positioned reads.
Changes:
- Remove ByteBufferReadable interface from SeaweedHadoopInputStream
- Change SeaweedFileSystem to use BufferedFSInputStream for all streams
- Clean up debug logging
- Delete debug-only classes and files
Tested: All Spark integration tests pass
Files Changed
Modified
other/java/hdfs3/src/main/java/seaweed/hdfs/SeaweedHadoopInputStream.javaother/java/hdfs3/src/main/java/seaweed/hdfs/SeaweedFileSystem.javaother/java/client/src/main/java/seaweedfs/client/SeaweedInputStream.javatest/java/spark/docker-compose.yml
Reverted
other/java/hdfs3/src/main/java/seaweed/hdfs/SeaweedFileSystemStore.java
Deleted
other/java/hdfs3/src/main/java/seaweed/hdfs/DebugDualInputStream.javaother/java/hdfs3/src/main/java/seaweed/hdfs/DebugDualInputStreamWrapper.javaother/java/hdfs3/src/main/java/seaweed/hdfs/DebugDualOutputStream.javaother/java/hdfs3/src/main/java/seaweed/hdfs/DebugMode.javaother/java/hdfs3/src/main/java/seaweed/hdfs/LocalOnlyInputStream.javaother/java/hdfs3/src/main/java/seaweed/hdfs/ShadowComparisonStream.java- All
.mdfiles intest/java/spark/(debug documentation)