5.5 KiB
Parquet EOFException Fix: 78-Byte Discrepancy
Problem Statement
Spark integration tests were consistently failing with:
java.io.EOFException: Reached the end of stream. Still have: 78 bytes left
at org.apache.parquet.hadoop.util.H2SeekableInputStream.readFully(H2SeekableInputStream.java:112)
The error was consistent across all Parquet writes:
- File sizes varied: 684, 693, 696, 707, 1275 bytes
- Missing bytes: ALWAYS exactly 78 bytes
- This suggested a systematic offset error, not random data loss
Root Cause Analysis
Investigation Steps
-
Examined Parquet-Java source code (
~/dev/parquet-java/):- Found the error originates in
H2SeekableInputStream.readFully()line 112 - Comment indicates: "this is probably a bug in the ParquetReader"
- Parquet is trying to read data based on footer metadata offsets
- Found the error originates in
-
Traced Parquet writer logic:
- In
ParquetFileWriter.javaline 1027-1029 and 1546:long beforeHeader = out.getPos(); if (currentChunkFirstDataPage < 0) { currentChunkFirstDataPage = beforeHeader; } - Parquet calls
out.getPos()to record where column chunks start - These positions are stored in the file's footer metadata
- In
-
Identified the disconnect:
outis Hadoop'sFSDataOutputStreamwrappingSeaweedHadoopOutputStreamFSDataOutputStreamuses an internal position counter- It does NOT call
SeaweedOutputStream.getPos()automatically - Evidence: No
"[DEBUG-2024] getPos() called"log messages appeared in tests
-
Confirmed with file download:
- Successfully downloaded actual Parquet file (1275 bytes)
- Parquet's footer claims data extends to byte 1353 (1275 + 78)
- The footer metadata has incorrect offsets!
The Mismatch
When writing:
┌─────────────────────────────────────────────────────────────┐
│ Parquet Writer │
│ ↓ write(data) │
│ FSDataOutputStream (Hadoop) │
│ - Counts bytes: position = 1353 │
│ - getPos() returns: 1353 ← Parquet records this! │
│ ↓ write(data) │
│ SeaweedOutputStream │
│ - Buffers data internally │
│ - getPos() returns: position + buffer.position() │
│ - But FSDataOutputStream NEVER calls this! │
│ ↓ flush on close() │
│ SeaweedFS Server │
│ - Actually stores: 1275 bytes │
└─────────────────────────────────────────────────────────────┘
Result: Footer says "read from offset 1353" but file only has 1275 bytes!
The Fix
File: other/java/hdfs3/src/main/java/seaweed/hdfs/SeaweedFileSystem.java
Override FSDataOutputStream.getPos() to delegate to our stream:
SeaweedHadoopOutputStream outputStream = (SeaweedHadoopOutputStream)
seaweedFileSystemStore.createFile(path, overwrite, permission,
seaweedBufferSize, replicaPlacement);
// Use custom FSDataOutputStream that delegates getPos() to our stream
return new FSDataOutputStream(outputStream, statistics) {
@Override
public long getPos() {
// Delegate to SeaweedOutputStream's position tracking
return outputStream.getPos();
}
};
Why This Works
-
Before: Parquet calls
FSDataOutputStream.getPos()→ Gets Hadoop's internal counter (wrong!) -
After: Parquet calls
FSDataOutputStream.getPos()→ Delegates toSeaweedOutputStream.getPos()→ Returnsposition + buffer.position()(correct!) -
SeaweedOutputStream.getPos()correctly accounts for:position: bytes already flushed to serverbuffer.position(): bytes in buffer not yet flushed- Total: accurate position for metadata
Testing
The fix will be validated by:
- The existing
getPos()logging will now show calls (previously silent) - Parquet files should be readable without EOFException
- The 78-byte discrepancy should disappear
Related Code
- Parquet Writer:
parquet-java/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java:1027,1546 - Parquet Reader:
parquet-java/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:1174,1180 - Error Location:
parquet-java/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/H2SeekableInputStream.java:112 - SeaweedFS Position Tracking:
other/java/client/src/main/java/seaweedfs/client/SeaweedOutputStream.java:100-108
Lessons Learned
- Double buffering is dangerous: When multiple layers track position independently, they can diverge
- Read the source: Examining Parquet-Java and Spark source code was essential to understanding the issue
- Systematic errors need systematic analysis: The consistent 78-byte offset was a clue it wasn't random data loss
- Framework integration matters: Hadoop's
FSDataOutputStreamwrapper behavior must be understood and explicitly handled
Commit
SHA: 9e7ed4868
Message: "fix: Override FSDataOutputStream.getPos() to use SeaweedOutputStream position"