Browse Source

docs: add detailed analysis of Parquet EOF fix

pull/7526/head
chrislu 1 week ago
parent
commit
90aa83dbe4
  1. 126
      test/java/spark/PARQUET_EOF_FIX.md

126
test/java/spark/PARQUET_EOF_FIX.md

@ -0,0 +1,126 @@
# Parquet EOFException Fix: 78-Byte Discrepancy
## Problem Statement
Spark integration tests were consistently failing with:
```
java.io.EOFException: Reached the end of stream. Still have: 78 bytes left
at org.apache.parquet.hadoop.util.H2SeekableInputStream.readFully(H2SeekableInputStream.java:112)
```
The error was consistent across all Parquet writes:
- File sizes varied: 684, 693, 696, 707, 1275 bytes
- Missing bytes: **ALWAYS exactly 78 bytes**
- This suggested a systematic offset error, not random data loss
## Root Cause Analysis
### Investigation Steps
1. **Examined Parquet-Java source code** (`~/dev/parquet-java/`):
- Found the error originates in `H2SeekableInputStream.readFully()` line 112
- Comment indicates: *"this is probably a bug in the ParquetReader"*
- Parquet is trying to read data based on footer metadata offsets
2. **Traced Parquet writer logic**:
- In `ParquetFileWriter.java` line 1027-1029 and 1546:
```java
long beforeHeader = out.getPos();
if (currentChunkFirstDataPage < 0) {
currentChunkFirstDataPage = beforeHeader;
}
```
- Parquet calls `out.getPos()` to record where column chunks start
- These positions are stored in the file's footer metadata
3. **Identified the disconnect**:
- `out` is Hadoop's `FSDataOutputStream` wrapping `SeaweedHadoopOutputStream`
- `FSDataOutputStream` uses an **internal position counter**
- It does **NOT** call `SeaweedOutputStream.getPos()` automatically
- Evidence: No `"[DEBUG-2024] getPos() called"` log messages appeared in tests
4. **Confirmed with file download**:
- Successfully downloaded actual Parquet file (1275 bytes)
- Parquet's footer claims data extends to byte 1353 (1275 + 78)
- The footer metadata has incorrect offsets!
### The Mismatch
```
When writing:
┌─────────────────────────────────────────────────────────────┐
│ Parquet Writer │
│ ↓ write(data) │
│ FSDataOutputStream (Hadoop) │
│ - Counts bytes: position = 1353 │
│ - getPos() returns: 1353 ← Parquet records this! │
│ ↓ write(data) │
│ SeaweedOutputStream │
│ - Buffers data internally │
│ - getPos() returns: position + buffer.position() │
│ - But FSDataOutputStream NEVER calls this! │
│ ↓ flush on close() │
│ SeaweedFS Server │
│ - Actually stores: 1275 bytes │
└─────────────────────────────────────────────────────────────┘
Result: Footer says "read from offset 1353" but file only has 1275 bytes!
```
## The Fix
**File**: `other/java/hdfs3/src/main/java/seaweed/hdfs/SeaweedFileSystem.java`
Override `FSDataOutputStream.getPos()` to delegate to our stream:
```java
SeaweedHadoopOutputStream outputStream = (SeaweedHadoopOutputStream)
seaweedFileSystemStore.createFile(path, overwrite, permission,
seaweedBufferSize, replicaPlacement);
// Use custom FSDataOutputStream that delegates getPos() to our stream
return new FSDataOutputStream(outputStream, statistics) {
@Override
public long getPos() {
// Delegate to SeaweedOutputStream's position tracking
return outputStream.getPos();
}
};
```
### Why This Works
1. **Before**: Parquet calls `FSDataOutputStream.getPos()` → Gets Hadoop's internal counter (wrong!)
2. **After**: Parquet calls `FSDataOutputStream.getPos()` → Delegates to `SeaweedOutputStream.getPos()` → Returns `position + buffer.position()` (correct!)
3. `SeaweedOutputStream.getPos()` correctly accounts for:
- `position`: bytes already flushed to server
- `buffer.position()`: bytes in buffer not yet flushed
- Total: accurate position for metadata
## Testing
The fix will be validated by:
1. The existing `getPos()` logging will now show calls (previously silent)
2. Parquet files should be readable without EOFException
3. The 78-byte discrepancy should disappear
## Related Code
- **Parquet Writer**: `parquet-java/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java:1027,1546`
- **Parquet Reader**: `parquet-java/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:1174,1180`
- **Error Location**: `parquet-java/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/H2SeekableInputStream.java:112`
- **SeaweedFS Position Tracking**: `other/java/client/src/main/java/seaweedfs/client/SeaweedOutputStream.java:100-108`
## Lessons Learned
1. **Double buffering is dangerous**: When multiple layers track position independently, they can diverge
2. **Read the source**: Examining Parquet-Java and Spark source code was essential to understanding the issue
3. **Systematic errors need systematic analysis**: The consistent 78-byte offset was a clue it wasn't random data loss
4. **Framework integration matters**: Hadoop's `FSDataOutputStream` wrapper behavior must be understood and explicitly handled
## Commit
**SHA**: 9e7ed4868
**Message**: "fix: Override FSDataOutputStream.getPos() to use SeaweedOutputStream position"
Loading…
Cancel
Save