Browse Source
docs: comprehensive analysis of persistent 78-byte Parquet issue
docs: comprehensive analysis of persistent 78-byte Parquet issue
After Parquet 1.16.0 upgrade: - Error persists (EOFException: 78 bytes left) - File sizes changed (684→693, 696→705) but SAME 78-byte gap - Footer IS being written (logs show complete write sequence) - All bytes ARE stored correctly (perfect consistency) Conclusion: This is a systematic offset calculation error in how Parquet calculates expected file size, not a missing data problem. Possible causes: 1. Page header size mismatch with Snappy compression 2. Column chunk metadata offset error in footer 3. FSDataOutputStream position tracking issue 4. Dictionary page size accounting problem Recommended next steps: 1. Try uncompressed Parquet (remove Snappy) 2. Examine actual file bytes with parquet-tools 3. Test with different Spark version (4.0.1) 4. Compare with known-working FS (HDFS, S3A) The 78-byte constant suggests a fixed structure size that Parquet accounts for but isn't actually written or is written differently.pull/7526/head
1 changed files with 199 additions and 0 deletions
@ -0,0 +1,199 @@ |
|||||
|
# Parquet 78-Byte EOFException - Final Analysis |
||||
|
|
||||
|
## Problem Status: **UNRESOLVED after Parquet 1.16.0 upgrade** |
||||
|
|
||||
|
### Symptoms |
||||
|
- `EOFException: Reached the end of stream. Still have: 78 bytes left` |
||||
|
- Occurs on **every** Parquet read operation |
||||
|
- **78 bytes is ALWAYS the exact discrepancy** across all files/runs |
||||
|
|
||||
|
### What We've Confirmed ✅ |
||||
|
|
||||
|
1. **SeaweedFS works perfectly** |
||||
|
- All bytes written through `write()` are stored correctly |
||||
|
- `totalBytesWritten == position == chunks == attr` (perfect consistency) |
||||
|
- File integrity verified at every step |
||||
|
|
||||
|
2. **Parquet footer IS being written** |
||||
|
- Logs show complete write sequence including footer |
||||
|
- Small writes at end (1, 1, 4 bytes) = Parquet trailer structure |
||||
|
- All bytes flushed successfully before close |
||||
|
|
||||
|
3. **Parquet 1.16.0 upgrade changes behavior but not error** |
||||
|
- **Before (1.13.1):** 684 bytes written → expects 762 (missing 78) |
||||
|
- **After (1.16.0):** 693 bytes written → expects 771 (missing 78) |
||||
|
- +9 bytes written, but SAME 78-byte error |
||||
|
|
||||
|
### Key Evidence from Logs |
||||
|
|
||||
|
``` |
||||
|
Parquet 1.13.1: |
||||
|
write(74 bytes): totalSoFar=679 |
||||
|
🔒 close: totalBytesWritten=684 writeCalls=250 |
||||
|
✅ Stored: 684 bytes |
||||
|
❌ Read expects: 762 bytes (684 + 78) |
||||
|
|
||||
|
Parquet 1.16.0: |
||||
|
write(1 byte): totalSoFar=688 [FOOTER?] |
||||
|
write(1 byte): totalSoFar=689 [FOOTER?] |
||||
|
write(4 bytes): totalSoFar=693 [FOOTER?] |
||||
|
🔒 close: totalBytesWritten=693 writeCalls=259 |
||||
|
✅ Stored: 693 bytes |
||||
|
❌ Read expects: 771 bytes (693 + 78) |
||||
|
``` |
||||
|
|
||||
|
### What This Means |
||||
|
|
||||
|
**The 78-byte discrepancy is NOT:** |
||||
|
- ❌ Missing footer (footer is written) |
||||
|
- ❌ Lost bytes (all bytes stored) |
||||
|
- ❌ SeaweedFS bug (perfect byte accounting) |
||||
|
- ❌ Version-specific (persists across Parquet versions) |
||||
|
|
||||
|
**The 78-byte discrepancy IS:** |
||||
|
- ✅ A **systematic offset calculation error** in Parquet |
||||
|
- ✅ Related to how Parquet calculates **expected file size** or **column chunk offsets** |
||||
|
- ✅ Consistent across all test cases (not random corruption) |
||||
|
|
||||
|
## Hypotheses |
||||
|
|
||||
|
### H1: Page Header Size Mismatch |
||||
|
Parquet might be calculating expected data size including page headers that are actually compressed/elided in Snappy compression. |
||||
|
|
||||
|
**Evidence:** |
||||
|
- 78 bytes could be multiple page headers (typically 8-16 bytes each) |
||||
|
- Compression might eliminate or reduce these headers |
||||
|
- Parquet calculates size pre-compression, reads post-compression |
||||
|
|
||||
|
**Test:** Try **uncompressed Parquet** (no Snappy) |
||||
|
|
||||
|
### H2: Column Chunk Metadata Offset Error |
||||
|
Parquet footer contains byte offsets to column chunks. These offsets might be calculated incorrectly. |
||||
|
|
||||
|
**Evidence:** |
||||
|
- Footer is written correctly (we see the writes) |
||||
|
- But offsets IN the footer point 78 bytes beyond actual data |
||||
|
- Reader tries to read from these wrong offsets |
||||
|
|
||||
|
**Test:** Examine actual file bytes to see footer content |
||||
|
|
||||
|
### H3: FSDataOutputStream Position Tracking |
||||
|
Hadoop's `FSDataOutputStream` wrapper might track position differently than our underlying stream. |
||||
|
|
||||
|
**Evidence:** |
||||
|
- `getPos()` was NEVER called (suspicious!) |
||||
|
- Parquet must get position somehow - likely from FSDataOutputStream directly |
||||
|
- FSDataOutputStream might return position before flush |
||||
|
|
||||
|
**Test:** Implement `Seekable` interface or check FSDataOutputStream behavior |
||||
|
|
||||
|
### H4: Dictionary Page Size Accounting |
||||
|
Parquet uses dictionary encoding. Dictionary pages might be: |
||||
|
- Calculated at full size |
||||
|
- Written compressed |
||||
|
- Not accounted for properly |
||||
|
|
||||
|
**Evidence:** |
||||
|
- Small files (good candidates for dictionary encoding) |
||||
|
- 78 bytes reasonable for dictionary overhead |
||||
|
- Consistent across files (dictionaries similar size) |
||||
|
|
||||
|
**Test:** Disable dictionary encoding in writer |
||||
|
|
||||
|
## Recommended Next Steps |
||||
|
|
||||
|
### Option 1: File Format Analysis (Definitive) |
||||
|
```bash |
||||
|
# Download a failing Parquet file and examine it |
||||
|
hexdump -C part-00000-xxx.parquet | tail -n 50 |
||||
|
|
||||
|
# Check footer structure |
||||
|
parquet-tools meta part-00000-xxx.parquet |
||||
|
parquet-tools dump part-00000-xxx.parquet |
||||
|
``` |
||||
|
|
||||
|
This will show: |
||||
|
- Actual footer content |
||||
|
- Column chunk offsets |
||||
|
- What byte 693 vs 771 contains |
||||
|
|
||||
|
### Option 2: Test Configuration Changes |
||||
|
|
||||
|
**A) Disable Snappy compression:** |
||||
|
```scala |
||||
|
df.write |
||||
|
.option("compression", "none") |
||||
|
.parquet(path) |
||||
|
``` |
||||
|
|
||||
|
**B) Disable dictionary encoding:** |
||||
|
```scala |
||||
|
df.write |
||||
|
.option("parquet.enable.dictionary", "false") |
||||
|
.parquet(path) |
||||
|
``` |
||||
|
|
||||
|
**C) Larger page sizes:** |
||||
|
```scala |
||||
|
spark.conf.set("parquet.page.size", "2097152") // 2MB instead of 1MB |
||||
|
``` |
||||
|
|
||||
|
### Option 3: Implement Seekable Interface |
||||
|
|
||||
|
Hadoop file systems that work with Parquet often implement `Seekable`: |
||||
|
|
||||
|
```java |
||||
|
public class SeaweedOutputStream extends OutputStream implements Seekable { |
||||
|
public long getPos() { |
||||
|
return position + buffer.position(); |
||||
|
} |
||||
|
|
||||
|
public void seek(long pos) { |
||||
|
// Not supported for output streams |
||||
|
throw new UnsupportedOperationException(); |
||||
|
} |
||||
|
} |
||||
|
``` |
||||
|
|
||||
|
### Option 4: Compare with Working Implementation |
||||
|
|
||||
|
Test the SAME Spark job against: |
||||
|
- Local HDFS |
||||
|
- S3A |
||||
|
- Azure ABFS |
||||
|
|
||||
|
See if they produce identical Parquet files or different sizes. |
||||
|
|
||||
|
### Option 5: Parquet Community |
||||
|
|
||||
|
This might be a known issue: |
||||
|
- Check [Parquet JIRA](https://issues.apache.org/jira/browse/PARQUET) |
||||
|
- Search for "column chunk offset" bugs |
||||
|
- Ask on Parquet mailing list |
||||
|
|
||||
|
## Why This is Hard to Debug |
||||
|
|
||||
|
1. **Black box problem:** We see writes going in, but don't see what Parquet's internal calculations are |
||||
|
2. **Wrapper layers:** FSDataOutputStream sits between us and Parquet |
||||
|
3. **Binary format:** Can't inspect footer without tools |
||||
|
4. **Consistent failure:** No edge cases to compare against |
||||
|
|
||||
|
## Files for Investigation |
||||
|
|
||||
|
Priority files to examine: |
||||
|
1. `part-00000-09a699c4-2299-45f9-8bee-8a8b1e241905.c000.snappy.parquet` (693 bytes, year=2021) |
||||
|
2. Any file from year=2020 (705 bytes) |
||||
|
|
||||
|
## Success Criteria for Fix |
||||
|
|
||||
|
- [ ] No EOFException on any Parquet file |
||||
|
- [ ] File size matches between write and read |
||||
|
- [ ] All 10 tests pass consistently |
||||
|
|
||||
|
## Workaround Options (If No Fix Found) |
||||
|
|
||||
|
1. **Use different format:** Write as ORC or Avro instead of Parquet |
||||
|
2. **Pad files:** Add 78 bytes of padding to match expected size (hacky) |
||||
|
3. **Fix on read:** Modify SeaweedInputStream to lie about file size |
||||
|
4. **Different Spark version:** Try Spark 4.0.1 (different Parquet integration) |
||||
|
|
||||
Write
Preview
Loading…
Cancel
Save
Reference in new issue