Browse Source
Revert "docs: comprehensive analysis of persistent 78-byte Parquet issue"
Revert "docs: comprehensive analysis of persistent 78-byte Parquet issue"
This reverts commit 8e5f1d60ee.
pull/7526/head
1 changed files with 0 additions and 199 deletions
@ -1,199 +0,0 @@ |
|||||
# Parquet 78-Byte EOFException - Final Analysis |
|
||||
|
|
||||
## Problem Status: **UNRESOLVED after Parquet 1.16.0 upgrade** |
|
||||
|
|
||||
### Symptoms |
|
||||
- `EOFException: Reached the end of stream. Still have: 78 bytes left` |
|
||||
- Occurs on **every** Parquet read operation |
|
||||
- **78 bytes is ALWAYS the exact discrepancy** across all files/runs |
|
||||
|
|
||||
### What We've Confirmed ✅ |
|
||||
|
|
||||
1. **SeaweedFS works perfectly** |
|
||||
- All bytes written through `write()` are stored correctly |
|
||||
- `totalBytesWritten == position == chunks == attr` (perfect consistency) |
|
||||
- File integrity verified at every step |
|
||||
|
|
||||
2. **Parquet footer IS being written** |
|
||||
- Logs show complete write sequence including footer |
|
||||
- Small writes at end (1, 1, 4 bytes) = Parquet trailer structure |
|
||||
- All bytes flushed successfully before close |
|
||||
|
|
||||
3. **Parquet 1.16.0 upgrade changes behavior but not error** |
|
||||
- **Before (1.13.1):** 684 bytes written → expects 762 (missing 78) |
|
||||
- **After (1.16.0):** 693 bytes written → expects 771 (missing 78) |
|
||||
- +9 bytes written, but SAME 78-byte error |
|
||||
|
|
||||
### Key Evidence from Logs |
|
||||
|
|
||||
``` |
|
||||
Parquet 1.13.1: |
|
||||
write(74 bytes): totalSoFar=679 |
|
||||
🔒 close: totalBytesWritten=684 writeCalls=250 |
|
||||
✅ Stored: 684 bytes |
|
||||
❌ Read expects: 762 bytes (684 + 78) |
|
||||
|
|
||||
Parquet 1.16.0: |
|
||||
write(1 byte): totalSoFar=688 [FOOTER?] |
|
||||
write(1 byte): totalSoFar=689 [FOOTER?] |
|
||||
write(4 bytes): totalSoFar=693 [FOOTER?] |
|
||||
🔒 close: totalBytesWritten=693 writeCalls=259 |
|
||||
✅ Stored: 693 bytes |
|
||||
❌ Read expects: 771 bytes (693 + 78) |
|
||||
``` |
|
||||
|
|
||||
### What This Means |
|
||||
|
|
||||
**The 78-byte discrepancy is NOT:** |
|
||||
- ❌ Missing footer (footer is written) |
|
||||
- ❌ Lost bytes (all bytes stored) |
|
||||
- ❌ SeaweedFS bug (perfect byte accounting) |
|
||||
- ❌ Version-specific (persists across Parquet versions) |
|
||||
|
|
||||
**The 78-byte discrepancy IS:** |
|
||||
- ✅ A **systematic offset calculation error** in Parquet |
|
||||
- ✅ Related to how Parquet calculates **expected file size** or **column chunk offsets** |
|
||||
- ✅ Consistent across all test cases (not random corruption) |
|
||||
|
|
||||
## Hypotheses |
|
||||
|
|
||||
### H1: Page Header Size Mismatch |
|
||||
Parquet might be calculating expected data size including page headers that are actually compressed/elided in Snappy compression. |
|
||||
|
|
||||
**Evidence:** |
|
||||
- 78 bytes could be multiple page headers (typically 8-16 bytes each) |
|
||||
- Compression might eliminate or reduce these headers |
|
||||
- Parquet calculates size pre-compression, reads post-compression |
|
||||
|
|
||||
**Test:** Try **uncompressed Parquet** (no Snappy) |
|
||||
|
|
||||
### H2: Column Chunk Metadata Offset Error |
|
||||
Parquet footer contains byte offsets to column chunks. These offsets might be calculated incorrectly. |
|
||||
|
|
||||
**Evidence:** |
|
||||
- Footer is written correctly (we see the writes) |
|
||||
- But offsets IN the footer point 78 bytes beyond actual data |
|
||||
- Reader tries to read from these wrong offsets |
|
||||
|
|
||||
**Test:** Examine actual file bytes to see footer content |
|
||||
|
|
||||
### H3: FSDataOutputStream Position Tracking |
|
||||
Hadoop's `FSDataOutputStream` wrapper might track position differently than our underlying stream. |
|
||||
|
|
||||
**Evidence:** |
|
||||
- `getPos()` was NEVER called (suspicious!) |
|
||||
- Parquet must get position somehow - likely from FSDataOutputStream directly |
|
||||
- FSDataOutputStream might return position before flush |
|
||||
|
|
||||
**Test:** Implement `Seekable` interface or check FSDataOutputStream behavior |
|
||||
|
|
||||
### H4: Dictionary Page Size Accounting |
|
||||
Parquet uses dictionary encoding. Dictionary pages might be: |
|
||||
- Calculated at full size |
|
||||
- Written compressed |
|
||||
- Not accounted for properly |
|
||||
|
|
||||
**Evidence:** |
|
||||
- Small files (good candidates for dictionary encoding) |
|
||||
- 78 bytes reasonable for dictionary overhead |
|
||||
- Consistent across files (dictionaries similar size) |
|
||||
|
|
||||
**Test:** Disable dictionary encoding in writer |
|
||||
|
|
||||
## Recommended Next Steps |
|
||||
|
|
||||
### Option 1: File Format Analysis (Definitive) |
|
||||
```bash |
|
||||
# Download a failing Parquet file and examine it |
|
||||
hexdump -C part-00000-xxx.parquet | tail -n 50 |
|
||||
|
|
||||
# Check footer structure |
|
||||
parquet-tools meta part-00000-xxx.parquet |
|
||||
parquet-tools dump part-00000-xxx.parquet |
|
||||
``` |
|
||||
|
|
||||
This will show: |
|
||||
- Actual footer content |
|
||||
- Column chunk offsets |
|
||||
- What byte 693 vs 771 contains |
|
||||
|
|
||||
### Option 2: Test Configuration Changes |
|
||||
|
|
||||
**A) Disable Snappy compression:** |
|
||||
```scala |
|
||||
df.write |
|
||||
.option("compression", "none") |
|
||||
.parquet(path) |
|
||||
``` |
|
||||
|
|
||||
**B) Disable dictionary encoding:** |
|
||||
```scala |
|
||||
df.write |
|
||||
.option("parquet.enable.dictionary", "false") |
|
||||
.parquet(path) |
|
||||
``` |
|
||||
|
|
||||
**C) Larger page sizes:** |
|
||||
```scala |
|
||||
spark.conf.set("parquet.page.size", "2097152") // 2MB instead of 1MB |
|
||||
``` |
|
||||
|
|
||||
### Option 3: Implement Seekable Interface |
|
||||
|
|
||||
Hadoop file systems that work with Parquet often implement `Seekable`: |
|
||||
|
|
||||
```java |
|
||||
public class SeaweedOutputStream extends OutputStream implements Seekable { |
|
||||
public long getPos() { |
|
||||
return position + buffer.position(); |
|
||||
} |
|
||||
|
|
||||
public void seek(long pos) { |
|
||||
// Not supported for output streams |
|
||||
throw new UnsupportedOperationException(); |
|
||||
} |
|
||||
} |
|
||||
``` |
|
||||
|
|
||||
### Option 4: Compare with Working Implementation |
|
||||
|
|
||||
Test the SAME Spark job against: |
|
||||
- Local HDFS |
|
||||
- S3A |
|
||||
- Azure ABFS |
|
||||
|
|
||||
See if they produce identical Parquet files or different sizes. |
|
||||
|
|
||||
### Option 5: Parquet Community |
|
||||
|
|
||||
This might be a known issue: |
|
||||
- Check [Parquet JIRA](https://issues.apache.org/jira/browse/PARQUET) |
|
||||
- Search for "column chunk offset" bugs |
|
||||
- Ask on Parquet mailing list |
|
||||
|
|
||||
## Why This is Hard to Debug |
|
||||
|
|
||||
1. **Black box problem:** We see writes going in, but don't see what Parquet's internal calculations are |
|
||||
2. **Wrapper layers:** FSDataOutputStream sits between us and Parquet |
|
||||
3. **Binary format:** Can't inspect footer without tools |
|
||||
4. **Consistent failure:** No edge cases to compare against |
|
||||
|
|
||||
## Files for Investigation |
|
||||
|
|
||||
Priority files to examine: |
|
||||
1. `part-00000-09a699c4-2299-45f9-8bee-8a8b1e241905.c000.snappy.parquet` (693 bytes, year=2021) |
|
||||
2. Any file from year=2020 (705 bytes) |
|
||||
|
|
||||
## Success Criteria for Fix |
|
||||
|
|
||||
- [ ] No EOFException on any Parquet file |
|
||||
- [ ] File size matches between write and read |
|
||||
- [ ] All 10 tests pass consistently |
|
||||
|
|
||||
## Workaround Options (If No Fix Found) |
|
||||
|
|
||||
1. **Use different format:** Write as ORC or Avro instead of Parquet |
|
||||
2. **Pad files:** Add 78 bytes of padding to match expected size (hacky) |
|
||||
3. **Fix on read:** Modify SeaweedInputStream to lie about file size |
|
||||
4. **Different Spark version:** Try Spark 4.0.1 (different Parquet integration) |
|
||||
|
|
||||
Write
Preview
Loading…
Cancel
Save
Reference in new issue