Browse Source

feat: add detailed offset analysis for 78-byte discrepancy

SUCCESS: File downloaded and readable! Now analyzing WHY Parquet expects 78 more bytes.

Added analysis:
1. Parse footer length from last 8 bytes
2. Extract column chunk offsets from parquet-tools meta
3. Compare actual file size with expected size from metadata
4. Identify if offsets are pointing beyond actual data

This will reveal:
- Are column chunk offsets incorrectly calculated during write?
- Is the footer claiming data that doesn't exist?
- Where exactly are the missing 78 bytes supposed to be?

The file is already uploaded as artifact for deeper local analysis.
pull/7526/head
chrislu 1 week ago
parent
commit
0afe330b4e
  1. 29
      .github/workflows/spark-integration-tests.yml

29
.github/workflows/spark-integration-tests.yml

@ -181,6 +181,35 @@ jobs:
echo ""
echo "=== Try reading data ==="
parquet-tools show test.parquet || echo "parquet-tools show failed"
echo ""
echo "=== CRITICAL ANALYSIS: Where are the missing 78 bytes? ==="
echo "Actual file size: $FILE_SIZE bytes"
# Parse footer to find what size Parquet thinks the file should be
echo ""
echo "Reading footer length (last 8 bytes)..."
FOOTER_LEN_HEX=$(tail -c 8 test.parquet | head -c 4 | xxd -p)
echo "Footer length (hex): $FOOTER_LEN_HEX"
# Get the highest offset from column metadata
echo ""
echo "Examining column chunk offsets from metadata..."
parquet-tools meta test.parquet > meta.txt 2>&1 || true
cat meta.txt
echo ""
echo "Analyzing offset pattern..."
grep -i "offset" meta.txt || echo "No offset info"
echo ""
echo "Expected file size based on Parquet metadata:"
echo " If Parquet reader expects $((FILE_SIZE + 78)) bytes,"
echo " then column chunks claim offsets beyond actual data"
echo ""
echo "=== Download the file as artifact for local analysis ==="
ls -lh test.parquet
else
echo "FAILED: Could not download chunk"
fi

Loading…
Cancel
Save