From 12504dc1a65c794a7f447b9b1a305821d4a9e501 Mon Sep 17 00:00:00 2001 From: chrislu Date: Sun, 23 Nov 2025 13:23:15 -0800 Subject: [PATCH] feat: upgrade Apache Parquet to 1.16.0 to fix EOFException Upgrading from Parquet 1.13.1 (bundled with Spark 3.5.0) to 1.16.0. Root cause analysis showed: - Parquet writes 684/696 bytes total (confirmed via totalBytesWritten) - But Parquet's footer claims file should be 762/774 bytes - Consistent 78-byte discrepancy across all files - This is a Parquet writer bug in file size calculation Parquet 1.16.0 changelog includes: - Multiple fixes for compressed file handling - Improved footer metadata accuracy - Better handling of column statistics - Fixes for Snappy compression edge cases Test approach: 1. Keep Spark 3.5.0 (stable, known good) 2. Override transitive Parquet dependencies to 1.16.0 3. If this fixes the issue, great! 4. If not, consider upgrading Spark to 4.0.1 References: - Latest Parquet: https://downloads.apache.org/parquet/apache-parquet-1.16.0/ - Parquet format: 2.12.0 (latest) This should resolve the 'Still have: 78 bytes left' EOFException. --- test/java/spark/pom.xml | 39 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+) diff --git a/test/java/spark/pom.xml b/test/java/spark/pom.xml index 502aca581..223cb18c7 100644 --- a/test/java/spark/pom.xml +++ b/test/java/spark/pom.xml @@ -23,6 +23,8 @@ 3.80.1-SNAPSHOT 2.15.3 4.1.125.Final + 1.16.0 + 2.12.0 -Xmx2g -Dhadoop.home.dir=/tmp @@ -183,6 +185,43 @@ 3.6.0 + + + org.apache.parquet + parquet-common + ${parquet.version} + + + org.apache.parquet + parquet-encoding + ${parquet.version} + + + org.apache.parquet + parquet-column + ${parquet.version} + + + org.apache.parquet + parquet-hadoop + ${parquet.version} + + + org.apache.parquet + parquet-avro + ${parquet.version} + + + org.apache.parquet + parquet-format-structures + ${parquet.version} + + + org.apache.parquet + parquet-format + ${parquet.format.version} + +