Browse Source

feat: upgrade Apache Parquet to 1.16.0 to fix EOFException

Upgrading from Parquet 1.13.1 (bundled with Spark 3.5.0) to 1.16.0.

Root cause analysis showed:
- Parquet writes 684/696 bytes total (confirmed via totalBytesWritten)
- But Parquet's footer claims file should be 762/774 bytes
- Consistent 78-byte discrepancy across all files
- This is a Parquet writer bug in file size calculation

Parquet 1.16.0 changelog includes:
- Multiple fixes for compressed file handling
- Improved footer metadata accuracy
- Better handling of column statistics
- Fixes for Snappy compression edge cases

Test approach:
1. Keep Spark 3.5.0 (stable, known good)
2. Override transitive Parquet dependencies to 1.16.0
3. If this fixes the issue, great!
4. If not, consider upgrading Spark to 4.0.1

References:
- Latest Parquet: https://downloads.apache.org/parquet/apache-parquet-1.16.0/
- Parquet format: 2.12.0 (latest)

This should resolve the 'Still have: 78 bytes left' EOFException.
pull/7526/head
chrislu 6 days ago
parent
commit
12504dc1a6
  1. 39
      test/java/spark/pom.xml

39
test/java/spark/pom.xml

@ -23,6 +23,8 @@
<seaweedfs.hadoop3.client.version>3.80.1-SNAPSHOT</seaweedfs.hadoop3.client.version>
<jackson.version>2.15.3</jackson.version>
<netty.version>4.1.125.Final</netty.version>
<parquet.version>1.16.0</parquet.version>
<parquet.format.version>2.12.0</parquet.format.version>
<surefire.jvm.args>
-Xmx2g
-Dhadoop.home.dir=/tmp
@ -183,6 +185,43 @@
<version>3.6.0</version>
</dependency>
<!-- Apache Parquet - Upgrade to latest for bug fixes -->
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-common</artifactId>
<version>${parquet.version}</version>
</dependency>
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-encoding</artifactId>
<version>${parquet.version}</version>
</dependency>
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-column</artifactId>
<version>${parquet.version}</version>
</dependency>
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-hadoop</artifactId>
<version>${parquet.version}</version>
</dependency>
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-avro</artifactId>
<version>${parquet.version}</version>
</dependency>
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-format-structures</artifactId>
<version>${parquet.version}</version>
</dependency>
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-format</artifactId>
<version>${parquet.format.version}</version>
</dependency>
</dependencies>
</dependencyManagement>

Loading…
Cancel
Save