Issue: Files written successfully but truncated when read back
Error: 'EOFException: Reached the end of stream. Still have: 78 bytes left'
Root cause: Potential race condition between write completion and read
- File metadata updated before all chunks fully flushed
- Spark immediately reads after write without ensuring sync
- Parquet reader gets incomplete file
Solutions applied:
1. Disable filesystem cache to avoid stale file handles
- spark.hadoop.fs.seaweedfs.impl.disable.cache=true
2. Enable explicit flush/sync on write (if supported by client)
- spark.hadoop.fs.seaweed.write.flush.sync=true
3. Add SPARK_SUBMIT_OPTS for cache disabling
These settings ensure:
- Files are fully flushed before close() returns
- No cached file handles with stale metadata
- Fresh reads always get current file state
Note: If issue persists, may need to add explicit delay between
write and read, or investigate seaweedfs-hadoop3-client flush behavior.
- Set Parquet I/O loggers to OFF (completely disabled)
- Add log4j.configuration system property to ensure config is used
- Override Spark's default log4j configuration
- Prevents thousands of record-level DEBUG messages in CI logs
- Set org.apache.parquet to WARN level
- Set org.apache.parquet.io to ERROR level
- Suppress RecordConsumerLoggingWrapper and MessageColumnIO DEBUG logs
- Reduces CI log noise from thousands of record-level messages
- Keeps important error messages visible
1. Add explicit permissions (least privilege):
- contents: read
- checks: write (for test reports)
- pull-requests: write (for PR comments)
2. Extract duplicate build steps into shared 'build-deps' job:
- Eliminates duplication between spark-tests and spark-example
- Build artifacts are uploaded and reused by dependent jobs
- Reduces CI time and ensures consistency
3. Fix spark-example service startup verification:
- Match robust approach from spark-tests job
- Add explicit timeout and failure handling
- Verify all services (master, volume, filer)
- Include diagnostic logging on failure
- Prevents silent failures and obscure errors
These changes improve maintainability, security, and reliability
of the Spark integration test workflow.
- In testLargeDataset(), add orderBy("value") before calling first()
- Parquet files don't guarantee row order, so first() on unordered
DataFrame can return any row, making assertions flaky
- Sorting by 'value' ensures the first row is always the one with
value=0, making the test deterministic and reliable