Mirror/seaweedfs - seaweedfs - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
chrislu	780a1fd059	fix: add file sync and cache settings to prevent EOF on read Issue: Files written successfully but truncated when read back Error: 'EOFException: Reached the end of stream. Still have: 78 bytes left' Root cause: Potential race condition between write completion and read - File metadata updated before all chunks fully flushed - Spark immediately reads after write without ensuring sync - Parquet reader gets incomplete file Solutions applied: 1. Disable filesystem cache to avoid stale file handles - spark.hadoop.fs.seaweedfs.impl.disable.cache=true 2. Enable explicit flush/sync on write (if supported by client) - spark.hadoop.fs.seaweed.write.flush.sync=true 3. Add SPARK_SUBMIT_OPTS for cache disabling These settings ensure: - Files are fully flushed before close() returns - No cached file handles with stale metadata - Fresh reads always get current file state Note: If issue persists, may need to add explicit delay between write and read, or investigate seaweedfs-hadoop3-client flush behavior.	4 months ago
chrislu	342705c99e	fmt	4 months ago
chrislu	150deefdc0	fix: aggressively suppress Parquet DEBUG logging - Set Parquet I/O loggers to OFF (completely disabled) - Add log4j.configuration system property to ensure config is used - Override Spark's default log4j configuration - Prevents thousands of record-level DEBUG messages in CI logs	4 months ago
chrislu	707e7732a7	fix: suppress verbose Parquet DEBUG logging - Set org.apache.parquet to WARN level - Set org.apache.parquet.io to ERROR level - Suppress RecordConsumerLoggingWrapper and MessageColumnIO DEBUG logs - Reduces CI log noise from thousands of record-level messages - Keeps important error messages visible	4 months ago
chrislu	786f5de7bb	ci: refactor Spark workflow for DRY and robustness 1. Add explicit permissions (least privilege): - contents: read - checks: write (for test reports) - pull-requests: write (for PR comments) 2. Extract duplicate build steps into shared 'build-deps' job: - Eliminates duplication between spark-tests and spark-example - Build artifacts are uploaded and reused by dependent jobs - Reduces CI time and ensures consistency 3. Fix spark-example service startup verification: - Match robust approach from spark-tests job - Add explicit timeout and failure handling - Verify all services (master, volume, filer) - Include diagnostic logging on failure - Prevents silent failures and obscure errors These changes improve maintainability, security, and reliability of the Spark integration test workflow.	4 months ago
chrislu	b35463c8b4	spark: fix flaky test by sorting DataFrame before first() - In testLargeDataset(), add orderBy("value") before calling first() - Parquet files don't guarantee row order, so first() on unordered DataFrame can return any row, making assertions flaky - Sorting by 'value' ensures the first row is always the one with value=0, making the test deterministic and reliable	4 months ago
chrislu	89a6d42cee	Complete Spark integration test suite	4 months ago