14 changed files with 1942 additions and 0 deletions
-
31test/java/spark/.gitignore
-
273test/java/spark/CI_SETUP.md
-
76test/java/spark/Makefile
-
361test/java/spark/README.md
-
72test/java/spark/docker-compose.yml
-
159test/java/spark/pom.xml
-
147test/java/spark/quick-start.sh
-
44test/java/spark/run-tests.sh
-
143test/java/spark/src/main/java/seaweed/spark/SparkSeaweedFSExample.java
-
216test/java/spark/src/test/java/seaweed/spark/SparkReadWriteTest.java
-
236test/java/spark/src/test/java/seaweed/spark/SparkSQLTest.java
-
126test/java/spark/src/test/java/seaweed/spark/SparkTestBase.java
-
18test/java/spark/src/test/resources/log4j.properties
-
40test/java/spark/test-one.sh
@ -0,0 +1,31 @@ |
|||
# Maven |
|||
target/ |
|||
pom.xml.tag |
|||
pom.xml.releaseBackup |
|||
pom.xml.versionsBackup |
|||
pom.xml.next |
|||
release.properties |
|||
dependency-reduced-pom.xml |
|||
buildNumber.properties |
|||
.mvn/timing.properties |
|||
|
|||
# IDE |
|||
.idea/ |
|||
*.iml |
|||
.vscode/ |
|||
.classpath |
|||
.project |
|||
.settings/ |
|||
|
|||
# Spark |
|||
spark-warehouse/ |
|||
metastore_db/ |
|||
derby.log |
|||
|
|||
# Logs |
|||
*.log |
|||
|
|||
# OS |
|||
.DS_Store |
|||
Thumbs.db |
|||
|
|||
@ -0,0 +1,273 @@ |
|||
# GitHub Actions CI/CD Setup |
|||
|
|||
## Overview |
|||
|
|||
The Spark integration tests are now configured to run automatically via GitHub Actions. |
|||
|
|||
## Workflow File |
|||
|
|||
**Location**: `.github/workflows/spark-integration-tests.yml` |
|||
|
|||
## Triggers |
|||
|
|||
The workflow runs automatically on: |
|||
|
|||
1. **Push to master/main** - When code is pushed to main branches |
|||
2. **Pull Requests** - When PRs target master/main |
|||
3. **Manual Trigger** - Via workflow_dispatch in GitHub UI |
|||
|
|||
The workflow only runs when changes are detected in: |
|||
- `test/java/spark/**` |
|||
- `other/java/hdfs2/**` |
|||
- `other/java/hdfs3/**` |
|||
- `other/java/client/**` |
|||
- The workflow file itself |
|||
|
|||
## Jobs |
|||
|
|||
### Job 1: spark-tests (Required) |
|||
**Duration**: ~5-10 minutes |
|||
|
|||
Steps: |
|||
1. ✓ Checkout code |
|||
2. ✓ Setup JDK 11 |
|||
3. ✓ Start SeaweedFS (master, volume, filer) |
|||
4. ✓ Build project |
|||
5. ✓ Run all integration tests (10 tests) |
|||
6. ✓ Upload test results |
|||
7. ✓ Publish test report |
|||
8. ✓ Cleanup |
|||
|
|||
**Test Coverage**: |
|||
- SparkReadWriteTest: 6 tests |
|||
- SparkSQLTest: 4 tests |
|||
|
|||
### Job 2: spark-example (Optional) |
|||
**Duration**: ~5 minutes |
|||
**Runs**: Only on push/manual trigger (not on PRs) |
|||
|
|||
Steps: |
|||
1. ✓ Checkout code |
|||
2. ✓ Setup JDK 11 |
|||
3. ✓ Download Apache Spark 3.5.0 (cached) |
|||
4. ✓ Start SeaweedFS |
|||
5. ✓ Build project |
|||
6. ✓ Run example Spark application |
|||
7. ✓ Verify output |
|||
8. ✓ Cleanup |
|||
|
|||
### Job 3: summary (Status Check) |
|||
**Duration**: < 1 minute |
|||
|
|||
Provides overall test status summary. |
|||
|
|||
## Viewing Results |
|||
|
|||
### In GitHub UI |
|||
|
|||
1. Go to the **Actions** tab in your GitHub repository |
|||
2. Click on **Spark Integration Tests** workflow |
|||
3. View individual workflow runs |
|||
4. Check test reports and logs |
|||
|
|||
### Status Badge |
|||
|
|||
Add this badge to your README.md to show the workflow status: |
|||
|
|||
```markdown |
|||
[](https://github.com/seaweedfs/seaweedfs/actions/workflows/spark-integration-tests.yml) |
|||
``` |
|||
|
|||
### Test Reports |
|||
|
|||
After each run: |
|||
- Test results are uploaded as artifacts (retained for 30 days) |
|||
- Detailed JUnit reports are published |
|||
- Logs are available for each step |
|||
|
|||
## Configuration |
|||
|
|||
### Environment Variables |
|||
|
|||
Set in the workflow: |
|||
```yaml |
|||
env: |
|||
SEAWEEDFS_TEST_ENABLED: true |
|||
SEAWEEDFS_FILER_HOST: localhost |
|||
SEAWEEDFS_FILER_PORT: 8888 |
|||
SEAWEEDFS_FILER_GRPC_PORT: 18888 |
|||
``` |
|||
|
|||
### Timeout |
|||
|
|||
- spark-tests job: 30 minutes max |
|||
- spark-example job: 20 minutes max |
|||
|
|||
## Troubleshooting CI Failures |
|||
|
|||
### SeaweedFS Connection Issues |
|||
|
|||
**Symptom**: Tests fail with connection refused |
|||
|
|||
**Check**: |
|||
1. View SeaweedFS logs in the workflow output |
|||
2. Look for "Display SeaweedFS logs on failure" step |
|||
3. Verify health check succeeded |
|||
|
|||
**Solution**: The workflow already includes retry logic and health checks |
|||
|
|||
### Test Failures |
|||
|
|||
**Symptom**: Tests pass locally but fail in CI |
|||
|
|||
**Check**: |
|||
1. Download test artifacts from the workflow run |
|||
2. Review detailed surefire reports |
|||
3. Check for timing issues or resource constraints |
|||
|
|||
**Common Issues**: |
|||
- Docker startup timing (already handled with 30 retries) |
|||
- Network issues (retry logic included) |
|||
- Resource limits (CI has sufficient memory) |
|||
|
|||
### Build Failures |
|||
|
|||
**Symptom**: Maven build fails |
|||
|
|||
**Check**: |
|||
1. Verify dependencies are available |
|||
2. Check Maven cache |
|||
3. Review build logs |
|||
|
|||
### Example Application Failures |
|||
|
|||
**Note**: This job is optional and only runs on push/manual trigger |
|||
|
|||
**Check**: |
|||
1. Verify Spark was downloaded and cached correctly |
|||
2. Check spark-submit logs |
|||
3. Verify SeaweedFS output directory |
|||
|
|||
## Manual Workflow Trigger |
|||
|
|||
To manually run the workflow: |
|||
|
|||
1. Go to **Actions** tab |
|||
2. Select **Spark Integration Tests** |
|||
3. Click **Run workflow** button |
|||
4. Select branch |
|||
5. Click **Run workflow** |
|||
|
|||
This is useful for: |
|||
- Testing changes before pushing |
|||
- Re-running failed tests |
|||
- Testing with different configurations |
|||
|
|||
## Local Testing Matching CI |
|||
|
|||
To run tests locally that match the CI environment: |
|||
|
|||
```bash |
|||
# Use the same Docker setup as CI |
|||
cd test/java/spark |
|||
docker-compose up -d seaweedfs-master seaweedfs-volume seaweedfs-filer |
|||
|
|||
# Wait for services (same as CI) |
|||
for i in {1..30}; do |
|||
curl -f http://localhost:8888/ && break |
|||
sleep 2 |
|||
done |
|||
|
|||
# Run tests (same environment variables as CI) |
|||
export SEAWEEDFS_TEST_ENABLED=true |
|||
export SEAWEEDFS_FILER_HOST=localhost |
|||
export SEAWEEDFS_FILER_PORT=8888 |
|||
export SEAWEEDFS_FILER_GRPC_PORT=18888 |
|||
mvn test -B |
|||
|
|||
# Cleanup |
|||
docker-compose down -v |
|||
``` |
|||
|
|||
## Maintenance |
|||
|
|||
### Updating Spark Version |
|||
|
|||
To update to a newer Spark version: |
|||
|
|||
1. Update `pom.xml`: Change `<spark.version>` |
|||
2. Update workflow: Change Spark download URL |
|||
3. Test locally first |
|||
4. Create PR to test in CI |
|||
|
|||
### Updating Java Version |
|||
|
|||
1. Update `pom.xml`: Change `<maven.compiler.source>` and `<target>` |
|||
2. Update workflow: Change JDK version in `setup-java` steps |
|||
3. Test locally |
|||
4. Update README with new requirements |
|||
|
|||
### Adding New Tests |
|||
|
|||
New test classes are automatically discovered and run by the workflow. |
|||
Just ensure they: |
|||
- Extend `SparkTestBase` |
|||
- Use `skipIfTestsDisabled()` |
|||
- Are in the correct package |
|||
|
|||
## CI Performance |
|||
|
|||
### Typical Run Times |
|||
|
|||
| Job | Duration | Can Fail Build? | |
|||
|-----|----------|-----------------| |
|||
| spark-tests | 5-10 min | Yes | |
|||
| spark-example | 5 min | No (optional) | |
|||
| summary | < 1 min | Only if tests fail | |
|||
|
|||
### Optimizations |
|||
|
|||
The workflow includes: |
|||
- ✓ Maven dependency caching |
|||
- ✓ Spark binary caching |
|||
- ✓ Parallel job execution |
|||
- ✓ Smart path filtering |
|||
- ✓ Docker layer caching |
|||
|
|||
### Resource Usage |
|||
|
|||
- Memory: ~4GB per job |
|||
- Disk: ~2GB (cached) |
|||
- Network: ~500MB (first run) |
|||
|
|||
## Security Considerations |
|||
|
|||
- No secrets required (tests use default ports) |
|||
- Runs in isolated Docker environment |
|||
- Clean up removes all test data |
|||
- No external services accessed |
|||
|
|||
## Future Enhancements |
|||
|
|||
Potential improvements: |
|||
- [ ] Matrix testing (multiple Spark versions) |
|||
- [ ] Performance benchmarking |
|||
- [ ] Code coverage reporting |
|||
- [ ] Integration with larger datasets |
|||
- [ ] Multi-node Spark cluster testing |
|||
|
|||
## Support |
|||
|
|||
If CI tests fail: |
|||
|
|||
1. Check workflow logs in GitHub Actions |
|||
2. Download test artifacts for detailed reports |
|||
3. Try reproducing locally using the "Local Testing" section above |
|||
4. Review recent changes in the failing paths |
|||
5. Check SeaweedFS logs in the workflow output |
|||
|
|||
For persistent issues: |
|||
- Open an issue with workflow run link |
|||
- Include test failure logs |
|||
- Note if it passes locally |
|||
|
|||
@ -0,0 +1,76 @@ |
|||
.PHONY: help build test test-local test-docker clean run-example docker-up docker-down |
|||
|
|||
help: |
|||
@echo "SeaweedFS Spark Integration Tests" |
|||
@echo "" |
|||
@echo "Available targets:" |
|||
@echo " build - Build the project" |
|||
@echo " test - Run integration tests (requires SeaweedFS running)" |
|||
@echo " test-local - Run tests against local SeaweedFS" |
|||
@echo " test-docker - Run tests in Docker with SeaweedFS" |
|||
@echo " run-example - Run the example Spark application" |
|||
@echo " docker-up - Start SeaweedFS in Docker" |
|||
@echo " docker-down - Stop SeaweedFS Docker containers" |
|||
@echo " clean - Clean build artifacts" |
|||
|
|||
build: |
|||
mvn clean package |
|||
|
|||
test: |
|||
@if [ -z "$$SEAWEEDFS_TEST_ENABLED" ]; then \
|
|||
echo "Setting SEAWEEDFS_TEST_ENABLED=true"; \
|
|||
export SEAWEEDFS_TEST_ENABLED=true; \
|
|||
fi |
|||
mvn test |
|||
|
|||
test-local: |
|||
@echo "Testing against local SeaweedFS (localhost:8888)..." |
|||
./run-tests.sh |
|||
|
|||
test-docker: |
|||
@echo "Running tests in Docker..." |
|||
docker-compose up --build --abort-on-container-exit spark-tests |
|||
docker-compose down |
|||
|
|||
docker-up: |
|||
@echo "Starting SeaweedFS in Docker..." |
|||
docker-compose up -d seaweedfs-master seaweedfs-volume seaweedfs-filer |
|||
@echo "Waiting for services to be ready..." |
|||
@sleep 5 |
|||
@echo "SeaweedFS is ready!" |
|||
@echo " Master: http://localhost:9333" |
|||
@echo " Filer: http://localhost:8888" |
|||
|
|||
docker-down: |
|||
@echo "Stopping SeaweedFS Docker containers..." |
|||
docker-compose down -v |
|||
|
|||
run-example: |
|||
@echo "Running example application..." |
|||
@if ! command -v spark-submit > /dev/null; then \
|
|||
echo "Error: spark-submit not found. Please install Apache Spark."; \
|
|||
exit 1; \
|
|||
fi |
|||
spark-submit \
|
|||
--class seaweed.spark.SparkSeaweedFSExample \
|
|||
--master local[2] \
|
|||
--conf spark.hadoop.fs.seaweedfs.impl=seaweed.hdfs.SeaweedFileSystem \
|
|||
--conf spark.hadoop.fs.seaweed.filer.host=localhost \
|
|||
--conf spark.hadoop.fs.seaweed.filer.port=8888 \
|
|||
--conf spark.hadoop.fs.seaweed.filer.port.grpc=18888 \
|
|||
--conf spark.hadoop.fs.seaweed.replication="" \
|
|||
target/seaweedfs-spark-integration-tests-1.0-SNAPSHOT.jar \
|
|||
seaweedfs://localhost:8888/spark-example-output |
|||
|
|||
clean: |
|||
mvn clean |
|||
@echo "Build artifacts cleaned" |
|||
|
|||
verify-seaweedfs: |
|||
@echo "Verifying SeaweedFS connection..." |
|||
@curl -f http://localhost:8888/ > /dev/null 2>&1 && \
|
|||
echo "✓ SeaweedFS filer is accessible" || \
|
|||
(echo "✗ SeaweedFS filer is not accessible at http://localhost:8888"; exit 1) |
|||
|
|||
.DEFAULT_GOAL := help |
|||
|
|||
@ -0,0 +1,361 @@ |
|||
# SeaweedFS Spark Integration Tests |
|||
|
|||
Comprehensive integration tests for Apache Spark with SeaweedFS HDFS client. |
|||
|
|||
## Overview |
|||
|
|||
This test suite validates that Apache Spark works correctly with SeaweedFS as the storage backend, covering: |
|||
|
|||
- **Data I/O**: Reading and writing data in various formats (Parquet, CSV, JSON) |
|||
- **Spark SQL**: Complex SQL queries, joins, aggregations, and window functions |
|||
- **Partitioning**: Partitioned writes and partition pruning |
|||
- **Performance**: Large dataset operations |
|||
|
|||
## Prerequisites |
|||
|
|||
### 1. Running SeaweedFS |
|||
|
|||
Start SeaweedFS with default ports: |
|||
|
|||
```bash |
|||
# Terminal 1: Start master |
|||
weed master |
|||
|
|||
# Terminal 2: Start volume server |
|||
weed volume -mserver=localhost:9333 |
|||
|
|||
# Terminal 3: Start filer |
|||
weed filer -master=localhost:9333 |
|||
``` |
|||
|
|||
Verify services are running: |
|||
- Master: http://localhost:9333 |
|||
- Filer HTTP: http://localhost:8888 |
|||
- Filer gRPC: localhost:18888 |
|||
|
|||
### 2. Java and Maven |
|||
|
|||
- Java 8 or higher |
|||
- Maven 3.6 or higher |
|||
|
|||
### 3. Apache Spark (for standalone execution) |
|||
|
|||
Download and extract Apache Spark 3.5.0: |
|||
|
|||
```bash |
|||
wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz |
|||
tar xzf spark-3.5.0-bin-hadoop3.tgz |
|||
export SPARK_HOME=$(pwd)/spark-3.5.0-bin-hadoop3 |
|||
export PATH=$SPARK_HOME/bin:$PATH |
|||
``` |
|||
|
|||
## Building |
|||
|
|||
```bash |
|||
mvn clean package |
|||
``` |
|||
|
|||
This creates: |
|||
- Test JAR: `target/seaweedfs-spark-integration-tests-1.0-SNAPSHOT.jar` |
|||
- Fat JAR (with dependencies): `target/original-seaweedfs-spark-integration-tests-1.0-SNAPSHOT.jar` |
|||
|
|||
## Running Integration Tests |
|||
|
|||
### Quick Test |
|||
|
|||
Run all integration tests (requires running SeaweedFS): |
|||
|
|||
```bash |
|||
# Enable integration tests |
|||
export SEAWEEDFS_TEST_ENABLED=true |
|||
|
|||
# Run all tests |
|||
mvn test |
|||
``` |
|||
|
|||
### Run Specific Test |
|||
|
|||
```bash |
|||
export SEAWEEDFS_TEST_ENABLED=true |
|||
|
|||
# Run only read/write tests |
|||
mvn test -Dtest=SparkReadWriteTest |
|||
|
|||
# Run only SQL tests |
|||
mvn test -Dtest=SparkSQLTest |
|||
``` |
|||
|
|||
### Custom SeaweedFS Configuration |
|||
|
|||
If your SeaweedFS is running on a different host or port: |
|||
|
|||
```bash |
|||
export SEAWEEDFS_TEST_ENABLED=true |
|||
export SEAWEEDFS_FILER_HOST=my-seaweedfs-host |
|||
export SEAWEEDFS_FILER_PORT=8888 |
|||
export SEAWEEDFS_FILER_GRPC_PORT=18888 |
|||
|
|||
mvn test |
|||
``` |
|||
|
|||
### Skip Tests |
|||
|
|||
By default, tests are skipped if `SEAWEEDFS_TEST_ENABLED` is not set: |
|||
|
|||
```bash |
|||
mvn test # Tests will be skipped with message |
|||
``` |
|||
|
|||
## Running the Example Application |
|||
|
|||
### Local Mode |
|||
|
|||
Run the example application in Spark local mode: |
|||
|
|||
```bash |
|||
spark-submit \ |
|||
--class seaweed.spark.SparkSeaweedFSExample \ |
|||
--master local[2] \ |
|||
--conf spark.hadoop.fs.seaweedfs.impl=seaweed.hdfs.SeaweedFileSystem \ |
|||
--conf spark.hadoop.fs.seaweed.filer.host=localhost \ |
|||
--conf spark.hadoop.fs.seaweed.filer.port=8888 \ |
|||
--conf spark.hadoop.fs.seaweed.filer.port.grpc=18888 \ |
|||
--conf spark.hadoop.fs.seaweed.replication="" \ |
|||
target/seaweedfs-spark-integration-tests-1.0-SNAPSHOT.jar \ |
|||
seaweedfs://localhost:8888/spark-example-output |
|||
``` |
|||
|
|||
### Cluster Mode |
|||
|
|||
For production Spark clusters: |
|||
|
|||
```bash |
|||
spark-submit \ |
|||
--class seaweed.spark.SparkSeaweedFSExample \ |
|||
--master spark://master-host:7077 \ |
|||
--deploy-mode cluster \ |
|||
--conf spark.hadoop.fs.seaweedfs.impl=seaweed.hdfs.SeaweedFileSystem \ |
|||
--conf spark.hadoop.fs.seaweed.filer.host=seaweedfs-filer \ |
|||
--conf spark.hadoop.fs.seaweed.filer.port=8888 \ |
|||
--conf spark.hadoop.fs.seaweed.filer.port.grpc=18888 \ |
|||
--conf spark.hadoop.fs.seaweed.replication=001 \ |
|||
--conf spark.executor.instances=4 \ |
|||
--conf spark.executor.memory=4g \ |
|||
--conf spark.executor.cores=2 \ |
|||
target/seaweedfs-spark-integration-tests-1.0-SNAPSHOT.jar \ |
|||
seaweedfs://seaweedfs-filer:8888/spark-output |
|||
``` |
|||
|
|||
## Configuration |
|||
|
|||
### SeaweedFS Configuration Options |
|||
|
|||
Configure Spark to use SeaweedFS through Hadoop configuration: |
|||
|
|||
| Property | Description | Default | Example | |
|||
|----------|-------------|---------|---------| |
|||
| `spark.hadoop.fs.seaweedfs.impl` | FileSystem implementation class | - | `seaweed.hdfs.SeaweedFileSystem` | |
|||
| `spark.hadoop.fs.seaweed.filer.host` | SeaweedFS filer hostname | `localhost` | `seaweedfs-filer` | |
|||
| `spark.hadoop.fs.seaweed.filer.port` | SeaweedFS filer HTTP port | `8888` | `8888` | |
|||
| `spark.hadoop.fs.seaweed.filer.port.grpc` | SeaweedFS filer gRPC port | `18888` | `18888` | |
|||
| `spark.hadoop.fs.seaweed.replication` | Replication strategy | (uses HDFS default) | `001`, `""` (filer default) | |
|||
| `spark.hadoop.fs.seaweed.buffer.size` | Buffer size for I/O | `4MB` | `8388608` | |
|||
|
|||
### Replication Configuration Priority |
|||
|
|||
1. **Non-empty value** (e.g., `001`) - uses that specific replication |
|||
2. **Empty string** (`""`) - uses SeaweedFS filer's default replication |
|||
3. **Not configured** - uses Hadoop/Spark's replication parameter |
|||
|
|||
## Test Coverage |
|||
|
|||
### SparkReadWriteTest |
|||
|
|||
- ✓ Write and read Parquet files |
|||
- ✓ Write and read CSV files with headers |
|||
- ✓ Write and read JSON files |
|||
- ✓ Partitioned data writes with partition pruning |
|||
- ✓ Append mode operations |
|||
- ✓ Large dataset handling (10,000+ rows) |
|||
|
|||
### SparkSQLTest |
|||
|
|||
- ✓ Create tables and run SELECT queries |
|||
- ✓ Aggregation queries (GROUP BY, SUM, AVG) |
|||
- ✓ JOIN operations between datasets |
|||
- ✓ Window functions (RANK, PARTITION BY) |
|||
|
|||
## Continuous Integration |
|||
|
|||
### GitHub Actions |
|||
|
|||
A GitHub Actions workflow is configured at `.github/workflows/spark-integration-tests.yml` that automatically: |
|||
- Runs on push/PR to `master`/`main` when Spark or HDFS code changes |
|||
- Starts SeaweedFS in Docker |
|||
- Runs all integration tests |
|||
- Runs the example application |
|||
- Uploads test reports |
|||
- Can be triggered manually via workflow_dispatch |
|||
|
|||
The workflow includes two jobs: |
|||
1. **spark-tests**: Runs all integration tests (10 tests) |
|||
2. **spark-example**: Runs the example Spark application |
|||
|
|||
View the workflow status in the GitHub Actions tab of the repository. |
|||
|
|||
### CI-Friendly Test Execution |
|||
|
|||
```bash |
|||
# In CI environment |
|||
./scripts/start-seaweedfs.sh # Start SeaweedFS in background |
|||
export SEAWEEDFS_TEST_ENABLED=true |
|||
mvn clean test |
|||
./scripts/stop-seaweedfs.sh # Cleanup |
|||
``` |
|||
|
|||
### Docker-Based Testing |
|||
|
|||
Use docker-compose for isolated testing: |
|||
|
|||
```bash |
|||
docker-compose up -d seaweedfs |
|||
export SEAWEEDFS_TEST_ENABLED=true |
|||
mvn test |
|||
docker-compose down |
|||
``` |
|||
|
|||
## Troubleshooting |
|||
|
|||
### Tests are Skipped |
|||
|
|||
**Symptom**: Tests show "Skipping test - SEAWEEDFS_TEST_ENABLED not set" |
|||
|
|||
**Solution**: |
|||
```bash |
|||
export SEAWEEDFS_TEST_ENABLED=true |
|||
mvn test |
|||
``` |
|||
|
|||
### Connection Refused Errors |
|||
|
|||
**Symptom**: `java.net.ConnectException: Connection refused` |
|||
|
|||
**Solution**: |
|||
1. Verify SeaweedFS is running: |
|||
```bash |
|||
curl http://localhost:8888/ |
|||
``` |
|||
|
|||
2. Check if ports are accessible: |
|||
```bash |
|||
netstat -an | grep 8888 |
|||
netstat -an | grep 18888 |
|||
``` |
|||
|
|||
### ClassNotFoundException: seaweed.hdfs.SeaweedFileSystem |
|||
|
|||
**Symptom**: Spark cannot find the SeaweedFS FileSystem implementation |
|||
|
|||
**Solution**: |
|||
1. Ensure the SeaweedFS HDFS client is in your classpath |
|||
2. For spark-submit, add the JAR: |
|||
```bash |
|||
spark-submit --jars /path/to/seaweedfs-hadoop3-client-*.jar ... |
|||
``` |
|||
|
|||
### Out of Memory Errors |
|||
|
|||
**Symptom**: `java.lang.OutOfMemoryError: Java heap space` |
|||
|
|||
**Solution**: |
|||
```bash |
|||
mvn test -DargLine="-Xmx4g" |
|||
``` |
|||
|
|||
For spark-submit: |
|||
```bash |
|||
spark-submit --driver-memory 4g --executor-memory 4g ... |
|||
``` |
|||
|
|||
### gRPC Version Conflicts |
|||
|
|||
**Symptom**: `java.lang.NoSuchMethodError` related to gRPC |
|||
|
|||
**Solution**: Ensure consistent gRPC versions. The project uses Spark 3.5.0 compatible versions. |
|||
|
|||
## Performance Tips |
|||
|
|||
1. **Increase buffer size** for large files: |
|||
```bash |
|||
--conf spark.hadoop.fs.seaweed.buffer.size=8388608 |
|||
``` |
|||
|
|||
2. **Use appropriate replication** based on your cluster: |
|||
```bash |
|||
--conf spark.hadoop.fs.seaweed.replication=001 |
|||
``` |
|||
|
|||
3. **Enable partition pruning** by partitioning data on commonly filtered columns |
|||
|
|||
4. **Use columnar formats** (Parquet) for better performance |
|||
|
|||
## Additional Examples |
|||
|
|||
### PySpark with SeaweedFS |
|||
|
|||
```python |
|||
from pyspark.sql import SparkSession |
|||
|
|||
spark = SparkSession.builder \ |
|||
.appName("PySparkSeaweedFS") \ |
|||
.config("spark.hadoop.fs.seaweedfs.impl", "seaweed.hdfs.SeaweedFileSystem") \ |
|||
.config("spark.hadoop.fs.seaweed.filer.host", "localhost") \ |
|||
.config("spark.hadoop.fs.seaweed.filer.port", "8888") \ |
|||
.config("spark.hadoop.fs.seaweed.filer.port.grpc", "18888") \ |
|||
.getOrCreate() |
|||
|
|||
# Write data |
|||
df = spark.range(1000) |
|||
df.write.parquet("seaweedfs://localhost:8888/pyspark-output") |
|||
|
|||
# Read data |
|||
df_read = spark.read.parquet("seaweedfs://localhost:8888/pyspark-output") |
|||
df_read.show() |
|||
``` |
|||
|
|||
### Scala with SeaweedFS |
|||
|
|||
```scala |
|||
import org.apache.spark.sql.SparkSession |
|||
|
|||
val spark = SparkSession.builder() |
|||
.appName("ScalaSeaweedFS") |
|||
.config("spark.hadoop.fs.seaweedfs.impl", "seaweed.hdfs.SeaweedFileSystem") |
|||
.config("spark.hadoop.fs.seaweed.filer.host", "localhost") |
|||
.config("spark.hadoop.fs.seaweed.filer.port", "8888") |
|||
.config("spark.hadoop.fs.seaweed.filer.port.grpc", "18888") |
|||
.getOrCreate() |
|||
|
|||
// Write data |
|||
val df = spark.range(1000) |
|||
df.write.parquet("seaweedfs://localhost:8888/scala-output") |
|||
|
|||
// Read data |
|||
val dfRead = spark.read.parquet("seaweedfs://localhost:8888/scala-output") |
|||
dfRead.show() |
|||
``` |
|||
|
|||
## Contributing |
|||
|
|||
When adding new tests: |
|||
|
|||
1. Extend `SparkTestBase` for new test classes |
|||
2. Use `skipIfTestsDisabled()` in test methods |
|||
3. Clean up test data in tearDown |
|||
4. Add documentation to this README |
|||
5. Ensure tests work in CI environment |
|||
|
|||
## License |
|||
|
|||
Same as SeaweedFS project. |
|||
|
|||
@ -0,0 +1,72 @@ |
|||
version: '3.8' |
|||
|
|||
services: |
|||
seaweedfs-master: |
|||
image: seaweedfs:test-grpc-v1.77.0 |
|||
container_name: seaweedfs-spark-master |
|||
ports: |
|||
- "9333:9333" |
|||
- "19333:19333" |
|||
command: "master -ip=seaweedfs-master -ip.bind=0.0.0.0" |
|||
networks: |
|||
- seaweedfs-spark |
|||
|
|||
seaweedfs-volume: |
|||
image: seaweedfs:test-grpc-v1.77.0 |
|||
container_name: seaweedfs-spark-volume |
|||
ports: |
|||
- "8080:8080" |
|||
- "18080:18080" |
|||
command: "volume -mserver=seaweedfs-master:9333 -ip.bind=0.0.0.0 -port=8080" |
|||
depends_on: |
|||
- seaweedfs-master |
|||
networks: |
|||
- seaweedfs-spark |
|||
|
|||
seaweedfs-filer: |
|||
image: seaweedfs:test-grpc-v1.77.0 |
|||
container_name: seaweedfs-spark-filer |
|||
ports: |
|||
- "8888:8888" |
|||
- "18888:18888" |
|||
command: ["filer", "-master=seaweedfs-master:9333", "-ip.bind=0.0.0.0", "-port=8888", "-port.grpc=18888"] |
|||
depends_on: |
|||
- seaweedfs-master |
|||
- seaweedfs-volume |
|||
networks: |
|||
- seaweedfs-spark |
|||
environment: |
|||
- GRPC_GO_LOG_VERBOSITY_LEVEL=99 |
|||
- GRPC_GO_LOG_SEVERITY_LEVEL=info |
|||
- GLOG_v=4 |
|||
- GLOG_logtostderr=true |
|||
- GODEBUG=http2debug=2 |
|||
healthcheck: |
|||
test: ["CMD", "wget", "--spider", "-q", "http://localhost:8888/"] |
|||
interval: 5s |
|||
timeout: 3s |
|||
retries: 10 |
|||
start_period: 10s |
|||
|
|||
spark-tests: |
|||
image: maven:3.9-eclipse-temurin-17 |
|||
container_name: seaweedfs-spark-tests |
|||
network_mode: "host" |
|||
volumes: |
|||
- .:/workspace |
|||
- ~/.m2:/root/.m2 |
|||
working_dir: /workspace |
|||
environment: |
|||
- SEAWEEDFS_TEST_ENABLED=true |
|||
- SEAWEEDFS_FILER_HOST=localhost |
|||
- SEAWEEDFS_FILER_PORT=8888 |
|||
- SEAWEEDFS_FILER_GRPC_PORT=18888 |
|||
- HADOOP_HOME=/tmp |
|||
command: sh -c "sleep 30 && mvn test" |
|||
mem_limit: 4g |
|||
cpus: 2 |
|||
|
|||
networks: |
|||
seaweedfs-spark: |
|||
driver: bridge |
|||
|
|||
@ -0,0 +1,159 @@ |
|||
<?xml version="1.0" encoding="UTF-8"?> |
|||
<project xmlns="http://maven.apache.org/POM/4.0.0" |
|||
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" |
|||
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> |
|||
<modelVersion>4.0.0</modelVersion> |
|||
|
|||
<groupId>com.seaweedfs</groupId> |
|||
<artifactId>seaweedfs-spark-integration-tests</artifactId> |
|||
<version>1.0-SNAPSHOT</version> |
|||
<packaging>jar</packaging> |
|||
|
|||
<name>SeaweedFS Spark Integration Tests</name> |
|||
<description>Integration tests for Apache Spark with SeaweedFS HDFS client</description> |
|||
|
|||
<properties> |
|||
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> |
|||
<maven.compiler.source>1.8</maven.compiler.source> |
|||
<maven.compiler.target>1.8</maven.compiler.target> |
|||
<spark.version>3.5.0</spark.version> |
|||
<hadoop.version>3.3.6</hadoop.version> |
|||
<scala.binary.version>2.12</scala.binary.version> |
|||
<junit.version>4.13.2</junit.version> |
|||
</properties> |
|||
|
|||
<dependencies> |
|||
<!-- Spark Core --> |
|||
<dependency> |
|||
<groupId>org.apache.spark</groupId> |
|||
<artifactId>spark-core_${scala.binary.version}</artifactId> |
|||
<version>${spark.version}</version> |
|||
<scope>provided</scope> |
|||
</dependency> |
|||
|
|||
<!-- Spark SQL --> |
|||
<dependency> |
|||
<groupId>org.apache.spark</groupId> |
|||
<artifactId>spark-sql_${scala.binary.version}</artifactId> |
|||
<version>${spark.version}</version> |
|||
<scope>provided</scope> |
|||
</dependency> |
|||
|
|||
<!-- Hadoop Client --> |
|||
<dependency> |
|||
<groupId>org.apache.hadoop</groupId> |
|||
<artifactId>hadoop-client</artifactId> |
|||
<version>${hadoop.version}</version> |
|||
<scope>provided</scope> |
|||
</dependency> |
|||
|
|||
<!-- SeaweedFS Hadoop3 Client --> |
|||
<dependency> |
|||
<groupId>com.seaweedfs</groupId> |
|||
<artifactId>seaweedfs-hadoop3-client</artifactId> |
|||
<version>3.80</version> |
|||
</dependency> |
|||
|
|||
<!-- Testing --> |
|||
<dependency> |
|||
<groupId>junit</groupId> |
|||
<artifactId>junit</artifactId> |
|||
<version>${junit.version}</version> |
|||
<scope>test</scope> |
|||
</dependency> |
|||
|
|||
<!-- Logging --> |
|||
<dependency> |
|||
<groupId>org.slf4j</groupId> |
|||
<artifactId>slf4j-api</artifactId> |
|||
<version>1.7.36</version> |
|||
</dependency> |
|||
|
|||
<dependency> |
|||
<groupId>org.slf4j</groupId> |
|||
<artifactId>slf4j-log4j12</artifactId> |
|||
<version>1.7.36</version> |
|||
<scope>test</scope> |
|||
</dependency> |
|||
</dependencies> |
|||
|
|||
<build> |
|||
<plugins> |
|||
<plugin> |
|||
<groupId>org.apache.maven.plugins</groupId> |
|||
<artifactId>maven-compiler-plugin</artifactId> |
|||
<version>3.11.0</version> |
|||
<configuration> |
|||
<source>1.8</source> |
|||
<target>1.8</target> |
|||
</configuration> |
|||
</plugin> |
|||
|
|||
<plugin> |
|||
<groupId>org.apache.maven.plugins</groupId> |
|||
<artifactId>maven-surefire-plugin</artifactId> |
|||
<version>3.0.0</version> |
|||
<configuration> |
|||
<skipTests>${skipTests}</skipTests> |
|||
<includes> |
|||
<include>**/*Test.java</include> |
|||
</includes> |
|||
<argLine> |
|||
-Xmx2g |
|||
-Dhadoop.home.dir=/tmp |
|||
--add-opens=java.base/java.lang=ALL-UNNAMED |
|||
--add-opens=java.base/java.lang.invoke=ALL-UNNAMED |
|||
--add-opens=java.base/java.lang.reflect=ALL-UNNAMED |
|||
--add-opens=java.base/java.io=ALL-UNNAMED |
|||
--add-opens=java.base/java.net=ALL-UNNAMED |
|||
--add-opens=java.base/java.nio=ALL-UNNAMED |
|||
--add-opens=java.base/java.util=ALL-UNNAMED |
|||
--add-opens=java.base/java.util.concurrent=ALL-UNNAMED |
|||
--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED |
|||
--add-opens=java.base/sun.nio.ch=ALL-UNNAMED |
|||
--add-opens=java.base/sun.nio.cs=ALL-UNNAMED |
|||
--add-opens=java.base/sun.security.action=ALL-UNNAMED |
|||
--add-opens=java.base/sun.util.calendar=ALL-UNNAMED |
|||
--add-exports=java.base/sun.nio.ch=ALL-UNNAMED |
|||
</argLine> |
|||
<environmentVariables> |
|||
<HADOOP_HOME>/tmp</HADOOP_HOME> |
|||
</environmentVariables> |
|||
</configuration> |
|||
</plugin> |
|||
|
|||
<!-- Shade plugin to create fat jar for Spark submit --> |
|||
<plugin> |
|||
<groupId>org.apache.maven.plugins</groupId> |
|||
<artifactId>maven-shade-plugin</artifactId> |
|||
<version>3.5.0</version> |
|||
<executions> |
|||
<execution> |
|||
<phase>package</phase> |
|||
<goals> |
|||
<goal>shade</goal> |
|||
</goals> |
|||
<configuration> |
|||
<filters> |
|||
<filter> |
|||
<artifact>*:*</artifact> |
|||
<excludes> |
|||
<exclude>META-INF/*.SF</exclude> |
|||
<exclude>META-INF/*.DSA</exclude> |
|||
<exclude>META-INF/*.RSA</exclude> |
|||
</excludes> |
|||
</filter> |
|||
</filters> |
|||
<transformers> |
|||
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer"> |
|||
<mainClass>seaweed.spark.SparkSeaweedFSExample</mainClass> |
|||
</transformer> |
|||
</transformers> |
|||
</configuration> |
|||
</execution> |
|||
</executions> |
|||
</plugin> |
|||
</plugins> |
|||
</build> |
|||
</project> |
|||
|
|||
@ -0,0 +1,147 @@ |
|||
#!/bin/bash |
|||
|
|||
set -e |
|||
|
|||
echo "=== SeaweedFS Spark Integration Tests Quick Start ===" |
|||
echo "" |
|||
|
|||
# Check if SeaweedFS is running |
|||
check_seaweedfs() { |
|||
echo "Checking if SeaweedFS is running..." |
|||
if curl -f http://localhost:8888/ > /dev/null 2>&1; then |
|||
echo "✓ SeaweedFS filer is accessible at http://localhost:8888" |
|||
return 0 |
|||
else |
|||
echo "✗ SeaweedFS filer is not accessible" |
|||
return 1 |
|||
fi |
|||
} |
|||
|
|||
# Start SeaweedFS with Docker if not running |
|||
start_seaweedfs() { |
|||
echo "" |
|||
echo "Starting SeaweedFS with Docker..." |
|||
docker-compose up -d seaweedfs-master seaweedfs-volume seaweedfs-filer |
|||
|
|||
echo "Waiting for SeaweedFS to be ready..." |
|||
for i in {1..30}; do |
|||
if curl -f http://localhost:8888/ > /dev/null 2>&1; then |
|||
echo "✓ SeaweedFS is ready!" |
|||
return 0 |
|||
fi |
|||
echo -n "." |
|||
sleep 2 |
|||
done |
|||
|
|||
echo "" |
|||
echo "✗ SeaweedFS failed to start" |
|||
return 1 |
|||
} |
|||
|
|||
# Build the project |
|||
build_project() { |
|||
echo "" |
|||
echo "Building the project..." |
|||
mvn clean package -DskipTests |
|||
echo "✓ Build completed" |
|||
} |
|||
|
|||
# Run tests |
|||
run_tests() { |
|||
echo "" |
|||
echo "Running integration tests..." |
|||
export SEAWEEDFS_TEST_ENABLED=true |
|||
mvn test |
|||
echo "✓ Tests completed" |
|||
} |
|||
|
|||
# Run example |
|||
run_example() { |
|||
echo "" |
|||
echo "Running example application..." |
|||
|
|||
if ! command -v spark-submit > /dev/null; then |
|||
echo "⚠ spark-submit not found. Skipping example application." |
|||
echo "To run the example, install Apache Spark and try: make run-example" |
|||
return 0 |
|||
fi |
|||
|
|||
spark-submit \ |
|||
--class seaweed.spark.SparkSeaweedFSExample \ |
|||
--master local[2] \ |
|||
--conf spark.hadoop.fs.seaweedfs.impl=seaweed.hdfs.SeaweedFileSystem \ |
|||
--conf spark.hadoop.fs.seaweed.filer.host=localhost \ |
|||
--conf spark.hadoop.fs.seaweed.filer.port=8888 \ |
|||
--conf spark.hadoop.fs.seaweed.filer.port.grpc=18888 \ |
|||
target/seaweedfs-spark-integration-tests-1.0-SNAPSHOT.jar \ |
|||
seaweedfs://localhost:8888/spark-quickstart-output |
|||
|
|||
echo "✓ Example completed" |
|||
} |
|||
|
|||
# Cleanup |
|||
cleanup() { |
|||
echo "" |
|||
echo "Cleaning up..." |
|||
docker-compose down -v |
|||
echo "✓ Cleanup completed" |
|||
} |
|||
|
|||
# Main execution |
|||
main() { |
|||
# Check if Docker is available |
|||
if ! command -v docker > /dev/null; then |
|||
echo "Error: Docker is not installed or not in PATH" |
|||
exit 1 |
|||
fi |
|||
|
|||
# Check if Maven is available |
|||
if ! command -v mvn > /dev/null; then |
|||
echo "Error: Maven is not installed or not in PATH" |
|||
exit 1 |
|||
fi |
|||
|
|||
# Check if SeaweedFS is running, if not start it |
|||
if ! check_seaweedfs; then |
|||
read -p "Do you want to start SeaweedFS with Docker? (y/n) " -n 1 -r |
|||
echo |
|||
if [[ $REPLY =~ ^[Yy]$ ]]; then |
|||
start_seaweedfs || exit 1 |
|||
else |
|||
echo "Please start SeaweedFS manually and rerun this script." |
|||
exit 1 |
|||
fi |
|||
fi |
|||
|
|||
# Build project |
|||
build_project || exit 1 |
|||
|
|||
# Run tests |
|||
run_tests || exit 1 |
|||
|
|||
# Run example if Spark is available |
|||
run_example |
|||
|
|||
echo "" |
|||
echo "=== Quick Start Completed Successfully! ===" |
|||
echo "" |
|||
echo "Next steps:" |
|||
echo " - View test results in target/surefire-reports/" |
|||
echo " - Check example output at http://localhost:8888/" |
|||
echo " - Run 'make help' for more options" |
|||
echo " - Read README.md for detailed documentation" |
|||
echo "" |
|||
|
|||
read -p "Do you want to stop SeaweedFS? (y/n) " -n 1 -r |
|||
echo |
|||
if [[ $REPLY =~ ^[Yy]$ ]]; then |
|||
cleanup |
|||
fi |
|||
} |
|||
|
|||
# Handle Ctrl+C |
|||
trap cleanup INT |
|||
|
|||
# Run main |
|||
main |
|||
|
|||
@ -0,0 +1,44 @@ |
|||
#!/bin/bash |
|||
|
|||
set -e |
|||
|
|||
echo "=== SeaweedFS Spark Integration Tests Runner ===" |
|||
echo "" |
|||
|
|||
# Check if SeaweedFS is running |
|||
check_seaweedfs() { |
|||
if curl -f http://localhost:8888/ > /dev/null 2>&1; then |
|||
echo "✓ SeaweedFS filer is accessible at http://localhost:8888" |
|||
return 0 |
|||
else |
|||
echo "✗ SeaweedFS filer is not accessible" |
|||
return 1 |
|||
fi |
|||
} |
|||
|
|||
# Main |
|||
if ! check_seaweedfs; then |
|||
echo "" |
|||
echo "Please start SeaweedFS first. You can use:" |
|||
echo " cd test/java/spark && docker-compose up -d" |
|||
echo "Or:" |
|||
echo " make docker-up" |
|||
exit 1 |
|||
fi |
|||
|
|||
echo "" |
|||
echo "Running Spark integration tests..." |
|||
echo "" |
|||
|
|||
export SEAWEEDFS_TEST_ENABLED=true |
|||
export SEAWEEDFS_FILER_HOST=localhost |
|||
export SEAWEEDFS_FILER_PORT=8888 |
|||
export SEAWEEDFS_FILER_GRPC_PORT=18888 |
|||
|
|||
# Run tests |
|||
mvn test "$@" |
|||
|
|||
echo "" |
|||
echo "✓ Test run completed" |
|||
echo "View detailed reports in: target/surefire-reports/" |
|||
|
|||
@ -0,0 +1,143 @@ |
|||
package seaweed.spark; |
|||
|
|||
import org.apache.spark.sql.Dataset; |
|||
import org.apache.spark.sql.Row; |
|||
import org.apache.spark.sql.SaveMode; |
|||
import org.apache.spark.sql.SparkSession; |
|||
|
|||
/** |
|||
* Example Spark application demonstrating SeaweedFS integration. |
|||
* |
|||
* This can be submitted to a Spark cluster using spark-submit. |
|||
* |
|||
* Example usage: |
|||
* spark-submit \ |
|||
* --class seaweed.spark.SparkSeaweedFSExample \ |
|||
* --master local[2] \ |
|||
* --conf spark.hadoop.fs.seaweedfs.impl=seaweed.hdfs.SeaweedFileSystem \ |
|||
* --conf spark.hadoop.fs.seaweed.filer.host=localhost \ |
|||
* --conf spark.hadoop.fs.seaweed.filer.port=8888 \ |
|||
* --conf spark.hadoop.fs.seaweed.filer.port.grpc=18888 \ |
|||
* target/seaweedfs-spark-integration-tests-1.0-SNAPSHOT.jar \ |
|||
* seaweedfs://localhost:8888/output |
|||
*/ |
|||
public class SparkSeaweedFSExample { |
|||
|
|||
public static void main(String[] args) { |
|||
if (args.length < 1) { |
|||
System.err.println("Usage: SparkSeaweedFSExample <output-path>"); |
|||
System.err.println("Example: seaweedfs://localhost:8888/spark-output"); |
|||
System.exit(1); |
|||
} |
|||
|
|||
String outputPath = args[0]; |
|||
|
|||
// Create Spark session |
|||
SparkSession spark = SparkSession.builder() |
|||
.appName("SeaweedFS Spark Example") |
|||
.getOrCreate(); |
|||
|
|||
try { |
|||
System.out.println("=== SeaweedFS Spark Integration Example ===\n"); |
|||
|
|||
// Example 1: Generate data and write to SeaweedFS |
|||
System.out.println("1. Generating sample data..."); |
|||
Dataset<Row> data = spark.range(0, 1000) |
|||
.selectExpr( |
|||
"id", |
|||
"id * 2 as doubled", |
|||
"CAST(rand() * 100 AS INT) as random_value" |
|||
); |
|||
|
|||
System.out.println(" Generated " + data.count() + " rows"); |
|||
data.show(5); |
|||
|
|||
// Write as Parquet |
|||
String parquetPath = outputPath + "/data.parquet"; |
|||
System.out.println("\n2. Writing data to SeaweedFS as Parquet..."); |
|||
System.out.println(" Path: " + parquetPath); |
|||
|
|||
data.write() |
|||
.mode(SaveMode.Overwrite) |
|||
.parquet(parquetPath); |
|||
|
|||
System.out.println(" ✓ Write completed"); |
|||
|
|||
// Read back and verify |
|||
System.out.println("\n3. Reading data back from SeaweedFS..."); |
|||
Dataset<Row> readData = spark.read().parquet(parquetPath); |
|||
System.out.println(" Read " + readData.count() + " rows"); |
|||
|
|||
// Perform aggregation |
|||
System.out.println("\n4. Performing aggregation..."); |
|||
Dataset<Row> stats = readData.selectExpr( |
|||
"COUNT(*) as count", |
|||
"AVG(random_value) as avg_random", |
|||
"MAX(doubled) as max_doubled" |
|||
); |
|||
|
|||
stats.show(); |
|||
|
|||
// Write aggregation results |
|||
String statsPath = outputPath + "/stats.parquet"; |
|||
System.out.println("5. Writing stats to: " + statsPath); |
|||
stats.write() |
|||
.mode(SaveMode.Overwrite) |
|||
.parquet(statsPath); |
|||
|
|||
// Create a partitioned dataset |
|||
System.out.println("\n6. Creating partitioned dataset..."); |
|||
Dataset<Row> partitionedData = data.selectExpr( |
|||
"*", |
|||
"CAST(id % 10 AS INT) as partition_key" |
|||
); |
|||
|
|||
String partitionedPath = outputPath + "/partitioned.parquet"; |
|||
System.out.println(" Path: " + partitionedPath); |
|||
|
|||
partitionedData.write() |
|||
.mode(SaveMode.Overwrite) |
|||
.partitionBy("partition_key") |
|||
.parquet(partitionedPath); |
|||
|
|||
System.out.println(" ✓ Partitioned write completed"); |
|||
|
|||
// Read specific partition |
|||
System.out.println("\n7. Reading specific partition (partition_key=0)..."); |
|||
Dataset<Row> partition0 = spark.read() |
|||
.parquet(partitionedPath) |
|||
.filter("partition_key = 0"); |
|||
|
|||
System.out.println(" Partition 0 contains " + partition0.count() + " rows"); |
|||
partition0.show(5); |
|||
|
|||
// SQL example |
|||
System.out.println("\n8. Using Spark SQL..."); |
|||
readData.createOrReplaceTempView("seaweedfs_data"); |
|||
|
|||
Dataset<Row> sqlResult = spark.sql( |
|||
"SELECT " + |
|||
" CAST(id / 100 AS INT) as bucket, " + |
|||
" COUNT(*) as count, " + |
|||
" AVG(random_value) as avg_random " + |
|||
"FROM seaweedfs_data " + |
|||
"GROUP BY CAST(id / 100 AS INT) " + |
|||
"ORDER BY bucket" |
|||
); |
|||
|
|||
System.out.println(" Bucketed statistics:"); |
|||
sqlResult.show(); |
|||
|
|||
System.out.println("\n=== Example completed successfully! ==="); |
|||
System.out.println("Output location: " + outputPath); |
|||
|
|||
} catch (Exception e) { |
|||
System.err.println("Error: " + e.getMessage()); |
|||
e.printStackTrace(); |
|||
System.exit(1); |
|||
} finally { |
|||
spark.stop(); |
|||
} |
|||
} |
|||
} |
|||
|
|||
@ -0,0 +1,216 @@ |
|||
package seaweed.spark; |
|||
|
|||
import org.apache.spark.sql.Dataset; |
|||
import org.apache.spark.sql.Row; |
|||
import org.apache.spark.sql.SaveMode; |
|||
import org.junit.Test; |
|||
|
|||
import java.util.Arrays; |
|||
import java.util.List; |
|||
|
|||
import static org.junit.Assert.*; |
|||
|
|||
/** |
|||
* Integration tests for Spark read/write operations with SeaweedFS. |
|||
*/ |
|||
public class SparkReadWriteTest extends SparkTestBase { |
|||
|
|||
@Test |
|||
public void testWriteAndReadParquet() { |
|||
skipIfTestsDisabled(); |
|||
|
|||
// Create test data |
|||
List<Person> people = Arrays.asList( |
|||
new Person("Alice", 30), |
|||
new Person("Bob", 25), |
|||
new Person("Charlie", 35) |
|||
); |
|||
|
|||
Dataset<Row> df = spark.createDataFrame(people, Person.class); |
|||
|
|||
// Write to SeaweedFS |
|||
String outputPath = getTestPath("people.parquet"); |
|||
df.write().mode(SaveMode.Overwrite).parquet(outputPath); |
|||
|
|||
// Read back from SeaweedFS |
|||
Dataset<Row> readDf = spark.read().parquet(outputPath); |
|||
|
|||
// Verify |
|||
assertEquals(3, readDf.count()); |
|||
assertEquals(2, readDf.columns().length); |
|||
|
|||
List<Row> results = readDf.collectAsList(); |
|||
assertTrue(results.stream().anyMatch(r -> "Alice".equals(r.getAs("name")) && (Integer)r.getAs("age") == 30)); |
|||
assertTrue(results.stream().anyMatch(r -> "Bob".equals(r.getAs("name")) && (Integer)r.getAs("age") == 25)); |
|||
assertTrue(results.stream().anyMatch(r -> "Charlie".equals(r.getAs("name")) && (Integer)r.getAs("age") == 35)); |
|||
} |
|||
|
|||
@Test |
|||
public void testWriteAndReadCSV() { |
|||
skipIfTestsDisabled(); |
|||
|
|||
// Create test data |
|||
List<Person> people = Arrays.asList( |
|||
new Person("Alice", 30), |
|||
new Person("Bob", 25) |
|||
); |
|||
|
|||
Dataset<Row> df = spark.createDataFrame(people, Person.class); |
|||
|
|||
// Write to SeaweedFS as CSV |
|||
String outputPath = getTestPath("people.csv"); |
|||
df.write().mode(SaveMode.Overwrite).option("header", "true").csv(outputPath); |
|||
|
|||
// Read back from SeaweedFS |
|||
Dataset<Row> readDf = spark.read().option("header", "true").option("inferSchema", "true").csv(outputPath); |
|||
|
|||
// Verify |
|||
assertEquals(2, readDf.count()); |
|||
assertEquals(2, readDf.columns().length); |
|||
} |
|||
|
|||
@Test |
|||
public void testWriteAndReadJSON() { |
|||
skipIfTestsDisabled(); |
|||
|
|||
// Create test data |
|||
List<Person> people = Arrays.asList( |
|||
new Person("Alice", 30), |
|||
new Person("Bob", 25), |
|||
new Person("Charlie", 35) |
|||
); |
|||
|
|||
Dataset<Row> df = spark.createDataFrame(people, Person.class); |
|||
|
|||
// Write to SeaweedFS as JSON |
|||
String outputPath = getTestPath("people.json"); |
|||
df.write().mode(SaveMode.Overwrite).json(outputPath); |
|||
|
|||
// Read back from SeaweedFS |
|||
Dataset<Row> readDf = spark.read().json(outputPath); |
|||
|
|||
// Verify |
|||
assertEquals(3, readDf.count()); |
|||
assertEquals(2, readDf.columns().length); |
|||
} |
|||
|
|||
@Test |
|||
public void testWritePartitionedData() { |
|||
skipIfTestsDisabled(); |
|||
|
|||
// Create test data with multiple years |
|||
List<PersonWithYear> people = Arrays.asList( |
|||
new PersonWithYear("Alice", 30, 2020), |
|||
new PersonWithYear("Bob", 25, 2021), |
|||
new PersonWithYear("Charlie", 35, 2020), |
|||
new PersonWithYear("David", 28, 2021) |
|||
); |
|||
|
|||
Dataset<Row> df = spark.createDataFrame(people, PersonWithYear.class); |
|||
|
|||
// Write partitioned data to SeaweedFS |
|||
String outputPath = getTestPath("people_partitioned"); |
|||
df.write().mode(SaveMode.Overwrite).partitionBy("year").parquet(outputPath); |
|||
|
|||
// Read back from SeaweedFS |
|||
Dataset<Row> readDf = spark.read().parquet(outputPath); |
|||
|
|||
// Verify |
|||
assertEquals(4, readDf.count()); |
|||
|
|||
// Verify partition filtering works |
|||
Dataset<Row> filtered2020 = readDf.filter("year = 2020"); |
|||
assertEquals(2, filtered2020.count()); |
|||
|
|||
Dataset<Row> filtered2021 = readDf.filter("year = 2021"); |
|||
assertEquals(2, filtered2021.count()); |
|||
} |
|||
|
|||
@Test |
|||
public void testAppendMode() { |
|||
skipIfTestsDisabled(); |
|||
|
|||
String outputPath = getTestPath("people_append.parquet"); |
|||
|
|||
// Write first batch |
|||
List<Person> batch1 = Arrays.asList( |
|||
new Person("Alice", 30), |
|||
new Person("Bob", 25) |
|||
); |
|||
Dataset<Row> df1 = spark.createDataFrame(batch1, Person.class); |
|||
df1.write().mode(SaveMode.Overwrite).parquet(outputPath); |
|||
|
|||
// Append second batch |
|||
List<Person> batch2 = Arrays.asList( |
|||
new Person("Charlie", 35), |
|||
new Person("David", 28) |
|||
); |
|||
Dataset<Row> df2 = spark.createDataFrame(batch2, Person.class); |
|||
df2.write().mode(SaveMode.Append).parquet(outputPath); |
|||
|
|||
// Read back and verify |
|||
Dataset<Row> readDf = spark.read().parquet(outputPath); |
|||
assertEquals(4, readDf.count()); |
|||
} |
|||
|
|||
@Test |
|||
public void testLargeDataset() { |
|||
skipIfTestsDisabled(); |
|||
|
|||
// Create a larger dataset |
|||
Dataset<Row> largeDf = spark.range(0, 10000) |
|||
.selectExpr("id as value", "id * 2 as doubled"); |
|||
|
|||
String outputPath = getTestPath("large_dataset.parquet"); |
|||
largeDf.write().mode(SaveMode.Overwrite).parquet(outputPath); |
|||
|
|||
// Read back and verify |
|||
Dataset<Row> readDf = spark.read().parquet(outputPath); |
|||
assertEquals(10000, readDf.count()); |
|||
|
|||
// Verify some data |
|||
Row firstRow = readDf.first(); |
|||
assertEquals(0L, firstRow.getLong(0)); |
|||
assertEquals(0L, firstRow.getLong(1)); |
|||
} |
|||
|
|||
// Test data classes |
|||
public static class Person implements java.io.Serializable { |
|||
private String name; |
|||
private int age; |
|||
|
|||
public Person() {} |
|||
|
|||
public Person(String name, int age) { |
|||
this.name = name; |
|||
this.age = age; |
|||
} |
|||
|
|||
public String getName() { return name; } |
|||
public void setName(String name) { this.name = name; } |
|||
public int getAge() { return age; } |
|||
public void setAge(int age) { this.age = age; } |
|||
} |
|||
|
|||
public static class PersonWithYear implements java.io.Serializable { |
|||
private String name; |
|||
private int age; |
|||
private int year; |
|||
|
|||
public PersonWithYear() {} |
|||
|
|||
public PersonWithYear(String name, int age, int year) { |
|||
this.name = name; |
|||
this.age = age; |
|||
this.year = year; |
|||
} |
|||
|
|||
public String getName() { return name; } |
|||
public void setName(String name) { this.name = name; } |
|||
public int getAge() { return age; } |
|||
public void setAge(int age) { this.age = age; } |
|||
public int getYear() { return year; } |
|||
public void setYear(int year) { this.year = year; } |
|||
} |
|||
} |
|||
|
|||
@ -0,0 +1,236 @@ |
|||
package seaweed.spark; |
|||
|
|||
import org.apache.spark.sql.Dataset; |
|||
import org.apache.spark.sql.Row; |
|||
import org.apache.spark.sql.SaveMode; |
|||
import org.junit.Test; |
|||
|
|||
import java.util.Arrays; |
|||
import java.util.List; |
|||
|
|||
import static org.junit.Assert.*; |
|||
|
|||
/** |
|||
* Integration tests for Spark SQL operations with SeaweedFS. |
|||
*/ |
|||
public class SparkSQLTest extends SparkTestBase { |
|||
|
|||
@Test |
|||
public void testCreateTableAndQuery() { |
|||
skipIfTestsDisabled(); |
|||
|
|||
// Create test data |
|||
List<Employee> employees = Arrays.asList( |
|||
new Employee(1, "Alice", "Engineering", 100000), |
|||
new Employee(2, "Bob", "Sales", 80000), |
|||
new Employee(3, "Charlie", "Engineering", 120000), |
|||
new Employee(4, "David", "Sales", 75000) |
|||
); |
|||
|
|||
Dataset<Row> df = spark.createDataFrame(employees, Employee.class); |
|||
|
|||
// Write to SeaweedFS |
|||
String tablePath = getTestPath("employees"); |
|||
df.write().mode(SaveMode.Overwrite).parquet(tablePath); |
|||
|
|||
// Create temporary view |
|||
Dataset<Row> employeesDf = spark.read().parquet(tablePath); |
|||
employeesDf.createOrReplaceTempView("employees"); |
|||
|
|||
// Run SQL queries |
|||
Dataset<Row> engineeringEmployees = spark.sql( |
|||
"SELECT name, salary FROM employees WHERE department = 'Engineering'" |
|||
); |
|||
|
|||
assertEquals(2, engineeringEmployees.count()); |
|||
|
|||
Dataset<Row> highPaidEmployees = spark.sql( |
|||
"SELECT name, salary FROM employees WHERE salary > 90000" |
|||
); |
|||
|
|||
assertEquals(2, highPaidEmployees.count()); |
|||
} |
|||
|
|||
@Test |
|||
public void testAggregationQueries() { |
|||
skipIfTestsDisabled(); |
|||
|
|||
// Create sales data |
|||
List<Sale> sales = Arrays.asList( |
|||
new Sale("2024-01", "Product A", 100), |
|||
new Sale("2024-01", "Product B", 150), |
|||
new Sale("2024-02", "Product A", 120), |
|||
new Sale("2024-02", "Product B", 180), |
|||
new Sale("2024-03", "Product A", 110) |
|||
); |
|||
|
|||
Dataset<Row> df = spark.createDataFrame(sales, Sale.class); |
|||
|
|||
// Write to SeaweedFS |
|||
String tablePath = getTestPath("sales"); |
|||
df.write().mode(SaveMode.Overwrite).parquet(tablePath); |
|||
|
|||
// Create temporary view |
|||
Dataset<Row> salesDf = spark.read().parquet(tablePath); |
|||
salesDf.createOrReplaceTempView("sales"); |
|||
|
|||
// Aggregate query |
|||
Dataset<Row> monthlySales = spark.sql( |
|||
"SELECT month, SUM(amount) as total FROM sales GROUP BY month ORDER BY month" |
|||
); |
|||
|
|||
List<Row> results = monthlySales.collectAsList(); |
|||
assertEquals(3, results.size()); |
|||
assertEquals("2024-01", results.get(0).getString(0)); |
|||
assertEquals(250, results.get(0).getLong(1)); |
|||
} |
|||
|
|||
@Test |
|||
public void testJoinOperations() { |
|||
skipIfTestsDisabled(); |
|||
|
|||
// Create employee data |
|||
List<Employee> employees = Arrays.asList( |
|||
new Employee(1, "Alice", "Engineering", 100000), |
|||
new Employee(2, "Bob", "Sales", 80000) |
|||
); |
|||
|
|||
// Create department data |
|||
List<Department> departments = Arrays.asList( |
|||
new Department("Engineering", "Building Products"), |
|||
new Department("Sales", "Selling Products") |
|||
); |
|||
|
|||
Dataset<Row> empDf = spark.createDataFrame(employees, Employee.class); |
|||
Dataset<Row> deptDf = spark.createDataFrame(departments, Department.class); |
|||
|
|||
// Write to SeaweedFS |
|||
String empPath = getTestPath("employees_join"); |
|||
String deptPath = getTestPath("departments_join"); |
|||
|
|||
empDf.write().mode(SaveMode.Overwrite).parquet(empPath); |
|||
deptDf.write().mode(SaveMode.Overwrite).parquet(deptPath); |
|||
|
|||
// Read back and create views |
|||
spark.read().parquet(empPath).createOrReplaceTempView("emp"); |
|||
spark.read().parquet(deptPath).createOrReplaceTempView("dept"); |
|||
|
|||
// Join query |
|||
Dataset<Row> joined = spark.sql( |
|||
"SELECT e.name, e.salary, d.description " + |
|||
"FROM emp e JOIN dept d ON e.department = d.name" |
|||
); |
|||
|
|||
assertEquals(2, joined.count()); |
|||
|
|||
List<Row> results = joined.collectAsList(); |
|||
assertTrue(results.stream().anyMatch(r -> |
|||
"Alice".equals(r.getString(0)) && "Building Products".equals(r.getString(2)) |
|||
)); |
|||
} |
|||
|
|||
@Test |
|||
public void testWindowFunctions() { |
|||
skipIfTestsDisabled(); |
|||
|
|||
// Create employee data with salaries |
|||
List<Employee> employees = Arrays.asList( |
|||
new Employee(1, "Alice", "Engineering", 100000), |
|||
new Employee(2, "Bob", "Engineering", 120000), |
|||
new Employee(3, "Charlie", "Sales", 80000), |
|||
new Employee(4, "David", "Sales", 90000) |
|||
); |
|||
|
|||
Dataset<Row> df = spark.createDataFrame(employees, Employee.class); |
|||
|
|||
String tablePath = getTestPath("employees_window"); |
|||
df.write().mode(SaveMode.Overwrite).parquet(tablePath); |
|||
|
|||
Dataset<Row> employeesDf = spark.read().parquet(tablePath); |
|||
employeesDf.createOrReplaceTempView("employees_ranked"); |
|||
|
|||
// Window function query - rank employees by salary within department |
|||
Dataset<Row> ranked = spark.sql( |
|||
"SELECT name, department, salary, " + |
|||
"RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rank " + |
|||
"FROM employees_ranked" |
|||
); |
|||
|
|||
assertEquals(4, ranked.count()); |
|||
|
|||
// Verify Bob has rank 1 in Engineering (highest salary) |
|||
List<Row> results = ranked.collectAsList(); |
|||
Row bobRow = results.stream() |
|||
.filter(r -> "Bob".equals(r.getString(0))) |
|||
.findFirst() |
|||
.orElse(null); |
|||
|
|||
assertNotNull(bobRow); |
|||
assertEquals(1, bobRow.getInt(3)); |
|||
} |
|||
|
|||
// Test data classes |
|||
public static class Employee implements java.io.Serializable { |
|||
private int id; |
|||
private String name; |
|||
private String department; |
|||
private int salary; |
|||
|
|||
public Employee() {} |
|||
|
|||
public Employee(int id, String name, String department, int salary) { |
|||
this.id = id; |
|||
this.name = name; |
|||
this.department = department; |
|||
this.salary = salary; |
|||
} |
|||
|
|||
public int getId() { return id; } |
|||
public void setId(int id) { this.id = id; } |
|||
public String getName() { return name; } |
|||
public void setName(String name) { this.name = name; } |
|||
public String getDepartment() { return department; } |
|||
public void setDepartment(String department) { this.department = department; } |
|||
public int getSalary() { return salary; } |
|||
public void setSalary(int salary) { this.salary = salary; } |
|||
} |
|||
|
|||
public static class Sale implements java.io.Serializable { |
|||
private String month; |
|||
private String product; |
|||
private int amount; |
|||
|
|||
public Sale() {} |
|||
|
|||
public Sale(String month, String product, int amount) { |
|||
this.month = month; |
|||
this.product = product; |
|||
this.amount = amount; |
|||
} |
|||
|
|||
public String getMonth() { return month; } |
|||
public void setMonth(String month) { this.month = month; } |
|||
public String getProduct() { return product; } |
|||
public void setProduct(String product) { this.product = product; } |
|||
public int getAmount() { return amount; } |
|||
public void setAmount(int amount) { this.amount = amount; } |
|||
} |
|||
|
|||
public static class Department implements java.io.Serializable { |
|||
private String name; |
|||
private String description; |
|||
|
|||
public Department() {} |
|||
|
|||
public Department(String name, String description) { |
|||
this.name = name; |
|||
this.description = description; |
|||
} |
|||
|
|||
public String getName() { return name; } |
|||
public void setName(String name) { this.name = name; } |
|||
public String getDescription() { return description; } |
|||
public void setDescription(String description) { this.description = description; } |
|||
} |
|||
} |
|||
|
|||
@ -0,0 +1,126 @@ |
|||
package seaweed.spark; |
|||
|
|||
import org.apache.hadoop.conf.Configuration; |
|||
import org.apache.spark.SparkConf; |
|||
import org.apache.spark.sql.SparkSession; |
|||
import org.junit.After; |
|||
import org.junit.Before; |
|||
|
|||
import java.io.IOException; |
|||
|
|||
/** |
|||
* Base class for Spark integration tests with SeaweedFS. |
|||
* |
|||
* These tests require a running SeaweedFS cluster. |
|||
* Set environment variable SEAWEEDFS_TEST_ENABLED=true to enable these tests. |
|||
*/ |
|||
public abstract class SparkTestBase { |
|||
|
|||
protected SparkSession spark; |
|||
protected static final String TEST_ROOT = "/test-spark"; |
|||
protected static final boolean TESTS_ENABLED = |
|||
"true".equalsIgnoreCase(System.getenv("SEAWEEDFS_TEST_ENABLED")); |
|||
|
|||
// SeaweedFS connection settings |
|||
protected static final String SEAWEEDFS_HOST = |
|||
System.getenv().getOrDefault("SEAWEEDFS_FILER_HOST", "localhost"); |
|||
protected static final String SEAWEEDFS_PORT = |
|||
System.getenv().getOrDefault("SEAWEEDFS_FILER_PORT", "8888"); |
|||
protected static final String SEAWEEDFS_GRPC_PORT = |
|||
System.getenv().getOrDefault("SEAWEEDFS_FILER_GRPC_PORT", "18888"); |
|||
|
|||
@Before |
|||
public void setUpSpark() throws IOException { |
|||
if (!TESTS_ENABLED) { |
|||
return; |
|||
} |
|||
|
|||
SparkConf sparkConf = new SparkConf() |
|||
.setAppName("SeaweedFS Integration Test") |
|||
.setMaster("local[1]") // Single thread to avoid concurrent gRPC issues |
|||
.set("spark.driver.host", "localhost") |
|||
.set("spark.sql.warehouse.dir", getSeaweedFSPath("/spark-warehouse")) |
|||
// SeaweedFS configuration |
|||
.set("spark.hadoop.fs.defaultFS", String.format("seaweedfs://%s:%s", SEAWEEDFS_HOST, SEAWEEDFS_PORT)) |
|||
.set("spark.hadoop.fs.seaweedfs.impl", "seaweed.hdfs.SeaweedFileSystem") |
|||
.set("spark.hadoop.fs.seaweed.impl", "seaweed.hdfs.SeaweedFileSystem") |
|||
.set("spark.hadoop.fs.seaweed.filer.host", SEAWEEDFS_HOST) |
|||
.set("spark.hadoop.fs.seaweed.filer.port", SEAWEEDFS_PORT) |
|||
.set("spark.hadoop.fs.seaweed.filer.port.grpc", SEAWEEDFS_GRPC_PORT) |
|||
.set("spark.hadoop.fs.AbstractFileSystem.seaweedfs.impl", "seaweed.hdfs.SeaweedAbstractFileSystem") |
|||
// Set replication to empty string to use filer default |
|||
.set("spark.hadoop.fs.seaweed.replication", "") |
|||
// Smaller buffer to reduce load |
|||
.set("spark.hadoop.fs.seaweed.buffer.size", "1048576") // 1MB |
|||
// Reduce parallelism |
|||
.set("spark.default.parallelism", "1") |
|||
.set("spark.sql.shuffle.partitions", "1") |
|||
// Simpler output committer |
|||
.set("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2") |
|||
.set("spark.sql.sources.commitProtocolClass", "org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol") |
|||
// Disable speculative execution to reduce load |
|||
.set("spark.speculation", "false"); |
|||
|
|||
spark = SparkSession.builder() |
|||
.config(sparkConf) |
|||
.getOrCreate(); |
|||
|
|||
// Clean up test directory |
|||
cleanupTestDirectory(); |
|||
} |
|||
|
|||
@After |
|||
public void tearDownSpark() { |
|||
if (!TESTS_ENABLED || spark == null) { |
|||
return; |
|||
} |
|||
|
|||
try { |
|||
// Try to cleanup but don't fail if it doesn't work |
|||
cleanupTestDirectory(); |
|||
} catch (Exception e) { |
|||
System.err.println("Cleanup failed: " + e.getMessage()); |
|||
} finally { |
|||
try { |
|||
spark.stop(); |
|||
} catch (Exception e) { |
|||
System.err.println("Spark stop failed: " + e.getMessage()); |
|||
} |
|||
spark = null; |
|||
} |
|||
} |
|||
|
|||
protected String getSeaweedFSPath(String path) { |
|||
return String.format("seaweedfs://%s:%s%s", SEAWEEDFS_HOST, SEAWEEDFS_PORT, path); |
|||
} |
|||
|
|||
protected String getTestPath(String subPath) { |
|||
return getSeaweedFSPath(TEST_ROOT + "/" + subPath); |
|||
} |
|||
|
|||
private void cleanupTestDirectory() { |
|||
if (spark != null) { |
|||
try { |
|||
Configuration conf = spark.sparkContext().hadoopConfiguration(); |
|||
org.apache.hadoop.fs.FileSystem fs = org.apache.hadoop.fs.FileSystem.get( |
|||
java.net.URI.create(getSeaweedFSPath("/")), conf); |
|||
org.apache.hadoop.fs.Path testPath = new org.apache.hadoop.fs.Path(TEST_ROOT); |
|||
if (fs.exists(testPath)) { |
|||
fs.delete(testPath, true); |
|||
} |
|||
} catch (Exception e) { |
|||
// Suppress cleanup errors - they shouldn't fail tests |
|||
// Common in distributed systems with eventual consistency |
|||
System.err.println("Warning: cleanup failed (non-critical): " + e.getMessage()); |
|||
} |
|||
} |
|||
} |
|||
|
|||
protected void skipIfTestsDisabled() { |
|||
if (!TESTS_ENABLED) { |
|||
System.out.println("Skipping test - SEAWEEDFS_TEST_ENABLED not set"); |
|||
org.junit.Assume.assumeTrue("SEAWEEDFS_TEST_ENABLED not set", false); |
|||
} |
|||
} |
|||
} |
|||
|
|||
@ -0,0 +1,18 @@ |
|||
# Set root logger level |
|||
log4j.rootLogger=WARN, console |
|||
|
|||
# Console appender |
|||
log4j.appender.console=org.apache.log4j.ConsoleAppender |
|||
log4j.appender.console.target=System.err |
|||
log4j.appender.console.layout=org.apache.log4j.PatternLayout |
|||
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n |
|||
|
|||
# Set log levels for specific packages |
|||
log4j.logger.org.apache.spark=WARN |
|||
log4j.logger.org.apache.hadoop=WARN |
|||
log4j.logger.seaweed=INFO |
|||
|
|||
# Suppress unnecessary warnings |
|||
log4j.logger.org.apache.spark.util.Utils=ERROR |
|||
log4j.logger.org.apache.hadoop.util.NativeCodeLoader=ERROR |
|||
|
|||
@ -0,0 +1,40 @@ |
|||
#!/bin/bash |
|||
|
|||
# Run a single test method for quick iteration |
|||
|
|||
set -e |
|||
|
|||
if [ $# -eq 0 ]; then |
|||
echo "Usage: ./test-one.sh <TestClass>#<methodName>" |
|||
echo "" |
|||
echo "Examples:" |
|||
echo " ./test-one.sh SparkReadWriteTest#testWriteAndReadParquet" |
|||
echo " ./test-one.sh SparkSQLTest#testCreateTableAndQuery" |
|||
echo "" |
|||
exit 1 |
|||
fi |
|||
|
|||
# Check if SeaweedFS is running |
|||
if ! curl -f http://localhost:8888/ > /dev/null 2>&1; then |
|||
echo "✗ SeaweedFS filer is not accessible at http://localhost:8888" |
|||
echo "" |
|||
echo "Please start SeaweedFS first:" |
|||
echo " docker-compose up -d" |
|||
echo "" |
|||
exit 1 |
|||
fi |
|||
|
|||
echo "✓ SeaweedFS filer is accessible" |
|||
echo "" |
|||
echo "Running test: $1" |
|||
echo "" |
|||
|
|||
# Set environment variables |
|||
export SEAWEEDFS_TEST_ENABLED=true |
|||
export SEAWEEDFS_FILER_HOST=localhost |
|||
export SEAWEEDFS_FILER_PORT=8888 |
|||
export SEAWEEDFS_FILER_GRPC_PORT=18888 |
|||
|
|||
# Run the specific test |
|||
mvn test -Dtest="$1" |
|||
|
|||
Write
Preview
Loading…
Cancel
Save
Reference in new issue