Browse Source
Refactor S3 integration tests to use weed mini (#7877)
Refactor S3 integration tests to use weed mini (#7877)
* Refactor S3 integration tests to use weed mini * Fix weed mini flags for sse and parquet tests * Fix IAM test startup: remove -iam.config flag from weed mini * Enhance logging in IAM Makefile to debug startup failure * Simplify weed mini flags and checks in S3 tests (IAM, Parquet, SSE, Copying) * Simplify weed mini flags and checks in all S3 tests * Fix IAM tests: use -s3.iam.config for weed mini * Replace timeout command with portable loop in IAM Makefile * Standardize portable loop-based readiness checks in all S3 Makefiles * Define SERVER_DIR in retention Makefile * Fix versioning and retention Makefiles: remove unsupported weed mini flags * fix filer_group test * fix cors * emojis * fix sse * fix retention * fixes * fix * fixes * fix parquet * fixes * fix * clean up * avoid duplicated debug server * Update .gitignore * simplify * clean up * add credentials * bind * delay * Update Makefile * Update Makefile * check ready * delay * update remote credentials * Update Makefile * clean up * kill * Update Makefile * update credentialspull/7183/merge
committed by
GitHub
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
21 changed files with 240 additions and 853 deletions
-
2.gitignore
-
18test/fuse_integration/Makefile
-
5test/s3/compatibility/run.sh
-
48test/s3/copying/Makefile
-
46test/s3/cors/Makefile
-
59test/s3/filer_group/Makefile
-
1test/s3/filer_group/test_config.json
-
99test/s3/iam/Makefile
-
172test/s3/parquet/CROSS_FILESYSTEM_COMPATIBILITY.md
-
58test/s3/parquet/FINAL_ROOT_CAUSE_ANALYSIS.md
-
70test/s3/parquet/MINIO_DIRECTORY_HANDLING.md
-
164test/s3/parquet/Makefile
-
46test/s3/parquet/TEST_COVERAGE.md
-
1test/s3/parquet/test_implicit_directory_fix.py
-
36test/s3/remote_cache/Makefile
-
15test/s3/remote_cache/remote_cache_test.go
-
31test/s3/retention/Makefile
-
115test/s3/sse/Makefile
-
36test/s3/tagging/Makefile
-
67test/s3/versioning/Makefile
-
4weed/command/mini.go
@ -1,172 +0,0 @@ |
|||||
# Cross-Filesystem Compatibility Test Results |
|
||||
|
|
||||
## Overview |
|
||||
|
|
||||
This document summarizes the cross-filesystem compatibility testing between **s3fs** and **PyArrow native S3 filesystem** implementations when working with SeaweedFS. |
|
||||
|
|
||||
## Test Purpose |
|
||||
|
|
||||
Verify that Parquet files written using one filesystem implementation (s3fs or PyArrow native S3) can be correctly read using the other implementation, confirming true file format compatibility. |
|
||||
|
|
||||
## Test Methodology |
|
||||
|
|
||||
### Test Matrix |
|
||||
|
|
||||
The test performs the following combinations: |
|
||||
|
|
||||
1. **Write with s3fs → Read with PyArrow native S3** |
|
||||
2. **Write with PyArrow native S3 → Read with s3fs** |
|
||||
|
|
||||
For each direction, the test: |
|
||||
- Creates a sample PyArrow table with multiple data types (int64, string, float64, bool) |
|
||||
- Writes the Parquet file using one filesystem implementation |
|
||||
- Reads the Parquet file using the other filesystem implementation |
|
||||
- Verifies data integrity by comparing: |
|
||||
- Row counts |
|
||||
- Schema equality |
|
||||
- Data contents (after sorting by ID to handle row order differences) |
|
||||
|
|
||||
### File Sizes Tested |
|
||||
|
|
||||
- **Small files**: 5 rows (quick validation) |
|
||||
- **Large files**: 200,000 rows (multi-row-group validation) |
|
||||
|
|
||||
## Test Results |
|
||||
|
|
||||
### ✅ Small Files (5 rows) |
|
||||
|
|
||||
| Write Method | Read Method | Result | Read Function Used | |
|
||||
|--------------|-------------|--------|--------------------| |
|
||||
| s3fs | PyArrow native S3 | ✅ PASS | pq.read_table | |
|
||||
| PyArrow native S3 | s3fs | ✅ PASS | pq.read_table | |
|
||||
|
|
||||
**Status**: **ALL TESTS PASSED** |
|
||||
|
|
||||
### Large Files (200,000 rows) |
|
||||
|
|
||||
Large file testing requires adequate volume capacity in SeaweedFS. When run with default volume settings (50MB max size), tests may encounter capacity issues with the number of large test files created simultaneously. |
|
||||
|
|
||||
**Recommendation**: For large file testing, increase `VOLUME_MAX_SIZE_MB` in the Makefile or run tests with `TEST_QUICK=1` for development/validation purposes. |
|
||||
|
|
||||
## Key Findings |
|
||||
|
|
||||
### ✅ Full Compatibility Confirmed |
|
||||
|
|
||||
**Files written with s3fs and PyArrow native S3 filesystem are fully compatible and can be read by either implementation.** |
|
||||
|
|
||||
This confirms that: |
|
||||
|
|
||||
1. **Identical Parquet Format**: Both s3fs and PyArrow native S3 use the same underlying PyArrow library to generate Parquet files, resulting in identical file formats at the binary level. |
|
||||
|
|
||||
2. **S3 API Compatibility**: SeaweedFS's S3 implementation handles both filesystem backends correctly, with proper: |
|
||||
- Object creation (PutObject) |
|
||||
- Object reading (GetObject) |
|
||||
- Directory handling (implicit directories) |
|
||||
- Multipart uploads (for larger files) |
|
||||
|
|
||||
3. **Metadata Consistency**: File metadata, schemas, and data integrity are preserved across both write and read operations regardless of which filesystem implementation is used. |
|
||||
|
|
||||
## Implementation Details |
|
||||
|
|
||||
### Common Write Path |
|
||||
|
|
||||
Both implementations use PyArrow's `pads.write_dataset()` function: |
|
||||
|
|
||||
```python |
|
||||
# s3fs approach |
|
||||
fs = s3fs.S3FileSystem(...) |
|
||||
pads.write_dataset(table, path, format="parquet", filesystem=fs) |
|
||||
|
|
||||
# PyArrow native approach |
|
||||
s3 = pafs.S3FileSystem(...) |
|
||||
pads.write_dataset(table, path, format="parquet", filesystem=s3) |
|
||||
``` |
|
||||
|
|
||||
### Multiple Read Methods Tested |
|
||||
|
|
||||
The test attempts reads using multiple PyArrow methods: |
|
||||
- `pq.read_table()` - Direct table reading |
|
||||
- `pq.ParquetDataset()` - Dataset-based reading |
|
||||
- `pads.dataset()` - PyArrow dataset API |
|
||||
|
|
||||
All methods successfully read files written by either filesystem implementation. |
|
||||
|
|
||||
## Practical Implications |
|
||||
|
|
||||
### For Users |
|
||||
|
|
||||
1. **Flexibility**: Users can choose either s3fs or PyArrow native S3 based on their preferences: |
|
||||
- **s3fs**: More mature, widely used, familiar API |
|
||||
- **PyArrow native**: Pure PyArrow solution, fewer dependencies |
|
||||
|
|
||||
2. **Interoperability**: Teams using different tools can seamlessly share Parquet datasets stored in SeaweedFS |
|
||||
|
|
||||
3. **Migration**: Easy to migrate between filesystem implementations without data conversion |
|
||||
|
|
||||
### For SeaweedFS |
|
||||
|
|
||||
1. **S3 Compatibility**: Confirms SeaweedFS's S3 implementation is compatible with major Python data science tools |
|
||||
|
|
||||
2. **Implicit Directory Handling**: The implicit directory fix works correctly for both filesystem implementations |
|
||||
|
|
||||
3. **Standard Compliance**: SeaweedFS handles S3 operations in a way that's compatible with AWS S3 behavior |
|
||||
|
|
||||
## Running the Tests |
|
||||
|
|
||||
### Quick Test (Recommended for Development) |
|
||||
|
|
||||
```bash |
|
||||
cd test/s3/parquet |
|
||||
TEST_QUICK=1 make test-cross-fs-with-server |
|
||||
``` |
|
||||
|
|
||||
### Full Test (All File Sizes) |
|
||||
|
|
||||
```bash |
|
||||
cd test/s3/parquet |
|
||||
make test-cross-fs-with-server |
|
||||
``` |
|
||||
|
|
||||
### Manual Test (Assuming Server is Running) |
|
||||
|
|
||||
```bash |
|
||||
cd test/s3/parquet |
|
||||
make setup-python |
|
||||
make start-seaweedfs-ci |
|
||||
|
|
||||
# In another terminal |
|
||||
TEST_QUICK=1 make test-cross-fs |
|
||||
|
|
||||
# Cleanup |
|
||||
make stop-seaweedfs-safe |
|
||||
``` |
|
||||
|
|
||||
## Environment Variables |
|
||||
|
|
||||
The test supports customization through environment variables: |
|
||||
|
|
||||
- `S3_ENDPOINT_URL`: S3 endpoint (default: `http://localhost:8333`) |
|
||||
- `S3_ACCESS_KEY`: Access key (default: `some_access_key1`) |
|
||||
- `S3_SECRET_KEY`: Secret key (default: `some_secret_key1`) |
|
||||
- `BUCKET_NAME`: Bucket name (default: `test-parquet-bucket`) |
|
||||
- `TEST_QUICK`: Run only small tests (default: `0`, set to `1` for quick mode) |
|
||||
|
|
||||
## Conclusion |
|
||||
|
|
||||
The cross-filesystem compatibility tests demonstrate that **Parquet files written via s3fs and PyArrow native S3 filesystem are completely interchangeable**. This validates that: |
|
||||
|
|
||||
1. The Parquet file format is implementation-agnostic |
|
||||
2. SeaweedFS's S3 API correctly handles both filesystem backends |
|
||||
3. Users have full flexibility in choosing their preferred filesystem implementation |
|
||||
|
|
||||
This compatibility is a testament to: |
|
||||
- PyArrow's consistent file format generation |
|
||||
- SeaweedFS's robust S3 API implementation |
|
||||
- Proper handling of S3 semantics (especially implicit directories) |
|
||||
|
|
||||
--- |
|
||||
|
|
||||
**Test Implementation**: `test_cross_filesystem_compatibility.py` |
|
||||
**Last Updated**: November 21, 2024 |
|
||||
**Status**: ✅ All critical tests passing |
|
||||
|
|
||||
@ -1,58 +0,0 @@ |
|||||
# Final Root Cause Analysis |
|
||||
|
|
||||
## Overview |
|
||||
|
|
||||
This document provides a deep technical analysis of the s3fs compatibility issue with PyArrow Parquet datasets on SeaweedFS, and the solution implemented to resolve it. |
|
||||
|
|
||||
## Root Cause |
|
||||
|
|
||||
When PyArrow writes datasets using `write_dataset()`, it creates implicit directory structures by writing files without explicit directory markers. However, some S3 workflows may create 0-byte directory markers. |
|
||||
|
|
||||
### The Problem |
|
||||
|
|
||||
1. **PyArrow writes dataset files** without creating explicit directory objects |
|
||||
2. **s3fs calls HEAD** on the directory path to check if it exists |
|
||||
3. **If HEAD returns 200** with `Content-Length: 0`, s3fs interprets it as a file (not a directory) |
|
||||
4. **PyArrow fails** when trying to read, reporting "Parquet file size is 0 bytes" |
|
||||
|
|
||||
### AWS S3 Behavior |
|
||||
|
|
||||
AWS S3 returns **404 Not Found** for implicit directories (directories that only exist because they have children but no explicit marker object). This allows s3fs to fall back to LIST operations to detect the directory. |
|
||||
|
|
||||
## The Solution |
|
||||
|
|
||||
### Implementation |
|
||||
|
|
||||
Modified the S3 API HEAD handler in `weed/s3api/s3api_object_handlers.go` to: |
|
||||
|
|
||||
1. **Check if object ends with `/`**: Explicit directory markers return 200 as before |
|
||||
2. **Check if object has children**: If a 0-byte object has children in the filer, treat it as an implicit directory |
|
||||
3. **Return 404 for implicit directories**: This matches AWS S3 behavior and triggers s3fs's LIST fallback |
|
||||
|
|
||||
### Code Changes |
|
||||
|
|
||||
The fix is implemented in the `HeadObjectHandler` function with logic to: |
|
||||
- Detect implicit directories by checking for child entries |
|
||||
- Return 404 (NoSuchKey) for implicit directories |
|
||||
- Preserve existing behavior for explicit directory markers and regular files |
|
||||
|
|
||||
## Performance Considerations |
|
||||
|
|
||||
### Optimization: Child Check Cache |
|
||||
- Child existence checks are performed via filer LIST operations |
|
||||
- Results could be cached for frequently accessed paths |
|
||||
- Trade-off between consistency and performance |
|
||||
|
|
||||
### Impact |
|
||||
- Minimal performance impact for normal file operations |
|
||||
- Slight overhead for HEAD requests on implicit directories (one additional LIST call) |
|
||||
- Overall improvement in PyArrow compatibility outweighs minor performance cost |
|
||||
|
|
||||
## TODO |
|
||||
|
|
||||
- [ ] Add detailed benchmarking results comparing before/after fix |
|
||||
- [ ] Document edge cases discovered during implementation |
|
||||
- [ ] Add architectural diagrams showing the request flow |
|
||||
- [ ] Document alternative solutions considered and why they were rejected |
|
||||
- [ ] Add performance profiling data for child existence checks |
|
||||
|
|
||||
@ -1,70 +0,0 @@ |
|||||
# MinIO Directory Handling Comparison |
|
||||
|
|
||||
## Overview |
|
||||
|
|
||||
This document compares how MinIO handles directory markers versus SeaweedFS's implementation, and explains the different approaches to S3 directory semantics. |
|
||||
|
|
||||
## MinIO's Approach |
|
||||
|
|
||||
MinIO handles implicit directories similarly to AWS S3: |
|
||||
|
|
||||
1. **No explicit directory objects**: Directories are implicit, defined only by object key prefixes |
|
||||
2. **HEAD on directory returns 404**: Consistent with AWS S3 behavior |
|
||||
3. **LIST operations reveal directories**: Directories are discovered through delimiter-based LIST operations |
|
||||
4. **Automatic prefix handling**: MinIO automatically recognizes prefixes as directories |
|
||||
|
|
||||
### MinIO Implementation Details |
|
||||
|
|
||||
- Uses in-memory metadata for fast prefix lookups |
|
||||
- Optimized for LIST operations with common delimiter (`/`) |
|
||||
- No persistent directory objects in storage layer |
|
||||
- Directories "exist" as long as they contain objects |
|
||||
|
|
||||
## SeaweedFS Approach |
|
||||
|
|
||||
SeaweedFS uses a filer-based approach with real directory entries: |
|
||||
|
|
||||
### Before the Fix |
|
||||
|
|
||||
1. **Explicit directory objects**: Could create 0-byte objects as directory markers |
|
||||
2. **HEAD returns 200**: Even for implicit directories |
|
||||
3. **Caused s3fs issues**: s3fs interpreted 0-byte HEAD responses as empty files |
|
||||
|
|
||||
### After the Fix |
|
||||
|
|
||||
1. **Hybrid approach**: Supports both explicit markers (with `/` suffix) and implicit directories |
|
||||
2. **HEAD returns 404 for implicit directories**: Matches AWS S3 and MinIO behavior |
|
||||
3. **Filer integration**: Uses filer's directory metadata to detect implicit directories |
|
||||
4. **s3fs compatibility**: Triggers proper LIST fallback behavior |
|
||||
|
|
||||
## Key Differences |
|
||||
|
|
||||
| Aspect | MinIO | SeaweedFS (After Fix) | |
|
||||
|--------|-------|----------------------| |
|
||||
| Directory Storage | No persistent objects | Filer directory entries | |
|
||||
| Implicit Directory HEAD | 404 Not Found | 404 Not Found | |
|
||||
| Explicit Marker HEAD | Not applicable | 200 OK (with `/` suffix) | |
|
||||
| Child Detection | Prefix scan | Filer LIST operation | |
|
||||
| Performance | In-memory lookups | Filer gRPC calls | |
|
||||
|
|
||||
## Implementation Considerations |
|
||||
|
|
||||
### Advantages of SeaweedFS Approach |
|
||||
- Integrates with existing filer metadata |
|
||||
- Supports both implicit and explicit directories |
|
||||
- Preserves directory metadata and attributes |
|
||||
- Compatible with POSIX filer semantics |
|
||||
|
|
||||
### Trade-offs |
|
||||
- Additional filer communication overhead for HEAD requests |
|
||||
- Complexity of supporting both directory paradigms |
|
||||
- Performance depends on filer efficiency |
|
||||
|
|
||||
## TODO |
|
||||
|
|
||||
- [ ] Add performance benchmark comparison: MinIO vs SeaweedFS |
|
||||
- [ ] Document edge cases where behaviors differ |
|
||||
- [ ] Add example request/response traces for both systems |
|
||||
- [ ] Document migration path for users moving from MinIO to SeaweedFS |
|
||||
- [ ] Add compatibility matrix for different S3 clients |
|
||||
|
|
||||
@ -1,46 +0,0 @@ |
|||||
# Test Coverage Documentation |
|
||||
|
|
||||
## Overview |
|
||||
|
|
||||
This document provides comprehensive test coverage documentation for the SeaweedFS S3 Parquet integration tests. |
|
||||
|
|
||||
## Test Categories |
|
||||
|
|
||||
### Unit Tests (Go) |
|
||||
- 17 test cases covering S3 API handlers |
|
||||
- Tests for implicit directory handling |
|
||||
- HEAD request behavior validation |
|
||||
- Located in: `weed/s3api/s3api_implicit_directory_test.go` |
|
||||
|
|
||||
### Integration Tests (Python) |
|
||||
- 6 test cases for implicit directory fix |
|
||||
- Tests HEAD request behavior on directory markers |
|
||||
- s3fs directory detection validation |
|
||||
- PyArrow dataset read compatibility |
|
||||
- Located in: `test_implicit_directory_fix.py` |
|
||||
|
|
||||
### End-to-End Tests (Python) |
|
||||
- 20 test cases combining write and read methods |
|
||||
- Small file tests (5 rows): 10 test combinations |
|
||||
- Large file tests (200,000 rows): 10 test combinations |
|
||||
- Tests multiple write methods: `pads.write_dataset`, `pq.write_table+s3fs` |
|
||||
- Tests multiple read methods: `pads.dataset`, `pq.ParquetDataset`, `pq.read_table`, `s3fs+direct`, `s3fs+buffered` |
|
||||
- Located in: `s3_parquet_test.py` |
|
||||
|
|
||||
## Coverage Summary |
|
||||
|
|
||||
| Test Type | Count | Status | |
|
||||
|-----------|-------|--------| |
|
||||
| Unit Tests (Go) | 17 | ✅ Pass | |
|
||||
| Integration Tests (Python) | 6 | ✅ Pass | |
|
||||
| End-to-End Tests (Python) | 20 | ✅ Pass | |
|
||||
| **Total** | **43** | **✅ All Pass** | |
|
||||
|
|
||||
## TODO |
|
||||
|
|
||||
- [ ] Add detailed test execution time metrics |
|
||||
- [ ] Document test data generation strategies |
|
||||
- [ ] Add code coverage percentages for Go tests |
|
||||
- [ ] Document edge cases and corner cases tested |
|
||||
- [ ] Add performance benchmarking results |
|
||||
|
|
||||
Write
Preview
Loading…
Cancel
Save
Reference in new issue