# PyArrow Parquet S3 Compatibility Tests This directory contains tests for PyArrow Parquet compatibility with SeaweedFS S3 API, including the implicit directory detection fix. ## Overview **Status**: ✅ **All PyArrow methods work correctly with SeaweedFS** SeaweedFS implements implicit directory detection to improve compatibility with s3fs and PyArrow. When PyArrow writes datasets using `write_dataset()`, it may create directory markers that can confuse s3fs. SeaweedFS now handles these correctly by returning 404 for HEAD requests on implicit directories (directories with children), forcing s3fs to use LIST-based discovery. ## Quick Start ### Running the Example Script ```bash # Start SeaweedFS server make start-seaweedfs-ci # Run the example script python3 example_pyarrow_native.py # Or with uv (if available) uv run example_pyarrow_native.py # Stop the server when done make stop-seaweedfs-safe ``` ### Running Tests ```bash # Setup Python environment make setup-python # Run all tests with server (small and large files) make test-with-server # Run quick tests with small files only (faster for development) make test-quick # Run implicit directory fix tests make test-implicit-dir-with-server # Run PyArrow native S3 filesystem tests make test-native-s3-with-server # Run SSE-S3 encryption tests make test-sse-s3-compat # Clean up make clean ``` ### Using PyArrow with SeaweedFS #### Option 1: Using s3fs (recommended for compatibility) ```python import pyarrow as pa import pyarrow.parquet as pq import pyarrow.dataset as pads import s3fs # Configure s3fs fs = s3fs.S3FileSystem( key='your_access_key', secret='your_secret_key', endpoint_url='http://localhost:8333', use_ssl=False ) # Write dataset (creates directory structure) table = pa.table({'id': [1, 2, 3], 'value': ['a', 'b', 'c']}) pads.write_dataset(table, 'bucket/dataset', filesystem=fs) # Read dataset (all methods work!) dataset = pads.dataset('bucket/dataset', filesystem=fs) # ✅ table = pq.read_table('bucket/dataset', filesystem=fs) # ✅ dataset = pq.ParquetDataset('bucket/dataset', filesystem=fs) # ✅ ``` #### Option 2: Using PyArrow's native S3 filesystem (pure PyArrow) ```python import pyarrow as pa import pyarrow.parquet as pq import pyarrow.dataset as pads import pyarrow.fs as pafs # Configure PyArrow's native S3 filesystem s3 = pafs.S3FileSystem( access_key='your_access_key', secret_key='your_secret_key', endpoint_override='localhost:8333', scheme='http', allow_bucket_creation=True, allow_bucket_deletion=True ) # Write dataset table = pa.table({'id': [1, 2, 3], 'value': ['a', 'b', 'c']}) pads.write_dataset(table, 'bucket/dataset', filesystem=s3) # Read dataset (all methods work!) table = pq.read_table('bucket/dataset', filesystem=s3) # ✅ dataset = pq.ParquetDataset('bucket/dataset', filesystem=s3) # ✅ dataset = pads.dataset('bucket/dataset', filesystem=s3) # ✅ ``` ## Test Files ### Main Test Suite - **`s3_parquet_test.py`** - Comprehensive PyArrow test suite - Tests 2 write methods × 5 read methods × 2 dataset sizes = 20 combinations - Uses s3fs library for S3 operations - All tests pass with the implicit directory fix ✅ ### PyArrow Native S3 Tests - **`test_pyarrow_native_s3.py`** - PyArrow's native S3 filesystem tests - Tests PyArrow's built-in S3FileSystem (pyarrow.fs.S3FileSystem) - Pure PyArrow solution without s3fs dependency - Tests 3 read methods × 2 dataset sizes = 6 scenarios - All tests pass ✅ - **`test_sse_s3_compatibility.py`** - SSE-S3 encryption compatibility tests - Tests PyArrow native S3 with SSE-S3 server-side encryption - Tests 5 different file sizes (10 to 500,000 rows) - Verifies multipart upload encryption works correctly - All tests pass ✅ ### Implicit Directory Tests - **`test_implicit_directory_fix.py`** - Specific tests for the implicit directory fix - Tests HEAD request behavior - Tests s3fs directory detection - Tests PyArrow dataset reading - All 6 tests pass ✅ ### Examples - **`example_pyarrow_native.py`** - Simple standalone example - Demonstrates PyArrow's native S3 filesystem usage - Can be run with `uv run` or regular Python - Minimal dependencies (pyarrow, boto3) ### Configuration - **`Makefile`** - Build and test automation - **`requirements.txt`** - Python dependencies (pyarrow, s3fs, boto3) - **`.gitignore`** - Ignore patterns for test artifacts ## Documentation ### Technical Documentation - **`TEST_COVERAGE.md`** - Comprehensive test coverage documentation - Unit tests (Go): 17 test cases - Integration tests (Python): 6 test cases - End-to-end tests (Python): 20 test cases - **`FINAL_ROOT_CAUSE_ANALYSIS.md`** - Deep technical analysis - Root cause of the s3fs compatibility issue - How the implicit directory fix works - Performance considerations - **`MINIO_DIRECTORY_HANDLING.md`** - Comparison with MinIO - How MinIO handles directory markers - Differences in implementation approaches ## The Implicit Directory Fix ### Problem When PyArrow writes datasets with `write_dataset()`, it may create 0-byte directory markers. s3fs's `info()` method calls HEAD on these paths, and if HEAD returns 200 with size=0, s3fs incorrectly reports them as files instead of directories. This causes PyArrow to fail with "Parquet file size is 0 bytes". ### Solution SeaweedFS now returns 404 for HEAD requests on implicit directories (0-byte objects or directories with children, when requested without a trailing slash). This forces s3fs to fall back to LIST-based discovery, which correctly identifies directories by checking for children. ### Implementation The fix is implemented in `weed/s3api/s3api_object_handlers.go`: - `HeadObjectHandler` - Returns 404 for implicit directories - `hasChildren` - Helper function to check if a path has children See the source code for detailed inline documentation. ### Test Coverage - **Unit tests** (Go): `weed/s3api/s3api_implicit_directory_test.go` - Run: `cd weed/s3api && go test -v -run TestImplicitDirectory` - **Integration tests** (Python): `test_implicit_directory_fix.py` - Run: `cd test/s3/parquet && make test-implicit-dir-with-server` - **End-to-end tests** (Python): `s3_parquet_test.py` - Run: `cd test/s3/parquet && make test-with-server` ## Makefile Targets ```bash # Setup make setup-python # Create Python virtual environment and install dependencies make build-weed # Build SeaweedFS binary # Testing make test # Run full tests (assumes server is already running) make test-with-server # Run full PyArrow test suite with server (small + large files) make test-quick # Run quick tests with small files only (assumes server is running) make test-implicit-dir-with-server # Run implicit directory tests with server make test-native-s3 # Run PyArrow native S3 tests (assumes server is running) make test-native-s3-with-server # Run PyArrow native S3 tests with server management make test-sse-s3-compat # Run comprehensive SSE-S3 encryption compatibility tests # Server Management make start-seaweedfs-ci # Start SeaweedFS in background (CI mode) make stop-seaweedfs-safe # Stop SeaweedFS gracefully make clean # Clean up all test artifacts # Development make help # Show all available targets ``` ## Continuous Integration The tests are automatically run in GitHub Actions on every push/PR that affects S3 or filer code: **Workflow**: `.github/workflows/s3-parquet-tests.yml` **Test Matrix**: - Python versions: 3.9, 3.11, 3.12 - PyArrow integration tests (s3fs): 20 test combinations - PyArrow native S3 tests: 6 test scenarios ✅ **NEW** - SSE-S3 encryption tests: 5 file sizes ✅ **NEW** - Implicit directory fix tests: 6 test scenarios - Go unit tests: 17 test cases **Test Steps** (run for each Python version): 1. Build SeaweedFS 2. Run PyArrow Parquet integration tests (`make test-with-server`) 3. Run implicit directory fix tests (`make test-implicit-dir-with-server`) 4. Run PyArrow native S3 filesystem tests (`make test-native-s3-with-server`) ✅ **NEW** 5. Run SSE-S3 encryption compatibility tests (`make test-sse-s3-compat`) ✅ **NEW** 6. Run Go unit tests for implicit directory handling **Triggers**: - Push/PR to master (when `weed/s3api/**` or `weed/filer/**` changes) - Manual trigger via GitHub UI (workflow_dispatch) ## Requirements - Python 3.8+ - PyArrow 22.0.0+ - s3fs 2024.12.0+ - boto3 1.40.0+ - SeaweedFS (latest) ## AWS S3 Compatibility The implicit directory fix makes SeaweedFS behavior more compatible with AWS S3: - AWS S3 typically doesn't create directory markers for implicit directories - HEAD on "dataset" (when only "dataset/file.txt" exists) returns 404 on AWS - SeaweedFS now matches this behavior for implicit directories with children ## Edge Cases Handled ✅ **Implicit directories with children** → 404 (forces LIST-based discovery) ✅ **Empty files (0-byte, no children)** → 200 (legitimate empty file) ✅ **Empty directories (no children)** → 200 (legitimate empty directory) ✅ **Explicit directory requests (trailing slash)** → 200 (normal directory behavior) ✅ **Versioned buckets** → Skip implicit directory check (versioned semantics) ✅ **Regular files** → 200 (normal file behavior) ## Performance The implicit directory check adds minimal overhead: - Only triggered for 0-byte objects or directories without trailing slash - Cost: One LIST operation with Limit=1 (~1-5ms) - No impact on regular file operations ## Contributing When adding new tests: 1. Add test cases to the appropriate test file 2. Update TEST_COVERAGE.md 3. Run the full test suite to ensure no regressions 4. Update this README if adding new functionality ## References - [PyArrow Documentation](https://arrow.apache.org/docs/python/parquet.html) - [s3fs Documentation](https://s3fs.readthedocs.io/) - [SeaweedFS S3 API](https://github.com/seaweedfs/seaweedfs/wiki/Amazon-S3-API) - [AWS S3 API Reference](https://docs.aws.amazon.com/AmazonS3/latest/API/) --- **Last Updated**: November 19, 2025 **Status**: All tests passing ✅