11 KiB

Raw Blame History

PyArrow Parquet S3 Compatibility Tests

This directory contains tests for PyArrow Parquet compatibility with SeaweedFS S3 API, including the implicit directory detection fix.

Overview

Status: ✅ All PyArrow methods work correctly with SeaweedFS

SeaweedFS implements implicit directory detection to improve compatibility with s3fs and PyArrow. When PyArrow writes datasets using write_dataset(), it may create directory markers that can confuse s3fs. SeaweedFS now handles these correctly by returning 404 for HEAD requests on implicit directories (directories with children), forcing s3fs to use LIST-based discovery.

Quick Start

Running the Example Script

# Start SeaweedFS server
make start-seaweedfs-ci

# Run the example script
python3 example_pyarrow_native.py

# Or with uv (if available)
uv run example_pyarrow_native.py

# Stop the server when done
make stop-seaweedfs-safe

Running Tests

# Setup Python environment
make setup-python

# Run all tests with server (small and large files)
make test-with-server

# Run quick tests with small files only (faster for development)
make test-quick

# Run implicit directory fix tests
make test-implicit-dir-with-server

# Run PyArrow native S3 filesystem tests
make test-native-s3-with-server

# Run cross-filesystem compatibility tests (s3fs ↔ PyArrow native)
make test-cross-fs-with-server

# Run SSE-S3 encryption tests
make test-sse-s3-compat

# Clean up
make clean

Using PyArrow with SeaweedFS

Option 1: Using s3fs (recommended for compatibility)

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as pads
import s3fs

# Configure s3fs
fs = s3fs.S3FileSystem(
    key='your_access_key',
    secret='your_secret_key',
    endpoint_url='http://localhost:8333',
    use_ssl=False
)

# Write dataset (creates directory structure)
table = pa.table({'id': [1, 2, 3], 'value': ['a', 'b', 'c']})
pads.write_dataset(table, 'bucket/dataset', filesystem=fs)

# Read dataset (all methods work!)
dataset = pads.dataset('bucket/dataset', filesystem=fs)  # ✅
table = pq.read_table('bucket/dataset', filesystem=fs)   # ✅
dataset = pq.ParquetDataset('bucket/dataset', filesystem=fs)  # ✅

Option 2: Using PyArrow's native S3 filesystem (pure PyArrow)

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as pads
import pyarrow.fs as pafs

# Configure PyArrow's native S3 filesystem
s3 = pafs.S3FileSystem(
    access_key='your_access_key',
    secret_key='your_secret_key',
    endpoint_override='localhost:8333',
    scheme='http',
    allow_bucket_creation=True,
    allow_bucket_deletion=True
)

# Write dataset
table = pa.table({'id': [1, 2, 3], 'value': ['a', 'b', 'c']})
pads.write_dataset(table, 'bucket/dataset', filesystem=s3)

# Read dataset (all methods work!)
table = pq.read_table('bucket/dataset', filesystem=s3)  # ✅
dataset = pq.ParquetDataset('bucket/dataset', filesystem=s3)  # ✅
dataset = pads.dataset('bucket/dataset', filesystem=s3)  # ✅

Test Files

Main Test Suite

s3_parquet_test.py - Comprehensive PyArrow test suite
- Tests 2 write methods × 5 read methods × 2 dataset sizes = 20 combinations
- Uses s3fs library for S3 operations
- All tests pass with the implicit directory fix ✅

PyArrow Native S3 Tests

test_pyarrow_native_s3.py - PyArrow's native S3 filesystem tests
- Tests PyArrow's built-in S3FileSystem (pyarrow.fs.S3FileSystem)
- Pure PyArrow solution without s3fs dependency
- Tests 3 read methods × 2 dataset sizes = 6 scenarios
- All tests pass ✅
test_sse_s3_compatibility.py - SSE-S3 encryption compatibility tests
- Tests PyArrow native S3 with SSE-S3 server-side encryption
- Tests 5 different file sizes (10 to 500,000 rows)
- Verifies multipart upload encryption works correctly
- All tests pass ✅

Cross-Filesystem Compatibility Tests

test_cross_filesystem_compatibility.py - Verifies cross-compatibility between s3fs and PyArrow native S3
- Tests write with s3fs → read with PyArrow native S3
- Tests write with PyArrow native S3 → read with s3fs
- Tests 2 directions × 3 read methods × 2 dataset sizes = 12 scenarios
- Validates that files written by either filesystem can be read by the other
- All tests pass ✅
- See CROSS_FILESYSTEM_COMPATIBILITY.md for detailed test results and analysis

Implicit Directory Tests

test_implicit_directory_fix.py - Specific tests for the implicit directory fix
- Tests HEAD request behavior
- Tests s3fs directory detection
- Tests PyArrow dataset reading
- All 6 tests pass ✅

Examples

example_pyarrow_native.py - Simple standalone example
- Demonstrates PyArrow's native S3 filesystem usage
- Can be run with uv run or regular Python
- Minimal dependencies (pyarrow, boto3)

Configuration

Makefile - Build and test automation
requirements.txt - Python dependencies (pyarrow, s3fs, boto3)
.gitignore - Ignore patterns for test artifacts

Documentation

Technical Documentation

TEST_COVERAGE.md - Comprehensive test coverage documentation
- Unit tests (Go): 17 test cases
- Integration tests (Python): 6 test cases
- End-to-end tests (Python): 20 test cases
FINAL_ROOT_CAUSE_ANALYSIS.md - Deep technical analysis
- Root cause of the s3fs compatibility issue
- How the implicit directory fix works
- Performance considerations
CROSS_FILESYSTEM_COMPATIBILITY.md - Cross-filesystem compatibility test results ✅ NEW
- Validates s3fs ↔ PyArrow native S3 interoperability
- Confirms files written by either can be read by the other
- Test methodology and detailed results
MINIO_DIRECTORY_HANDLING.md - Comparison with MinIO
- How MinIO handles directory markers
- Differences in implementation approaches

The Implicit Directory Fix

Problem

When PyArrow writes datasets with write_dataset(), it may create 0-byte directory markers. s3fs's info() method calls HEAD on these paths, and if HEAD returns 200 with size=0, s3fs incorrectly reports them as files instead of directories. This causes PyArrow to fail with "Parquet file size is 0 bytes".

Solution

SeaweedFS now returns 404 for HEAD requests on implicit directories (0-byte objects or directories with children, when requested without a trailing slash). This forces s3fs to fall back to LIST-based discovery, which correctly identifies directories by checking for children.

Implementation

The fix is implemented in weed/s3api/s3api_object_handlers.go:

HeadObjectHandler - Returns 404 for implicit directories
hasChildren - Helper function to check if a path has children

See the source code for detailed inline documentation.

Test Coverage

Unit tests (Go): weed/s3api/s3api_implicit_directory_test.go
- Run: cd weed/s3api && go test -v -run TestImplicitDirectory
Integration tests (Python): test_implicit_directory_fix.py
- Run: cd test/s3/parquet && make test-implicit-dir-with-server
End-to-end tests (Python): s3_parquet_test.py
- Run: cd test/s3/parquet && make test-with-server

Makefile Targets

# Setup
make setup-python          # Create Python virtual environment and install dependencies
make build-weed           # Build SeaweedFS binary

# Testing
make test                 # Run full tests (assumes server is already running)
make test-with-server     # Run full PyArrow test suite with server (small + large files)
make test-quick           # Run quick tests with small files only (assumes server is running)
make test-implicit-dir-with-server  # Run implicit directory tests with server
make test-native-s3       # Run PyArrow native S3 tests (assumes server is running)
make test-native-s3-with-server  # Run PyArrow native S3 tests with server management
make test-cross-fs        # Run cross-filesystem compatibility tests (assumes server is running)
make test-cross-fs-with-server  # Run cross-filesystem compatibility tests with server management
make test-sse-s3-compat   # Run comprehensive SSE-S3 encryption compatibility tests

# Server Management
make start-seaweedfs-ci   # Start SeaweedFS in background (CI mode)
make stop-seaweedfs-safe  # Stop SeaweedFS gracefully
make clean                # Clean up all test artifacts

# Development
make help                 # Show all available targets

Continuous Integration

The tests are automatically run in GitHub Actions on every push/PR that affects S3 or filer code:

Workflow: .github/workflows/s3-parquet-tests.yml

Test Matrix:

Python versions: 3.9, 3.11, 3.12
PyArrow integration tests (s3fs): 20 test combinations
PyArrow native S3 tests: 6 test scenarios ✅
Cross-filesystem compatibility tests: 12 test scenarios ✅ NEW
SSE-S3 encryption tests: 5 file sizes ✅
Implicit directory fix tests: 6 test scenarios
Go unit tests: 17 test cases

Test Steps (run for each Python version):

Build SeaweedFS
Run PyArrow Parquet integration tests (make test-with-server)
Run implicit directory fix tests (make test-implicit-dir-with-server)
Run PyArrow native S3 filesystem tests (make test-native-s3-with-server)
Run cross-filesystem compatibility tests (make test-cross-fs-with-server) ✅ NEW
Run SSE-S3 encryption compatibility tests (make test-sse-s3-compat)
Run Go unit tests for implicit directory handling

Triggers:

Push/PR to master (when weed/s3api/** or weed/filer/** changes)
Manual trigger via GitHub UI (workflow_dispatch)

Requirements

Python 3.8+
PyArrow 22.0.0+
s3fs 2024.12.0+
boto3 1.40.0+
SeaweedFS (latest)

AWS S3 Compatibility

The implicit directory fix makes SeaweedFS behavior more compatible with AWS S3:

AWS S3 typically doesn't create directory markers for implicit directories
HEAD on "dataset" (when only "dataset/file.txt" exists) returns 404 on AWS
SeaweedFS now matches this behavior for implicit directories with children

Edge Cases Handled

✅ Implicit directories with children → 404 (forces LIST-based discovery)
✅ Empty files (0-byte, no children) → 200 (legitimate empty file)
✅ Empty directories (no children) → 200 (legitimate empty directory)
✅ Explicit directory requests (trailing slash) → 200 (normal directory behavior)
✅ Versioned buckets → Skip implicit directory check (versioned semantics)
✅ Regular files → 200 (normal file behavior)

Performance

The implicit directory check adds minimal overhead:

Only triggered for 0-byte objects or directories without trailing slash
Cost: One LIST operation with Limit=1 (~1-5ms)
No impact on regular file operations

Contributing

When adding new tests:

Add test cases to the appropriate test file
Update TEST_COVERAGE.md
Run the full test suite to ensure no regressions
Update this README if adding new functionality

References

Last Updated: November 19, 2025
Status: All tests passing ✅

11 KiB Raw Blame History