You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 

11 KiB

PyArrow Parquet S3 Compatibility Tests

This directory contains tests for PyArrow Parquet compatibility with SeaweedFS S3 API, including the implicit directory detection fix.

Overview

Status: All PyArrow methods work correctly with SeaweedFS

SeaweedFS implements implicit directory detection to improve compatibility with s3fs and PyArrow. When PyArrow writes datasets using write_dataset(), it may create directory markers that can confuse s3fs. SeaweedFS now handles these correctly by returning 404 for HEAD requests on implicit directories (directories with children), forcing s3fs to use LIST-based discovery.

Quick Start

Running the Example Script

# Start SeaweedFS server
make start-seaweedfs-ci

# Run the example script
python3 example_pyarrow_native.py

# Or with uv (if available)
uv run example_pyarrow_native.py

# Stop the server when done
make stop-seaweedfs-safe

Running Tests

# Setup Python environment
make setup-python

# Run all tests with server (small and large files)
make test-with-server

# Run quick tests with small files only (faster for development)
make test-quick

# Run implicit directory fix tests
make test-implicit-dir-with-server

# Run PyArrow native S3 filesystem tests
make test-native-s3-with-server

# Run cross-filesystem compatibility tests (s3fs ↔ PyArrow native)
make test-cross-fs-with-server

# Run SSE-S3 encryption tests
make test-sse-s3-compat

# Clean up
make clean

Using PyArrow with SeaweedFS

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as pads
import s3fs

# Configure s3fs
fs = s3fs.S3FileSystem(
    key='your_access_key',
    secret='your_secret_key',
    endpoint_url='http://localhost:8333',
    use_ssl=False
)

# Write dataset (creates directory structure)
table = pa.table({'id': [1, 2, 3], 'value': ['a', 'b', 'c']})
pads.write_dataset(table, 'bucket/dataset', filesystem=fs)

# Read dataset (all methods work!)
dataset = pads.dataset('bucket/dataset', filesystem=fs)  # ✅
table = pq.read_table('bucket/dataset', filesystem=fs)   # ✅
dataset = pq.ParquetDataset('bucket/dataset', filesystem=fs)  # ✅

Option 2: Using PyArrow's native S3 filesystem (pure PyArrow)

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as pads
import pyarrow.fs as pafs

# Configure PyArrow's native S3 filesystem
s3 = pafs.S3FileSystem(
    access_key='your_access_key',
    secret_key='your_secret_key',
    endpoint_override='localhost:8333',
    scheme='http',
    allow_bucket_creation=True,
    allow_bucket_deletion=True
)

# Write dataset
table = pa.table({'id': [1, 2, 3], 'value': ['a', 'b', 'c']})
pads.write_dataset(table, 'bucket/dataset', filesystem=s3)

# Read dataset (all methods work!)
table = pq.read_table('bucket/dataset', filesystem=s3)  # ✅
dataset = pq.ParquetDataset('bucket/dataset', filesystem=s3)  # ✅
dataset = pads.dataset('bucket/dataset', filesystem=s3)  # ✅

Test Files

Main Test Suite

  • s3_parquet_test.py - Comprehensive PyArrow test suite
    • Tests 2 write methods × 5 read methods × 2 dataset sizes = 20 combinations
    • Uses s3fs library for S3 operations
    • All tests pass with the implicit directory fix

PyArrow Native S3 Tests

  • test_pyarrow_native_s3.py - PyArrow's native S3 filesystem tests

    • Tests PyArrow's built-in S3FileSystem (pyarrow.fs.S3FileSystem)
    • Pure PyArrow solution without s3fs dependency
    • Tests 3 read methods × 2 dataset sizes = 6 scenarios
    • All tests pass
  • test_sse_s3_compatibility.py - SSE-S3 encryption compatibility tests

    • Tests PyArrow native S3 with SSE-S3 server-side encryption
    • Tests 5 different file sizes (10 to 500,000 rows)
    • Verifies multipart upload encryption works correctly
    • All tests pass

Cross-Filesystem Compatibility Tests

  • test_cross_filesystem_compatibility.py - Verifies cross-compatibility between s3fs and PyArrow native S3
    • Tests write with s3fs → read with PyArrow native S3
    • Tests write with PyArrow native S3 → read with s3fs
    • Tests 2 directions × 3 read methods × 2 dataset sizes = 12 scenarios
    • Validates that files written by either filesystem can be read by the other
    • All tests pass
    • See CROSS_FILESYSTEM_COMPATIBILITY.md for detailed test results and analysis

Implicit Directory Tests

  • test_implicit_directory_fix.py - Specific tests for the implicit directory fix
    • Tests HEAD request behavior
    • Tests s3fs directory detection
    • Tests PyArrow dataset reading
    • All 6 tests pass

Examples

  • example_pyarrow_native.py - Simple standalone example
    • Demonstrates PyArrow's native S3 filesystem usage
    • Can be run with uv run or regular Python
    • Minimal dependencies (pyarrow, boto3)

Configuration

  • Makefile - Build and test automation
  • requirements.txt - Python dependencies (pyarrow, s3fs, boto3)
  • .gitignore - Ignore patterns for test artifacts

Documentation

Technical Documentation

  • TEST_COVERAGE.md - Comprehensive test coverage documentation

    • Unit tests (Go): 17 test cases
    • Integration tests (Python): 6 test cases
    • End-to-end tests (Python): 20 test cases
  • FINAL_ROOT_CAUSE_ANALYSIS.md - Deep technical analysis

    • Root cause of the s3fs compatibility issue
    • How the implicit directory fix works
    • Performance considerations
  • CROSS_FILESYSTEM_COMPATIBILITY.md - Cross-filesystem compatibility test results NEW

    • Validates s3fs ↔ PyArrow native S3 interoperability
    • Confirms files written by either can be read by the other
    • Test methodology and detailed results
  • MINIO_DIRECTORY_HANDLING.md - Comparison with MinIO

    • How MinIO handles directory markers
    • Differences in implementation approaches

The Implicit Directory Fix

Problem

When PyArrow writes datasets with write_dataset(), it may create 0-byte directory markers. s3fs's info() method calls HEAD on these paths, and if HEAD returns 200 with size=0, s3fs incorrectly reports them as files instead of directories. This causes PyArrow to fail with "Parquet file size is 0 bytes".

Solution

SeaweedFS now returns 404 for HEAD requests on implicit directories (0-byte objects or directories with children, when requested without a trailing slash). This forces s3fs to fall back to LIST-based discovery, which correctly identifies directories by checking for children.

Implementation

The fix is implemented in weed/s3api/s3api_object_handlers.go:

  • HeadObjectHandler - Returns 404 for implicit directories
  • hasChildren - Helper function to check if a path has children

See the source code for detailed inline documentation.

Test Coverage

  • Unit tests (Go): weed/s3api/s3api_implicit_directory_test.go

    • Run: cd weed/s3api && go test -v -run TestImplicitDirectory
  • Integration tests (Python): test_implicit_directory_fix.py

    • Run: cd test/s3/parquet && make test-implicit-dir-with-server
  • End-to-end tests (Python): s3_parquet_test.py

    • Run: cd test/s3/parquet && make test-with-server

Makefile Targets

# Setup
make setup-python          # Create Python virtual environment and install dependencies
make build-weed           # Build SeaweedFS binary

# Testing
make test                 # Run full tests (assumes server is already running)
make test-with-server     # Run full PyArrow test suite with server (small + large files)
make test-quick           # Run quick tests with small files only (assumes server is running)
make test-implicit-dir-with-server  # Run implicit directory tests with server
make test-native-s3       # Run PyArrow native S3 tests (assumes server is running)
make test-native-s3-with-server  # Run PyArrow native S3 tests with server management
make test-cross-fs        # Run cross-filesystem compatibility tests (assumes server is running)
make test-cross-fs-with-server  # Run cross-filesystem compatibility tests with server management
make test-sse-s3-compat   # Run comprehensive SSE-S3 encryption compatibility tests

# Server Management
make start-seaweedfs-ci   # Start SeaweedFS in background (CI mode)
make stop-seaweedfs-safe  # Stop SeaweedFS gracefully
make clean                # Clean up all test artifacts

# Development
make help                 # Show all available targets

Continuous Integration

The tests are automatically run in GitHub Actions on every push/PR that affects S3 or filer code:

Workflow: .github/workflows/s3-parquet-tests.yml

Test Matrix:

  • Python versions: 3.9, 3.11, 3.12
  • PyArrow integration tests (s3fs): 20 test combinations
  • PyArrow native S3 tests: 6 test scenarios
  • Cross-filesystem compatibility tests: 12 test scenarios NEW
  • SSE-S3 encryption tests: 5 file sizes
  • Implicit directory fix tests: 6 test scenarios
  • Go unit tests: 17 test cases

Test Steps (run for each Python version):

  1. Build SeaweedFS
  2. Run PyArrow Parquet integration tests (make test-with-server)
  3. Run implicit directory fix tests (make test-implicit-dir-with-server)
  4. Run PyArrow native S3 filesystem tests (make test-native-s3-with-server)
  5. Run cross-filesystem compatibility tests (make test-cross-fs-with-server) NEW
  6. Run SSE-S3 encryption compatibility tests (make test-sse-s3-compat)
  7. Run Go unit tests for implicit directory handling

Triggers:

  • Push/PR to master (when weed/s3api/** or weed/filer/** changes)
  • Manual trigger via GitHub UI (workflow_dispatch)

Requirements

  • Python 3.8+
  • PyArrow 22.0.0+
  • s3fs 2024.12.0+
  • boto3 1.40.0+
  • SeaweedFS (latest)

AWS S3 Compatibility

The implicit directory fix makes SeaweedFS behavior more compatible with AWS S3:

  • AWS S3 typically doesn't create directory markers for implicit directories
  • HEAD on "dataset" (when only "dataset/file.txt" exists) returns 404 on AWS
  • SeaweedFS now matches this behavior for implicit directories with children

Edge Cases Handled

Implicit directories with children → 404 (forces LIST-based discovery)
Empty files (0-byte, no children) → 200 (legitimate empty file)
Empty directories (no children) → 200 (legitimate empty directory)
Explicit directory requests (trailing slash) → 200 (normal directory behavior)
Versioned buckets → Skip implicit directory check (versioned semantics)
Regular files → 200 (normal file behavior)

Performance

The implicit directory check adds minimal overhead:

  • Only triggered for 0-byte objects or directories without trailing slash
  • Cost: One LIST operation with Limit=1 (~1-5ms)
  • No impact on regular file operations

Contributing

When adding new tests:

  1. Add test cases to the appropriate test file
  2. Update TEST_COVERAGE.md
  3. Run the full test suite to ensure no regressions
  4. Update this README if adding new functionality

References


Last Updated: November 19, 2025
Status: All tests passing