You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
2.5 KiB
2.5 KiB
Final Root Cause Analysis
Overview
This document provides a deep technical analysis of the s3fs compatibility issue with PyArrow Parquet datasets on SeaweedFS, and the solution implemented to resolve it.
Root Cause
When PyArrow writes datasets using write_dataset(), it creates implicit directory structures by writing files without explicit directory markers. However, some S3 workflows may create 0-byte directory markers.
The Problem
- PyArrow writes dataset files without creating explicit directory objects
- s3fs calls HEAD on the directory path to check if it exists
- If HEAD returns 200 with
Content-Length: 0, s3fs interprets it as a file (not a directory) - PyArrow fails when trying to read, reporting "Parquet file size is 0 bytes"
AWS S3 Behavior
AWS S3 returns 404 Not Found for implicit directories (directories that only exist because they have children but no explicit marker object). This allows s3fs to fall back to LIST operations to detect the directory.
The Solution
Implementation
Modified the S3 API HEAD handler in weed/s3api/s3api_object_handlers.go to:
- Check if object ends with
/: Explicit directory markers return 200 as before - Check if object has children: If a 0-byte object has children in the filer, treat it as an implicit directory
- Return 404 for implicit directories: This matches AWS S3 behavior and triggers s3fs's LIST fallback
Code Changes
The fix is implemented in the HeadObjectHandler function with logic to:
- Detect implicit directories by checking for child entries
- Return 404 (NoSuchKey) for implicit directories
- Preserve existing behavior for explicit directory markers and regular files
Performance Considerations
Optimization: Child Check Cache
- Child existence checks are performed via filer LIST operations
- Results could be cached for frequently accessed paths
- Trade-off between consistency and performance
Impact
- Minimal performance impact for normal file operations
- Slight overhead for HEAD requests on implicit directories (one additional LIST call)
- Overall improvement in PyArrow compatibility outweighs minor performance cost
TODO
- Add detailed benchmarking results comparing before/after fix
- Document edge cases discovered during implementation
- Add architectural diagrams showing the request flow
- Document alternative solutions considered and why they were rejected
- Add performance profiling data for child existence checks