You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 

2.5 KiB

Final Root Cause Analysis

Overview

This document provides a deep technical analysis of the s3fs compatibility issue with PyArrow Parquet datasets on SeaweedFS, and the solution implemented to resolve it.

Root Cause

When PyArrow writes datasets using write_dataset(), it creates implicit directory structures by writing files without explicit directory markers. However, some S3 workflows may create 0-byte directory markers.

The Problem

  1. PyArrow writes dataset files without creating explicit directory objects
  2. s3fs calls HEAD on the directory path to check if it exists
  3. If HEAD returns 200 with Content-Length: 0, s3fs interprets it as a file (not a directory)
  4. PyArrow fails when trying to read, reporting "Parquet file size is 0 bytes"

AWS S3 Behavior

AWS S3 returns 404 Not Found for implicit directories (directories that only exist because they have children but no explicit marker object). This allows s3fs to fall back to LIST operations to detect the directory.

The Solution

Implementation

Modified the S3 API HEAD handler in weed/s3api/s3api_object_handlers.go to:

  1. Check if object ends with /: Explicit directory markers return 200 as before
  2. Check if object has children: If a 0-byte object has children in the filer, treat it as an implicit directory
  3. Return 404 for implicit directories: This matches AWS S3 behavior and triggers s3fs's LIST fallback

Code Changes

The fix is implemented in the HeadObjectHandler function with logic to:

  • Detect implicit directories by checking for child entries
  • Return 404 (NoSuchKey) for implicit directories
  • Preserve existing behavior for explicit directory markers and regular files

Performance Considerations

Optimization: Child Check Cache

  • Child existence checks are performed via filer LIST operations
  • Results could be cached for frequently accessed paths
  • Trade-off between consistency and performance

Impact

  • Minimal performance impact for normal file operations
  • Slight overhead for HEAD requests on implicit directories (one additional LIST call)
  • Overall improvement in PyArrow compatibility outweighs minor performance cost

TODO

  • Add detailed benchmarking results comparing before/after fix
  • Document edge cases discovered during implementation
  • Add architectural diagrams showing the request flow
  • Document alternative solutions considered and why they were rejected
  • Add performance profiling data for child existence checks