* s3tables: add Iceberg file layout validation for table buckets
This PR adds file layout validation for table buckets to enforce Apache
Iceberg table structure. Files uploaded to table buckets must conform
to the expected Iceberg layout:
- metadata/ directory: contains metadata files (*.json, *.avro)
- v*.metadata.json (table metadata)
- snap-*.avro (snapshot manifests)
- *-m*.avro (manifest files)
- version-hint.text
- data/ directory: contains data files (*.parquet, *.orc, *.avro)
- Supports partition paths (e.g., year=2024/month=01/)
- Supports bucket subdirectories
The validator exports functions for use by the S3 API:
- IsTableBucketPath: checks if a path is under /table-buckets/
- GetTableInfoFromPath: extracts bucket/namespace/table from path
- ValidateTableBucketUpload: validates file layout for table bucket uploads
- ValidateTableBucketUploadWithClient: validates with filer client access
Invalid uploads receive InvalidIcebergLayout error response.
* Address review comments: regex performance, error handling, stricter patterns
* Fix validateMetadataFile and validateDataFile to handle subdirectories and directory creation
* Fix error handling, metadata validation, reduce code duplication
* Fix empty remainingPath handling for directory paths
* Refactor: unify validateMetadataFile and validateDataFile
* Refactor: extract UUID pattern constant
* fix: allow Iceberg partition and directory paths without trailing slashes
Modified validateFile to correctly handle directory paths that do not end with a trailing slash.
This ensures that paths like 'data/year=2024' are validated as directories if they match
partition or subdirectory patterns, rather than being incorrectly rejected as invalid files.
Added comprehensive test cases for various directory and partition path combinations.
* refactor: use standard path package and idiomatic returns
Simplified directory and filename extraction in validateFile by using the standard
path package (aliased as pathpkg). This improves readability and avoids manual string
manipulation. Also updated GetTableInfoFromPath to use naked returns for named
return values, aligning with Go conventions for short functions.
* feat: enforce strict Iceberg top-level directories and metadata restrictions
Implemented strict validation for Iceberg layout:
- Bare top-level keys like 'metadata' and 'data' are now rejected; they must have
a trailing slash or a subpath.
- Subdirectories under 'metadata/' are now prohibited to enforce the flat structure
required by Iceberg.
- Updated the test suite with negative test cases and ensured proper formatting.
* feat: allow table root directory markers in ValidateTableBucketUpload
Modified ValidateTableBucketUpload to short-circuit and return nil when the
relative path within a table is empty. This occurs when a trailing slash is
used on the table directory (e.g., /table-buckets/mybucket/myns/mytable/).
Added a test case 'table dir with slash' to verify this behavior.
* test: add regression cases for metadata subdirs and table markers
Enforced a strictly flat structure for the metadata directory by removing the
"directory without trailing slash" fallback in validateFile for metadata.
Added regression test cases:
- metadata/nested (must fail)
- /table-buckets/.../mytable/ (must pass)
Verified all tests pass.
* feat: reject double slashes in Iceberg table paths
Modified validateDirectoryPath to return an error when encountering empty path
segments, effectively rejecting double slashes like 'data//file.parquet'.
Updated validateFile to use manual path splitting instead of the 'path' package
for intermediate directories to ensure redundant slashes are not auto-cleaned
before validation. Added regression tests for various double slash scenarios.
* refactor: separate isMetadata logic in validateDirectoryPath
Following reviewer feedback, refactored validateDirectoryPath to explicitly
separate the handling of metadata and data paths. This improves readability
and clarifies the function's intent while maintaining the strict validation rules
and double-slash rejection previously implemented.
* feat: validate bucket, namespace, and table path segments
Updated ValidateTableBucketUpload to ensure that bucket, namespace, and table
segments in the path are non-empty. This prevents invalid paths like
'/table-buckets//myns/mytable/...' from being accepted during upload.
Added regression tests for various empty segment scenarios.
* Update weed/s3api/s3tables/iceberg_layout.go
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* feat: block double-slash bypass in table relative paths
Added a guard in ValidateTableBucketUpload to reject tableRelativePath if it
starts with a '/' or contains '//'. This ensures that paths like
'/table-buckets/b/ns/t//data/file.parquet' are properly rejected and cannot
bypass the layout validation. Added regression tests to verify.
---------
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
1. go get aqwari.net/xml/cmd/xsdgen 2. Add EncodingType element for ListBucketResult in AmazonS3.xsd 3. xsdgen -o s3api_xsd_generated.go -pkg s3api AmazonS3.xsd 4. Remove empty Grantee struct in s3api_xsd_generated.go 5. Remove xmlns: sed s'/http:\/\/s3.amazonaws.com\/doc\/2006-03-01\/\ //' s3api_xsd_generated.go