From 1406a30cdd0559598f9ddb2ca613bb94dc5a320a Mon Sep 17 00:00:00 2001 From: Chris Lu Date: Mon, 9 Feb 2026 23:37:52 -0800 Subject: [PATCH] docs: add stage-create support design and rollout plan --- weed/s3api/iceberg/STAGE_CREATE_DESIGN.md | 132 ++++++++++++++++++++++ 1 file changed, 132 insertions(+) create mode 100644 weed/s3api/iceberg/STAGE_CREATE_DESIGN.md diff --git a/weed/s3api/iceberg/STAGE_CREATE_DESIGN.md b/weed/s3api/iceberg/STAGE_CREATE_DESIGN.md new file mode 100644 index 000000000..d84a150fc --- /dev/null +++ b/weed/s3api/iceberg/STAGE_CREATE_DESIGN.md @@ -0,0 +1,132 @@ +# Iceberg Stage-Create Support Design + +## Problem +`stage-create=true` currently cannot be fulfilled safely with the existing create path because table registration and metadata-file persistence are coupled. The prior behavior risked partial state (catalog entry without expected metadata file). + +## Goals +- Implement Iceberg-compatible staged table creation semantics. +- Avoid partial table state on failures. +- Keep conflict behavior consistent with current commit path (`409 CommitFailedException`). +- Preserve backward compatibility for non-staged creates. + +## Non-Goals +- Supporting all speculative client workflows in one step. +- Adding cross-process transactional primitives in filer/S3Tables. + +## Proposed Semantics + +### 1) `POST /v1/.../namespaces/{ns}/tables` with `stage-create=true` +- Validate request as normal create request. +- Build initial metadata (`v1`) exactly like normal create. +- Persist `v1.metadata.json` to table metadata directory. +- Do **not** call `S3Tables.CreateTable`. +- Return `200` with `LoadTableResult` (`metadata-location` = `.../metadata/v1.metadata.json`). + +This produces a staged metadata root that can be committed later, while avoiding catalog visibility before commit. + +### 2) `POST /v1/.../namespaces/{ns}/tables/{table}` commit path +- Keep current logic for existing tables. +- Add create-commit flow for missing table: + - If `GetTable` returns not found: + - Require at least one requirement of type `assert-create`. + - Build base metadata from staged `v1` (if present) or synthetic metadata using resolved table location. + - Apply updates and requirements. + - Write new metadata file (typically `v2` if staged `v1` exists, otherwise `v1`). + - Finalize with `S3Tables.CreateTable` (not `UpdateTable`). +- Map `already exists` / conflict to `409 CommitFailedException`. +- Best-effort cleanup of newly written metadata file on any create-finalization failure. + +## Data and State Model + +### Metadata files +- Staged create writes `metadata/v1.metadata.json` at final table location. +- Commit writes next version and updates catalog pointer by creating table entry. + +### Optional staged-intent marker (recommended) +- Store small JSON marker under a dedicated internal prefix, e.g.: + - `/buckets//.iceberg_staged///.json` +- Fields: + - `table_uuid` + - `location` + - `created_at` + - `expires_at` +- Purpose: + - Better observability and cleanup of abandoned staged creates. + - Diagnostics for duplicate staged attempts. + +If marker is skipped in v1 implementation, staged metadata file alone is still sufficient for functional behavior. + +## API and Error Contract + +### Create with stage-create +- Success: `200` with metadata payload. +- Invalid request: `400 BadRequestException`. +- Metadata file persistence failure: `500 InternalServerError`. + +### Commit finalize (table missing + assert-create) +- Requirement failure: `409 CommitFailedException`. +- Concurrent create detected (`CreateTable` conflict/already exists): `409 CommitFailedException`. +- Other backend failures: `500 InternalServerError` (after best-effort metadata cleanup). + +## Concurrency and Failure Handling +- Finalization uses `CreateTable` as the single catalog-visibility gate. +- On conflict, clean up newly written metadata file best-effort. +- Add bounded retry only when appropriate for conflict races. +- Preserve existing deterministic UUID behavior for retries within one request. + +## Security and Authorization +- Stage-create request uses same auth checks as current create path. +- Finalize commit uses same auth checks as create/update table operations. +- No staged state should bypass policy checks. + +## Implementation Plan + +### Phase 1: Functional staged create +1. Remove `NotImplemented` rejection for `stage-create=true`. +2. In `handleCreateTable`: + - write metadata file, + - skip `CreateTable`, + - return `LoadTableResult`. +3. In `handleUpdateTable`: + - support not-found + `assert-create` as create-finalization flow using `CreateTable`. +4. Add cleanup logic on finalize failures. + +### Phase 2: Hardening +1. Add optional staged-intent marker + TTL janitor. +2. Add metrics/counters: + - staged create count + - staged finalize success/failure + - staged cleanup failures + +### Phase 3: Interop polish +1. Validate behavior against Spark/Flink/Trino create transaction flows. +2. Document client expectations around staged metadata retention. + +## Test Plan + +### Unit tests +- `handleCreateTable` with `stage-create=true`: + - returns success payload, + - does not invoke `CreateTable`, + - writes `v1.metadata.json`. +- Commit with missing table + `assert-create`: + - creates catalog entry via `CreateTable`. +- Commit without `assert-create` on missing table: + - returns `404` or `409` (choose one and lock behavior). +- Conflict on finalize: + - returns `409 CommitFailedException`, + - cleanup attempted. + +### Integration tests +- End-to-end staged create -> commit finalize -> load table. +- Concurrent finalize race from two clients: exactly one success. +- Abandoned staged create cleanup behavior (if marker/TTL implemented). + +## Rollout +- Gate behind a feature flag initially, e.g. `ICEBERG_ENABLE_STAGE_CREATE=true`. +- Default OFF for one release cycle if risk-sensitive; otherwise ON with clear release note. + +## Open Questions +- Should commit-on-missing-table without `assert-create` be `404` or `409`? +- Do we require staged-intent marker in phase 1, or defer to phase 2? +- Should staged metadata be deleted automatically if never finalized within TTL?