1 changed files with 132 additions and 0 deletions
@ -0,0 +1,132 @@ |
|||||
|
# Iceberg Stage-Create Support Design |
||||
|
|
||||
|
## Problem |
||||
|
`stage-create=true` currently cannot be fulfilled safely with the existing create path because table registration and metadata-file persistence are coupled. The prior behavior risked partial state (catalog entry without expected metadata file). |
||||
|
|
||||
|
## Goals |
||||
|
- Implement Iceberg-compatible staged table creation semantics. |
||||
|
- Avoid partial table state on failures. |
||||
|
- Keep conflict behavior consistent with current commit path (`409 CommitFailedException`). |
||||
|
- Preserve backward compatibility for non-staged creates. |
||||
|
|
||||
|
## Non-Goals |
||||
|
- Supporting all speculative client workflows in one step. |
||||
|
- Adding cross-process transactional primitives in filer/S3Tables. |
||||
|
|
||||
|
## Proposed Semantics |
||||
|
|
||||
|
### 1) `POST /v1/.../namespaces/{ns}/tables` with `stage-create=true` |
||||
|
- Validate request as normal create request. |
||||
|
- Build initial metadata (`v1`) exactly like normal create. |
||||
|
- Persist `v1.metadata.json` to table metadata directory. |
||||
|
- Do **not** call `S3Tables.CreateTable`. |
||||
|
- Return `200` with `LoadTableResult` (`metadata-location` = `.../metadata/v1.metadata.json`). |
||||
|
|
||||
|
This produces a staged metadata root that can be committed later, while avoiding catalog visibility before commit. |
||||
|
|
||||
|
### 2) `POST /v1/.../namespaces/{ns}/tables/{table}` commit path |
||||
|
- Keep current logic for existing tables. |
||||
|
- Add create-commit flow for missing table: |
||||
|
- If `GetTable` returns not found: |
||||
|
- Require at least one requirement of type `assert-create`. |
||||
|
- Build base metadata from staged `v1` (if present) or synthetic metadata using resolved table location. |
||||
|
- Apply updates and requirements. |
||||
|
- Write new metadata file (typically `v2` if staged `v1` exists, otherwise `v1`). |
||||
|
- Finalize with `S3Tables.CreateTable` (not `UpdateTable`). |
||||
|
- Map `already exists` / conflict to `409 CommitFailedException`. |
||||
|
- Best-effort cleanup of newly written metadata file on any create-finalization failure. |
||||
|
|
||||
|
## Data and State Model |
||||
|
|
||||
|
### Metadata files |
||||
|
- Staged create writes `metadata/v1.metadata.json` at final table location. |
||||
|
- Commit writes next version and updates catalog pointer by creating table entry. |
||||
|
|
||||
|
### Optional staged-intent marker (recommended) |
||||
|
- Store small JSON marker under a dedicated internal prefix, e.g.: |
||||
|
- `/buckets/<bucket>/.iceberg_staged/<namespace>/<table>/<uuid>.json` |
||||
|
- Fields: |
||||
|
- `table_uuid` |
||||
|
- `location` |
||||
|
- `created_at` |
||||
|
- `expires_at` |
||||
|
- Purpose: |
||||
|
- Better observability and cleanup of abandoned staged creates. |
||||
|
- Diagnostics for duplicate staged attempts. |
||||
|
|
||||
|
If marker is skipped in v1 implementation, staged metadata file alone is still sufficient for functional behavior. |
||||
|
|
||||
|
## API and Error Contract |
||||
|
|
||||
|
### Create with stage-create |
||||
|
- Success: `200` with metadata payload. |
||||
|
- Invalid request: `400 BadRequestException`. |
||||
|
- Metadata file persistence failure: `500 InternalServerError`. |
||||
|
|
||||
|
### Commit finalize (table missing + assert-create) |
||||
|
- Requirement failure: `409 CommitFailedException`. |
||||
|
- Concurrent create detected (`CreateTable` conflict/already exists): `409 CommitFailedException`. |
||||
|
- Other backend failures: `500 InternalServerError` (after best-effort metadata cleanup). |
||||
|
|
||||
|
## Concurrency and Failure Handling |
||||
|
- Finalization uses `CreateTable` as the single catalog-visibility gate. |
||||
|
- On conflict, clean up newly written metadata file best-effort. |
||||
|
- Add bounded retry only when appropriate for conflict races. |
||||
|
- Preserve existing deterministic UUID behavior for retries within one request. |
||||
|
|
||||
|
## Security and Authorization |
||||
|
- Stage-create request uses same auth checks as current create path. |
||||
|
- Finalize commit uses same auth checks as create/update table operations. |
||||
|
- No staged state should bypass policy checks. |
||||
|
|
||||
|
## Implementation Plan |
||||
|
|
||||
|
### Phase 1: Functional staged create |
||||
|
1. Remove `NotImplemented` rejection for `stage-create=true`. |
||||
|
2. In `handleCreateTable`: |
||||
|
- write metadata file, |
||||
|
- skip `CreateTable`, |
||||
|
- return `LoadTableResult`. |
||||
|
3. In `handleUpdateTable`: |
||||
|
- support not-found + `assert-create` as create-finalization flow using `CreateTable`. |
||||
|
4. Add cleanup logic on finalize failures. |
||||
|
|
||||
|
### Phase 2: Hardening |
||||
|
1. Add optional staged-intent marker + TTL janitor. |
||||
|
2. Add metrics/counters: |
||||
|
- staged create count |
||||
|
- staged finalize success/failure |
||||
|
- staged cleanup failures |
||||
|
|
||||
|
### Phase 3: Interop polish |
||||
|
1. Validate behavior against Spark/Flink/Trino create transaction flows. |
||||
|
2. Document client expectations around staged metadata retention. |
||||
|
|
||||
|
## Test Plan |
||||
|
|
||||
|
### Unit tests |
||||
|
- `handleCreateTable` with `stage-create=true`: |
||||
|
- returns success payload, |
||||
|
- does not invoke `CreateTable`, |
||||
|
- writes `v1.metadata.json`. |
||||
|
- Commit with missing table + `assert-create`: |
||||
|
- creates catalog entry via `CreateTable`. |
||||
|
- Commit without `assert-create` on missing table: |
||||
|
- returns `404` or `409` (choose one and lock behavior). |
||||
|
- Conflict on finalize: |
||||
|
- returns `409 CommitFailedException`, |
||||
|
- cleanup attempted. |
||||
|
|
||||
|
### Integration tests |
||||
|
- End-to-end staged create -> commit finalize -> load table. |
||||
|
- Concurrent finalize race from two clients: exactly one success. |
||||
|
- Abandoned staged create cleanup behavior (if marker/TTL implemented). |
||||
|
|
||||
|
## Rollout |
||||
|
- Gate behind a feature flag initially, e.g. `ICEBERG_ENABLE_STAGE_CREATE=true`. |
||||
|
- Default OFF for one release cycle if risk-sensitive; otherwise ON with clear release note. |
||||
|
|
||||
|
## Open Questions |
||||
|
- Should commit-on-missing-table without `assert-create` be `404` or `409`? |
||||
|
- Do we require staged-intent marker in phase 1, or defer to phase 2? |
||||
|
- Should staged metadata be deleted automatically if never finalized within TTL? |
||||
Write
Preview
Loading…
Cancel
Save
Reference in new issue