3 changed files with 0 additions and 800 deletions
@ -1,242 +0,0 @@ |
|||
# Bucket Policy Engine Integration - Complete |
|||
|
|||
## Summary |
|||
|
|||
Successfully integrated the `policy_engine` package to evaluate bucket policies for **all requests** (both anonymous and authenticated). This provides comprehensive AWS S3-compatible bucket policy support. |
|||
|
|||
## What Changed |
|||
|
|||
### 1. **New File: `s3api_bucket_policy_engine.go`** |
|||
Created a wrapper around `policy_engine.PolicyEngine` to: |
|||
- Load bucket policies from filer entries |
|||
- Sync policies from the bucket config cache |
|||
- Evaluate policies for any request (bucket, object, action, principal) |
|||
- Return structured results (allowed, evaluated, error) |
|||
|
|||
### 2. **Modified: `s3api_server.go`** |
|||
- Added `policyEngine *BucketPolicyEngine` field to `S3ApiServer` struct |
|||
- Initialized the policy engine in `NewS3ApiServerWithStore()` |
|||
- Linked `IdentityAccessManagement` back to `S3ApiServer` for policy evaluation |
|||
|
|||
### 3. **Modified: `auth_credentials.go`** |
|||
- Added `s3ApiServer *S3ApiServer` field to `IdentityAccessManagement` struct |
|||
- Added `buildPrincipalARN()` helper to convert identities to AWS ARN format |
|||
- **Integrated bucket policy evaluation into the authentication flow:** |
|||
- Policies are now checked **before** IAM/identity-based permissions |
|||
- Explicit `Deny` in bucket policy blocks access immediately |
|||
- Explicit `Allow` in bucket policy grants access and **bypasses IAM checks** (enables cross-account access) |
|||
- If no policy exists, falls through to normal IAM checks |
|||
- Policy evaluation errors result in access denial (fail-close security) |
|||
|
|||
### 4. **Modified: `s3api_bucket_config.go`** |
|||
- Added policy engine sync when bucket configs are loaded |
|||
- Ensures policies are loaded into the engine for evaluation |
|||
|
|||
### 5. **Modified: `auth_credentials_subscribe.go`** |
|||
- Added policy engine sync when bucket metadata changes |
|||
- Keeps the policy engine up-to-date via event-driven updates |
|||
|
|||
## How It Works |
|||
|
|||
### Anonymous Requests |
|||
``` |
|||
1. Request comes in (no credentials) |
|||
2. Check ACL-based public access → if public, allow |
|||
3. Check bucket policy for anonymous ("*") access → if allowed, allow |
|||
4. Otherwise, deny |
|||
``` |
|||
|
|||
### Authenticated Requests (NEW!) |
|||
``` |
|||
1. Request comes in (with credentials) |
|||
2. Authenticate user → get Identity |
|||
3. Build principal ARN (e.g., "arn:aws:iam::123456:user/bob") |
|||
4. Check bucket policy: |
|||
- If DENY → reject immediately |
|||
- If ALLOW → grant access immediately (bypasses IAM checks) |
|||
- If no policy or no matching statements → continue to step 5 |
|||
5. Check IAM/identity-based permissions (only if not already allowed by bucket policy) |
|||
6. Allow or deny based on identity permissions |
|||
``` |
|||
|
|||
## Policy Evaluation Flow |
|||
|
|||
``` |
|||
┌─────────────────────────────────────────────────────────┐ |
|||
│ Request (GET /bucket/file) │ |
|||
└───────────────────────────┬─────────────────────────────┘ |
|||
│ |
|||
┌───────────▼──────────┐ |
|||
│ Authenticate User │ |
|||
│ (or Anonymous) │ |
|||
└───────────┬──────────┘ |
|||
│ |
|||
┌───────────▼──────────────────────────────┐ |
|||
│ Build Principal ARN │ |
|||
│ - Anonymous: "*" │ |
|||
│ - User: "arn:aws:iam::123456:user/bob" │ |
|||
└───────────┬──────────────────────────────┘ |
|||
│ |
|||
┌───────────▼──────────────────────────────┐ |
|||
│ Evaluate Bucket Policy (PolicyEngine) │ |
|||
│ - Action: "s3:GetObject" │ |
|||
│ - Resource: "arn:aws:s3:::bucket/file" │ |
|||
│ - Principal: (from above) │ |
|||
└───────────┬──────────────────────────────┘ |
|||
│ |
|||
┌─────────────┼─────────────┐ |
|||
│ │ │ |
|||
DENY │ ALLOW │ NO POLICY |
|||
│ │ │ |
|||
▼ ▼ ▼ |
|||
Reject Request Grant Access Continue |
|||
│ |
|||
┌───────────────────┘ |
|||
│ |
|||
┌────────────▼─────────────┐ |
|||
│ IAM/Identity Check │ |
|||
│ (identity.canDo) │ |
|||
└────────────┬─────────────┘ |
|||
│ |
|||
┌─────────┴─────────┐ |
|||
│ │ |
|||
ALLOW │ DENY │ |
|||
▼ ▼ |
|||
Grant Access Reject Request |
|||
``` |
|||
|
|||
## Example Policies That Now Work |
|||
|
|||
### 1. **Public Read Access** (Anonymous) |
|||
```json |
|||
{ |
|||
"Version": "2012-10-17", |
|||
"Statement": [{ |
|||
"Effect": "Allow", |
|||
"Principal": "*", |
|||
"Action": "s3:GetObject", |
|||
"Resource": "arn:aws:s3:::mybucket/*" |
|||
}] |
|||
} |
|||
``` |
|||
- Anonymous users can read all objects |
|||
- Authenticated users are also evaluated against this policy. If they don't match an explicit `Allow` for this action, they will fall back to their own IAM permissions |
|||
|
|||
### 2. **Grant Access to Specific User** (Authenticated) |
|||
```json |
|||
{ |
|||
"Version": "2012-10-17", |
|||
"Statement": [{ |
|||
"Effect": "Allow", |
|||
"Principal": {"AWS": "arn:aws:iam::123456789012:user/bob"}, |
|||
"Action": ["s3:GetObject", "s3:PutObject"], |
|||
"Resource": "arn:aws:s3:::mybucket/shared/*" |
|||
}] |
|||
} |
|||
``` |
|||
- User "bob" can read/write objects in `/shared/` prefix |
|||
- Other users cannot (unless granted by their IAM policies) |
|||
|
|||
### 3. **Deny Access to Specific Path** (Both) |
|||
```json |
|||
{ |
|||
"Version": "2012-10-17", |
|||
"Statement": [{ |
|||
"Effect": "Deny", |
|||
"Principal": "*", |
|||
"Action": "s3:*", |
|||
"Resource": "arn:aws:s3:::mybucket/confidential/*" |
|||
}] |
|||
} |
|||
``` |
|||
- **No one** can access `/confidential/` objects |
|||
- Denies override all other allows (AWS policy evaluation rules) |
|||
|
|||
## Performance Characteristics |
|||
|
|||
### Policy Loading |
|||
- **Cold start**: Policy loaded from filer → parsed → compiled → cached |
|||
- **Warm path**: Policy retrieved from `BucketConfigCache` (already parsed) |
|||
- **Updates**: Event-driven sync via metadata subscription (real-time) |
|||
|
|||
### Policy Evaluation |
|||
- **Compiled policies**: Pre-compiled regex patterns and matchers |
|||
- **Pattern cache**: Regex patterns cached with LRU eviction (max 1000) |
|||
- **Fast path**: Common patterns (`*`, exact matches) optimized |
|||
- **Case sensitivity**: Actions case-insensitive, resources case-sensitive (AWS-compatible) |
|||
|
|||
### Overhead |
|||
- **Anonymous requests**: Minimal (policy already checked, now using compiled engine) |
|||
- **Authenticated requests**: ~1-2ms added for policy evaluation (compiled patterns) |
|||
- **No policy**: Near-zero overhead (quick indeterminate check) |
|||
|
|||
## Testing |
|||
|
|||
All tests pass: |
|||
```bash |
|||
✅ TestBucketPolicyValidationBasics |
|||
✅ TestPrincipalMatchesAnonymous |
|||
✅ TestActionToS3Action |
|||
✅ TestResourceMatching |
|||
✅ TestMatchesPatternRegexEscaping (security tests) |
|||
✅ TestActionMatchingCaseInsensitive |
|||
✅ TestResourceMatchingCaseSensitive |
|||
✅ All policy_engine package tests (30+ tests) |
|||
``` |
|||
|
|||
## Security Improvements |
|||
|
|||
1. **Regex Metacharacter Escaping**: Patterns like `*.json` properly match only files ending in `.json` (not `filexjson`) |
|||
2. **Case-Insensitive Actions**: S3 actions matched case-insensitively per AWS spec |
|||
3. **Case-Sensitive Resources**: Resource paths matched case-sensitively for security |
|||
4. **Pattern Cache Size Limit**: Prevents DoS attacks via unbounded cache growth |
|||
5. **Principal Validation**: Supports `[]string` for manually constructed policies |
|||
|
|||
## AWS Compatibility |
|||
|
|||
The implementation follows AWS S3 bucket policy evaluation rules: |
|||
1. **Explicit Deny** always wins (checked first) |
|||
2. **Explicit Allow** grants access (checked second) |
|||
3. **Default Deny** if no matching statements (implicit) |
|||
4. Bucket policies work alongside IAM policies (both are evaluated) |
|||
|
|||
## Files Changed |
|||
|
|||
``` |
|||
Modified: |
|||
weed/s3api/auth_credentials.go (+47 lines) |
|||
weed/s3api/auth_credentials_subscribe.go (+8 lines) |
|||
weed/s3api/s3api_bucket_config.go (+8 lines) |
|||
weed/s3api/s3api_server.go (+5 lines) |
|||
|
|||
New: |
|||
weed/s3api/s3api_bucket_policy_engine.go (115 lines) |
|||
``` |
|||
|
|||
## Migration Notes |
|||
|
|||
- **Backward Compatible**: Existing setups without bucket policies work unchanged |
|||
- **No Breaking Changes**: All existing ACL and IAM-based authorization still works |
|||
- **Additive Feature**: Bucket policies are an additional layer of authorization |
|||
- **Performance**: Minimal impact on existing workloads |
|||
|
|||
## Future Enhancements |
|||
|
|||
Potential improvements (not implemented yet): |
|||
- [ ] Condition support (IP address, time-based, etc.) - already in policy_engine |
|||
- [ ] Cross-account policies (different AWS accounts) |
|||
- [ ] Policy validation API endpoint |
|||
- [ ] Policy simulation/testing tool |
|||
- [ ] Metrics for policy evaluations (allow/deny counts) |
|||
|
|||
## Conclusion |
|||
|
|||
Bucket policies now work for **all requests** in SeaweedFS S3 API: |
|||
- ✅ Anonymous requests (public access) |
|||
- ✅ Authenticated requests (user-specific policies) |
|||
- ✅ High performance (compiled policies, caching) |
|||
- ✅ AWS-compatible (follows AWS evaluation rules) |
|||
- ✅ Secure (proper escaping, case sensitivity) |
|||
|
|||
The integration is complete, tested, and ready for use! |
|||
|
|||
@ -1,413 +0,0 @@ |
|||
# SeaweedFS Task Distribution System Design |
|||
|
|||
## Overview |
|||
|
|||
This document describes the design of a distributed task management system for SeaweedFS that handles Erasure Coding (EC) and vacuum operations through a scalable admin server and worker process architecture. |
|||
|
|||
## System Architecture |
|||
|
|||
### High-Level Components |
|||
|
|||
``` |
|||
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ |
|||
│ Master │◄──►│ Admin Server │◄──►│ Workers │ |
|||
│ │ │ │ │ │ |
|||
│ - Volume Info │ │ - Task Discovery │ │ - Task Exec │ |
|||
│ - Shard Status │ │ - Task Assign │ │ - Progress │ |
|||
│ - Heartbeats │ │ - Progress Track │ │ - Error Report │ |
|||
└─────────────────┘ └──────────────────┘ └─────────────────┘ |
|||
│ │ │ |
|||
│ │ │ |
|||
▼ ▼ ▼ |
|||
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ |
|||
│ Volume Servers │ │ Volume Monitor │ │ Task Execution │ |
|||
│ │ │ │ │ │ |
|||
│ - Store Volumes │ │ - Health Check │ │ - EC Convert │ |
|||
│ - EC Shards │ │ - Usage Stats │ │ - Vacuum Clean │ |
|||
│ - Report Status │ │ - State Sync │ │ - Status Report │ |
|||
└─────────────────┘ └──────────────────┘ └─────────────────┘ |
|||
``` |
|||
|
|||
## 1. Admin Server Design |
|||
|
|||
### 1.1 Core Responsibilities |
|||
|
|||
- **Task Discovery**: Scan volumes to identify EC and vacuum candidates |
|||
- **Worker Management**: Track available workers and their capabilities |
|||
- **Task Assignment**: Match tasks to optimal workers |
|||
- **Progress Tracking**: Monitor in-progress tasks for capacity planning |
|||
- **State Reconciliation**: Sync with master server for volume state updates |
|||
|
|||
### 1.2 Task Discovery Engine |
|||
|
|||
```go |
|||
type TaskDiscoveryEngine struct { |
|||
masterClient MasterClient |
|||
volumeScanner VolumeScanner |
|||
taskDetectors map[TaskType]TaskDetector |
|||
scanInterval time.Duration |
|||
} |
|||
|
|||
type VolumeCandidate struct { |
|||
VolumeID uint32 |
|||
Server string |
|||
Collection string |
|||
TaskType TaskType |
|||
Priority TaskPriority |
|||
Reason string |
|||
DetectedAt time.Time |
|||
Parameters map[string]interface{} |
|||
} |
|||
``` |
|||
|
|||
**EC Detection Logic**: |
|||
- Find volumes >= 95% full and idle for > 1 hour |
|||
- Exclude volumes already in EC format |
|||
- Exclude volumes with ongoing operations |
|||
- Prioritize by collection and age |
|||
|
|||
**Vacuum Detection Logic**: |
|||
- Find volumes with garbage ratio > 30% |
|||
- Exclude read-only volumes |
|||
- Exclude volumes with recent vacuum operations |
|||
- Prioritize by garbage percentage |
|||
|
|||
### 1.3 Worker Registry & Management |
|||
|
|||
```go |
|||
type WorkerRegistry struct { |
|||
workers map[string]*Worker |
|||
capabilities map[TaskType][]*Worker |
|||
lastHeartbeat map[string]time.Time |
|||
taskAssignment map[string]*Task |
|||
mutex sync.RWMutex |
|||
} |
|||
|
|||
type Worker struct { |
|||
ID string |
|||
Address string |
|||
Capabilities []TaskType |
|||
MaxConcurrent int |
|||
CurrentLoad int |
|||
Status WorkerStatus |
|||
LastSeen time.Time |
|||
Performance WorkerMetrics |
|||
} |
|||
``` |
|||
|
|||
### 1.4 Task Assignment Algorithm |
|||
|
|||
```go |
|||
type TaskScheduler struct { |
|||
registry *WorkerRegistry |
|||
taskQueue *PriorityQueue |
|||
inProgressTasks map[string]*InProgressTask |
|||
volumeReservations map[uint32]*VolumeReservation |
|||
} |
|||
|
|||
// Worker Selection Criteria: |
|||
// 1. Has required capability (EC or Vacuum) |
|||
// 2. Available capacity (CurrentLoad < MaxConcurrent) |
|||
// 3. Best performance history for task type |
|||
// 4. Lowest current load |
|||
// 5. Geographically close to volume server (optional) |
|||
``` |
|||
|
|||
## 2. Worker Process Design |
|||
|
|||
### 2.1 Worker Architecture |
|||
|
|||
```go |
|||
type MaintenanceWorker struct { |
|||
id string |
|||
config *WorkerConfig |
|||
adminClient AdminClient |
|||
taskExecutors map[TaskType]TaskExecutor |
|||
currentTasks map[string]*RunningTask |
|||
registry *TaskRegistry |
|||
heartbeatTicker *time.Ticker |
|||
requestTicker *time.Ticker |
|||
} |
|||
``` |
|||
|
|||
### 2.2 Task Execution Framework |
|||
|
|||
```go |
|||
type TaskExecutor interface { |
|||
Execute(ctx context.Context, task *Task) error |
|||
EstimateTime(task *Task) time.Duration |
|||
ValidateResources(task *Task) error |
|||
GetProgress() float64 |
|||
Cancel() error |
|||
} |
|||
|
|||
type ErasureCodingExecutor struct { |
|||
volumeClient VolumeServerClient |
|||
progress float64 |
|||
cancelled bool |
|||
} |
|||
|
|||
type VacuumExecutor struct { |
|||
volumeClient VolumeServerClient |
|||
progress float64 |
|||
cancelled bool |
|||
} |
|||
``` |
|||
|
|||
### 2.3 Worker Capabilities & Registration |
|||
|
|||
```go |
|||
type WorkerCapabilities struct { |
|||
SupportedTasks []TaskType |
|||
MaxConcurrent int |
|||
ResourceLimits ResourceLimits |
|||
PreferredServers []string // Affinity for specific volume servers |
|||
} |
|||
|
|||
type ResourceLimits struct { |
|||
MaxMemoryMB int64 |
|||
MaxDiskSpaceMB int64 |
|||
MaxNetworkMbps int64 |
|||
MaxCPUPercent float64 |
|||
} |
|||
``` |
|||
|
|||
## 3. Task Lifecycle Management |
|||
|
|||
### 3.1 Task States |
|||
|
|||
```go |
|||
type TaskState string |
|||
|
|||
const ( |
|||
TaskStatePending TaskState = "pending" |
|||
TaskStateAssigned TaskState = "assigned" |
|||
TaskStateInProgress TaskState = "in_progress" |
|||
TaskStateCompleted TaskState = "completed" |
|||
TaskStateFailed TaskState = "failed" |
|||
TaskStateCancelled TaskState = "cancelled" |
|||
TaskStateStuck TaskState = "stuck" // Taking too long |
|||
TaskStateDuplicate TaskState = "duplicate" // Detected duplicate |
|||
) |
|||
``` |
|||
|
|||
### 3.2 Progress Tracking & Monitoring |
|||
|
|||
```go |
|||
type InProgressTask struct { |
|||
Task *Task |
|||
WorkerID string |
|||
StartedAt time.Time |
|||
LastUpdate time.Time |
|||
Progress float64 |
|||
EstimatedEnd time.Time |
|||
VolumeReserved bool // Reserved for capacity planning |
|||
} |
|||
|
|||
type TaskMonitor struct { |
|||
inProgressTasks map[string]*InProgressTask |
|||
timeoutChecker *time.Ticker |
|||
stuckDetector *time.Ticker |
|||
duplicateChecker *time.Ticker |
|||
} |
|||
``` |
|||
|
|||
## 4. Volume Capacity Reconciliation |
|||
|
|||
### 4.1 Volume State Tracking |
|||
|
|||
```go |
|||
type VolumeStateManager struct { |
|||
masterClient MasterClient |
|||
inProgressTasks map[uint32]*InProgressTask // VolumeID -> Task |
|||
committedChanges map[uint32]*VolumeChange // Changes not yet in master |
|||
reconcileInterval time.Duration |
|||
} |
|||
|
|||
type VolumeChange struct { |
|||
VolumeID uint32 |
|||
ChangeType ChangeType // "ec_encoding", "vacuum_completed" |
|||
OldCapacity int64 |
|||
NewCapacity int64 |
|||
TaskID string |
|||
CompletedAt time.Time |
|||
ReportedToMaster bool |
|||
} |
|||
``` |
|||
|
|||
### 4.2 Shard Assignment Integration |
|||
|
|||
When the master needs to assign shards, it must consider: |
|||
1. **Current volume state** from its own records |
|||
2. **In-progress capacity changes** from admin server |
|||
3. **Committed but unreported changes** from admin server |
|||
|
|||
```go |
|||
type CapacityOracle struct { |
|||
adminServer AdminServerClient |
|||
masterState *MasterVolumeState |
|||
updateFreq time.Duration |
|||
} |
|||
|
|||
func (o *CapacityOracle) GetAdjustedCapacity(volumeID uint32) int64 { |
|||
baseCapacity := o.masterState.GetCapacity(volumeID) |
|||
|
|||
// Adjust for in-progress tasks |
|||
if task := o.adminServer.GetInProgressTask(volumeID); task != nil { |
|||
switch task.Type { |
|||
case TaskTypeErasureCoding: |
|||
// EC reduces effective capacity |
|||
return baseCapacity / 2 // Simplified |
|||
case TaskTypeVacuum: |
|||
// Vacuum may increase available space |
|||
return baseCapacity + int64(float64(baseCapacity) * 0.3) |
|||
} |
|||
} |
|||
|
|||
// Adjust for completed but unreported changes |
|||
if change := o.adminServer.GetPendingChange(volumeID); change != nil { |
|||
return change.NewCapacity |
|||
} |
|||
|
|||
return baseCapacity |
|||
} |
|||
``` |
|||
|
|||
## 5. Error Handling & Recovery |
|||
|
|||
### 5.1 Worker Failure Scenarios |
|||
|
|||
```go |
|||
type FailureHandler struct { |
|||
taskRescheduler *TaskRescheduler |
|||
workerMonitor *WorkerMonitor |
|||
alertManager *AlertManager |
|||
} |
|||
|
|||
// Failure Scenarios: |
|||
// 1. Worker becomes unresponsive (heartbeat timeout) |
|||
// 2. Task execution fails (reported by worker) |
|||
// 3. Task gets stuck (progress timeout) |
|||
// 4. Duplicate task detection |
|||
// 5. Resource exhaustion |
|||
``` |
|||
|
|||
### 5.2 Recovery Strategies |
|||
|
|||
**Worker Timeout Recovery**: |
|||
- Mark worker as inactive after 3 missed heartbeats |
|||
- Reschedule all assigned tasks to other workers |
|||
- Cleanup any partial state |
|||
|
|||
**Task Stuck Recovery**: |
|||
- Detect tasks with no progress for > 2x estimated time |
|||
- Cancel stuck task and mark volume for cleanup |
|||
- Reschedule if retry count < max_retries |
|||
|
|||
**Duplicate Task Prevention**: |
|||
```go |
|||
type DuplicateDetector struct { |
|||
activeFingerprints map[string]bool // VolumeID+TaskType |
|||
recentCompleted *LRUCache // Recently completed tasks |
|||
} |
|||
|
|||
func (d *DuplicateDetector) IsTaskDuplicate(task *Task) bool { |
|||
fingerprint := fmt.Sprintf("%d-%s", task.VolumeID, task.Type) |
|||
return d.activeFingerprints[fingerprint] || |
|||
d.recentCompleted.Contains(fingerprint) |
|||
} |
|||
``` |
|||
|
|||
## 6. Simulation & Testing Framework |
|||
|
|||
### 6.1 Failure Simulation |
|||
|
|||
```go |
|||
type TaskSimulator struct { |
|||
scenarios map[string]SimulationScenario |
|||
} |
|||
|
|||
type SimulationScenario struct { |
|||
Name string |
|||
WorkerCount int |
|||
VolumeCount int |
|||
FailurePatterns []FailurePattern |
|||
Duration time.Duration |
|||
} |
|||
|
|||
type FailurePattern struct { |
|||
Type FailureType // "worker_timeout", "task_stuck", "duplicate" |
|||
Probability float64 // 0.0 to 1.0 |
|||
Timing TimingSpec // When during task execution |
|||
Duration time.Duration |
|||
} |
|||
``` |
|||
|
|||
### 6.2 Test Scenarios |
|||
|
|||
**Scenario 1: Worker Timeout During EC** |
|||
- Start EC task on 30GB volume |
|||
- Kill worker at 50% progress |
|||
- Verify task reassignment |
|||
- Verify no duplicate EC operations |
|||
|
|||
**Scenario 2: Stuck Vacuum Task** |
|||
- Start vacuum on high-garbage volume |
|||
- Simulate worker hanging at 75% progress |
|||
- Verify timeout detection and cleanup |
|||
- Verify volume state consistency |
|||
|
|||
**Scenario 3: Duplicate Task Prevention** |
|||
- Submit same EC task from multiple sources |
|||
- Verify only one task executes |
|||
- Verify proper conflict resolution |
|||
|
|||
**Scenario 4: Master-Admin State Divergence** |
|||
- Create in-progress EC task |
|||
- Simulate master restart |
|||
- Verify state reconciliation |
|||
- Verify shard assignment accounts for in-progress work |
|||
|
|||
## 7. Performance & Scalability |
|||
|
|||
### 7.1 Metrics & Monitoring |
|||
|
|||
```go |
|||
type SystemMetrics struct { |
|||
TasksPerSecond float64 |
|||
WorkerUtilization float64 |
|||
AverageTaskTime time.Duration |
|||
FailureRate float64 |
|||
QueueDepth int |
|||
VolumeStatesSync bool |
|||
} |
|||
``` |
|||
|
|||
### 7.2 Scalability Considerations |
|||
|
|||
- **Horizontal Worker Scaling**: Add workers without admin server changes |
|||
- **Admin Server HA**: Master-slave admin servers for fault tolerance |
|||
- **Task Partitioning**: Partition tasks by collection or datacenter |
|||
- **Batch Operations**: Group similar tasks for efficiency |
|||
|
|||
## 8. Implementation Plan |
|||
|
|||
### Phase 1: Core Infrastructure |
|||
1. Admin server basic framework |
|||
2. Worker registration and heartbeat |
|||
3. Simple task assignment |
|||
4. Basic progress tracking |
|||
|
|||
### Phase 2: Advanced Features |
|||
1. Volume state reconciliation |
|||
2. Sophisticated worker selection |
|||
3. Failure detection and recovery |
|||
4. Duplicate prevention |
|||
|
|||
### Phase 3: Optimization & Monitoring |
|||
1. Performance metrics |
|||
2. Load balancing algorithms |
|||
3. Capacity planning integration |
|||
4. Comprehensive monitoring |
|||
|
|||
This design provides a robust, scalable foundation for distributed task management in SeaweedFS while maintaining consistency with the existing architecture patterns. |
|||
@ -1,145 +0,0 @@ |
|||
# SQL Query Engine Feature, Dev, and Test Plan |
|||
|
|||
This document outlines the plan for adding SQL querying support to SeaweedFS, focusing on reading and analyzing data from Message Queue (MQ) topics. |
|||
|
|||
## Feature Plan |
|||
|
|||
**1. Goal** |
|||
|
|||
To provide a SQL querying interface for SeaweedFS, enabling analytics on existing MQ topics. This enables: |
|||
- Basic querying with SELECT, WHERE, aggregations on MQ topics |
|||
- Schema discovery and metadata operations (SHOW DATABASES, SHOW TABLES, DESCRIBE) |
|||
- In-place analytics on Parquet-stored messages without data movement |
|||
|
|||
**2. Key Features** |
|||
|
|||
* **Schema Discovery and Metadata:** |
|||
* `SHOW DATABASES` - List all MQ namespaces |
|||
* `SHOW TABLES` - List all topics in a namespace |
|||
* `DESCRIBE table_name` - Show topic schema details |
|||
* Automatic schema detection from existing Parquet data |
|||
* **Basic Query Engine:** |
|||
* `SELECT` support with `WHERE`, `LIMIT`, `OFFSET` |
|||
* Aggregation functions: `COUNT()`, `SUM()`, `AVG()`, `MIN()`, `MAX()` |
|||
* Temporal queries with timestamp-based filtering |
|||
* **User Interfaces:** |
|||
* New CLI command `weed sql` with interactive shell mode |
|||
* Optional: Web UI for query execution and result visualization |
|||
* **Output Formats:** |
|||
* JSON (default), CSV, Parquet for result sets |
|||
* Streaming results for large queries |
|||
* Pagination support for result navigation |
|||
|
|||
## Development Plan |
|||
|
|||
|
|||
|
|||
**3. Data Source Integration** |
|||
|
|||
* **MQ Topic Connector (Primary):** |
|||
* Build on existing `weed/mq/logstore/read_parquet_to_log.go` |
|||
* Implement efficient Parquet scanning with predicate pushdown |
|||
* Support schema evolution and backward compatibility |
|||
* Handle partition-based parallelism for scalable queries |
|||
* **Schema Registry Integration:** |
|||
* Extend `weed/mq/schema/schema.go` for SQL metadata operations |
|||
* Read existing topic schemas for query planning |
|||
* Handle schema evolution during query execution |
|||
|
|||
**4. API & CLI Integration** |
|||
|
|||
* **CLI Command:** |
|||
* New `weed sql` command with interactive shell mode (similar to `weed shell`) |
|||
* Support for script execution and result formatting |
|||
* Connection management for remote SeaweedFS clusters |
|||
* **gRPC API:** |
|||
* Add SQL service to existing MQ broker gRPC interface |
|||
* Enable efficient query execution with streaming results |
|||
|
|||
## Example Usage Scenarios |
|||
|
|||
**Scenario 1: Schema Discovery and Metadata** |
|||
```sql |
|||
-- List all namespaces (databases) |
|||
SHOW DATABASES; |
|||
|
|||
-- List topics in a namespace |
|||
USE my_namespace; |
|||
SHOW TABLES; |
|||
|
|||
-- View topic structure and discovered schema |
|||
DESCRIBE user_events; |
|||
``` |
|||
|
|||
**Scenario 2: Data Querying** |
|||
```sql |
|||
-- Basic filtering and projection |
|||
SELECT user_id, event_type, timestamp |
|||
FROM user_events |
|||
WHERE timestamp > 1640995200000 |
|||
LIMIT 100; |
|||
|
|||
-- Aggregation queries |
|||
SELECT COUNT(*) as event_count |
|||
FROM user_events |
|||
WHERE timestamp >= 1640995200000; |
|||
|
|||
-- More aggregation examples |
|||
SELECT MAX(timestamp), MIN(timestamp) |
|||
FROM user_events; |
|||
``` |
|||
|
|||
**Scenario 3: Analytics & Monitoring** |
|||
```sql |
|||
-- Basic analytics |
|||
SELECT COUNT(*) as total_events |
|||
FROM user_events |
|||
WHERE timestamp >= 1640995200000; |
|||
|
|||
-- Simple monitoring |
|||
SELECT AVG(response_time) as avg_response |
|||
FROM api_logs |
|||
WHERE timestamp >= 1640995200000; |
|||
|
|||
## Architecture Overview |
|||
|
|||
``` |
|||
SQL Query Flow: |
|||
1. Parse SQL 2. Plan & Optimize 3. Execute Query |
|||
┌─────────────┐ ┌──────────────┐ ┌─────────────────┐ ┌──────────────┐ |
|||
│ Client │ │ SQL Parser │ │ Query Planner │ │ Execution │ |
|||
│ (CLI) │──→ │ PostgreSQL │──→ │ & Optimizer │──→ │ Engine │ |
|||
│ │ │ (Custom) │ │ │ │ │ |
|||
└─────────────┘ └──────────────┘ └─────────────────┘ └──────────────┘ |
|||
│ │ |
|||
│ Schema Lookup │ Data Access |
|||
▼ ▼ |
|||
┌─────────────────────────────────────────────────────────────┐ |
|||
│ Schema Catalog │ |
|||
│ • Namespace → Database mapping │ |
|||
│ • Topic → Table mapping │ |
|||
│ • Schema version management │ |
|||
└─────────────────────────────────────────────────────────────┘ |
|||
▲ |
|||
│ Metadata |
|||
│ |
|||
┌─────────────────────────────────────────────────────────────────────────────┐ |
|||
│ MQ Storage Layer │ |
|||
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ▲ │ |
|||
│ │ Topic A │ │ Topic B │ │ Topic C │ │ ... │ │ │ |
|||
│ │ (Parquet) │ │ (Parquet) │ │ (Parquet) │ │ (Parquet) │ │ │ |
|||
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │ │ |
|||
└──────────────────────────────────────────────────────────────────────────│──┘ |
|||
│ |
|||
Data Access |
|||
``` |
|||
|
|||
|
|||
## Success Metrics |
|||
|
|||
* **Feature Completeness:** Support for all specified SELECT operations and metadata commands |
|||
* **Performance:** |
|||
* **Simple SELECT queries**: < 100ms latency for single-table queries with up to 3 WHERE predicates on ≤ 100K records |
|||
* **Complex queries**: < 1s latency for queries involving aggregations (COUNT, SUM, MAX, MIN) on ≤ 1M records |
|||
* **Time-range queries**: < 500ms for timestamp-based filtering on ≤ 500K records within 24-hour windows |
|||
* **Scalability:** Handle topics with millions of messages efficiently |
|||
Write
Preview
Loading…
Cancel
Save
Reference in new issue