3 changed files with 0 additions and 800 deletions
@ -1,242 +0,0 @@ |
|||||
# Bucket Policy Engine Integration - Complete |
|
||||
|
|
||||
## Summary |
|
||||
|
|
||||
Successfully integrated the `policy_engine` package to evaluate bucket policies for **all requests** (both anonymous and authenticated). This provides comprehensive AWS S3-compatible bucket policy support. |
|
||||
|
|
||||
## What Changed |
|
||||
|
|
||||
### 1. **New File: `s3api_bucket_policy_engine.go`** |
|
||||
Created a wrapper around `policy_engine.PolicyEngine` to: |
|
||||
- Load bucket policies from filer entries |
|
||||
- Sync policies from the bucket config cache |
|
||||
- Evaluate policies for any request (bucket, object, action, principal) |
|
||||
- Return structured results (allowed, evaluated, error) |
|
||||
|
|
||||
### 2. **Modified: `s3api_server.go`** |
|
||||
- Added `policyEngine *BucketPolicyEngine` field to `S3ApiServer` struct |
|
||||
- Initialized the policy engine in `NewS3ApiServerWithStore()` |
|
||||
- Linked `IdentityAccessManagement` back to `S3ApiServer` for policy evaluation |
|
||||
|
|
||||
### 3. **Modified: `auth_credentials.go`** |
|
||||
- Added `s3ApiServer *S3ApiServer` field to `IdentityAccessManagement` struct |
|
||||
- Added `buildPrincipalARN()` helper to convert identities to AWS ARN format |
|
||||
- **Integrated bucket policy evaluation into the authentication flow:** |
|
||||
- Policies are now checked **before** IAM/identity-based permissions |
|
||||
- Explicit `Deny` in bucket policy blocks access immediately |
|
||||
- Explicit `Allow` in bucket policy grants access and **bypasses IAM checks** (enables cross-account access) |
|
||||
- If no policy exists, falls through to normal IAM checks |
|
||||
- Policy evaluation errors result in access denial (fail-close security) |
|
||||
|
|
||||
### 4. **Modified: `s3api_bucket_config.go`** |
|
||||
- Added policy engine sync when bucket configs are loaded |
|
||||
- Ensures policies are loaded into the engine for evaluation |
|
||||
|
|
||||
### 5. **Modified: `auth_credentials_subscribe.go`** |
|
||||
- Added policy engine sync when bucket metadata changes |
|
||||
- Keeps the policy engine up-to-date via event-driven updates |
|
||||
|
|
||||
## How It Works |
|
||||
|
|
||||
### Anonymous Requests |
|
||||
``` |
|
||||
1. Request comes in (no credentials) |
|
||||
2. Check ACL-based public access → if public, allow |
|
||||
3. Check bucket policy for anonymous ("*") access → if allowed, allow |
|
||||
4. Otherwise, deny |
|
||||
``` |
|
||||
|
|
||||
### Authenticated Requests (NEW!) |
|
||||
``` |
|
||||
1. Request comes in (with credentials) |
|
||||
2. Authenticate user → get Identity |
|
||||
3. Build principal ARN (e.g., "arn:aws:iam::123456:user/bob") |
|
||||
4. Check bucket policy: |
|
||||
- If DENY → reject immediately |
|
||||
- If ALLOW → grant access immediately (bypasses IAM checks) |
|
||||
- If no policy or no matching statements → continue to step 5 |
|
||||
5. Check IAM/identity-based permissions (only if not already allowed by bucket policy) |
|
||||
6. Allow or deny based on identity permissions |
|
||||
``` |
|
||||
|
|
||||
## Policy Evaluation Flow |
|
||||
|
|
||||
``` |
|
||||
┌─────────────────────────────────────────────────────────┐ |
|
||||
│ Request (GET /bucket/file) │ |
|
||||
└───────────────────────────┬─────────────────────────────┘ |
|
||||
│ |
|
||||
┌───────────▼──────────┐ |
|
||||
│ Authenticate User │ |
|
||||
│ (or Anonymous) │ |
|
||||
└───────────┬──────────┘ |
|
||||
│ |
|
||||
┌───────────▼──────────────────────────────┐ |
|
||||
│ Build Principal ARN │ |
|
||||
│ - Anonymous: "*" │ |
|
||||
│ - User: "arn:aws:iam::123456:user/bob" │ |
|
||||
└───────────┬──────────────────────────────┘ |
|
||||
│ |
|
||||
┌───────────▼──────────────────────────────┐ |
|
||||
│ Evaluate Bucket Policy (PolicyEngine) │ |
|
||||
│ - Action: "s3:GetObject" │ |
|
||||
│ - Resource: "arn:aws:s3:::bucket/file" │ |
|
||||
│ - Principal: (from above) │ |
|
||||
└───────────┬──────────────────────────────┘ |
|
||||
│ |
|
||||
┌─────────────┼─────────────┐ |
|
||||
│ │ │ |
|
||||
DENY │ ALLOW │ NO POLICY |
|
||||
│ │ │ |
|
||||
▼ ▼ ▼ |
|
||||
Reject Request Grant Access Continue |
|
||||
│ |
|
||||
┌───────────────────┘ |
|
||||
│ |
|
||||
┌────────────▼─────────────┐ |
|
||||
│ IAM/Identity Check │ |
|
||||
│ (identity.canDo) │ |
|
||||
└────────────┬─────────────┘ |
|
||||
│ |
|
||||
┌─────────┴─────────┐ |
|
||||
│ │ |
|
||||
ALLOW │ DENY │ |
|
||||
▼ ▼ |
|
||||
Grant Access Reject Request |
|
||||
``` |
|
||||
|
|
||||
## Example Policies That Now Work |
|
||||
|
|
||||
### 1. **Public Read Access** (Anonymous) |
|
||||
```json |
|
||||
{ |
|
||||
"Version": "2012-10-17", |
|
||||
"Statement": [{ |
|
||||
"Effect": "Allow", |
|
||||
"Principal": "*", |
|
||||
"Action": "s3:GetObject", |
|
||||
"Resource": "arn:aws:s3:::mybucket/*" |
|
||||
}] |
|
||||
} |
|
||||
``` |
|
||||
- Anonymous users can read all objects |
|
||||
- Authenticated users are also evaluated against this policy. If they don't match an explicit `Allow` for this action, they will fall back to their own IAM permissions |
|
||||
|
|
||||
### 2. **Grant Access to Specific User** (Authenticated) |
|
||||
```json |
|
||||
{ |
|
||||
"Version": "2012-10-17", |
|
||||
"Statement": [{ |
|
||||
"Effect": "Allow", |
|
||||
"Principal": {"AWS": "arn:aws:iam::123456789012:user/bob"}, |
|
||||
"Action": ["s3:GetObject", "s3:PutObject"], |
|
||||
"Resource": "arn:aws:s3:::mybucket/shared/*" |
|
||||
}] |
|
||||
} |
|
||||
``` |
|
||||
- User "bob" can read/write objects in `/shared/` prefix |
|
||||
- Other users cannot (unless granted by their IAM policies) |
|
||||
|
|
||||
### 3. **Deny Access to Specific Path** (Both) |
|
||||
```json |
|
||||
{ |
|
||||
"Version": "2012-10-17", |
|
||||
"Statement": [{ |
|
||||
"Effect": "Deny", |
|
||||
"Principal": "*", |
|
||||
"Action": "s3:*", |
|
||||
"Resource": "arn:aws:s3:::mybucket/confidential/*" |
|
||||
}] |
|
||||
} |
|
||||
``` |
|
||||
- **No one** can access `/confidential/` objects |
|
||||
- Denies override all other allows (AWS policy evaluation rules) |
|
||||
|
|
||||
## Performance Characteristics |
|
||||
|
|
||||
### Policy Loading |
|
||||
- **Cold start**: Policy loaded from filer → parsed → compiled → cached |
|
||||
- **Warm path**: Policy retrieved from `BucketConfigCache` (already parsed) |
|
||||
- **Updates**: Event-driven sync via metadata subscription (real-time) |
|
||||
|
|
||||
### Policy Evaluation |
|
||||
- **Compiled policies**: Pre-compiled regex patterns and matchers |
|
||||
- **Pattern cache**: Regex patterns cached with LRU eviction (max 1000) |
|
||||
- **Fast path**: Common patterns (`*`, exact matches) optimized |
|
||||
- **Case sensitivity**: Actions case-insensitive, resources case-sensitive (AWS-compatible) |
|
||||
|
|
||||
### Overhead |
|
||||
- **Anonymous requests**: Minimal (policy already checked, now using compiled engine) |
|
||||
- **Authenticated requests**: ~1-2ms added for policy evaluation (compiled patterns) |
|
||||
- **No policy**: Near-zero overhead (quick indeterminate check) |
|
||||
|
|
||||
## Testing |
|
||||
|
|
||||
All tests pass: |
|
||||
```bash |
|
||||
✅ TestBucketPolicyValidationBasics |
|
||||
✅ TestPrincipalMatchesAnonymous |
|
||||
✅ TestActionToS3Action |
|
||||
✅ TestResourceMatching |
|
||||
✅ TestMatchesPatternRegexEscaping (security tests) |
|
||||
✅ TestActionMatchingCaseInsensitive |
|
||||
✅ TestResourceMatchingCaseSensitive |
|
||||
✅ All policy_engine package tests (30+ tests) |
|
||||
``` |
|
||||
|
|
||||
## Security Improvements |
|
||||
|
|
||||
1. **Regex Metacharacter Escaping**: Patterns like `*.json` properly match only files ending in `.json` (not `filexjson`) |
|
||||
2. **Case-Insensitive Actions**: S3 actions matched case-insensitively per AWS spec |
|
||||
3. **Case-Sensitive Resources**: Resource paths matched case-sensitively for security |
|
||||
4. **Pattern Cache Size Limit**: Prevents DoS attacks via unbounded cache growth |
|
||||
5. **Principal Validation**: Supports `[]string` for manually constructed policies |
|
||||
|
|
||||
## AWS Compatibility |
|
||||
|
|
||||
The implementation follows AWS S3 bucket policy evaluation rules: |
|
||||
1. **Explicit Deny** always wins (checked first) |
|
||||
2. **Explicit Allow** grants access (checked second) |
|
||||
3. **Default Deny** if no matching statements (implicit) |
|
||||
4. Bucket policies work alongside IAM policies (both are evaluated) |
|
||||
|
|
||||
## Files Changed |
|
||||
|
|
||||
``` |
|
||||
Modified: |
|
||||
weed/s3api/auth_credentials.go (+47 lines) |
|
||||
weed/s3api/auth_credentials_subscribe.go (+8 lines) |
|
||||
weed/s3api/s3api_bucket_config.go (+8 lines) |
|
||||
weed/s3api/s3api_server.go (+5 lines) |
|
||||
|
|
||||
New: |
|
||||
weed/s3api/s3api_bucket_policy_engine.go (115 lines) |
|
||||
``` |
|
||||
|
|
||||
## Migration Notes |
|
||||
|
|
||||
- **Backward Compatible**: Existing setups without bucket policies work unchanged |
|
||||
- **No Breaking Changes**: All existing ACL and IAM-based authorization still works |
|
||||
- **Additive Feature**: Bucket policies are an additional layer of authorization |
|
||||
- **Performance**: Minimal impact on existing workloads |
|
||||
|
|
||||
## Future Enhancements |
|
||||
|
|
||||
Potential improvements (not implemented yet): |
|
||||
- [ ] Condition support (IP address, time-based, etc.) - already in policy_engine |
|
||||
- [ ] Cross-account policies (different AWS accounts) |
|
||||
- [ ] Policy validation API endpoint |
|
||||
- [ ] Policy simulation/testing tool |
|
||||
- [ ] Metrics for policy evaluations (allow/deny counts) |
|
||||
|
|
||||
## Conclusion |
|
||||
|
|
||||
Bucket policies now work for **all requests** in SeaweedFS S3 API: |
|
||||
- ✅ Anonymous requests (public access) |
|
||||
- ✅ Authenticated requests (user-specific policies) |
|
||||
- ✅ High performance (compiled policies, caching) |
|
||||
- ✅ AWS-compatible (follows AWS evaluation rules) |
|
||||
- ✅ Secure (proper escaping, case sensitivity) |
|
||||
|
|
||||
The integration is complete, tested, and ready for use! |
|
||||
|
|
||||
@ -1,413 +0,0 @@ |
|||||
# SeaweedFS Task Distribution System Design |
|
||||
|
|
||||
## Overview |
|
||||
|
|
||||
This document describes the design of a distributed task management system for SeaweedFS that handles Erasure Coding (EC) and vacuum operations through a scalable admin server and worker process architecture. |
|
||||
|
|
||||
## System Architecture |
|
||||
|
|
||||
### High-Level Components |
|
||||
|
|
||||
``` |
|
||||
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ |
|
||||
│ Master │◄──►│ Admin Server │◄──►│ Workers │ |
|
||||
│ │ │ │ │ │ |
|
||||
│ - Volume Info │ │ - Task Discovery │ │ - Task Exec │ |
|
||||
│ - Shard Status │ │ - Task Assign │ │ - Progress │ |
|
||||
│ - Heartbeats │ │ - Progress Track │ │ - Error Report │ |
|
||||
└─────────────────┘ └──────────────────┘ └─────────────────┘ |
|
||||
│ │ │ |
|
||||
│ │ │ |
|
||||
▼ ▼ ▼ |
|
||||
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ |
|
||||
│ Volume Servers │ │ Volume Monitor │ │ Task Execution │ |
|
||||
│ │ │ │ │ │ |
|
||||
│ - Store Volumes │ │ - Health Check │ │ - EC Convert │ |
|
||||
│ - EC Shards │ │ - Usage Stats │ │ - Vacuum Clean │ |
|
||||
│ - Report Status │ │ - State Sync │ │ - Status Report │ |
|
||||
└─────────────────┘ └──────────────────┘ └─────────────────┘ |
|
||||
``` |
|
||||
|
|
||||
## 1. Admin Server Design |
|
||||
|
|
||||
### 1.1 Core Responsibilities |
|
||||
|
|
||||
- **Task Discovery**: Scan volumes to identify EC and vacuum candidates |
|
||||
- **Worker Management**: Track available workers and their capabilities |
|
||||
- **Task Assignment**: Match tasks to optimal workers |
|
||||
- **Progress Tracking**: Monitor in-progress tasks for capacity planning |
|
||||
- **State Reconciliation**: Sync with master server for volume state updates |
|
||||
|
|
||||
### 1.2 Task Discovery Engine |
|
||||
|
|
||||
```go |
|
||||
type TaskDiscoveryEngine struct { |
|
||||
masterClient MasterClient |
|
||||
volumeScanner VolumeScanner |
|
||||
taskDetectors map[TaskType]TaskDetector |
|
||||
scanInterval time.Duration |
|
||||
} |
|
||||
|
|
||||
type VolumeCandidate struct { |
|
||||
VolumeID uint32 |
|
||||
Server string |
|
||||
Collection string |
|
||||
TaskType TaskType |
|
||||
Priority TaskPriority |
|
||||
Reason string |
|
||||
DetectedAt time.Time |
|
||||
Parameters map[string]interface{} |
|
||||
} |
|
||||
``` |
|
||||
|
|
||||
**EC Detection Logic**: |
|
||||
- Find volumes >= 95% full and idle for > 1 hour |
|
||||
- Exclude volumes already in EC format |
|
||||
- Exclude volumes with ongoing operations |
|
||||
- Prioritize by collection and age |
|
||||
|
|
||||
**Vacuum Detection Logic**: |
|
||||
- Find volumes with garbage ratio > 30% |
|
||||
- Exclude read-only volumes |
|
||||
- Exclude volumes with recent vacuum operations |
|
||||
- Prioritize by garbage percentage |
|
||||
|
|
||||
### 1.3 Worker Registry & Management |
|
||||
|
|
||||
```go |
|
||||
type WorkerRegistry struct { |
|
||||
workers map[string]*Worker |
|
||||
capabilities map[TaskType][]*Worker |
|
||||
lastHeartbeat map[string]time.Time |
|
||||
taskAssignment map[string]*Task |
|
||||
mutex sync.RWMutex |
|
||||
} |
|
||||
|
|
||||
type Worker struct { |
|
||||
ID string |
|
||||
Address string |
|
||||
Capabilities []TaskType |
|
||||
MaxConcurrent int |
|
||||
CurrentLoad int |
|
||||
Status WorkerStatus |
|
||||
LastSeen time.Time |
|
||||
Performance WorkerMetrics |
|
||||
} |
|
||||
``` |
|
||||
|
|
||||
### 1.4 Task Assignment Algorithm |
|
||||
|
|
||||
```go |
|
||||
type TaskScheduler struct { |
|
||||
registry *WorkerRegistry |
|
||||
taskQueue *PriorityQueue |
|
||||
inProgressTasks map[string]*InProgressTask |
|
||||
volumeReservations map[uint32]*VolumeReservation |
|
||||
} |
|
||||
|
|
||||
// Worker Selection Criteria: |
|
||||
// 1. Has required capability (EC or Vacuum) |
|
||||
// 2. Available capacity (CurrentLoad < MaxConcurrent) |
|
||||
// 3. Best performance history for task type |
|
||||
// 4. Lowest current load |
|
||||
// 5. Geographically close to volume server (optional) |
|
||||
``` |
|
||||
|
|
||||
## 2. Worker Process Design |
|
||||
|
|
||||
### 2.1 Worker Architecture |
|
||||
|
|
||||
```go |
|
||||
type MaintenanceWorker struct { |
|
||||
id string |
|
||||
config *WorkerConfig |
|
||||
adminClient AdminClient |
|
||||
taskExecutors map[TaskType]TaskExecutor |
|
||||
currentTasks map[string]*RunningTask |
|
||||
registry *TaskRegistry |
|
||||
heartbeatTicker *time.Ticker |
|
||||
requestTicker *time.Ticker |
|
||||
} |
|
||||
``` |
|
||||
|
|
||||
### 2.2 Task Execution Framework |
|
||||
|
|
||||
```go |
|
||||
type TaskExecutor interface { |
|
||||
Execute(ctx context.Context, task *Task) error |
|
||||
EstimateTime(task *Task) time.Duration |
|
||||
ValidateResources(task *Task) error |
|
||||
GetProgress() float64 |
|
||||
Cancel() error |
|
||||
} |
|
||||
|
|
||||
type ErasureCodingExecutor struct { |
|
||||
volumeClient VolumeServerClient |
|
||||
progress float64 |
|
||||
cancelled bool |
|
||||
} |
|
||||
|
|
||||
type VacuumExecutor struct { |
|
||||
volumeClient VolumeServerClient |
|
||||
progress float64 |
|
||||
cancelled bool |
|
||||
} |
|
||||
``` |
|
||||
|
|
||||
### 2.3 Worker Capabilities & Registration |
|
||||
|
|
||||
```go |
|
||||
type WorkerCapabilities struct { |
|
||||
SupportedTasks []TaskType |
|
||||
MaxConcurrent int |
|
||||
ResourceLimits ResourceLimits |
|
||||
PreferredServers []string // Affinity for specific volume servers |
|
||||
} |
|
||||
|
|
||||
type ResourceLimits struct { |
|
||||
MaxMemoryMB int64 |
|
||||
MaxDiskSpaceMB int64 |
|
||||
MaxNetworkMbps int64 |
|
||||
MaxCPUPercent float64 |
|
||||
} |
|
||||
``` |
|
||||
|
|
||||
## 3. Task Lifecycle Management |
|
||||
|
|
||||
### 3.1 Task States |
|
||||
|
|
||||
```go |
|
||||
type TaskState string |
|
||||
|
|
||||
const ( |
|
||||
TaskStatePending TaskState = "pending" |
|
||||
TaskStateAssigned TaskState = "assigned" |
|
||||
TaskStateInProgress TaskState = "in_progress" |
|
||||
TaskStateCompleted TaskState = "completed" |
|
||||
TaskStateFailed TaskState = "failed" |
|
||||
TaskStateCancelled TaskState = "cancelled" |
|
||||
TaskStateStuck TaskState = "stuck" // Taking too long |
|
||||
TaskStateDuplicate TaskState = "duplicate" // Detected duplicate |
|
||||
) |
|
||||
``` |
|
||||
|
|
||||
### 3.2 Progress Tracking & Monitoring |
|
||||
|
|
||||
```go |
|
||||
type InProgressTask struct { |
|
||||
Task *Task |
|
||||
WorkerID string |
|
||||
StartedAt time.Time |
|
||||
LastUpdate time.Time |
|
||||
Progress float64 |
|
||||
EstimatedEnd time.Time |
|
||||
VolumeReserved bool // Reserved for capacity planning |
|
||||
} |
|
||||
|
|
||||
type TaskMonitor struct { |
|
||||
inProgressTasks map[string]*InProgressTask |
|
||||
timeoutChecker *time.Ticker |
|
||||
stuckDetector *time.Ticker |
|
||||
duplicateChecker *time.Ticker |
|
||||
} |
|
||||
``` |
|
||||
|
|
||||
## 4. Volume Capacity Reconciliation |
|
||||
|
|
||||
### 4.1 Volume State Tracking |
|
||||
|
|
||||
```go |
|
||||
type VolumeStateManager struct { |
|
||||
masterClient MasterClient |
|
||||
inProgressTasks map[uint32]*InProgressTask // VolumeID -> Task |
|
||||
committedChanges map[uint32]*VolumeChange // Changes not yet in master |
|
||||
reconcileInterval time.Duration |
|
||||
} |
|
||||
|
|
||||
type VolumeChange struct { |
|
||||
VolumeID uint32 |
|
||||
ChangeType ChangeType // "ec_encoding", "vacuum_completed" |
|
||||
OldCapacity int64 |
|
||||
NewCapacity int64 |
|
||||
TaskID string |
|
||||
CompletedAt time.Time |
|
||||
ReportedToMaster bool |
|
||||
} |
|
||||
``` |
|
||||
|
|
||||
### 4.2 Shard Assignment Integration |
|
||||
|
|
||||
When the master needs to assign shards, it must consider: |
|
||||
1. **Current volume state** from its own records |
|
||||
2. **In-progress capacity changes** from admin server |
|
||||
3. **Committed but unreported changes** from admin server |
|
||||
|
|
||||
```go |
|
||||
type CapacityOracle struct { |
|
||||
adminServer AdminServerClient |
|
||||
masterState *MasterVolumeState |
|
||||
updateFreq time.Duration |
|
||||
} |
|
||||
|
|
||||
func (o *CapacityOracle) GetAdjustedCapacity(volumeID uint32) int64 { |
|
||||
baseCapacity := o.masterState.GetCapacity(volumeID) |
|
||||
|
|
||||
// Adjust for in-progress tasks |
|
||||
if task := o.adminServer.GetInProgressTask(volumeID); task != nil { |
|
||||
switch task.Type { |
|
||||
case TaskTypeErasureCoding: |
|
||||
// EC reduces effective capacity |
|
||||
return baseCapacity / 2 // Simplified |
|
||||
case TaskTypeVacuum: |
|
||||
// Vacuum may increase available space |
|
||||
return baseCapacity + int64(float64(baseCapacity) * 0.3) |
|
||||
} |
|
||||
} |
|
||||
|
|
||||
// Adjust for completed but unreported changes |
|
||||
if change := o.adminServer.GetPendingChange(volumeID); change != nil { |
|
||||
return change.NewCapacity |
|
||||
} |
|
||||
|
|
||||
return baseCapacity |
|
||||
} |
|
||||
``` |
|
||||
|
|
||||
## 5. Error Handling & Recovery |
|
||||
|
|
||||
### 5.1 Worker Failure Scenarios |
|
||||
|
|
||||
```go |
|
||||
type FailureHandler struct { |
|
||||
taskRescheduler *TaskRescheduler |
|
||||
workerMonitor *WorkerMonitor |
|
||||
alertManager *AlertManager |
|
||||
} |
|
||||
|
|
||||
// Failure Scenarios: |
|
||||
// 1. Worker becomes unresponsive (heartbeat timeout) |
|
||||
// 2. Task execution fails (reported by worker) |
|
||||
// 3. Task gets stuck (progress timeout) |
|
||||
// 4. Duplicate task detection |
|
||||
// 5. Resource exhaustion |
|
||||
``` |
|
||||
|
|
||||
### 5.2 Recovery Strategies |
|
||||
|
|
||||
**Worker Timeout Recovery**: |
|
||||
- Mark worker as inactive after 3 missed heartbeats |
|
||||
- Reschedule all assigned tasks to other workers |
|
||||
- Cleanup any partial state |
|
||||
|
|
||||
**Task Stuck Recovery**: |
|
||||
- Detect tasks with no progress for > 2x estimated time |
|
||||
- Cancel stuck task and mark volume for cleanup |
|
||||
- Reschedule if retry count < max_retries |
|
||||
|
|
||||
**Duplicate Task Prevention**: |
|
||||
```go |
|
||||
type DuplicateDetector struct { |
|
||||
activeFingerprints map[string]bool // VolumeID+TaskType |
|
||||
recentCompleted *LRUCache // Recently completed tasks |
|
||||
} |
|
||||
|
|
||||
func (d *DuplicateDetector) IsTaskDuplicate(task *Task) bool { |
|
||||
fingerprint := fmt.Sprintf("%d-%s", task.VolumeID, task.Type) |
|
||||
return d.activeFingerprints[fingerprint] || |
|
||||
d.recentCompleted.Contains(fingerprint) |
|
||||
} |
|
||||
``` |
|
||||
|
|
||||
## 6. Simulation & Testing Framework |
|
||||
|
|
||||
### 6.1 Failure Simulation |
|
||||
|
|
||||
```go |
|
||||
type TaskSimulator struct { |
|
||||
scenarios map[string]SimulationScenario |
|
||||
} |
|
||||
|
|
||||
type SimulationScenario struct { |
|
||||
Name string |
|
||||
WorkerCount int |
|
||||
VolumeCount int |
|
||||
FailurePatterns []FailurePattern |
|
||||
Duration time.Duration |
|
||||
} |
|
||||
|
|
||||
type FailurePattern struct { |
|
||||
Type FailureType // "worker_timeout", "task_stuck", "duplicate" |
|
||||
Probability float64 // 0.0 to 1.0 |
|
||||
Timing TimingSpec // When during task execution |
|
||||
Duration time.Duration |
|
||||
} |
|
||||
``` |
|
||||
|
|
||||
### 6.2 Test Scenarios |
|
||||
|
|
||||
**Scenario 1: Worker Timeout During EC** |
|
||||
- Start EC task on 30GB volume |
|
||||
- Kill worker at 50% progress |
|
||||
- Verify task reassignment |
|
||||
- Verify no duplicate EC operations |
|
||||
|
|
||||
**Scenario 2: Stuck Vacuum Task** |
|
||||
- Start vacuum on high-garbage volume |
|
||||
- Simulate worker hanging at 75% progress |
|
||||
- Verify timeout detection and cleanup |
|
||||
- Verify volume state consistency |
|
||||
|
|
||||
**Scenario 3: Duplicate Task Prevention** |
|
||||
- Submit same EC task from multiple sources |
|
||||
- Verify only one task executes |
|
||||
- Verify proper conflict resolution |
|
||||
|
|
||||
**Scenario 4: Master-Admin State Divergence** |
|
||||
- Create in-progress EC task |
|
||||
- Simulate master restart |
|
||||
- Verify state reconciliation |
|
||||
- Verify shard assignment accounts for in-progress work |
|
||||
|
|
||||
## 7. Performance & Scalability |
|
||||
|
|
||||
### 7.1 Metrics & Monitoring |
|
||||
|
|
||||
```go |
|
||||
type SystemMetrics struct { |
|
||||
TasksPerSecond float64 |
|
||||
WorkerUtilization float64 |
|
||||
AverageTaskTime time.Duration |
|
||||
FailureRate float64 |
|
||||
QueueDepth int |
|
||||
VolumeStatesSync bool |
|
||||
} |
|
||||
``` |
|
||||
|
|
||||
### 7.2 Scalability Considerations |
|
||||
|
|
||||
- **Horizontal Worker Scaling**: Add workers without admin server changes |
|
||||
- **Admin Server HA**: Master-slave admin servers for fault tolerance |
|
||||
- **Task Partitioning**: Partition tasks by collection or datacenter |
|
||||
- **Batch Operations**: Group similar tasks for efficiency |
|
||||
|
|
||||
## 8. Implementation Plan |
|
||||
|
|
||||
### Phase 1: Core Infrastructure |
|
||||
1. Admin server basic framework |
|
||||
2. Worker registration and heartbeat |
|
||||
3. Simple task assignment |
|
||||
4. Basic progress tracking |
|
||||
|
|
||||
### Phase 2: Advanced Features |
|
||||
1. Volume state reconciliation |
|
||||
2. Sophisticated worker selection |
|
||||
3. Failure detection and recovery |
|
||||
4. Duplicate prevention |
|
||||
|
|
||||
### Phase 3: Optimization & Monitoring |
|
||||
1. Performance metrics |
|
||||
2. Load balancing algorithms |
|
||||
3. Capacity planning integration |
|
||||
4. Comprehensive monitoring |
|
||||
|
|
||||
This design provides a robust, scalable foundation for distributed task management in SeaweedFS while maintaining consistency with the existing architecture patterns. |
|
||||
@ -1,145 +0,0 @@ |
|||||
# SQL Query Engine Feature, Dev, and Test Plan |
|
||||
|
|
||||
This document outlines the plan for adding SQL querying support to SeaweedFS, focusing on reading and analyzing data from Message Queue (MQ) topics. |
|
||||
|
|
||||
## Feature Plan |
|
||||
|
|
||||
**1. Goal** |
|
||||
|
|
||||
To provide a SQL querying interface for SeaweedFS, enabling analytics on existing MQ topics. This enables: |
|
||||
- Basic querying with SELECT, WHERE, aggregations on MQ topics |
|
||||
- Schema discovery and metadata operations (SHOW DATABASES, SHOW TABLES, DESCRIBE) |
|
||||
- In-place analytics on Parquet-stored messages without data movement |
|
||||
|
|
||||
**2. Key Features** |
|
||||
|
|
||||
* **Schema Discovery and Metadata:** |
|
||||
* `SHOW DATABASES` - List all MQ namespaces |
|
||||
* `SHOW TABLES` - List all topics in a namespace |
|
||||
* `DESCRIBE table_name` - Show topic schema details |
|
||||
* Automatic schema detection from existing Parquet data |
|
||||
* **Basic Query Engine:** |
|
||||
* `SELECT` support with `WHERE`, `LIMIT`, `OFFSET` |
|
||||
* Aggregation functions: `COUNT()`, `SUM()`, `AVG()`, `MIN()`, `MAX()` |
|
||||
* Temporal queries with timestamp-based filtering |
|
||||
* **User Interfaces:** |
|
||||
* New CLI command `weed sql` with interactive shell mode |
|
||||
* Optional: Web UI for query execution and result visualization |
|
||||
* **Output Formats:** |
|
||||
* JSON (default), CSV, Parquet for result sets |
|
||||
* Streaming results for large queries |
|
||||
* Pagination support for result navigation |
|
||||
|
|
||||
## Development Plan |
|
||||
|
|
||||
|
|
||||
|
|
||||
**3. Data Source Integration** |
|
||||
|
|
||||
* **MQ Topic Connector (Primary):** |
|
||||
* Build on existing `weed/mq/logstore/read_parquet_to_log.go` |
|
||||
* Implement efficient Parquet scanning with predicate pushdown |
|
||||
* Support schema evolution and backward compatibility |
|
||||
* Handle partition-based parallelism for scalable queries |
|
||||
* **Schema Registry Integration:** |
|
||||
* Extend `weed/mq/schema/schema.go` for SQL metadata operations |
|
||||
* Read existing topic schemas for query planning |
|
||||
* Handle schema evolution during query execution |
|
||||
|
|
||||
**4. API & CLI Integration** |
|
||||
|
|
||||
* **CLI Command:** |
|
||||
* New `weed sql` command with interactive shell mode (similar to `weed shell`) |
|
||||
* Support for script execution and result formatting |
|
||||
* Connection management for remote SeaweedFS clusters |
|
||||
* **gRPC API:** |
|
||||
* Add SQL service to existing MQ broker gRPC interface |
|
||||
* Enable efficient query execution with streaming results |
|
||||
|
|
||||
## Example Usage Scenarios |
|
||||
|
|
||||
**Scenario 1: Schema Discovery and Metadata** |
|
||||
```sql |
|
||||
-- List all namespaces (databases) |
|
||||
SHOW DATABASES; |
|
||||
|
|
||||
-- List topics in a namespace |
|
||||
USE my_namespace; |
|
||||
SHOW TABLES; |
|
||||
|
|
||||
-- View topic structure and discovered schema |
|
||||
DESCRIBE user_events; |
|
||||
``` |
|
||||
|
|
||||
**Scenario 2: Data Querying** |
|
||||
```sql |
|
||||
-- Basic filtering and projection |
|
||||
SELECT user_id, event_type, timestamp |
|
||||
FROM user_events |
|
||||
WHERE timestamp > 1640995200000 |
|
||||
LIMIT 100; |
|
||||
|
|
||||
-- Aggregation queries |
|
||||
SELECT COUNT(*) as event_count |
|
||||
FROM user_events |
|
||||
WHERE timestamp >= 1640995200000; |
|
||||
|
|
||||
-- More aggregation examples |
|
||||
SELECT MAX(timestamp), MIN(timestamp) |
|
||||
FROM user_events; |
|
||||
``` |
|
||||
|
|
||||
**Scenario 3: Analytics & Monitoring** |
|
||||
```sql |
|
||||
-- Basic analytics |
|
||||
SELECT COUNT(*) as total_events |
|
||||
FROM user_events |
|
||||
WHERE timestamp >= 1640995200000; |
|
||||
|
|
||||
-- Simple monitoring |
|
||||
SELECT AVG(response_time) as avg_response |
|
||||
FROM api_logs |
|
||||
WHERE timestamp >= 1640995200000; |
|
||||
|
|
||||
## Architecture Overview |
|
||||
|
|
||||
``` |
|
||||
SQL Query Flow: |
|
||||
1. Parse SQL 2. Plan & Optimize 3. Execute Query |
|
||||
┌─────────────┐ ┌──────────────┐ ┌─────────────────┐ ┌──────────────┐ |
|
||||
│ Client │ │ SQL Parser │ │ Query Planner │ │ Execution │ |
|
||||
│ (CLI) │──→ │ PostgreSQL │──→ │ & Optimizer │──→ │ Engine │ |
|
||||
│ │ │ (Custom) │ │ │ │ │ |
|
||||
└─────────────┘ └──────────────┘ └─────────────────┘ └──────────────┘ |
|
||||
│ │ |
|
||||
│ Schema Lookup │ Data Access |
|
||||
▼ ▼ |
|
||||
┌─────────────────────────────────────────────────────────────┐ |
|
||||
│ Schema Catalog │ |
|
||||
│ • Namespace → Database mapping │ |
|
||||
│ • Topic → Table mapping │ |
|
||||
│ • Schema version management │ |
|
||||
└─────────────────────────────────────────────────────────────┘ |
|
||||
▲ |
|
||||
│ Metadata |
|
||||
│ |
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐ |
|
||||
│ MQ Storage Layer │ |
|
||||
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ▲ │ |
|
||||
│ │ Topic A │ │ Topic B │ │ Topic C │ │ ... │ │ │ |
|
||||
│ │ (Parquet) │ │ (Parquet) │ │ (Parquet) │ │ (Parquet) │ │ │ |
|
||||
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │ │ |
|
||||
└──────────────────────────────────────────────────────────────────────────│──┘ |
|
||||
│ |
|
||||
Data Access |
|
||||
``` |
|
||||
|
|
||||
|
|
||||
## Success Metrics |
|
||||
|
|
||||
* **Feature Completeness:** Support for all specified SELECT operations and metadata commands |
|
||||
* **Performance:** |
|
||||
* **Simple SELECT queries**: < 100ms latency for single-table queries with up to 3 WHERE predicates on ≤ 100K records |
|
||||
* **Complex queries**: < 1s latency for queries involving aggregations (COUNT, SUM, MAX, MIN) on ≤ 1M records |
|
||||
* **Time-range queries**: < 500ms for timestamp-based filtering on ≤ 500K records within 24-hour windows |
|
||||
* **Scalability:** Handle topics with millions of messages efficiently |
|
||||
Write
Preview
Loading…
Cancel
Save
Reference in new issue