You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
chrislu
3a5b5ea02c
improve: add circuit breaker to skip known-unhealthy filers
The previous implementation tried all filers on every failure, including
known-unhealthy ones. This wasted time retrying permanently down filers.
Problem scenario (3 filers, filer0 is down):
- Last successful: filer1 (saved as filerIndex=1)
- Next lookup when filer1 fails:
Retry 1: filer1(fail) → filer2(fail) → filer0(fail, wastes 5s timeout)
Retry 2: filer1(fail) → filer2(fail) → filer0(fail, wastes 5s timeout)
Retry 3: filer1(fail) → filer2(fail) → filer0(fail, wastes 5s timeout)
Total wasted: 15 seconds on known-bad filer!
Solution: Circuit breaker pattern
- Track consecutive failures per filer (atomic int32)
- Skip filers with 3+ consecutive failures
- Re-check unhealthy filers every 30 seconds
- Reset failure count on success
New behavior:
- filer0 fails 3 times → marked unhealthy
- Future lookups skip filer0 for 30 seconds
- After 30s, re-check filer0 (allows recovery)
- If filer0 succeeds, reset failure count to 0
Benefits:
1. Avoids wasting time on known-down filers
2. Still sticks to last healthy filer (via filerIndex)
3. Allows recovery (30s re-check window)
4. No configuration needed (automatic)
Implementation details:
- filerHealth struct tracks failureCount (atomic) + lastFailureTime
- shouldSkipUnhealthyFiler(): checks if we should skip this filer
- recordFilerSuccess(): resets failure count to 0
- recordFilerFailure(): increments count, updates timestamp
- Logs when skipping unhealthy filers (V(2) level)
Example with circuit breaker:
- filer0 down, saved filerIndex=1 (filer1 healthy)
- Lookup 1: filer1(ok) → Done (0.01s)
- Lookup 2: filer1(fail) → filer2(ok) → Done, save filerIndex=2 (0.01s)
- Lookup 3: filer2(fail) → skip filer0 (unhealthy) → filer1(ok) → Done (0.01s)
Much better than wasting 15s trying filer0 repeatedly!
|
2 months ago |
| .. |
|
exclusive_locks
|
ExclusiveLocker only create one renew goroutine (#6269)
|
1 year ago |
|
net2
|
convert error fromating to %w everywhere (#6995)
|
6 months ago |
|
resource_pool
|
convert error fromating to %w everywhere (#6995)
|
6 months ago |
|
filer_client.go
|
improve: add circuit breaker to skip known-unhealthy filers
|
2 months ago |
|
masterclient.go
|
fix: restore observability instrumentation in MasterClient
|
2 months ago |
|
vid_map.go
|
Fix masterclient vidmap race condition (#7412)
|
3 months ago |
|
vid_map_test.go
|
Implement SRV lookups for filer (#4767)
|
2 years ago |
|
vidmap_client.go
|
fix: handle partial results correctly in LookupVolumeIdsWithFallback callers
|
2 months ago |