You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
chrislu
3a5b5ea02c
improve: add circuit breaker to skip known-unhealthy filers
The previous implementation tried all filers on every failure, including
known-unhealthy ones. This wasted time retrying permanently down filers.
Problem scenario (3 filers, filer0 is down):
- Last successful: filer1 (saved as filerIndex=1)
- Next lookup when filer1 fails:
Retry 1: filer1(fail) → filer2(fail) → filer0(fail, wastes 5s timeout)
Retry 2: filer1(fail) → filer2(fail) → filer0(fail, wastes 5s timeout)
Retry 3: filer1(fail) → filer2(fail) → filer0(fail, wastes 5s timeout)
Total wasted: 15 seconds on known-bad filer!
Solution: Circuit breaker pattern
- Track consecutive failures per filer (atomic int32)
- Skip filers with 3+ consecutive failures
- Re-check unhealthy filers every 30 seconds
- Reset failure count on success
New behavior:
- filer0 fails 3 times → marked unhealthy
- Future lookups skip filer0 for 30 seconds
- After 30s, re-check filer0 (allows recovery)
- If filer0 succeeds, reset failure count to 0
Benefits:
1. Avoids wasting time on known-down filers
2. Still sticks to last healthy filer (via filerIndex)
3. Allows recovery (30s re-check window)
4. No configuration needed (automatic)
Implementation details:
- filerHealth struct tracks failureCount (atomic) + lastFailureTime
- shouldSkipUnhealthyFiler(): checks if we should skip this filer
- recordFilerSuccess(): resets failure count to 0
- recordFilerFailure(): increments count, updates timestamp
- Logs when skipping unhealthy filers (V(2) level)
Example with circuit breaker:
- filer0 down, saved filerIndex=1 (filer1 healthy)
- Lookup 1: filer1(ok) → Done (0.01s)
- Lookup 2: filer1(fail) → filer2(ok) → Done, save filerIndex=2 (0.01s)
- Lookup 3: filer2(fail) → skip filer0 (unhealthy) → filer1(ok) → Done (0.01s)
Much better than wasting 15s trying filer0 repeatedly!
|
2 months ago |
| .. |
|
admin
|
muted texts
|
2 months ago |
|
cluster
|
adds FilerClient to use cached volume id
|
2 months ago |
|
command
|
backup: handle volume not found when backing up (#7465)
|
2 months ago |
|
credential
|
Filer Store: postgres backend support pgbouncer (#7077)
|
5 months ago |
|
filer
|
improve: address remaining code review findings
|
2 months ago |
|
filer_client
|
Clean up logs and deprecated functions (#7339)
|
3 months ago |
|
glog
|
Add Kafka Gateway (#7231)
|
3 months ago |
|
iam
|
S3: Enforce bucket policy (#7471)
|
2 months ago |
|
iamapi
|
Clean up logs and deprecated functions (#7339)
|
3 months ago |
|
images
|
Migrates from disintegration/imaging c2019 to cognusion/imaging c2024. (#5533)
|
2 years ago |
|
kms
|
S3 API: Add integration with KMS providers (#7152)
|
5 months ago |
|
mount
|
improve: address remaining code review findings
|
2 months ago |
|
mq
|
S3: Directly read write volume servers (#7481)
|
2 months ago |
|
notification
|
fix: dead letter message log message (#7072)
|
5 months ago |
|
operation
|
S3: Directly read write volume servers (#7481)
|
2 months ago |
|
pb
|
S3: Directly read write volume servers (#7481)
|
2 months ago |
|
query
|
Fix date string parsing bug for the SQL Engine. (#7446)
|
2 months ago |
|
remote_storage
|
Filer: Fixed critical bugs in the Azure SDK migration (PR #7310) (#7401)
|
2 months ago |
|
replication
|
Filer: Fixed critical bugs in the Azure SDK migration (PR #7310) (#7401)
|
2 months ago |
|
s3api
|
fix: FilerClient supports multiple filer addresses for high availability
|
2 months ago |
|
security
|
remove spoof-able request header (#7103)
|
5 months ago |
|
sequence
|
remove unused function
|
2 years ago |
|
server
|
filer store: add foundationdb (#7178)
|
2 months ago |
|
sftpd
|
S3 API: Advanced IAM System (#7160)
|
4 months ago |
|
shell
|
Account Info (#7507)
|
2 months ago |
|
static
|
Fix Broken Links (#5287)
|
2 years ago |
|
stats
|
[volume] refactor and add metrics for flight upload and download data limit condition (#6920)
|
6 months ago |
|
storage
|
Volume Server: avoid aggressive volume assignment (#7501)
|
2 months ago |
|
telemetry
|
convert error fromating to %w everywhere (#6995)
|
6 months ago |
|
topology
|
master: fix negative active volumes (#7440)
|
2 months ago |
|
util
|
S3: Directly read write volume servers (#7481)
|
2 months ago |
|
wdclient
|
improve: add circuit breaker to skip known-unhealthy filers
|
2 months ago |
|
worker
|
go fmt
|
3 months ago |
|
Makefile
|
test versioning also (#7000)
|
6 months ago |
|
weed.go
|
set exit status
|
10 months ago |