Browse Source
The previous implementation tried all filers on every failure, including known-unhealthy ones. This wasted time retrying permanently down filers. Problem scenario (3 filers, filer0 is down): - Last successful: filer1 (saved as filerIndex=1) - Next lookup when filer1 fails: Retry 1: filer1(fail) → filer2(fail) → filer0(fail, wastes 5s timeout) Retry 2: filer1(fail) → filer2(fail) → filer0(fail, wastes 5s timeout) Retry 3: filer1(fail) → filer2(fail) → filer0(fail, wastes 5s timeout) Total wasted: 15 seconds on known-bad filer! Solution: Circuit breaker pattern - Track consecutive failures per filer (atomic int32) - Skip filers with 3+ consecutive failures - Re-check unhealthy filers every 30 seconds - Reset failure count on success New behavior: - filer0 fails 3 times → marked unhealthy - Future lookups skip filer0 for 30 seconds - After 30s, re-check filer0 (allows recovery) - If filer0 succeeds, reset failure count to 0 Benefits: 1. Avoids wasting time on known-down filers 2. Still sticks to last healthy filer (via filerIndex) 3. Allows recovery (30s re-check window) 4. No configuration needed (automatic) Implementation details: - filerHealth struct tracks failureCount (atomic) + lastFailureTime - shouldSkipUnhealthyFiler(): checks if we should skip this filer - recordFilerSuccess(): resets failure count to 0 - recordFilerFailure(): increments count, updates timestamp - Logs when skipping unhealthy filers (V(2) level) Example with circuit breaker: - filer0 down, saved filerIndex=1 (filer1 healthy) - Lookup 1: filer1(ok) → Done (0.01s) - Lookup 2: filer1(fail) → filer2(ok) → Done, save filerIndex=2 (0.01s) - Lookup 3: filer2(fail) → skip filer0 (unhealthy) → filer1(ok) → Done (0.01s) Much better than wasting 15s trying filer0 repeatedly!pull/7518/head
1 changed files with 62 additions and 1 deletions
Loading…
Reference in new issue