The previous implementation used util.Retry which only retries errors
containing the string "transport". This is insufficient for handling
the full range of transient gRPC errors.
Changes:
1. Added isRetryableGrpcError() to properly inspect gRPC status codes
- Retries: Unavailable, DeadlineExceeded, ResourceExhausted, Aborted
- Falls back to string matching for non-gRPC network errors
2. Replaced util.Retry with custom retry loop
- 3 attempts with exponential backoff (1s, 1.5s, 2.25s)
- Tries all N filers on each attempt (N*3 total attempts max)
- Fast-fails on non-retryable errors (NotFound, PermissionDenied, etc.)
3. Improved logging
- Shows both filer attempt (x/N) and retry attempt (y/3)
- Logs retry reason and wait time for debugging
Benefits:
- Better handling of transient gRPC failures (server restarts, load spikes)
- Faster failure for permanent errors (no wasted retries)
- More informative logs for troubleshooting
- Maintains existing HA failover across multiple filers
Example: If all 3 filers return Unavailable (server overload):
- Attempt 1: try all 3 filers, wait 1s
- Attempt 2: try all 3 filers, wait 1.5s
- Attempt 3: try all 3 filers, fail
Example: If filer returns NotFound (volume doesn't exist):
- Attempt 1: try all 3 filers, fast-fail (no retry)