Browse Source

improve: clarify Aborted error handling in volume lookups

Added documentation and logging to address the concern that codes.Aborted
might not always be retryable in all contexts.

Context-specific justification for treating Aborted as retryable:

Volume location lookups (LookupVolume RPC) are simple, read-only operations:
  - No transactions
  - No write conflicts
  - No application-level state changes
  - Idempotent (safe to retry)

In this context, Aborted is most likely caused by:
  - Filer restarting/recovering (transient)
  - Connection interrupted mid-request (transient)
  - Server-side resource cleanup (transient)

NOT caused by:
  - Application-level conflicts (no writes)
  - Transaction failures (no transactions)
  - Logical errors (read-only lookup)

Changes:
1. Added detailed comment explaining the context-specific reasoning
2. Added V(1) logging when treating Aborted as retryable
   - Helps detect misclassification if it occurs
   - Visible in verbose logs for troubleshooting
3. Split switch statement for clarity (one case per line)

If future analysis shows Aborted should not be retried, operators will
now have visibility via logs to make that determination. The logging
provides evidence for future tuning decisions.

Alternative approaches considered but not implemented:
  - Removing Aborted entirely (too conservative for read-only ops)
  - Message content inspection (adds complexity, no known patterns yet)
  - Different handling per RPC type (premature optimization)
pull/7518/head
chrislu 2 weeks ago
parent
commit
8ef04a4a84
  1. 24
      weed/wdclient/filer_client.go

24
weed/wdclient/filer_client.go

@ -156,23 +156,35 @@ func (fc *FilerClient) GetLookupFileIdFunction() LookupFileIdFunctionType {
} }
// isRetryableGrpcError checks if a gRPC error is transient and should be retried // isRetryableGrpcError checks if a gRPC error is transient and should be retried
//
// Note on codes.Aborted: While Aborted can indicate application-level conflicts
// (e.g., transaction failures), in the context of volume location lookups (which
// are simple read-only operations with no transactions), Aborted is more likely
// to indicate transient server issues during restart/recovery. We include it here
// for volume lookups but log it for visibility in case misclassification occurs.
func isRetryableGrpcError(err error) bool { func isRetryableGrpcError(err error) bool {
if err == nil { if err == nil {
return false return false
} }
// Check gRPC status code // Check gRPC status code
st, ok := status.FromError(err) st, ok := status.FromError(err)
if ok { if ok {
switch st.Code() { switch st.Code() {
case codes.Unavailable, // Server unavailable (temporary)
codes.DeadlineExceeded, // Request timeout
codes.ResourceExhausted, // Rate limited or overloaded
codes.Aborted: // Operation aborted (might succeed on retry)
case codes.Unavailable: // Server unavailable (temporary)
return true
case codes.DeadlineExceeded: // Request timeout
return true
case codes.ResourceExhausted: // Rate limited or overloaded
return true
case codes.Aborted:
// Aborted during read-only volume lookups is likely transient
// (e.g., filer restarting), but log for visibility
glog.V(1).Infof("Treating Aborted as retryable for volume lookup: %v", err)
return true return true
} }
} }
// Fallback to string matching for non-gRPC errors (e.g., network errors) // Fallback to string matching for non-gRPC errors (e.g., network errors)
errStr := err.Error() errStr := err.Error()
return strings.Contains(errStr, "transport") || return strings.Contains(errStr, "transport") ||

Loading…
Cancel
Save