* filer.sync: add exponential backoff on unexpected EOF during replication
When the source volume server drops connections under high traffic,
filer.sync retries aggressively (every 1-6s), hammering the already
overloaded source. This adds a longer exponential backoff (10s to 2min)
specifically for "unexpected EOF" errors, reducing pressure on the
source while still retrying indefinitely until success.
Also adds more logging throughout the replication path:
- Log source URL and error at V(0) when ReadPart or io.ReadAll fails
- Log content-length and byte counts at V(4) on success
- Log backoff duration in retry messages
Fixes#8542
* filer.sync: extract backoff helper and fix 2-minute cap
- Extract nextEofBackoff() and isEofError() helpers to deduplicate
the backoff logic between fetchAndWrite and uploadManifestChunk
- Fix the cap: previously 80s would double to 160s and pass the
< 2min check uncapped. Now doubles first, then clamps to 2min.
* filer.sync: log source URL instead of empty upload URL on read errors
UploadUrl is not populated until after the reader is consumed, so the
V(0) and V(4) logs were printing an empty string. Add SourceUrl field
to UploadOption and populate it from the HTTP response in fetchAndWrite.
* filer.sync: guard isEofError against nil error
* filer.sync: use errors.Is for EOF detection, fix log wording
- Replace broad substring matching ("read input", "unexpected EOF")
with errors.Is(err, io.ErrUnexpectedEOF) and errors.Is(err, io.EOF)
so only actual EOF errors trigger the longer backoff
- Fix awkward log phrasing: "interrupted replicate" → "interrupted
while replicating"
* filer.sync: remove EOF backoff from uploadManifestChunk
uploadManifestChunk reads from an in-memory bytes.Reader, so any EOF
errors there are from the destination side, not a broken source stream.
The long source-oriented backoff is inappropriate; let RetryUntil
handle destination retries at its normal cadence.
---------
Co-authored-by: Copilot <copilot@github.com>
* Added global http client
* Added Do func for global http client
* Changed the code to use the global http client
* Fix http client in volume uploader
* Fixed pkg name
* Fixed http util funcs
* Fixed http client for bench_filer_upload
* Fixed http client for stress_filer_upload
* Fixed http client for filer_server_handlers_proxy
* Fixed http client for command_fs_merge_volumes
* Fixed http client for command_fs_merge_volumes and command_volume_fsck
* Fixed http client for s3api_server
* Added init global client for main funcs
* Rename global_client to client
* Changed:
- fixed NewHttpClient;
- added CheckIsHttpsClientEnabled func
- updated security.toml in scaffold
* Reduce the visibility of some functions in the util/http/client pkg
* Added the loadSecurityConfig function
* Use util.LoadSecurityConfiguration() in NewHttpClient func
* remove old raft servers if they don't answer to pings for too long
add ping durations as options
rename ping fields
fix some todos
get masters through masterclient
raft remove server from leader
use raft servers to ping them
CheckMastersAlive for hashicorp raft only
* prepare blocking ping
* pass waitForReady as param
* pass waitForReady through all functions
* waitForReady works
* refactor
* remove unneeded params
* rollback unneeded changes
* fix
Running mount outside of the cluster would not need to expose all the volume servers to outside of the cluster. The chunk read and write will go through the filer.