The condition was inverted - it was caching lookups with errors
instead of successful lookups. This caused every replicated write
to make a gRPC call to master for volume location lookup, resulting
in ~1 second latency for writeToReplicas.
The bug particularly affected TTL volumes because:
- More unique volumes are created (separate pools per TTL)
- Volumes expire and get recreated frequently
- Each new volume requires a fresh lookup (cache miss)
- Higher volume churn = more cache misses = more master lookups
With this fix, successful lookups are cached for 10 minutes,
reducing replication latency from ~1s to ~10ms for cached volumes.
* Added context for the MasterClient's methods to avoid endless loops
* Returned WithClient function. Added WithClientCustomGetMaster function
* Hid unused ctx arguments
* Using a common context for the KeepConnectedToMaster and WaitUntilConnected functions
* Changed the context termination check in the tryConnectToMaster function
* Added a child context to the tryConnectToMaster function
* Added a common context for KeepConnectedToMaster and WaitUntilConnected functions in benchmark
Originally there are only url(ip + port), and publicUrl. Because ip was
used to listen for http service, it has less flexibility and volume
server has to be accessed via publicUrl.
Recently we added ip.bind, for binding http service.
With this change, url can be used to connect to volume servers. And
publicUrl becomes a free style piece of url information, it does not
even need to be unique.
Walk needed to be added to NeedleMap and CompactMap, to be able to add WalkKeys and WalkValues to volume. This is needed for iterating through all the stored needles in a volume - this was dump's purpose.