fix: make snapshotting process more responsive (#3759)

* fix: improve BreakStalledFlowsInShard heuristic

Before this change - we wrote in a single call whatever record chunks we pulled from the channel.
This can be problematic for 1GB chunks for example, which might take 10sec to write.

Lately we added a replication breaker on the master side that breaks the fully sync after
a predefined threshold has passed. By default it was 10sec. To improve the robustness of this
breaker, we now write chunks of upto 1MB and update last_write_time_ns_ more frequently.

Also, we added more logs to make replication delays on both sides more visible.
We also added logs of breaking the replication on the master sides.

Unfortunately, this did not help making BreakStalledFlowsInShard more robust because now the
problem moved to replica side which may take 20s+ seconds to parse huge values.
Therefore, I increased the threshold for breaking the replication to 30s.

Finally, instrument GetMetrics call as it takes sometimes more than 1 sec.

---------

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
This commit is contained in:
Roman Gershman 2024-09-22 17:05:28 +03:00 committed by GitHub
parent 2e9b133ea0
commit f1f8ee17dc
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
6 changed files with 56 additions and 13 deletions

View file

@ -13,7 +13,7 @@
using namespace facade;
ABSL_FLAG(uint32_t, replication_timeout, 10000,
ABSL_FLAG(uint32_t, replication_timeout, 30000,
"Time in milliseconds to wait for the replication writes being stuck.");
ABSL_FLAG(uint32_t, replication_stream_output_limit, 64_KB,