* remove DbSlice mutex
* add ConditionFlag in SliceSnapshot
* disable compression when big value serialization is on
* add metrics
---------
Signed-off-by: kostas <kostas@dragonflydb.io>
There are actually a few failures fixed in this PR, only one of which is a test bug:
* `db_slice_->Traverse()` can yield, causing `fiber_cancelled_`'s value to change
* When a migration is cancelled, it may never finish `WaitForInflightToComplete()` because it has `in_flight_bytes_` that will never reach destination due to the cancellation
* `IterateMap()` with numeric key/values overrode the key's buffer with the value's buffer
Fixes#4207
* feat: Huge values breakdown in cluster migration
Before this PR we used `RESTORE` commands for transferring data between
source and target nodes in cluster slots migration.
While this _works_, it has a side effect of consuming 2x memory for huge
values (i.e. if a single key's value takes 10gb, serializing it will
take 20gb or even 30gb).
With this PR we break down huge keys into multiple commands (`RPUSH`,
`HSET`, etc), respecting the existing `--serialization_max_chunk_size`
flag.
Part of #4100
* fix: improve BreakStalledFlowsInShard heuristic
Before this change - we wrote in a single call whatever record chunks we pulled from the channel.
This can be problematic for 1GB chunks for example, which might take 10sec to write.
Lately we added a replication breaker on the master side that breaks the fully sync after
a predefined threshold has passed. By default it was 10sec. To improve the robustness of this
breaker, we now write chunks of upto 1MB and update last_write_time_ns_ more frequently.
Also, we added more logs to make replication delays on both sides more visible.
We also added logs of breaking the replication on the master sides.
Unfortunately, this did not help making BreakStalledFlowsInShard more robust because now the
problem moved to replica side which may take 20s+ seconds to parse huge values.
Therefore, I increased the threshold for breaking the replication to 30s.
Finally, instrument GetMetrics call as it takes sometimes more than 1 sec.
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* chore: add timeout fo replication sockets
Master will stop the replication flow if writes could not progress for more than K millis.
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
Signed-off-by: Roman Gershman <romange@gmail.com>
Co-authored-by: Shahar Mike <chakaz@users.noreply.github.com>
DastTable::Traverse is error prone when the callback passed preempts because the segment might change. This is problematic and we need atomicity while traversing segments with preemption. The fix is to add Traverse in DbSlice and protect the traversal via ThreadLocalMutex.
* add ConditionFlag to DbSlice
* add Traverse in DbSlice and protect it with the ConditionFlag
* remove condition flag from snapshot
* remove condition flag from streamer
---------
Signed-off-by: kostas <kostas@dragonflydb.io>
Before it was possible to issue several concurrent AsyncWrite requests.
But these are not atomic, which leads to replication stream corruption.
Now we wait for the previous request to finish before sending the next one.
ThrottleIfNeeded is now takes into account pending buffer size for throttling.
Fixes#3329
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
**The Bug**
Before this fix, source nodes would send `FIN` entries to target nodes
(in all thread flows), and would then send a `DFLYMIGRATE ACK` command
to verify that all flows received the `FIN` in time.
If they didn't, the source node would retry this logic in a loop, until
successful.
The problem is that, in some rear cases, one or more of the flows would
indeed be in a `FIN` state, _but of a previous `FIN` that is already
outdated_. If that's indeed the case, all data between that `FIN` and
the next `FIN`(s) will be lost.
**The Fix**
We already have an attempt id that we send in the `DFLYMIGRATE ACK`
command, and return it in the response. This fix utilizes the same
attempt id to be sent to all flows, and then when joined, we make sure
we join on the correct (latest) attempt id.
Unfortunately, we can't use `FIN` opcode now, because the protocol does
not send any additional metadata for this opcode. I chose to use LSN
because it has exactly the fields that we need, and one could possibly
think of Log Sequence Number as an attempt id, but I could change that
if it's unclear or too hacky.
**Testing**
To reproduce this, one needs to lower
`--slot_migration_connection_timeout_ms` significantly, say to 500ms.
This would fail, on my laptop, every ~2 runs.
With this fix, it runs hundreds of times and never reproduces.
* fix(cluster): Support `FLUSHALL` while slot migration is in progress
Fixes#3132
Also do a small refactor to move cancellation logic into
`RestoreStreamer`.
* chore: Streamer is rewritten with async interface
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
Send journal lsn to replica and compare the lsn value against number of records received in replica side
Signed-off-by: kostas <kostas@dragonflydb.io>
Co-authored-by: adi_holden <adi@dragonflydb.io>
* feat(cluster): add tx execution in cluster_shard_migration
refactor(replication): move code that is common for cluster and
replica into a separate file, add full-sync-cut cmd
* feat(cluster): Add `RestoreStreamer`.
`RestoreStreamer`, like `JournalStreamer`, streams journal changes to a
sink. However, in addition, it traverses the DB like `RdbSerializer` and
sends existing entries as `RESTORE` commands.
Adding it required a bit of plumbing to get all journal changes to be
slot-aware.
In a follow-up PR I will remove the now unneeded `SerializerBase`.
* Fix build
* Fix bug
* Remove unimplemented function
* Iterate DB, drop support for db1+
* Send FULL-SYNC-CUT
* feat(replication): Use a ring buffer with messages to serve replication.
* Fix libraries dep graph
* Address PR feedback
* nits
* add a comment
* Lower the default log length
* feat: Use journal LSNs for absolute replication offsets
* 1 - Address small CR comments
2 - Simplify the offset accounting so that we send the correct offset
in `SliceSnapshot::Stop` instead of counting in RdbLoader. This
allows us to revert the changes to slice journaling of EXEC
commands, for example.
* Store int with absl::little_endian
* Document the offset management
Remove Boost.Fibers mentions and remove fibers_ext mentions.
Done in preparation to switch to helio-native fb2 implementation.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
This change removes most mentions of boost::fibers or util::fibers_ext.
Instead it introduces "core/fibers.h" file that incorporates most of
the primitives under dfly namespace. This is done in preparation to
switching from Boost.Fibers to helio native fibers.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>