docs(rdb): More work on the rdb snapshotting document.

Also add some debugging command to control the replication flow. Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2025-05-10 18:05:44 +02:00 · 2022-08-28 17:27:34 +03:00 · 2022-08-28 17:27:34 +03:00 · ae65c489e5
commit ae65c489e5
parent c72939d0d4
7 changed files with 132 additions and 22 deletions
--- a/docs/rdbsave.md
+++ b/docs/rdbsave.md
@ -36,22 +36,90 @@ and replay it by distributing entries among `K` replica shards. After all the sn
 they will continue with replaying the change log (stable state replication), which is out of context
 of this document.

-## Moving point-in-time (TBD)
+## Relaxed point-in-time (TBD)
 When DF saves its snapshot file on disk, it maintains snapshot isolation by applying a virtual cut
-through all the process shards. Snapshotting may take time during which DF may apply multiple mutations
-to its contents. These mutations won't be part of the snapshot since it captures data
-*right before* it has started. This is perfect for backups.
+through all the process shards. Snapshotting may take time, during which, DF may process many write requests.
+These mutations won't be part of the snapshot, because the cut captures data up to the point
+**it has started**. This is perfect for backups. I call this variation - conservative snapshotting.

-However, when we peform snapshotting for replication, we would love to produce a snapshot
-that includes all the data upto point in time when the snapshotting finishes.
-(Why, actually, we do not want similar semantics for file snapshots??).
+However, when we peform snapshotting for replication, we would like to produce a snapshot
+that includes all the data upto point in time when the snapshotting **finishes**. I called
+this *relaxed snapshotting*. The reason for relaxed snapshotting is to avoid keeping the changelog
+of all mutations during the snapshot creation.

-The reason for this is because, otherwise, we would need to keep aside a change-log
-of all mutations after the snapshot is started so we could replay it after it finishes.
+As a side comment - we could, in theory, support the same (relaxed)
+semantics for file snapshots, but it's no necessary since it might increase the snapshot sizes.

-We would need this change-log anyways, after the snapshot finishes - this changelog is what
-provides data for the stable state replication. However, the snapshotting phase can take up
-lots of time and add lots of memory pressure on the system. Keeping the change-log during this phase,
-will only add more pressure. By relaxing the requirement of point-in-time we can push the changelog
-into replication sockets without saving it aside. Of course we would still need a point in time consistency,
-in order to know when the snapshotting finished and stable state replication started.
+The snapshotting phase (full-sync) can take up lots of time which add lots of memory pressure on the system.
+Keeping the change-log aside during the full-sync phase will only add more pressure.
+We achieve relaxed snapshotting by pushing the changes into the replication sockets without saving them aside.
+Of course, we would still need a point-in-time consistency,
+in order to know when the snapshotting finished and the stable state replication started.
+
+## Conservative and relaxed snapshotting variations
+
+Both algorithms maintain a scanning process (fiber) that iterarively goes over the main dictionary
+and serializes its data. Before starting the process, the SnapshotShard captures
+the change epoch of its shard (this epoch is increased with each write request).
+
+```cpp
+SnapshotShard.epoch = shard.epoch++;
+```
+
+For sake of simplicity, we can assume that each entry in the shard maintains its own version counter.
+By capturing the epoch number we establish a cut: all entries with `version <= SnapshotShard.epoch`
+have not been serialized yet and were not modified by the concurrent writes.
+
+The DashTable iteration algorithm guarantees convergeance and coverage ("at most once"),
+but it does not guarantee that each entry is visited *exactly once*.
+Therefore, we use entry versions for two things: 1) to avoid serialization of the same entry multiple times,
+and 2) to correctly serialize entries that need to change due to concurrent writes.
+
+Serialization Fiber:
+
+```cpp
+ for (entry : table) {
+    if (entry.version <= cut.epoch) {
+      entry.version = cut.epoch + 1;
+      SendToSerializationSink(entry);
+    }
+ }
+```
+
+To allow concurrent writes during the snapshotting phase, we setup a hook that is triggerred on each
+entry mutation in the table:
+
+OnWriteHook:
+```cpp
+....
+if (entry.version <= cut.version) {
+  SendToSerializationSink(entry);
+}
+...
+entry = new_entry;
+entry.version = shard.epoch++;  // guaranteed to become > cut.version
+```
+
+Please note that this hook maintains point-in-time semantics for the conservative variation by pushing
+the previous value of the entry into the sink before changing it.
+
+However, for the relaxed point-in-time, we do not have to store the old value.
+Therefore, we can do the following:
+
+OnWriteHook:
+
+```cpp
+if (entry.version <= cut.version) {
+  SendToSerializationSink(new_entry);  // do not have to send the old value
+} else {
+  // Keep sending the changes.
+  SendToSerializationSink(IncrementalDiff(entry, new_entry));
+}
+
+entry = new_entry;
+entry.version = shard.epoch++;
+```
+
+The change data is sent along with the rest of the contents, and it requires to extend
+the existing rdb format to support differential operations like (hset, append, etc).
+The Serialization Fiber loop is the same for this variation.