* remove DbSlice mutex
* add ConditionFlag in SliceSnapshot
* disable compression when big value serialization is on
* add metrics
---------
Signed-off-by: kostas <kostas@dragonflydb.io>
Specifically:
* `INFO REPLICATION` does not list the replicas, but does still show
`connected_slaves`
* `INFO SERVER` does not show `thread_count` and `os`
Fixes#4173
* fix: enforce load limits when loading snapshot
Prevent loading snapshots with used memory higher than max memory limit.
1. Store the used memory metadata only inside the summary file
2. Load the summary file before loading anything else, and if the used-memory is higher,
abort the load.
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* chore: optimize info command
Info command has a large latency when returning all the sections.
But often a single section is required. Specifically,
SERVER and REPLICATION sections are often fetched by clients
or management components.
This PR:
1. Removes any hops for "INFO SERVER" command.
2. Removes some redundant stats.
3. Prints latency stats around GetMetrics command if it took to much.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* Update src/server/server_family.cc
Co-authored-by: Shahar Mike <chakaz@users.noreply.github.com>
Signed-off-by: Roman Gershman <romange@gmail.com>
* chore: remove GetMetrics dependency from the REPLICATION section
Also, address comments
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* fix: clang build
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
Signed-off-by: Roman Gershman <romange@gmail.com>
Co-authored-by: Shahar Mike <chakaz@users.noreply.github.com>
* chore: change Namespaces to be a global pointer
Before the namespaces object was defined globally.
However it has non-trivial d'tor that is being called after main exits.
It's quite dangerous to have global non-POD objects being defined globally.
For example, if we used LOG(INFO) inside the Clear function , that would crash dragonfly on exit.
Ths PR changes it to be a global pointer.
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
Not everything is supported but manual load save is supported.
1. Run dragonfly with `--dir gs://bucket/path`
2. In redis-cli:
a. SET foo bar
b. SAVE DF gsdump
c. DFLY LOAD gs://bucket/path/gsdump-summary.dfs
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
S3 and file expansion logic had some duplicate code.
this PR refactors it before adding GCS support.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* feat: introduce metrics/logs of when pipelining is being throttled
Fixes#3999 following up on discussion at #3997.
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* chore: get rid of MutableSlice
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* chore: comments
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
This PR introduces "DEBUG RECVSIZE ENABLE|DISABLE|tid"
command that allows tracking of request sizes.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
The problem - we used file write in non-direct mode when writing snapshots in epoll mode.
As a result - lots of data was cached into OS memory. But then during the rename operation,
when we rename "xxx.dfs.tmp" into "xxx.dfs", the OS flushes the file caches and the thread
is stuck in OS system call rename for a long time.
The fix - to use DIRECT mode and to avoid caching the data into OS caches at all.
Fixes#3895
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
Use intrusive queue that allows batching of scheduling calls instead of handling each call separately.
This optimizations improves latency and throughput by 3-5%
In addition, we expose batching statistics in info transaction block.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
Today, some of the failures to load an RDB file passed via
`--dbfilename` cause Dragonfly to terminate with an error code. This is
ok and works as expected.
The problem is that the same code path is used for `DFLY LOAD`, which
means that if there's an error loading the file (such as corrupted
file), Dragonfly will exit instead of returning an error code to the
client.
This change fixes that, by only exiting in the code path which loads on
init.
Note to reviewer: apparently we can't call `Future::Get()` more than
once, as the first call resets the state of the future and drops the
previously saved value, so we use a Fiber here instead.
* chore: Forbid replicating a replica
We do not support connecting a replica to a replica, but before this PR
we allowed doing so. This PR disables that behavior.
Fixes#3679
* `replicaof_mu_`
* chore: some renames + fix a typo in RETURN_ON_BAD_STATUS
Renames in transaction.h - no functional changes.
Fix a typo in error.h following #3758
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* fix: improve BreakStalledFlowsInShard heuristic
Before this change - we wrote in a single call whatever record chunks we pulled from the channel.
This can be problematic for 1GB chunks for example, which might take 10sec to write.
Lately we added a replication breaker on the master side that breaks the fully sync after
a predefined threshold has passed. By default it was 10sec. To improve the robustness of this
breaker, we now write chunks of upto 1MB and update last_write_time_ns_ more frequently.
Also, we added more logs to make replication delays on both sides more visible.
We also added logs of breaking the replication on the master sides.
Unfortunately, this did not help making BreakStalledFlowsInShard more robust because now the
problem moved to replica side which may take 20s+ seconds to parse huge values.
Therefore, I increased the threshold for breaking the replication to 30s.
Finally, instrument GetMetrics call as it takes sometimes more than 1 sec.
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
For some cases, this map can grow indefinitely.
This change makes it less detailed by makes sure that number of possible keys is bounded.
Still it can provide a good summary of nature of exec transactions.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* feat: add slave_repl_offset to the replication section.
In Valkey slave_repl_offset denotes the replication offset on replica site during stable sync phase.
During fullsync phase it appears with 0 value.
In Dragonfly this field appears only after full sync has completed, thus it allows
to check whether Dragonfly reached stable sync phase. The value of this field describes the cumulative progress
of all the replication flows and it does not directly correspond to master side metrics.
In addition, this PR fixes the bug in wait_available_async() function in our replication tests.
This function is intended to wait until a replica reaches stable state and it did by sending pings until they do not
respond with LOADING error, hence the assumption is that the replica is in full sync state already.
However it can happen that master_link_status is "up" but replica has not reached full sync state, and the PING will succeed
just because wait_available_async() was called before full sync started. The whole approach of polling the state is fragile.
Now we use `slave_repl_offset` explicitly to see if the replica reaches stable state.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* chore: simplify wait_available_async
* chore: comments
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
Stop supporting DflyVersion::VER0 from more than a year ago.
In addition, rename Metrics fields to make them more clear
General improvements and fix the reconnect metric.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* feat(cluster): Allow appending RDB to existing store
The goal of this PR is to support the loadoing of multiple RDB files into a single server, like when migrating from a Valkey cluster to Dragonfly with a different number of nodes.
It makes the following changes:
* Removes `DEBUG LOAD`, as we already have `DFLY LOAD`
* Adds `APPEND` option to `DFLY LOAD` (i.e. `DFLY LOAD <filename> APPEND`) that loads an RDB without first flushing the data store, overriding existing keys
* Does not load keys belonging to unowned slots, if in cluster mode
Fixes#2840
* chore: reduce pipelining latency by reusing existing shard fibers
To prove the benefits, run `./dfly_bench --pipeline=50 -n 20000 --ratio 0:1 --qps=0 --key_maximum=1`
Before: the average pipelining latency was 10ms
After: the average pipelining latency is 5ms.
Avg latency: pipelined_latency_usec / total_pipelined_squashed_commands
Also, improved counting of squashed commands - to count actual squashed ones.
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>