Commit graph

427 commits

Author SHA1 Message Date
Kostas Kyrimis
267d5ab370
chore: remove DbSlice mutex and add ConditionFlag in SliceSnapshot (#4073)
* remove DbSlice mutex
* add ConditionFlag in SliceSnapshot
* disable compression when big value serialization is on
* add metrics

---------

Signed-off-by: kostas <kostas@dragonflydb.io>
2024-12-05 13:24:23 +02:00
Shahar Mike
95f2320825
chore: Hide managed service info in INFO (#4248)
Specifically:
* `INFO REPLICATION` does not list the replicas, but does still show
  `connected_slaves`
* `INFO SERVER` does not show `thread_count` and `os`

Fixes #4173
2024-12-03 16:09:13 +02:00
Roman Gershman
010bd8add4
chore: change the interface of stream and server commands (#4219) 2024-11-28 18:44:01 +02:00
Borys
43c83d29fa
feat: cluster migrations restarts immediately if timeout happens (#4081)
* feat: cluster migrations restarts immediately if timeout happens

* feat: add DEBUG MIGRATION PAUSE command
2024-11-25 16:02:22 +02:00
Roman Gershman
0e7ae34fe4
fix: enforce load limits when loading snapshot (#4136)
* fix: enforce load limits when loading snapshot

Prevent loading snapshots with used memory higher than max memory limit.

1. Store the used memory metadata only inside the summary file
2. Load the summary file before loading anything else, and if the used-memory is higher,
   abort the load.
---------

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-11-20 06:12:47 +02:00
Borys
4e7800f94f
fix: UB during cmd squashing reply size calculation (#4149)
* fix: UB during cmd squashing reply size calculation

* feat: add promtheus metric commands_squashing_replies_bytes
2024-11-19 13:40:30 +02:00
Borys
e16ef838e4
feat: add INFO memory section for squashing replies memory consuming (#4147)
* feat: add INFO memory section for squashing replies memory consuming

* refactor: address comments
2024-11-18 21:16:41 +02:00
Roman Gershman
8bd2b9ed3e
chore: optimize info command (#4137)
* chore: optimize info command

    Info command has a large latency when returning all the sections.
    But often a single section is required. Specifically,
    SERVER and REPLICATION sections are often fetched by clients
    or management components.

    This PR:
    1. Removes any hops for "INFO SERVER" command.
    2. Removes some redundant stats.
    3. Prints latency stats around GetMetrics command if it took to much.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>

* Update src/server/server_family.cc

Co-authored-by: Shahar Mike <chakaz@users.noreply.github.com>
Signed-off-by: Roman Gershman <romange@gmail.com>

* chore: remove GetMetrics dependency from the REPLICATION section

Also, address comments

Signed-off-by: Roman Gershman <roman@dragonflydb.io>

* fix: clang build

---------

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
Signed-off-by: Roman Gershman <romange@gmail.com>
Co-authored-by: Shahar Mike <chakaz@users.noreply.github.com>
2024-11-17 13:33:29 +02:00
Roman Gershman
be96e6cf99
chore: change Namespaces to be a global pointer (#4032)
* chore: change Namespaces to be a global pointer

Before the namespaces object was defined globally.
However it has non-trivial d'tor that is being called after main exits.
It's quite dangerous to have global non-POD objects being defined globally.
For example, if we used LOG(INFO) inside the Clear function , that would crash dragonfly on exit.

Ths PR changes it to be a global pointer.

---------

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-11-10 10:45:53 +00:00
Vladislav
eadce55b67
chore: remove old io (#3953)
* chore: Remove old IO

* fix: fix last error accounting
* chore: use unique_ptr<char> in MGetResponse storage

---------

Signed-off-by: Vladislav Oleshko <vlad@dragonflydb.io>
2024-11-10 11:56:41 +02:00
adiholden
2d49a28c15
fix(server): handle running script load inside multi (#4074)
Signed-off-by: adi_holden <adi@dragonflydb.io>
2024-11-10 09:34:40 +02:00
Roman Gershman
7df8c268d8
chore: eliminate redundant ConnectionContext arguments (#4065) 2024-11-05 10:40:04 +02:00
Roman Gershman
d10e76b408
chore: support load/save from GCS (#4006)
Not everything is supported but manual load save is supported.

1. Run dragonfly with `--dir gs://bucket/path`
2. In redis-cli:
   a. SET foo bar
   b. SAVE DF gsdump
   c. DFLY LOAD gs://bucket/path/gsdump-summary.dfs

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-10-30 13:57:58 +02:00
Roman Gershman
6f6897cef1
chore: pass RedisReplyBuilder explicitly from dragonfly connection (#4009)
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-10-29 14:52:09 +02:00
Roman Gershman
92be74f4e4
fix: build break in search_family (#4008)
Also perform additional clean-up of cntx->reply_builder() - Part11

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-10-28 17:40:27 +02:00
Roman Gershman
1bdd56c973
chore: pass SinkReplyBuilder and Transaction explicitly. Part10 (#3998) 2024-10-28 16:18:52 +02:00
Roman Gershman
c2710604de
chore: refactor snapshot expanding logic (#4003)
S3 and file expansion logic had some duplicate code.
this PR refactors it before adding GCS support.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-10-28 10:08:10 +02:00
Roman Gershman
b0d52c69ba
feat: introduce metrics/logs of when pipelining is being throttled (#4000)
* feat: introduce metrics/logs of when pipelining is being throttled

Fixes #3999 following up on discussion at #3997.
---------

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-10-28 09:15:04 +02:00
Roman Gershman
7035606b4b
chore: pass SinkReplyBuilder and Transaction explicitly. Part6 (#3987) 2024-10-24 18:47:18 +03:00
Roman Gershman
132ffe0920
chore: reduce dependency of debug/memory commands on ConnectionContext (#3977)
chore: reduce dependency of debug/dfly/memory commands on ConnectionContext
2024-10-24 10:24:18 +03:00
Roman Gershman
4aa0ca1ef7
chore: get rid of MutableSlice (#3952)
* chore: get rid of MutableSlice

Signed-off-by: Roman Gershman <roman@dragonflydb.io>

* chore: comments

---------

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-10-23 21:50:39 +03:00
Roman Gershman
f0c30a6d59
feat: track request sizes histograms (#3951)
This PR introduces "DEBUG RECVSIZE ENABLE|DISABLE|tid"
command that allows tracking of request sizes.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-10-20 19:54:34 +03:00
Roman Gershman
14220a6a20
chore: get rid of ToUpper/ToLower mutations on arguments (#3950)
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-10-18 23:23:59 +03:00
Roman Gershman
5ab32b97d9
chore(refactoring): header clean ups (#3943)
Move privately used header code to cc files. Remove redunandant includes.
No functional changes.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-10-18 12:47:26 +03:00
adiholden
a1830e1b5e
feat(server): use listpack node encoding for list (#3914)
Signed-off-by: adi_holden <adi@dragonflydb.io>
2024-10-15 13:55:26 +03:00
Roman Gershman
4012ad1855
fix: prevents Dragonfly from blocking in epoll during snapshotting (#3911)
The problem - we used file write in non-direct mode when writing snapshots in epoll mode.
As a result - lots of data was cached into OS memory. But then during the rename operation,
when we rename "xxx.dfs.tmp" into "xxx.dfs", the OS flushes the file caches and the thread
is stuck in OS system call rename for a long time.

The fix - to use DIRECT mode and to avoid caching the data into OS caches at all.
Fixes #3895

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-10-12 18:26:12 +03:00
Roman Gershman
5d2c308c99
chore: schedule chains (#3819)
Use intrusive queue that allows batching of scheduling calls instead of handling each call separately.
This optimizations improves latency and throughput by 3-5%
In addition, we expose batching statistics in info transaction block.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-10-11 22:41:31 +03:00
Shahar Mike
50a7f2bcb1
fix: Do not kill Dragonfly on failed DFLY LOAD (#3892)
Today, some of the failures to load an RDB file passed via
`--dbfilename` cause Dragonfly to terminate with an error code. This is
ok and works as expected.

The problem is that the same code path is used for `DFLY LOAD`, which
means that if there's an error loading the file (such as corrupted
file), Dragonfly will exit instead of returning an error code to the
client.

This change fixes that, by only exiting in the code path which loads on
init.

Note to reviewer: apparently we can't call `Future::Get()` more than
once, as the first call resets the state of the future and drops the
previously saved value, so we use a Fiber here instead.
2024-10-08 14:47:31 +03:00
Kostas Kyrimis
a5d34adc4c
chore: remove goto statements (#3791)
* replace goto statements with lambda calls

Signed-off-by: kostas <kostas@dragonflydb.io>
2024-09-25 16:08:31 +03:00
Shahar Mike
526bce4222
chore: Forbid replicating a replica (#3779)
* chore: Forbid replicating a replica

We do not support connecting a replica to a replica, but before this PR
we allowed doing so. This PR disables that behavior.

Fixes #3679

* `replicaof_mu_`
2024-09-24 13:42:22 +00:00
Roman Gershman
b7b4cabacc
chore: some renames + fix a typo in RETURN_ON_BAD_STATUS (#3763)
* chore: some renames + fix a typo in RETURN_ON_BAD_STATUS

Renames in transaction.h - no functional changes.
Fix a typo in error.h following  #3758
---------

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-09-23 13:16:50 +03:00
Roman Gershman
f1f8ee17dc
fix: make snapshotting process more responsive (#3759)
* fix: improve BreakStalledFlowsInShard heuristic

Before this change - we wrote in a single call whatever record chunks we pulled from the channel.
This can be problematic for 1GB chunks for example, which might take 10sec to write.

Lately we added a replication breaker on the master side that breaks the fully sync after
a predefined threshold has passed. By default it was 10sec. To improve the robustness of this
breaker, we now write chunks of upto 1MB and update last_write_time_ns_ more frequently.

Also, we added more logs to make replication delays on both sides more visible.
We also added logs of breaking the replication on the master sides.

Unfortunately, this did not help making BreakStalledFlowsInShard more robust because now the
problem moved to replica side which may take 20s+ seconds to parse huge values.
Therefore, I increased the threshold for breaking the replication to 30s.

Finally, instrument GetMetrics call as it takes sometimes more than 1 sec.

---------

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-09-22 17:05:28 +03:00
adiholden
4d38271efa
feat(server): introduce rss oom limit (#3702)
* introduce rss denyoom limit

Signed-off-by: adi_holden <adi@dragonflydb.io>
2024-09-22 13:28:24 +03:00
Andy Dunstall
b9ff6934e8
fix: fix s3 load snapshot (#3717) 2024-09-17 07:17:24 +01:00
Roman Gershman
bdc578acef
chore: limit number of descriptors in the exec map (#3688)
For some cases, this map can grow indefinitely.
This change makes it less detailed by makes sure that number of possible keys is bounded.
Still it can provide a good summary of nature of exec transactions.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-09-10 07:50:30 +00:00
Shahar Mike
b10a4a5348
feat(server): Support CLIENT SETINFO (#3673)
Add support for `CLIENT SETINFO <LIB-NAME | LIB-VER>` and also return
that as part of `CLIENT LIST`, like Valkey.

Fixes #3137
2024-09-09 11:03:05 +03:00
Shahar Mike
1306a91bda
chore: Add CLIENT ID command (#3672)
We already adhere to all requirements, we just need to return the id
when the command is issued :)

Fixes #3651
2024-09-08 22:00:53 +03:00
Borys
a1e9ee1b6d
CmdArgParser improvement (#3633)
* feat: add processing of tail args into CmdArgParser::Check
* refactor: rename CmdArgParser::Switch to Map
* feat: add CheckMap method into CmdArgParser
2024-09-05 15:30:54 +03:00
Borys
d40e9088ae
refactor: remove extra code from CmdArgParser (#3619)
* refactor: remove extra code from CmdArgParser
2024-09-03 07:04:05 +00:00
Roman Gershman
dd0effac6f
feat: add slave_repl_offset to the replication section. (#3596)
* feat: add slave_repl_offset to the replication section.

In Valkey slave_repl_offset denotes the replication offset on replica site during stable sync phase.
During fullsync phase it appears with 0 value.

In Dragonfly this field appears only after full sync has completed, thus it allows
to check whether Dragonfly reached stable sync phase. The value of this field describes the cumulative progress
of all the replication flows and it does not directly correspond to master side metrics.

In addition, this PR fixes the bug in wait_available_async() function in our replication tests.
This function is intended to wait until a replica reaches stable state and it did by sending pings until they do not
respond with LOADING error, hence the assumption is that the replica is in full sync state already.

However it can happen that master_link_status is "up" but replica has not reached full sync state, and the PING will succeed
just because wait_available_async() was called before full sync started. The whole approach of polling the state is fragile.

Now we use `slave_repl_offset` explicitly to see if the replica reaches stable state.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>

* chore: simplify wait_available_async

* chore: comments

---------

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-08-30 18:58:07 +03:00
Kostas Kyrimis
41f7b611d0
chore: enable -Werror=thread-safety and add missing annotations (part 2/2) (#3595)
* add missing annotations
* small mutex fixes
* enable -Werror=thread-safety for clang builds

---------

Signed-off-by: kostas <kostas@dragonflydb.io>
2024-08-30 15:42:30 +03:00
Kostas Kyrimis
0705bbb536
feat(acl): add pub/sub (#3574)
* add support for pub/sub
* add tests
---------

Signed-off-by: kostas <kostas@dragonflydb.io>
2024-08-30 15:41:28 +03:00
Stepan Bagritsevich
a22eff15dc
fix(server_family): Remove search indexes during the FLUSHALL command (#3539)
* fix(server_family): Add search indixes removing to the FLUSHALL command

fixes dragonflydf#3532

---------

Signed-off-by: Stepan Bagritsevich <bagr.stepan@gmail.com>
Signed-off-by: Stepan Bagritsevich <stefan@dragonflydb.io>
2024-08-30 08:26:14 +03:00
Borys
88229cf365
refactor: remove toUpper() from cmd_arg_parser (#3599)
* refactor: remove usage of toUpper() from cmd_arg_parser

* refactor: remove CmdArgParser::NextUpper
2024-08-29 15:19:52 +03:00
Roman Gershman
0ee52c9d35
chore: remove DflyVersion::VER0 (#3593)
Stop supporting DflyVersion::VER0 from more than a year ago.
In addition, rename Metrics fields to make them more clear
General improvements and fix the reconnect metric.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-08-28 18:21:53 +03:00
Kostas Kyrimis
839b1be82d
chore: add -Wthread-analysis and annotate (part 1/2) (#3502)
* enable -Wthread-analysis
* add missing annotations
* small fixes

---------

Signed-off-by: kostas <kostas@dragonflydb.io>
2024-08-26 18:22:38 +03:00
Stepan Bagritsevich
80c3579596
feat(server_family): Add backup/restore Prometheus metrics (#3520)
* feat(server_family): Add backup/restore Prometheus metrics

fixes dragonflydb#3210

---------

Signed-off-by: Stepan Bagritsevich <bagr.stepan@gmail.com>
2024-08-24 00:36:31 +03:00
Shahar Mike
ad3ebf61d2
feat(cluster): Allow appending RDB to existing store (#3505)
* feat(cluster): Allow appending RDB to existing store

The goal of this PR is to support the loadoing of multiple RDB files into a single server, like when migrating from a Valkey cluster to Dragonfly with a different number of nodes.

It makes the following changes:

* Removes `DEBUG LOAD`, as we already have `DFLY LOAD`
* Adds `APPEND` option to `DFLY LOAD` (i.e. `DFLY LOAD <filename> APPEND`) that loads an RDB without first flushing the data store, overriding existing keys
* Does not load keys belonging to unowned slots, if in cluster mode

Fixes #2840
2024-08-15 14:56:40 +03:00
Roman Gershman
93f6773297
chore: reduce pipelining latency by reusing existing shard fibers (#3494)
* chore: reduce pipelining latency by reusing existing shard fibers

To prove the benefits, run `./dfly_bench --pipeline=50   -n 20000  --ratio 0:1  --qps=0  --key_maximum=1`
Before: the average pipelining latency was 10ms
After: the average pipelining latency is 5ms.
Avg latency: pipelined_latency_usec / total_pipelined_squashed_commands

Also, improved counting of squashed commands - to count actual squashed ones.
---------

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-08-14 14:45:54 +03:00
Stepan Bagritsevich
c756023332
feat: Expose replica_reconnect_count for Prometheus metrics (#3370) 2024-08-13 12:34:01 +02:00