Commit graph

133 commits

Author SHA1 Message Date
Roman Gershman
6d30baa20b
chore: Pipelining fixes (#4994)
Fixes #4998.
1. Reduces agressive yielding when reading multiple requests since it humpers pipeline efficiency.
   Now we yield consistently based on cpu time spend since the last resume point (via flag with sane defaults).
2. Increases socket read buffer size effectively allowing processing more requests in bulk.

`./dragonfly  --cluster_mode=emulated`
latencies (usec) for pipeline sizes 80-199:
p50: 1887, p75: 2367, p90: 2897, p99: 6266

`./dragonfly  --cluster_mode=emulated --experimental_cluster_shard_by_slot`
latencies (usec) for pipeline sizes 80-199:
p50: 813, p75: 976, p90: 1216, p99: 3528

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2025-04-27 20:48:02 +03:00
Tarun Pothulapati
7d0530656f
feat(tools/replay): Add pipeline latency distribution data (#4990)
feat(replay): add latency distributions

* also add avg latency
* also include pipeline range
* display both at the end
2025-04-24 19:23:43 +03:00
Roman Gershman
7ffe812967
feat(dfly_bench): allow regulated throughput in 3 modes (#4962)
* feat(dfly_bench): allow regulated throughput in 3 modes

1. Coordinated omission - with --qps=0, each request is sent and then we wait for the response and so on.
   For pipeline mode, k requests are sent and then we wait for them to return to send another k
2. qps > 0: we schedule sending requests at frequency "qps" per connection but if pending requests count crosses a limit
   we slow down by throttling request sending. This mode enables gentle uncoordinated omission, where the schedule
   converges to the real throughput capacity of the backend (if it's slower than the target throughput).
3. qps < 0, similar as (2) but does not adjust its scheduling and may overload the server
   if target QPS is too high.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>

* chore: change pipelining and coordinated omission logic

Before that the uncoordinated omission only worked without pipelining.
Now, with pipelining mode with send a burst of P requests and then:
a) For coordinated omission - wait for all of them to complete before proceeding
   further
b) For non-coordinated omission - we sleep to pace our single connection throughput as
   defined by the qps setting.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>

---------

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2025-04-21 09:56:33 +03:00
Roman Gershman
220f20bac6
feat: expose table capacities instead of number of buckets (#4956)
Also, add a local dashboard demonstrating prime table load per db.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2025-04-18 10:30:04 +03:00
Roman Gershman
5a2192dfdf
fix: local dashboard show rapid changes in QPS (#4886)
Helps investigating #4787

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2025-04-03 20:12:26 +03:00
Roman Gershman
460690855e
fix: add version id for dev container builds (#4878)
now it looks like this:
```
> docker run --rm ghcr.io/dragonflydb/dragonfly-dev:ubuntu-f767d82 --version
dragonfly f767d82-f767d82ce78ccbc90ddfb525f4ad4bd9aafcfbed
```

fixes #4830

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2025-04-02 20:21:19 +03:00
Roman Gershman
2a3a1567b9
feat(cluster_mgr): add populate command (#4816)
* feat(cluster_mgr): add populate command

We further simplify the code around cluster config
Also - add a command that populates all the cluster ranges in the cluster
using the "populate" command. `--size` and `--valsize` arguments are also added.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>

* chore: fixes

---------

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2025-03-25 10:47:10 +02:00
Roman Gershman
72fb25694b
chore(cluster_mgr): introduce SlotRange class (#4814)
Before: slot merging/splitting logic was mixed with business logic.
Also, slots were represented as dictionary, which made the code less readable.
Now, SlotRange handles the low-level logic, which makes the high-level code simpler
to understand.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2025-03-23 08:41:44 +02:00
Roman Gershman
e01aec2a21
fix(dfly_bench): track hit rate for mget command (#4723)
Also, clean up the code a bit, reduce nesting.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2025-03-07 07:09:59 +00:00
mkaruza
debb2eb9e8
feat(cluster_mgr): Add argument to set path to dragonfly binary (#4695)
Add optional argument to cluster_mgr script so that we can run cluster with different builds.

Signed-off-by: mkaruza <mario@dragonflydb.io>
2025-03-04 12:52:24 +01:00
Roman Gershman
52d88c2372
chore: introduce docker release pipeline (#4618)
* chore: introduce docker release pipeline

The whole flow is reimplemented using native arm64/amd64 runners.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>

* Update .github/workflows/docker-release2.yml

Co-authored-by: Kostas Kyrimis  <kostas@dragonflydb.io>
Signed-off-by: Roman Gershman <romange@gmail.com>

* chore: comments

---------

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
Signed-off-by: Roman Gershman <romange@gmail.com>
Co-authored-by: Kostas Kyrimis <kostas@dragonflydb.io>
2025-02-17 12:24:24 +02:00
Roman Gershman
e433ef87bf
fix: debian path in dragonfly.service (#4594)
Split the rpm service file from debian.
Fixes #4593

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2025-02-12 09:18:51 +02:00
Roman Gershman
b0b9a72dbd
feat: introduce more options for traffic logger (#4571)
1. Provide clear usage instructions
2. Add "pace" option, which when false, sends traffic as quickly as possible (default true).
3. Add skip option that sometimes can be useful to remove unneeded noise

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2025-02-07 11:10:13 +02:00
Roman Gershman
bafb427a09
fix: rpm package setup (#4506)
Also, fix the deadlock problem on shutdown on Oracle Linux 5.15
Fixes #4505

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2025-01-26 12:40:33 +02:00
Roman Gershman
904775cfe6
chore: new docker build pipeline (#4503)
Our previous weekly pipeline used qemu, was very slow and over-complicated.
This one uses matrix with proper parallelization and the latest arm64 github runners.

now it takes less than 30 minutes to build everything.
lets make it daily.
2025-01-26 12:03:42 +02:00
Roman Gershman
6265f52bff
feat(dev): allow monitoring a valkey server on localhost (#4467) 2025-01-18 10:46:14 +02:00
Roman Gershman
95cd9dfb4c
chore: update helio and improve our stack overflow resiliency (#4349)
1. Run CI/Regression tests with HELIO_STACK_CHECK=4096.
   This will crash if a fiber stack usage goes below this limit.
2. Increase shard queue stack size to 64KB
3. Increase fiber stack size to 40KB on Debug builds.
4. Updated helio has some changes around the TLS socket code.
   In addition we add a helper script to generate self-signed certificates helpful for local development work.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-12-23 08:13:45 +00:00
Roman Gershman
904d21d666
fix: add content-type for metrics response (#4340)
chore: add content-type for metrics response.

Also, update the local stack to use prometheus 3.0
Finally, hex-escape arguments when logging an error for a command.

Fixes #4277

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-12-18 19:12:00 +00:00
Borys
d6f2b76666
fix: cluster_mgr script (#4210) 2024-11-27 14:09:19 +00:00
Roman Gershman
63742dd0cf
fix: stop using openssl for container healthchecks (#4181)
Dragonfly responds to ascii based requests to tls port with:
`-ERR Bad TLS header, double check if you enabled TLS for your client.`

Therefore, it is possible to test now both tls and non-tls ports with a plain-text PING.
Fixes #4171

Also, blacklist the bloom-filter test that Dragonfly does not support yet.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-11-25 17:41:17 +02:00
s13k
ff2359af30
fix(tools): Prevent dragonfly.logrotate to stop logrotate service (#4176)
Update dragonfly.logrotate

If multiple logs are being rotated and one of them fails (due to exit 1), the other logs that follow won't be rotated either, unless logrotate is run again.

If you want to prevent the rotation of a specific log file and not affect the rest of the logs, you'll want to handle the condition properly to ensure that logrotate doesn't abort due to the failure of the prerotate script.

To prevent the rotation of a specific log file without causing issues for other logs, you can use exit 0 to prevent rotation cleanly or design your prerotate script to handle conditions carefully.

Signed-off-by: s13k <s13k@pm.me>
2024-11-24 17:27:05 +00:00
Sebastian Struß
cfca3e798d
adjusted grafana dashboard to be more user friendly (#4165) 2024-11-24 09:16:00 +02:00
dependabot[bot]
86b64d910a
chore(deps): bump github.com/redis/go-redis/v9 from 9.5.1 to 9.7.0 in /tools/replay (#4062)
chore(deps): bump github.com/redis/go-redis/v9 in /tools/replay

Bumps [github.com/redis/go-redis/v9](https://github.com/redis/go-redis) from 9.5.1 to 9.7.0.
- [Release notes](https://github.com/redis/go-redis/releases)
- [Changelog](https://github.com/redis/go-redis/blob/master/CHANGELOG.md)
- [Commits](https://github.com/redis/go-redis/compare/v9.5.1...v9.7.0)

---
updated-dependencies:
- dependency-name: github.com/redis/go-redis/v9
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-04 22:31:01 +02:00
dependabot[bot]
ceb474fbda
chore(deps): bump numpy from 1.24.1 to 2.1.3 in /tools (#4063)
Bumps [numpy](https://github.com/numpy/numpy) from 1.24.1 to 2.1.3.
- [Release notes](https://github.com/numpy/numpy/releases)
- [Changelog](https://github.com/numpy/numpy/blob/main/doc/RELEASE_WALKTHROUGH.rst)
- [Commits](https://github.com/numpy/numpy/compare/v1.24.1...v2.1.3)

---
updated-dependencies:
- dependency-name: numpy
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-04 22:30:34 +02:00
Roman Gershman
4012ad1855
fix: prevents Dragonfly from blocking in epoll during snapshotting (#3911)
The problem - we used file write in non-direct mode when writing snapshots in epoll mode.
As a result - lots of data was cached into OS memory. But then during the rename operation,
when we rename "xxx.dfs.tmp" into "xxx.dfs", the OS flushes the file caches and the thread
is stuck in OS system call rename for a long time.

The fix - to use DIRECT mode and to avoid caching the data into OS caches at all.
Fixes #3895

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-10-12 18:26:12 +03:00
Roman Gershman
c9a2334f6d
fix: allow the healthcheck run in non-privileged containers as well (#3731)
fix: allow the healthcheck running in non-privileged containers as well

Fixes #3644 (again).

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-09-20 05:41:06 +00:00
Shahar Mike
1c6be62a0b
fix: Fix cluster_mgr.py (#3730)
We updated the reply of `SLOT-MIGRATION-STATUS`, so `cluster_mgr.py`
needs to be adjusted as well.
2024-09-18 11:44:15 +03:00
Roman Gershman
3cdc8fa128
chore: add a script that parses allocator tracking logs (#3687) 2024-09-10 07:26:44 +00:00
Tarun Pothulapati
65f96e3bb5
fix(docker/healthcheck): run netstat port retreival command as dfly (#3647)
* fix(docker/healthcheck): run netstat port retreival command as dfly
2024-09-04 14:34:35 +00:00
Sebastian Struß
06f6dcafcd
fix(grafana): Fix grafana dragonfly dashboard datasource (#3608)
fix: grafana dragonfly dashboard datasource
2024-08-30 17:15:51 +00:00
dependabot[bot]
e8a8d534f9
chore(deps): bump gopkg.in/yaml.v3 from 3.0.0-20210107192922-496545a6307b to 3.0.0 in /tools/replay (#3603)
chore(deps): bump gopkg.in/yaml.v3 in /tools/replay

Bumps gopkg.in/yaml.v3 from 3.0.0-20210107192922-496545a6307b to 3.0.0.

---
updated-dependencies:
- dependency-name: gopkg.in/yaml.v3
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-08-29 16:40:37 +03:00
Roman Gershman
cec3659b51
fix: named volume permissions in docker (#3518)
Fixes #2917

The problem is described in this "working as intended" issue https://github.com/moby/moby/issues/3124
So the advised approach of using "USER dfly" directive does not really work because it requires
that the host will also define 'dfly' user with the same id. It's unrealistic expectation.

Therefore, we revert the fix done in #1775 and follow valkey approach:
https://github.com/valkey-io/valkey-container/blob/mainline/docker-entrypoint.sh#L12

1. we run the entrypoint in the container as root which later spawns the dragonfly process
2. if we run as root:
   a. we chmod files under /data to dfly.
   b. use setpriv to exec ourselves as dfly.
3. if we do not run as root we execute the docker command.

So even though the process starts as root, the server runs as dfly and only the bootstrap
part has elevated permissions is used to fix the volume access.

While we are at it, we also switched to setpriv following the change of https://github.com/valkey-io/valkey-container/pull/24/files

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-08-22 11:33:29 +03:00
Vladislav
84a697dd75
chore(traffic loger): use pipelining and print/analyze commands (#3527)
Add run, print, analyze commands to traffic logger; add support for pipelines
2024-08-20 09:32:15 +03:00
Roman Gershman
93f6773297
chore: reduce pipelining latency by reusing existing shard fibers (#3494)
* chore: reduce pipelining latency by reusing existing shard fibers

To prove the benefits, run `./dfly_bench --pipeline=50   -n 20000  --ratio 0:1  --qps=0  --key_maximum=1`
Before: the average pipelining latency was 10ms
After: the average pipelining latency is 5ms.
Avg latency: pipelined_latency_usec / total_pipelined_squashed_commands

Also, improved counting of squashed commands - to count actual squashed ones.
---------

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-08-14 14:45:54 +03:00
Borys
48a28c3ea3
refactor: set info_replication_valkey_compatible=true (#3467)
* refactor: set info_replication_valkey_compatible=true
* test: mark test_cluster_replication_migration as skipped because it's broken
2024-08-08 21:42:58 +03:00
Shahar Mike
38fba1d398
fix: cluster_mgr.py to use CLUSTER MYID (#3444) 2024-08-05 07:29:31 +00:00
adiholden
e3eb8518fd
feat(test): Improve benchmark workflow (#3330)
Signed-off-by: adi_holden <adi@dragonflydb.io>
2024-07-17 14:34:48 +03:00
Roman Gershman
374a5f529e
chore: print effective QPS of the server. (#3274)
Also refactor ReceiveFB into multiple functions.
Finally, fix the memcached command in local monitoring stack.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-07-07 06:26:14 +00:00
Roman Gershman
8240c7f19e
chore(monitoring): add more dashboards + memcached (#3268) 2024-07-05 07:12:13 +00:00
Shahar Mike
5b731f163c
feat(cluster_mgr): Fix migration action (#3124) 2024-06-04 13:27:42 +03:00
Shahar Mike
bcbcc5a2c6
feat(cluster_mgr): Take over command (#3120) 2024-06-04 11:39:08 +03:00
Shahar Mike
6e6c91aeaf
feat(cluster_mgr): Improvements to cluster_mgr.py (#3118)
Make sure attached node is in right mode
Enable detaching nodes
2024-06-03 19:05:17 +00:00
Roman Gershman
0394387a5f
chore: export pipeline related metrics (#3104)
* chore: export pipeline related metrics

Export in /metrics
1. Total pipeline queue length
2. Total pipeline commands
3. Total pipelined duration

Signed-off-by: Roman Gershman <roman@dragonflydb.io>

---------

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
2024-05-30 19:10:35 +03:00
Shahar Mike
d1e3c82eaa
feat(cluster_mgr): Allow attaching replicas (#3105) 2024-05-30 15:29:58 +03:00
Vladislav
fd5ece09fb
chore: small replayer fixes (#3081) 2024-05-25 22:48:29 +03:00
Roman Gershman
8a0007d761
chore: add replication memory stats to the dashboard (#3065) 2024-05-22 08:11:54 +03:00
Jirapong Pansak
3babe99cf6
<chore>!: Update grafana panel (#3064)
update panel
2024-05-19 15:56:44 +00:00
Roman Gershman
fd74fd5b4b
chore: Export replication memory stats (#3062) 2024-05-18 22:40:14 +03:00
Borys
3dd6c4959c
feat: add defragment command (#3003)
* feat: add defragment command and improve auto defragmentation algorithm
2024-05-08 14:26:42 +03:00
adiholden
186ff31e29
Fix benchmark (#3017)
* fix(benchmark): fix lag check

Signed-off-by: adi_holden <adi@dragonflydb.io>

---------

Signed-off-by: adi_holden <adi@dragonflydb.io>
2024-05-06 18:38:13 +03:00