Fixes#4998.
1. Reduces agressive yielding when reading multiple requests since it humpers pipeline efficiency.
Now we yield consistently based on cpu time spend since the last resume point (via flag with sane defaults).
2. Increases socket read buffer size effectively allowing processing more requests in bulk.
`./dragonfly --cluster_mode=emulated`
latencies (usec) for pipeline sizes 80-199:
p50: 1887, p75: 2367, p90: 2897, p99: 6266
`./dragonfly --cluster_mode=emulated --experimental_cluster_shard_by_slot`
latencies (usec) for pipeline sizes 80-199:
p50: 813, p75: 976, p90: 1216, p99: 3528
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* feat(dfly_bench): allow regulated throughput in 3 modes
1. Coordinated omission - with --qps=0, each request is sent and then we wait for the response and so on.
For pipeline mode, k requests are sent and then we wait for them to return to send another k
2. qps > 0: we schedule sending requests at frequency "qps" per connection but if pending requests count crosses a limit
we slow down by throttling request sending. This mode enables gentle uncoordinated omission, where the schedule
converges to the real throughput capacity of the backend (if it's slower than the target throughput).
3. qps < 0, similar as (2) but does not adjust its scheduling and may overload the server
if target QPS is too high.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* chore: change pipelining and coordinated omission logic
Before that the uncoordinated omission only worked without pipelining.
Now, with pipelining mode with send a burst of P requests and then:
a) For coordinated omission - wait for all of them to complete before proceeding
further
b) For non-coordinated omission - we sleep to pace our single connection throughput as
defined by the qps setting.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
now it looks like this:
```
> docker run --rm ghcr.io/dragonflydb/dragonfly-dev:ubuntu-f767d82 --version
dragonfly f767d82-f767d82ce78ccbc90ddfb525f4ad4bd9aafcfbed
```
fixes#4830
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* feat(cluster_mgr): add populate command
We further simplify the code around cluster config
Also - add a command that populates all the cluster ranges in the cluster
using the "populate" command. `--size` and `--valsize` arguments are also added.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* chore: fixes
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
Before: slot merging/splitting logic was mixed with business logic.
Also, slots were represented as dictionary, which made the code less readable.
Now, SlotRange handles the low-level logic, which makes the high-level code simpler
to understand.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
1. Provide clear usage instructions
2. Add "pace" option, which when false, sends traffic as quickly as possible (default true).
3. Add skip option that sometimes can be useful to remove unneeded noise
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
Our previous weekly pipeline used qemu, was very slow and over-complicated.
This one uses matrix with proper parallelization and the latest arm64 github runners.
now it takes less than 30 minutes to build everything.
lets make it daily.
1. Run CI/Regression tests with HELIO_STACK_CHECK=4096.
This will crash if a fiber stack usage goes below this limit.
2. Increase shard queue stack size to 64KB
3. Increase fiber stack size to 40KB on Debug builds.
4. Updated helio has some changes around the TLS socket code.
In addition we add a helper script to generate self-signed certificates helpful for local development work.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
chore: add content-type for metrics response.
Also, update the local stack to use prometheus 3.0
Finally, hex-escape arguments when logging an error for a command.
Fixes#4277
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
Dragonfly responds to ascii based requests to tls port with:
`-ERR Bad TLS header, double check if you enabled TLS for your client.`
Therefore, it is possible to test now both tls and non-tls ports with a plain-text PING.
Fixes#4171
Also, blacklist the bloom-filter test that Dragonfly does not support yet.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
Update dragonfly.logrotate
If multiple logs are being rotated and one of them fails (due to exit 1), the other logs that follow won't be rotated either, unless logrotate is run again.
If you want to prevent the rotation of a specific log file and not affect the rest of the logs, you'll want to handle the condition properly to ensure that logrotate doesn't abort due to the failure of the prerotate script.
To prevent the rotation of a specific log file without causing issues for other logs, you can use exit 0 to prevent rotation cleanly or design your prerotate script to handle conditions carefully.
Signed-off-by: s13k <s13k@pm.me>
The problem - we used file write in non-direct mode when writing snapshots in epoll mode.
As a result - lots of data was cached into OS memory. But then during the rename operation,
when we rename "xxx.dfs.tmp" into "xxx.dfs", the OS flushes the file caches and the thread
is stuck in OS system call rename for a long time.
The fix - to use DIRECT mode and to avoid caching the data into OS caches at all.
Fixes#3895
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
Fixes#2917
The problem is described in this "working as intended" issue https://github.com/moby/moby/issues/3124
So the advised approach of using "USER dfly" directive does not really work because it requires
that the host will also define 'dfly' user with the same id. It's unrealistic expectation.
Therefore, we revert the fix done in #1775 and follow valkey approach:
https://github.com/valkey-io/valkey-container/blob/mainline/docker-entrypoint.sh#L12
1. we run the entrypoint in the container as root which later spawns the dragonfly process
2. if we run as root:
a. we chmod files under /data to dfly.
b. use setpriv to exec ourselves as dfly.
3. if we do not run as root we execute the docker command.
So even though the process starts as root, the server runs as dfly and only the bootstrap
part has elevated permissions is used to fix the volume access.
While we are at it, we also switched to setpriv following the change of https://github.com/valkey-io/valkey-container/pull/24/files
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* chore: reduce pipelining latency by reusing existing shard fibers
To prove the benefits, run `./dfly_bench --pipeline=50 -n 20000 --ratio 0:1 --qps=0 --key_maximum=1`
Before: the average pipelining latency was 10ms
After: the average pipelining latency is 5ms.
Avg latency: pipelined_latency_usec / total_pipelined_squashed_commands
Also, improved counting of squashed commands - to count actual squashed ones.
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
Also refactor ReceiveFB into multiple functions.
Finally, fix the memcached command in local monitoring stack.
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
* chore: export pipeline related metrics
Export in /metrics
1. Total pipeline queue length
2. Total pipeline commands
3. Total pipelined duration
Signed-off-by: Roman Gershman <roman@dragonflydb.io>
---------
Signed-off-by: Roman Gershman <roman@dragonflydb.io>