RE: [PATCH] cpuidle: psd: add power sleep demotion prevention for fast I/O devices

From: King, Colin
Date: Mon Mar 17 2025 - 06:06:36 EST


Hi Christian,

Follow-up below:

> -----Original Message-----
> From: Christian Loehle <christian.loehle@xxxxxxx>
> Sent: 03 March 2025 22:25
> To: King, Colin <colin.king@xxxxxxxxx>; Jens Axboe <axboe@xxxxxxxxx>; Rafael
> J. Wysocki <rafael@xxxxxxxxxx>; Daniel Lezcano <daniel.lezcano@xxxxxxxxxx>;
> linux-block@xxxxxxxxxxxxxxx; linux-pm@xxxxxxxxxxxxxxx
> Cc: linux-kernel@xxxxxxxxxxxxxxx
> Subject: Re: [PATCH] cpuidle: psd: add power sleep demotion prevention for
> fast I/O devices
>
> On 3/3/25 16:43, Colin Ian King wrote:
> > Modern processors can drop into deep sleep states relatively quickly
> > to save power. However, coming out of deep sleep states takes a small
> > amount of time and this is detrimental to performance for I/O devices
> > such as fast PCIe NVME drives when servicing a completed I/O
> > transactions.
> >
> > Testing with fio with read/write RAID0 PCIe NVME devices on various
> > modern SMP based systems (such as 96 thead Granite Rapids Xeon 6741P)
> > has shown that on 85-90% of read/write transactions issued on a CPU
> > are completed by the same CPU, so it makes some sense to prevent the
> > CPU from dropping into a deep sleep state to help reduce I/O handling
> > latency.
>
> For the platform you tested on that may be true, but even if we constrain
> ourselves to pci-nvme there's a variety of queue/irq mappings where this
> doesn't hold I'm afraid.

This code is optional, one can enable it or disable it via the config option. Also,
even when it is built-in one can disable it by writing 0 to the sysfs file
/sys/devices/system/cpu/cpuidle/psd_cpu_lat_timeout_ms

>
> >
> > This commit introduces a simple, lightweight and fast power sleep
> > demotion mechanism that provides the block layer a way to inform the
> > menu governor to prevent a CPU from going into a deep sleep when an
> > I/O operation is requested. While it is true that some I/Os may not
>
> s/requested/completed is the full truth, isn't it?
>
> > be serviced on the same CPU that issued the I/O request and hence is
> > not 100% perfect the mechanism does work well in the vast majority of
> > I/O operations and there is very small overhead with the sleep
> > demotion prevention.
> >
> > Test results on a 96 thread Xeon 6741P with a 6 way RAID0 PCIe NVME md
> > array using fio 3.35 performing random read and read-write test on a
> > 512GB file with 8 concurrent I/O jobs. Tested with the
> > NHM_C1_AUTO_DEMOTE bit set in MSR_PKG_CST_CONFIG_CONTROL set in
> the BIOS.
> >
> > Test case: random reads, results based on geometic mean of results
> > from
> > 5 test runs:
> > Bandwidth IO-ops Latency Bandwidth
> > read (bytes/sec) per sec (ns) % Std.Deviation
> > Baseline: 21365755610 20377 390105 1.86%
> > Patched: 25950107558 24748 322905 0.16%
>
> What is the baseline?
> Do you mind trying with Rafael's recently posted series?
> Given the IOPS I'd expect good results from that alone already.
> https://lore.kernel.org/lkml/1916668.tdWV9SEqCh@xxxxxxxxxxxxx/
>
> (Happy to see teo as comparison too, which you don't modify).

OK, I re-ran the tests on 6.14-rc5 on the same H/W. The "Baseline" is 6.14-rc5 without
Raphel's patch. I also testing the Baseline with C6 and C6P disabled as this stops deeper
C-state sleeps and in theory should provide "best performance". I also benchmarked with
Raphel's patch and just my patch, and finally Raphels patch AND my patch:

Reads
Bandwidth Bandwidth latency latency
Bytes/sec %Std.Dev. nanosecs %Std.Dev.
Baseline 25689182477 0.15 326177 0.15
C6, C6P disabled 25839580014 0.19 324349 0.19
Raphels Patch: 25695523747 0.06 326150 0.06
My patch: 25782011833 0.07 324999 0.07
Raphel + My patch: 25792551514 0.10 324924 0.10

Writes
Bandwidth Bandwidth latency latency
Bytes/sec %Std.Dev. nanosecs %Std.Dev.
Baseline 15220468898 3.33 550290 3.36
C6, C6P disabled 13405624707 0.66 625424 0.66
Raphels Patch: 14017625200 1.15 598049 1.16
My patch: 15444417488 3.73 467818 29.10
Raphel + My patch: 14037711032 1.13 597143 1.13

Combined Read+Writes, Reads

Bandwidth Bandwidth latency latency
Bytes/sec %Std.Dev. nanosecs %Std.Dev.
Baseline 10132229433 0.41 484919 0.25
C6, C6P disabled 10567536346 0.60 515260 1.16
Raphels Patch: 10171044817 0.37 486937 0.20
My patch: 10468953527 0.07 504797 0.07
Raphel + My patch: 10174707546 1.26 488263 1.13

Combined Read+Writes, Writes

Bandwidth Bandwidth latency latency
Bytes/sec %Std.Dev. nanosecs %Std.Dev.
Baseline 10139393169 0.44 342132 1.23
C6, C6P disabled 10583264662 0.63 277052 3.87
Raphels Patch: 10178275035 0.39 336989 0.94
My patch: 10482766569 1.28 294803 6.87
Raphel + My patch: 10183837235 0.38 330657 3.39
>
> >
> > Read rate improvement of ~21%.
> >
> > Test case: random read+writes, results based on geometic mean of
> > results from 5 test runs:
> >
> > Bandwidth IO-ops Latency Bandwidth
> > read (bytes/sec) per sec (ns) % Std.Deviation
> > Baseline: 9937848224 9477 550094 1.04%
> > Patched: 10502592508 10016 509315 1.85%
> >
> > Read rate improvement of ~5.7%
> >
> > Bandwidth IO-ops Latency Bandwidth
> > write (bytes/sec) per sec (ns) % Std.Deviation
> > Baseline: 9945197656 9484 288933 1.02%
> > Patched: 10517268400 10030 287026 1.85%
> >
> > Write rate improvement of ~5.7%
> >
> > For kernel builds, where all CPUs are fully loaded no perfomance
> > improvement or regressions were observed based on the results of
> > 5 kernel build test runs.
> >
> > By default, CPU power sleep demotion blocking is set to run for 3 ms
> > on I/O requests, but this can be modified using the new sysfs
> > interface:
> >
> > /sys/devices/system/cpu/cpuidle/psd_cpu_lat_timeout_ms
>
> rounding up a jiffie sure is a heavy price to pay then.

Jiffies were used because the code is designed to be as light-weight as possible. Rounding it up isn't
problematic, using higher precision timing is more CPU expensive and really doesn't gain us any extra
win for this kind of time resolution. We need the code to be light weight and "good-enough" on the I/O
path rather than overly expensive and "100% perfect".

>
> >
> > setting this to zero will disabled the mechanism.
> >
> > Signed-off-by: Colin Ian King <colin.king@xxxxxxxxx>
> > ---
> > block/blk-mq.c | 2 +
> > drivers/cpuidle/Kconfig | 10 +++
> > drivers/cpuidle/Makefile | 1 +
> > drivers/cpuidle/governors/menu.c | 4 +
> > drivers/cpuidle/psd.c | 123 +++++++++++++++++++++++++++++++
> > include/linux/cpuidle_psd.h | 32 ++++++++
> > 6 files changed, 172 insertions(+)
> > create mode 100644 drivers/cpuidle/psd.c create mode 100644
> > include/linux/cpuidle_psd.h
> >