Re: [PATCH] cpuidle: psd: add power sleep demotion prevention for fast I/O devices

From: Christian Loehle
Date: Mon Mar 03 2025 - 17:25:16 EST


On 3/3/25 16:43, Colin Ian King wrote:
> Modern processors can drop into deep sleep states relatively quickly
> to save power. However, coming out of deep sleep states takes a small
> amount of time and this is detrimental to performance for I/O devices
> such as fast PCIe NVME drives when servicing a completed I/O
> transactions.
>
> Testing with fio with read/write RAID0 PCIe NVME devices on various
> modern SMP based systems (such as 96 thead Granite Rapids Xeon 6741P)
> has shown that on 85-90% of read/write transactions issued on a CPU
> are completed by the same CPU, so it makes some sense to prevent the
> CPU from dropping into a deep sleep state to help reduce I/O handling
> latency.

For the platform you tested on that may be true, but even if we constrain
ourselves to pci-nvme there's a variety of queue/irq mappings where
this doesn't hold I'm afraid.

>
> This commit introduces a simple, lightweight and fast power sleep
> demotion mechanism that provides the block layer a way to inform the
> menu governor to prevent a CPU from going into a deep sleep when an
> I/O operation is requested. While it is true that some I/Os may not

s/requested/completed is the full truth, isn't it?

> be serviced on the same CPU that issued the I/O request and hence
> is not 100% perfect the mechanism does work well in the vast majority
> of I/O operations and there is very small overhead with the sleep
> demotion prevention.
>
> Test results on a 96 thread Xeon 6741P with a 6 way RAID0 PCIe NVME md
> array using fio 3.35 performing random read and read-write test on a
> 512GB file with 8 concurrent I/O jobs. Tested with the NHM_C1_AUTO_DEMOTE
> bit set in MSR_PKG_CST_CONFIG_CONTROL set in the BIOS.
>
> Test case: random reads, results based on geometic mean of results from
> 5 test runs:
> Bandwidth IO-ops Latency Bandwidth
> read (bytes/sec) per sec (ns) % Std.Deviation
> Baseline: 21365755610 20377 390105 1.86%
> Patched: 25950107558 24748 322905 0.16%

What is the baseline?
Do you mind trying with Rafael's recently posted series?
Given the IOPS I'd expect good results from that alone already.
https://lore.kernel.org/lkml/1916668.tdWV9SEqCh@xxxxxxxxxxxxx/

(Happy to see teo as comparison too, which you don't modify).

>
> Read rate improvement of ~21%.
>
> Test case: random read+writes, results based on geometic mean of results
> from 5 test runs:
>
> Bandwidth IO-ops Latency Bandwidth
> read (bytes/sec) per sec (ns) % Std.Deviation
> Baseline: 9937848224 9477 550094 1.04%
> Patched: 10502592508 10016 509315 1.85%
>
> Read rate improvement of ~5.7%
>
> Bandwidth IO-ops Latency Bandwidth
> write (bytes/sec) per sec (ns) % Std.Deviation
> Baseline: 9945197656 9484 288933 1.02%
> Patched: 10517268400 10030 287026 1.85%
>
> Write rate improvement of ~5.7%
>
> For kernel builds, where all CPUs are fully loaded no perfomance
> improvement or regressions were observed based on the results of
> 5 kernel build test runs.
>
> By default, CPU power sleep demotion blocking is set to run
> for 3 ms on I/O requests, but this can be modified using the
> new sysfs interface:
>
> /sys/devices/system/cpu/cpuidle/psd_cpu_lat_timeout_ms

rounding up a jiffie sure is a heavy price to pay then.

>
> setting this to zero will disabled the mechanism.
>
> Signed-off-by: Colin Ian King <colin.king@xxxxxxxxx>
> ---
> block/blk-mq.c | 2 +
> drivers/cpuidle/Kconfig | 10 +++
> drivers/cpuidle/Makefile | 1 +
> drivers/cpuidle/governors/menu.c | 4 +
> drivers/cpuidle/psd.c | 123 +++++++++++++++++++++++++++++++
> include/linux/cpuidle_psd.h | 32 ++++++++
> 6 files changed, 172 insertions(+)
> create mode 100644 drivers/cpuidle/psd.c
> create mode 100644 include/linux/cpuidle_psd.h
>