Re: [External] Re: How to Avoid Starving the Kernel When Using SSE

From: Atish Patra

Date: Thu Nov 20 2025 - 04:09:29 EST

On 11/18/25 12:51 AM, ClÃ©ment LÃ©ger wrote:

On 11/18/25 09:46, Zhanpeng Zhang wrote:

Hi Clément,

It seems that your patch is based on the PMU-ctr-delegation version. However, I'm using PMU-SBI and the SSE extension of v8.
So, I made corresponding modifications in `pmu_sbi_ovf_handler` as yours in `rvpmu_ovf_handler`.

Hi Zhanpeng,

Indeed my modifications were based on Atish rvpmu series.

This indeed prevents the kernel from hanging during perf sampling, and the sampling results look good. But according to the debug results, I think the problem with this approach is quite obvious. After this modification, most `pmu_start` operations are triggered by `event_sched_in` rather than do_sse. That is to say, `pmu_start` is delayed until `event_sched_in`.

That's the expected result, point being that the IRQ rate should be
throttled. Depending on how slow is your platform, overflow might kick
in faster leading to more throttling and thus delayed start of counter.
I do think this solution is the one that should be implemented since the
overflow return value from perf_event_overflow() is exactly meant for
that (ie throttling IRQs).

While that works fine for the events that already overflowed, what happens for the counter that was about to overflow but stopped because the SSE event handler is called ?

In that case, you may miss those events until the process is scheduled again. Depending on the frequency of the event that may be okay or not.

We also need to think about if we need to handle such scenarios for other SSE event ? For example, a continuous trigger correctable RAS errors can end up in the same situation.

Should we provide a generic sysfs mechanism for users to disable this SSE event ? By default, SSE event would not be throttled but an user can opt in for throttling by writing to the sysfs.

Clément

Actually, I'm also planning to try Atish's PMU-ctr-delegation extension, such delegation optimization seems to be helpful to alleviate the hanging problem. But I think this is no longer the same issue, as it requires comprehensive modifications to the hardware/qemu, kernel, SBI, and Perf itself.

Regards,
Zhanpeng

From: "Clément Léger"<cleger@xxxxxxxxxxxx>
Date: Fri, Nov 14, 2025, 23:41
Subject: [External] Re: How to Avoid Starving the Kernel When Using SSE
To: "张展鹏"<zhangzhanpeng.jasper@xxxxxxxxxxxxx>
Cc: "Paul Walmsley"<paul.walmsley@xxxxxxxxxx>, "Palmer Dabbelt"<palmer@xxxxxxxxxxx>, "linux-riscv@xxxxxxxxxxxxxxxxxxx"<linux-riscv@xxxxxxxxxxxxxxxxxxx>, "linux-kernel@xxxxxxxxxxxxxxx"<linux-kernel@xxxxxxxxxxxxxxx>, "linux-arm-kernel@xxxxxxxxxxxxxxxxxxx"<linux-arm-kernel@xxxxxxxxxxxxxxxxxxx>, "Himanshu Chauhan"<hchauhan@xxxxxxxxxxxxxxxx>, "Anup Patel"<apatel@xxxxxxxxxxxxxxxx>, "路旭"<luxu.kernel@xxxxxxxxxxxxx>, "Atish Patra"<atishp@xxxxxxxxxxxxxx>, "Björn Töpel"<bjorn@xxxxxxxxxxxx>, "崔运辉"<cuiyunhui@xxxxxxxxxxxxx>, "元竹"<yuanzhu@xxxxxxxxxxxxx>
On 11/14/25 14:36, Clément Léger wrote:

Hi Zhanpeng,

On 11/14/25 11:24, 张展鹏 wrote:

Hi Clément,

Lately, I've been thinking about how to avoid starving the kernel when

using SSE:

SSE is powered by M-mode irqs such as M-mode PMU irq for perf sampling and

M-mode IPI for inter-hart injection. Meanwhile, kernel is powered by S-mode

irqs, so the kernel may experience starvation when there is a flood of

M-mode irqs, and kernel may cause such flooding of M-mode irqs when using

SSE, either deliberately or inadvertently:

1. Malicious SSE handler: Kernel may deliberately register a bad SSE

handler, which triggers a new inter-hart SSE request via ecall. This will

cause an endless loop of SSE, rendering the kernel unresponsive. In this

case, the only thing SBI can do is to prevent the nesting of SSE in

`sbi_trap_handler` and ensure that SSE events are executed in priority

order.

That seems quite convoluted. Anyone that can load a module can do worse

than crashing the kernel :)

2. Perf sampling: Kernel may inadvertently choose a bad parameter for

Perf, which causes PMU irqs to occur too frequently. Continuous PMU irqs

will leave the system with no time to respond to S-mode irqs.

But this one concern however is valid !

Hence, I think we are supposed to improve the SSE framework to avoid

starving the kernel so easily.

Here is a case study of perf sampling:

When using PMU-SSE for Perf sampling, the kernel may hang and become

unresponsive due to the PMU-SSE loop. Once we start to process a Perf

sampling using PMU-SSE, the kernel may fail to respond to `Ctrl+C` or fail

to exit after the timing of `sleep 1` completes (these are the two most

commonly used time-based sampling methods in perf).

By default, perf uses a relatively high sampling frequency, namely

`perf_event_max_sample_rate`, and will adjust it on demand if sampling

takes too much time. If this frequency/period goes beyond what system can

handle, it will make SSE events connect end-to-end, and the system will get

stuck in an endless loop of "SSE → PMU interrupt → SSE". The kernel is then

starved (at this point, if you print the `sepc` of SSE completion, you will

find that the `sepc` remains unchanged each time, indicating that the

kernel is stuck), and the kernel can never escape from this loop of

PMU-SSE, because it can neither respond to Ctrl+C interrupts nor adjust the

sampling frequency.

Current solution: The key to this problem is that every time we finish

sse_complete, there is already a new PMU irq pending. Then we resume the

kernel execution via mret, and the system will immediately trap back into

SSE.

The PMU-SSE-Perf processing flow includes the following steps: `sse_inject`

(mret to SSE handler), `pmu_stop` (clear PMU pending bit), `pmu_start` (set

a new value for PMU counter), and `sse_complete` (resume execution to the

point where the kernel was interrupted). The reason why kernel traps right

after `sse_complete` is that there is a new PMU irq generated between

`pmu_start` and `sse_complete`.

In order to address this issue, we propose to delay the procedure of

re-starting the overflowed PMU counter during PMU-SSE. When kernel triggers

an ecall to restart the overflowed PMU counters, SBI can check whether it

is SSE-powered PMU handling. If so, we temporarily modify mhpmevent CSR to

stop counting kernel events. In this process, M-mode events are always

inhibited, and U-mode code will not be executed during the

`pmu_sbi_ovf_handler`, so we only need to inhibit the counting of kernel

events.

I'd rather let the kernel control the PMU SSE event delivery by masking

it at the end of the SSE handler and reenabling it later. Additionally,

that solution being in the SBI itself, it does not guarantee that all

SBI implementation will actually do that correctly.

I agree. The solution has to be implemented within Linux kernel rather than in the SBI specification as it is a problem of the SSE event handler & user (in this case where incorrect sampling rate is used).

What seems odd is that the perf_event_sample_took() call at each end of

PMU event handler should actually allow perf subsystem to throttle the

rate. I'll take another look at that part to make sure it works as

expected and that we aren't missing any bits.

Hey Zhanpeng,

Could you try to apply this quick'n'dirty patch on top of the SSE series

and check if it still hang ?

diff --git a/drivers/perf/riscv_pmu_dev.c b/drivers/perf/riscv_pmu_dev.c

index 7dec9c2afa9b..0fb8749c476f 100644

--- a/drivers/perf/riscv_pmu_dev.c

+++ b/drivers/perf/riscv_pmu_dev.c

@@ -1326,6 +1326,7 @@ static irqreturn_t rvpmu_ovf_handler(struct

cpu_hw_events *cpu_hw_evt,

int lidx, hidx, fidx;

struct riscv_pmu *pmu;

struct perf_event *event;

+ int ev_overflow = 0;

u64 overflow;

u64 overflowed_ctrs = 0;

u64 start_clock = sched_clock();

@@ -1423,13 +1424,15 @@ static irqreturn_t rvpmu_ovf_handler(struct

cpu_hw_events *cpu_hw_evt,

* TODO: We will need to stop the guest counters

once

* virtualization support is added.

*/

- perf_event_overflow(event, &data, regs);

+ ev_overflow |= perf_event_overflow(event, &data,

regs);

}

/* Reset the state as we are going to start the counter

after the loop */

hw_evt->state = 0;

}

- rvpmu_start_overflow_mask(pmu, overflowed_ctrs);

+ if (!ev_overflow || !from_sse)

+ rvpmu_start_overflow_mask(pmu, overflowed_ctrs);

+

perf_sample_event_took(sched_clock() - start_clock);

return IRQ_HANDLED;

Thanks,

Clément

In this way, we can ensure that `pmu_sbi_ovf_handler` will not be

re-entered by the new PMU-SSE, and minimize the modification of perf logic.

The price is that we gave up sampling a small portion of kernel code(from

`pmu_ctr_start` to the end of `pmu_sbi_ovf_handler`), and we probably need

a new parameter in `pmu_ctr_start`.

Looking forward to your suggestions. Thanks!

Best regards,

Zhanpeng Zhang

_______________________________________________
linux-riscv mailing list
linux-riscv@xxxxxxxxxxxxxxxxxxx
http://lists.infradead.org/mailman/listinfo/linux-riscv