Re: [External] Re: How to Avoid Starving the Kernel When Using SSE
From: Atish Patra
Date: Thu Nov 20 2025 - 04:09:29 EST
On 11/18/25 12:51 AM, Clément Léger wrote:
On 11/18/25 09:46, Zhanpeng Zhang wrote:
Hi Clément,
It seems that your patch is based on the PMU-ctr-delegation version. However, I'm using PMU-SBI and the SSE extension of v8.
So, I made corresponding modifications in `pmu_sbi_ovf_handler` as yours in `rvpmu_ovf_handler`.
Hi Zhanpeng,
Indeed my modifications were based on Atish rvpmu series.
This indeed prevents the kernel from hanging during perf sampling, and the sampling results look good. But according to the debug results, I think the problem with this approach is quite obvious. After this modification, most `pmu_start` operations are triggered by `event_sched_in` rather than do_sse. That is to say, `pmu_start` is delayed until `event_sched_in`.
That's the expected result, point being that the IRQ rate should be
throttled. Depending on how slow is your platform, overflow might kick
in faster leading to more throttling and thus delayed start of counter.
I do think this solution is the one that should be implemented since the
overflow return value from perf_event_overflow() is exactly meant for
that (ie throttling IRQs).
While that works fine for the events that already overflowed, what happens for the counter that was about to overflow but stopped because the SSE event handler is called ?
In that case, you may miss those events until the process is scheduled again. Depending on the frequency of the event that may be okay or not.
We also need to think about if we need to handle such scenarios for other SSE event ? For example, a continuous trigger correctable RAS errors can end up in the same situation.
Should we provide a generic sysfs mechanism for users to disable this SSE event ? By default, SSE event would not be throttled but an user can opt in for throttling by writing to the sysfs.
Clément
Actually, I'm also planning to try Atish's PMU-ctr-delegation extension, such delegation optimization seems to be helpful to alleviate the hanging problem. But I think this is no longer the same issue, as it requires comprehensive modifications to the hardware/qemu, kernel, SBI, and Perf itself.
Regards,
Zhanpeng
From: "Clément Léger"<cleger@xxxxxxxxxxxx>
Date: Fri, Nov 14, 2025, 23:41
Subject: [External] Re: How to Avoid Starving the Kernel When Using SSE
To: "张展鹏"<zhangzhanpeng.jasper@xxxxxxxxxxxxx>
Cc: "Paul Walmsley"<paul.walmsley@xxxxxxxxxx>, "Palmer Dabbelt"<palmer@xxxxxxxxxxx>, "linux-riscv@xxxxxxxxxxxxxxxxxxx"<linux-riscv@xxxxxxxxxxxxxxxxxxx>, "linux-kernel@xxxxxxxxxxxxxxx"<linux-kernel@xxxxxxxxxxxxxxx>, "linux-arm-kernel@xxxxxxxxxxxxxxxxxxx"<linux-arm-kernel@xxxxxxxxxxxxxxxxxxx>, "Himanshu Chauhan"<hchauhan@xxxxxxxxxxxxxxxx>, "Anup Patel"<apatel@xxxxxxxxxxxxxxxx>, "路旭"<luxu.kernel@xxxxxxxxxxxxx>, "Atish Patra"<atishp@xxxxxxxxxxxxxx>, "Björn Töpel"<bjorn@xxxxxxxxxxxx>, "崔运辉"<cuiyunhui@xxxxxxxxxxxxx>, "元竹"<yuanzhu@xxxxxxxxxxxxx>
On 11/14/25 14:36, Clément Léger wrote:
Hi Zhanpeng,
On 11/14/25 11:24, 张展鹏 wrote:
Hi Clément,
Lately, I've been thinking about how to avoid starving the kernel when
using SSE:
SSE is powered by M-mode irqs such as M-mode PMU irq for perf sampling and
M-mode IPI for inter-hart injection. Meanwhile, kernel is powered by S-mode
irqs, so the kernel may experience starvation when there is a flood of
M-mode irqs, and kernel may cause such flooding of M-mode irqs when using
SSE, either deliberately or inadvertently:
1. Malicious SSE handler: Kernel may deliberately register a bad SSE
handler, which triggers a new inter-hart SSE request via ecall. This will
cause an endless loop of SSE, rendering the kernel unresponsive. In this
case, the only thing SBI can do is to prevent the nesting of SSE in
`sbi_trap_handler` and ensure that SSE events are executed in priority
order.
That seems quite convoluted. Anyone that can load a module can do worse
than crashing the kernel :)
2. Perf sampling: Kernel may inadvertently choose a bad parameter for
Perf, which causes PMU irqs to occur too frequently. Continuous PMU irqs
will leave the system with no time to respond to S-mode irqs.
But this one concern however is valid !
Hence, I think we are supposed to improve the SSE framework to avoid
starving the kernel so easily.
Here is a case study of perf sampling:
When using PMU-SSE for Perf sampling, the kernel may hang and become
unresponsive due to the PMU-SSE loop. Once we start to process a Perf
sampling using PMU-SSE, the kernel may fail to respond to `Ctrl+C` or fail
to exit after the timing of `sleep 1` completes (these are the two most
commonly used time-based sampling methods in perf).
By default, perf uses a relatively high sampling frequency, namely
`perf_event_max_sample_rate`, and will adjust it on demand if sampling
takes too much time. If this frequency/period goes beyond what system can
handle, it will make SSE events connect end-to-end, and the system will get
stuck in an endless loop of "SSE → PMU interrupt → SSE". The kernel is then
starved (at this point, if you print the `sepc` of SSE completion, you will
find that the `sepc` remains unchanged each time, indicating that the
kernel is stuck), and the kernel can never escape from this loop of
PMU-SSE, because it can neither respond to Ctrl+C interrupts nor adjust the
sampling frequency.
Current solution: The key to this problem is that every time we finish
sse_complete, there is already a new PMU irq pending. Then we resume the
kernel execution via mret, and the system will immediately trap back into
SSE.
The PMU-SSE-Perf processing flow includes the following steps: `sse_inject`
(mret to SSE handler), `pmu_stop` (clear PMU pending bit), `pmu_start` (set
a new value for PMU counter), and `sse_complete` (resume execution to the
point where the kernel was interrupted). The reason why kernel traps right
after `sse_complete` is that there is a new PMU irq generated between
`pmu_start` and `sse_complete`.
In order to address this issue, we propose to delay the procedure of
re-starting the overflowed PMU counter during PMU-SSE. When kernel triggers
an ecall to restart the overflowed PMU counters, SBI can check whether it
is SSE-powered PMU handling. If so, we temporarily modify mhpmevent CSR to
stop counting kernel events. In this process, M-mode events are always
inhibited, and U-mode code will not be executed during the
`pmu_sbi_ovf_handler`, so we only need to inhibit the counting of kernel
events.
I'd rather let the kernel control the PMU SSE event delivery by masking
it at the end of the SSE handler and reenabling it later. Additionally,
that solution being in the SBI itself, it does not guarantee that all
SBI implementation will actually do that correctly.
I agree. The solution has to be implemented within Linux kernel rather than in the SBI specification as it is a problem of the SSE event handler & user (in this case where incorrect sampling rate is used).
What seems odd is that the perf_event_sample_took() call at each end of
PMU event handler should actually allow perf subsystem to throttle the
rate. I'll take another look at that part to make sure it works as
expected and that we aren't missing any bits.
Hey Zhanpeng,
Could you try to apply this quick'n'dirty patch on top of the SSE series
and check if it still hang ?
diff --git a/drivers/perf/riscv_pmu_dev.c b/drivers/perf/riscv_pmu_dev.c
index 7dec9c2afa9b..0fb8749c476f 100644
--- a/drivers/perf/riscv_pmu_dev.c
+++ b/drivers/perf/riscv_pmu_dev.c
@@ -1326,6 +1326,7 @@ static irqreturn_t rvpmu_ovf_handler(struct
cpu_hw_events *cpu_hw_evt,
int lidx, hidx, fidx;
struct riscv_pmu *pmu;
struct perf_event *event;
+ int ev_overflow = 0;
u64 overflow;
u64 overflowed_ctrs = 0;
u64 start_clock = sched_clock();
@@ -1423,13 +1424,15 @@ static irqreturn_t rvpmu_ovf_handler(struct
cpu_hw_events *cpu_hw_evt,
* TODO: We will need to stop the guest counters
once
* virtualization support is added.
*/
- perf_event_overflow(event, &data, regs);
+ ev_overflow |= perf_event_overflow(event, &data,
regs);
}
/* Reset the state as we are going to start the counter
after the loop */
hw_evt->state = 0;
}
- rvpmu_start_overflow_mask(pmu, overflowed_ctrs);
+ if (!ev_overflow || !from_sse)
+ rvpmu_start_overflow_mask(pmu, overflowed_ctrs);
+
perf_sample_event_took(sched_clock() - start_clock);
return IRQ_HANDLED;
Thanks,
Clément
In this way, we can ensure that `pmu_sbi_ovf_handler` will not be
re-entered by the new PMU-SSE, and minimize the modification of perf logic.
The price is that we gave up sampling a small portion of kernel code(from
`pmu_ctr_start` to the end of `pmu_sbi_ovf_handler`), and we probably need
a new parameter in `pmu_ctr_start`.
Looking forward to your suggestions. Thanks!
Best regards,
Zhanpeng Zhang
_______________________________________________
linux-riscv mailing list
linux-riscv@xxxxxxxxxxxxxxxxxxx
http://lists.infradead.org/mailman/listinfo/linux-riscv