Re: [PATCH] perf/x86/intel: Restrict period on Haswell
From: Thomas Gleixner
Date: Wed Jul 31 2024 - 15:20:59 EST
On Tue, Jul 30 2024 at 06:33, Li Huafei wrote:
> On my Haswell machine, running the ltp test cve-2015-3290 concurrently
> reports the following warnings:
>
> perfevents: irq loop stuck!
> WARNING: CPU: 31 PID: 32438 at arch/x86/events/intel/core.c:3174 intel_pmu_handle_irq+0x285/0x370
> CPU: 31 UID: 0 PID: 32438 Comm: cve-2015-3290 Kdump: loaded Tainted: G S W 6.11.0-rc1+ #3
> ...
> Call Trace:
> <NMI>
> ? __warn+0xa4/0x220
> ? intel_pmu_handle_irq+0x285/0x370
> ? __report_bug+0x123/0x130
> ? intel_pmu_handle_irq+0x285/0x370
> ? __report_bug+0x123/0x130
> ? intel_pmu_handle_irq+0x285/0x370
> ? report_bug+0x3e/0xa0
> ? handle_bug+0x3c/0x70
> ? exc_invalid_op+0x18/0x50
> ? asm_exc_invalid_op+0x1a/0x20
> ? irq_work_claim+0x1e/0x40
> ? intel_pmu_handle_irq+0x285/0x370
> perf_event_nmi_handler+0x3d/0x60
> nmi_handle+0x104/0x330
> ? ___ratelimit+0xe4/0x1b0
> default_do_nmi+0x40/0x100
> exc_nmi+0x104/0x180
> end_repeat_nmi+0xf/0x53
> ...
> ? intel_pmu_lbr_enable_all+0x2a/0x90
> ? __intel_pmu_enable_all.constprop.0+0x16d/0x1b0
> ? __intel_pmu_enable_all.constprop.0+0x16d/0x1b0
> perf_ctx_enable+0x8e/0xc0
> __perf_install_in_context+0x146/0x3e0
> ? __pfx___perf_install_in_context+0x10/0x10
> remote_function+0x7c/0xa0
> ? __pfx_remote_function+0x10/0x10
> generic_exec_single+0xf8/0x150
> smp_call_function_single+0x1dc/0x230
> ? __pfx_remote_function+0x10/0x10
> ? __pfx_smp_call_function_single+0x10/0x10
> ? __pfx_remote_function+0x10/0x10
> ? lock_is_held_type+0x9e/0x120
> ? exclusive_event_installable+0x4f/0x140
> perf_install_in_context+0x197/0x330
> ? __pfx_perf_install_in_context+0x10/0x10
> ? __pfx___perf_install_in_context+0x10/0x10
> __do_sys_perf_event_open+0xb80/0x1100
> ? __pfx___do_sys_perf_event_open+0x10/0x10
> ? __pfx___lock_release+0x10/0x10
> ? lockdep_hardirqs_on_prepare+0x135/0x200
> ? ktime_get_coarse_real_ts64+0xee/0x100
> ? ktime_get_coarse_real_ts64+0x92/0x100
> do_syscall_64+0x70/0x180
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
> ...
Please trim the backtrace to something useful:
https://www.kernel.org/doc/html/latest/process/submitting-patches.html#backtraces
> My machine has 32 physical cores, each with two logical cores. During
> testing, it executes the CVE-2015-3290 test case 100 times concurrently.
>
> This warning was already present in [1] and a patch was given there to
> limit period to 128 on Haswell, but that patch was not merged into the
> mainline. In [2] the period on Nehalem was limited to 32. I tested 16
> and 32 period on my machine and found that the problem could be
> reproduced with a limit of 16, but the problem did not reproduce when
> set to 32. It looks like we can limit the cycles to 32 on Haswell as
> well.
It looks like? Either it works or not.
>
> +static void hsw_limit_period(struct perf_event *event, s64 *left)
> +{
> + *left = max(*left, 32LL);
> +}
And why do we need a copy of nhm_limit_period() ?
Thanks,
tglx