Re: [perf] perf_fuzzer causes crash in intel_pmu_drain_pebs_nhm()

From: Liang, Kan
Date: Mon Mar 01 2021 - 08:22:54 EST




On 2/11/2021 9:53 AM, Peter Zijlstra wrote:

Kan, do you have time to look at this?

On Thu, Jan 28, 2021 at 02:49:47PM -0500, Vince Weaver wrote:
On Thu, 28 Jan 2021, Vince Weaver wrote:

the perf_fuzzer has turned up a repeatable crash on my haswell system.

addr2line is not being very helpful, it points to DECLARE_PER_CPU_FIRST.
I'll investigate more when I have the chance.

so I poked around some more.

This seems to be caused in

__intel_pmu_pebs_event()
get_next_pebs_record_by_bit() ds.c line 1639
get_pebs_status(at) ds.c line 1317
return ((struct pebs_record_nhm *)n)->status;

where "n" has the value of 0xc0 rather than a proper pointer.


I think I find the suspicious patch.
The commt id 01330d7288e00 ("perf/x86: Allow zero PEBS status with only single active event")
https://lore.kernel.org/lkml/tip-01330d7288e0050c5aaabc558059ff91589e67cd@xxxxxxxxxxxxxx/
The patch is an SW workaround for some old CPUs (HSW and earlier), which may set 0 to the PEBS status. It adds a check in the intel_pmu_drain_pebs_nhm(). It tries to minimize the impact of the defect by avoiding dropping the PEBS records which have PEBS status 0.
But, it doesn't correct the PEBS status, which may bring problems,
especially for the large PEBS.
It's possible that all the PEBS records in a large PEBS have the PEBS status 0. If so, the first get_next_pebs_record_by_bit() in the __intel_pmu_pebs_event() returns NULL. The at = NULL. Since it's a large PEBS, the 'count' parameter must > 1. The second get_next_pebs_record_by_bit() will crash.

Could you please revert the patch and check whether it fixes your issue?

Thanks,
Kan