On 24-Jun-24 9:48 PM, George Kennedy wrote:No worries. Thank you for the update, Ravi,
I was able to reproduce it with passthrough pmu[1] as well on a Zen4 machine
On 6/10/2024 6:51 AM, Ravi Bangoria wrote:
On 6/8/2024 12:43 AM, George Kennedy wrote:Hi Ravi,
Hi Ravi,I'm able to reproduce within the KVM guest. Will try to investigate further.
On 6/4/2024 9:40 AM, Ravi Bangoria wrote:
Were you able to reproduce the crash on the AMD machine?On 6/4/2024 9:16 AM, Ravi Bangoria wrote:Sure, that would help in future. But for current splat, can you please
We could add a similar WARN_ON_ONCE() to the proposed patch.There are subtle differences between Intel and AMD pmu implementation.Also, a similar fix is done in __intel_pmu_enable_all() in arch/x86/events/intel/core.c except that a WARN_ON_ONCE is done as well.It looks like x86_pmu_stop() is clearing the bit in active_mask and setting the events entry to NULL (and doing it in the correct order) for the same events index that amd_pmu_enable_all() is trying to enable.Events can be deleted and the entry can be NULL.Can you please also explain "how".
The Syzkaller reproducer can be found in this link:Check event for NULL in amd_pmu_enable_all() before enable to avoid a GPF.Can you please provide a bug report link? Also, any reproducer?
This appears to be an AMD only issue.
Syzkaller reported a GPF in amd_pmu_enable_all.
https://lore.kernel.org/netdev/CAMt6jhyec7-TSFpr3F+_ikjpu39WV3jnCBBGwpzpBrPx55w20g@xxxxxxxxxxxxxx/T/#u
Good question, but the crash has not reproduced with the proposed fix in hours of testing. It usually reproduces within minutes without the fix.@@ -760,7 +760,8 @@ static void amd_pmu_enable_all(int added)What if cpuc->events[idx] becomes NULL after if (cpuc->events[idx]) but
if (!test_bit(idx, cpuc->active_mask))
continue;
- amd_pmu_enable_event(cpuc->events[idx]);
+ if (cpuc->events[idx])
+ amd_pmu_enable_event(cpuc->events[idx]);
before amd_pmu_enable_event(cpuc->events[idx])?
See: https://elixir.bootlin.com/linux/v6.10-rc1/source/arch/x86/events/intel/core.c#L2256
__intel_pmu_enable_all() enables all event with single WRMSR whereas
amd_pmu_enable_all() loops over each PMC and enables it individually.
The WARN_ON_ONCE() is important because it will warn about potential
sw bug somewhere else.
try to rootcause the underlying race condition?
Any new status?
where Host has PerfMonV2 support (GlobalCtrl etc) but guest do not. I've
debugged it at some extent and seeing some race conditions, but not working
on this with top priority since this requires root/CAP_PERFMON privileges to
cause a crash. I'll resume investigation once I get some time. Sorry about
the delay.
[1] https://lore.kernel.org/all/20240506053020.3911940-1-mizhang@xxxxxxxxxx
Thanks,
Ravi