Re: [BUG] resctrl: using smp_processor_id() in preemptible code in __l3_mon_event_count() via mbm_handle_overflow() during CPU hotplug

Next message: Maxime Ripard: "Re: [PATCH v7 05/30] drm/display: hdmi_state_helper: Add ctx-aware hotplug helper for SCDC sync"
Previous message: Simon Horman: "Re: [PATCH net-next v5 2/4] net: dsa: mxl862xx: move phylink stubs to mxl862xx-phylink.c"
In reply to: Luck, Tony: "RE: [BUG] resctrl: using smp_processor_id() in preemptible code in __l3_mon_event_count() via mbm_handle_overflow() during CPU hotplug"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

From: Reinette Chatre

Date: Thu Jun 11 2026 - 11:22:37 EST

Thanks to Qinyun Tan for doing this stress testing and creating this detailed report.

On 6/11/26 8:02 AM, Luck, Tony wrote:
>> I do not have a good fix in mind. The read in __l3_mon_event_count()
>> fundamentally assumes it runs on a CPU of the domain, but during hotplug
>> the overflow work can be migrated off that CPU; neither the cpus_read_lock()
>> held here nor the existing cpumask_test_cpu() guard addresses the
>> preemptible-context use of smp_processor_id() itself.
>>
>> I would appreciate your guidance on how this should best be addressed.
>>
>> I can provide the full log and a reproducer on request.
>>
>
> Qinyun Tan,
>
> I think this is addressed by this pending patch:
>
> https://lore.kernel.org/all/b5178a191a8a660e1f4aed356484d4eebfbd30fc.1781029125.git.reinette.chatre@xxxxxxxxx/
>
> [At least the scenario seems similar with CPU offline and subsequent unbound run of a worker]

Indeed. That patch modifies the resctrl CPU offline handler to wait out any existing
work. Considering that the resctrl offline handler runs before the workqueue offline
handler I thus expect that this change would ensure the work completes on the CPU
going offline and there would be no work left for the workqueue offline handler to
migrate to another CPU.

This same patch also adds an additional protection within the worker against this scenario
happening by ensuring that when the worker runs it is still a "per CPU thread" so that
it can be assured that once it does start running, smp_processor_id() can be used safely.

Reinette