Re: [PATCH] mmc: core: add WQ_PERCPU to alloc_workqueue users

From: Adrian Hunter
Date: Mon Nov 17 2025 - 05:48:14 EST


On 12/11/2025 13:45, Ulf Hansson wrote:
> On Wed, 12 Nov 2025 at 07:49, Adrian Hunter <adrian.hunter@xxxxxxxxx> wrote:
>>
>> On 11/11/2025 19:12, Ulf Hansson wrote:
>>> + Adrian
>>>
>>> On Fri, 7 Nov 2025 at 15:17, Marco Crivellari <marco.crivellari@xxxxxxxx> wrote:
>>>>
>>>> Currently if a user enqueues a work item using schedule_delayed_work() the
>>>> used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
>>>> WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
>>>> schedule_work() that is using system_wq and queue_work(), that makes use
>>>> again of WORK_CPU_UNBOUND.
>>>> This lack of consistency cannot be addressed without refactoring the API.
>>>>
>>>> alloc_workqueue() treats all queues as per-CPU by default, while unbound
>>>> workqueues must opt-in via WQ_UNBOUND.
>>>>
>>>> This default is suboptimal: most workloads benefit from unbound queues,
>>>> allowing the scheduler to place worker threads where they’re needed and
>>>> reducing noise when CPUs are isolated.
>>>>
>>>> This continues the effort to refactor workqueue APIs, which began with
>>>> the introduction of new workqueues and a new alloc_workqueue flag in:
>>>>
>>>> commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq")
>>>> commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag")
>>>>
>>>> This change adds a new WQ_PERCPU flag to explicitly request
>>>> alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified.
>>>>
>>>> With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
>>>> any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
>>>> must now use WQ_PERCPU.
>>>>
>>>> Once migration is complete, WQ_UNBOUND can be removed and unbound will
>>>> become the implicit default.
>>>>
>>>> Suggested-by: Tejun Heo <tj@xxxxxxxxxx>
>>>> Signed-off-by: Marco Crivellari <marco.crivellari@xxxxxxxx>
>>>> ---
>>>> drivers/mmc/core/block.c | 3 ++-
>>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
>>>> index c0ffe0817fd4..6a651ddccf28 100644
>>>> --- a/drivers/mmc/core/block.c
>>>> +++ b/drivers/mmc/core/block.c
>>>> @@ -3275,7 +3275,8 @@ static int mmc_blk_probe(struct mmc_card *card)
>>>> mmc_fixup_device(card, mmc_blk_fixups);
>>>>
>>>> card->complete_wq = alloc_workqueue("mmc_complete",
>>>> - WQ_MEM_RECLAIM | WQ_HIGHPRI, 0);
>>>> + WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_PERCPU,
>>>> + 0);
>>>
>>> I guess we prefer to keep the existing behaviour to avoid breaking
>>> anything, before continuing with the refactoring. Although I think it
>>> should be fine to use WQ_UNBOUND here.
>>>
>>> Looping in Adrian to get his opinion around this.
>>
>> Typically the work is being queued from the CPU that received the
>> interrupt. I'd assume, running the work on that CPU, as we do now,
>> has some merit.
>>
>
> Thanks, I get your point!
>
> So, to me it seems like if we want to explore other options, it would
> require us to do more analysis to avoid introducing performance
> regressions.
>
> BTW, do we know how other block device drivers are dealing with this?

AFAIK, call blk_mq_complete_request() from the interrupt handler.
mmc_block does that in the case of CQE or HSQ.