Re: [PATCH v2 2/4] blk-cgroup: fix race between policy activation and blkg destruction

From: yu kuai

Date: Mon Jun 29 2026 - 05:08:41 EST

Hi,

在 2026/6/29 13:33, Nilay Shroff 写道:
> On 6/27/26 9:43 AM, yu kuai wrote:
>> Hi,
>>
>> 在 2026/6/26 14:12, Nilay Shroff 写道:
>>> On 6/26/26 7:22 AM, yu kuai wrote:
>>>> Hi,
>>>>
>>>> 在 2026/6/26 9:50, yu kuai 写道:
>>>>> Hi,
>>>>>
>>>>> 在 2026/6/25 23:08, Nilay Shroff 写道:
>>>>>> On 6/24/26 12:16 PM, Yu Kuai wrote:
>>>>>>> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
>>>>>>> index 7baccfb690fe..f7e788a7fe95 100644
>>>>>>> --- a/block/blk-cgroup.c
>>>>>>> +++ b/block/blk-cgroup.c
>>>>>>> @@ -1563,10 +1563,12 @@ int blkcg_activate_policy(struct gendisk
>>>>>>> *disk, const struct blkcg_policy *pol)
>>>>>>>          if (WARN_ON_ONCE(!pol->pd_alloc_fn || !pol->pd_free_fn))
>>>>>>>              return -EINVAL;
>>>>>>>          if (queue_is_mq(q))
>>>>>>>              memflags = blk_mq_freeze_queue(q);
>>>>>>> +
>>>>>>> +    mutex_lock(&q->blkcg_mutex);
>>>>>>>    retry:
>>>>>>>          spin_lock_irq(&q->queue_lock);
>>>>>>>          /* blkg_list is pushed at the head, reverse walk to
>>>>>>> initialize parents first */
>>>>>>>          list_for_each_entry_reverse(blkg, &q->blkg_list, q_node) {
>>>>>>> @@ -1625,10 +1627,11 @@ int blkcg_activate_policy(struct gendisk
>>>>>>> *disk, const struct blkcg_policy *pol)
>>>>>>>          __set_bit(pol->plid, q->blkcg_pols);
>>>>>>>          ret = 0;
>>>>>>>          spin_unlock_irq(&q->queue_lock);
>>>>>>>    out:
>>>>>>> +    mutex_unlock(&q->blkcg_mutex);
>>>>>>>          if (queue_is_mq(q))
>>>>>>>              blk_mq_unfreeze_queue(q, memflags);
>>>>>>>          if (pinned_blkg)
>>>>>>>              blkg_put(pinned_blkg);
>>>>>>>          if (pd_prealloc)
>>>>>> If the policy allocation fails, we jump to the lable enomem: and
>>>>>> teardown pds.
>>>>>> But I see this path still only acquires ->queue_lock. Don't we also
>>>>>> need
>>>>>> to protect it with ->blkcg_mutex?
>>>>> Yes, I agree we should protect it as well.
>>>>
>>>> Just take a closer look at the code, the enomem is already
>>>> protected by
>>>> blkcg_mutex :)
>>>>
>>>
>>> Oh yes, but the ->blkcg_mutex is never released if we jump to enomem.
>>> So that may potentially cause deadlock. We need to release
>>> ->blkcg_mutex
>>> once blkcg_policy_teardown_pds() returns. Or may be refactor code
>>> (or add
>>> comment) so that it's easy to realize or spot the ->blkcg_mutex is
>>> acquired
>>> and then released around blkcg_policy_teardown_pds().
>>
>> the enomem will goto out at last, and blkcg_mutex do released. The
>> code is
>> a bit hacky.
>>
>>>
>>>>>
>>>>>> Moreover I still see race between blkg insertion in blkg_create()
>>>>>> which
>>>>>> still doesn't use ->blkcg_mutex and so list traversal in
>>>>>> bfq_end_wr_async()
>>>>>> may still race with blkg_create(), isn't it? I remember you once
>>>>>> told
>>>>>> this will be handled in another series but I couldn't find that yet.
>>>>> This is the set:
>>>>>
>>>>> [PATCH 0/8] blk-cgroup: remove queue_lock nesting from blkcg paths
>>>>> - Yu
>>>>> Kuai
>>>>> <https://lore.kernel.org/all/cover.1780621988.git.yukuai@xxxxxxx/>
>>>>>
>>>>> Noted that set just make sure queue_lock is not nested under other
>>>>> atomic
>>>>> context, and that set do not acquire blkcg_mutex for blkg_create()
>>>>> yet. Howerver,
>>>>> with that set it'll be easy to convert all queue_lock to blkcg_mtuex
>>>>> for blkg
>>>>> protection, and together with lots of blk-cgroup code cleanups.
>>>>>
>>>
>>> Okay, so are you planning to send a follow-up patchset that replaces
>>> ->queue_lock
>>> with ->blkcg_mutex for protecting the blkg_list? If so, I'd still
>>> prefer acquiring
>>> ->blkcg_mutex around blkg_create() in this patchset. That would
>>> address the race
>>> between blkg_create() and the blkg_list traversal in
>>> bfq_end_wr_async(), while the
>>> subsequent series can focus on cleaning up and removing the remaining
>>> ->queue_lock
>>> usage for blkg protection.
>>
>> Yes, there is a follow-up patchset. When this set was posted,
>> blkg_create is still
>> called with queue_lock held, so I can't do that. However, not that
>> the other set
>> is already applied, I can hold blkcg_mutex for blkg_create() now.
>>
>
> If you already have a follow-up patchset that replaces ->queue_lock with
> ->blkcg_mutex for protecting blkg_list, then I think it might make sense
> to send that out first (if you haven't already) and hold off this
> series for
> now. That way, the blkg_create() race can be addressed first, and this
> series can build on top of those changes.

Yes, I already have a follow-up patchset that is build after this set.
I agree send that first will make sense, I'll rebase and send soon.

>
> BTW, if that follow-up series is merged first, I suspect the first two
> patches in
> this series may no longer be necessary, leaving only patches 3/4 and 4/4.

Yes, the first two patches will be folded into the patch to replace
queue_lock with blkcg_mutex.

>
> Thanks,
> --Nilay
>
>
>
--
Thanks,
Kuai