Re: [PATCH V2] mm, page_alloc: fix core hung in free_pcppages_bulk()
From: David Hildenbrand
Date: Thu Aug 13 2020 - 05:32:45 EST
On 12.08.20 12:11, Charan Teja Kalla wrote:
>
>
> On 8/12/2020 3:30 PM, David Hildenbrand wrote:
>> On 12.08.20 11:46, Charan Teja Kalla wrote:
>>>
>>> Thanks David for the inputs.
>>>
>>> On 8/12/2020 2:35 AM, David Hildenbrand wrote:
>>>> On 11.08.20 14:58, Charan Teja Reddy wrote:
>>>>> The following race is observed with the repeated online, offline and a
>>>>> delay between two successive online of memory blocks of movable zone.
>>>>>
>>>>> P1 P2
>>>>>
>>>>> Online the first memory block in
>>>>> the movable zone. The pcp struct
>>>>> values are initialized to default
>>>>> values,i.e., pcp->high = 0 &
>>>>> pcp->batch = 1.
>>>>>
>>>>> Allocate the pages from the
>>>>> movable zone.
>>>>>
>>>>> Try to Online the second memory
>>>>> block in the movable zone thus it
>>>>> entered the online_pages() but yet
>>>>> to call zone_pcp_update().
>>>>> This process is entered into
>>>>> the exit path thus it tries
>>>>> to release the order-0 pages
>>>>> to pcp lists through
>>>>> free_unref_page_commit().
>>>>> As pcp->high = 0, pcp->count = 1
>>>>> proceed to call the function
>>>>> free_pcppages_bulk().
>>>>> Update the pcp values thus the
>>>>> new pcp values are like, say,
>>>>> pcp->high = 378, pcp->batch = 63.
>>>>> Read the pcp's batch value using
>>>>> READ_ONCE() and pass the same to
>>>>> free_pcppages_bulk(), pcp values
>>>>> passed here are, batch = 63,
>>>>> count = 1.
>>>>>
>>>>> Since num of pages in the pcp
>>>>> lists are less than ->batch,
>>>>> then it will stuck in
>>>>> while(list_empty(list)) loop
>>>>> with interrupts disabled thus
>>>>> a core hung.
>>>>>
>>>>> Avoid this by ensuring free_pcppages_bulk() is called with proper count
>>>>> of pcp list pages.
>>>>>
>>>>> The mentioned race is some what easily reproducible without [1] because
>>>>> pcp's are not updated for the first memory block online and thus there
>>>>> is a enough race window for P2 between alloc+free and pcp struct values
>>>>> update through onlining of second memory block.
>>>>>
>>>>> With [1], the race is still exists but it is very much narrow as we
>>>>> update the pcp struct values for the first memory block online itself.
>>>>>
>>>>> [1]: https://patchwork.kernel.org/patch/11696389/
>>>>>
>>>>
>>>> IIUC, this is not limited to the movable zone, it could also happen in
>>>> corner cases with the normal zone (e.g., hotplug to a node that only has
>>>> DMA memory, or no other memory yet).
>>>
>>> Yes, this is my understanding too. I explained the above race in terms
>>> of just movable zone for which it is observed. We can add the below line
>>> in the end in patch commit message:
>>> "This is not limited to the movable zone, it could also happen in cases
>>> with the normal zone (e.g., hotplug to a node that only has DMA memory,
>>> or no other memory yet)."
>>
>> Yeah, that makes sense!
>>
>>>
>>> Just curious, there exists such systems where just a dma zone present
>>> and we hot add the normal zone? I am not aware such thing in the
>>> embedded world.
>>
>> You can easily create such setups using QEMU.
>>
>> IIRC, just specify a QEMU guest with 2G initial memory and a single NUMA
>> node, or 4G initial memory and two NUMA nodes. Then hotplug memory.
>>
>> (IIRC kata containers always start a VM with 2G and then hotplug memory)
>>
> I see. Thanks for letting me know this.
>
>>>>
>>>>> Signed-off-by: Charan Teja Reddy <charante@xxxxxxxxxxxxxx>
>>>>> ---
>>>>>
>>>>> v1: https://patchwork.kernel.org/patch/11707637/
>>>>>
>>>>> mm/page_alloc.c | 5 +++++
>>>>> 1 file changed, 5 insertions(+)
>>>>>
>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>> index e4896e6..839039f 100644
>>>>> --- a/mm/page_alloc.c
>>>>> +++ b/mm/page_alloc.c
>>>>> @@ -1304,6 +1304,11 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>>>>> struct page *page, *tmp;
>>>>> LIST_HEAD(head);
>>>>>
>>>>> + /*
>>>>> + * Ensure proper count is passed which otherwise would stuck in the
>>>>> + * below while (list_empty(list)) loop.
>>>>> + */
>>>>> + count = min(pcp->count, count);
>>>>> while (count) {
>>>>> struct list_head *list;
>>>>>
>>>>>
>>>>
>>>> Fixes: and Cc: stable... tags?
>>>
>>> Fixes: 5f8dcc21211a ("page-allocator: split per-cpu list into
>>> one-list-per-migrate-type")
>>> Cc: <stable@xxxxxxxxxxxxxxx> [2.6+]
>>
>> Did we have memory hotplug support then already?
>
> Yes, it exist.
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/mm/memory_hotplug.c?h=v2.6.39
Okay, so I guess these tags make sense.
>
>>
>>>
>>> I am not sure If I should have to raise V3 including these?
>>
>>
>> Maybe Andrew can fixup when applying.
>
> Okay, let Andrew decide on this. Meanwhile If you find that this patch
> looks correct, ACK from you helps here.
Sure, I think this is good enough as a simple fix.
Acked-by: David Hildenbrand <david@xxxxxxxxxx>
--
Thanks,
David / dhildenb