Re: [PATCH] mm: fix use-after free of page_ext after race with memory-offline

From: Michal Hocko
Date: Wed Jul 20 2022 - 11:23:17 EST


On Wed 20-07-22 20:38:58, Charan Teja Kalla wrote:
> Thanks Michal here!!
>
> On 7/19/2022 9:13 PM, Michal Hocko wrote:
> >>>>>> diff --git a/mm/page_ext.c b/mm/page_ext.c
> >>>>>> index 3dc715d..5ccd3ee 100644
> >>>>>> --- a/mm/page_ext.c
> >>>>>> +++ b/mm/page_ext.c
> >>>>>> @@ -299,8 +299,9 @@ static void __free_page_ext(unsigned long pfn)
> >>>>>> if (!ms || !ms->page_ext)
> >>>>>> return;
> >>>>>> base = get_entry(ms->page_ext, pfn);
> >>>>>> - free_page_ext(base);
> >>>>>> ms->page_ext = NULL;
> >>>>>> + synchronize_rcu();
> >>>>>> + free_page_ext(base);
> >>>>>> }
> >>>>> So you are imposing the RCU grace period for each page_ext! This can get
> >>>>> really expensive. Have you tried to measure the effect?
> >>> I was wrong here! This is for each memory section which is not as
> >>> terrible as every single page_ext. This can be still quite a lot memory
> >>> sections in a single memory block (e.g. on ppc memory sections are
> >>> ridiculously small).
> >>>
> >> On the ARM64, I see that the minimum a section size will go is 128MB. I
> >> think 16MB is the section size on ppc. Any inputs on how frequently
> >> offline/online operation is being done on this ppc arch?
> > I have seen several reports where 16MB sections were used on PPC LPARs
> > with a non trivial size. My usual answer to that is tha this is mostly a
> > self inflicted injury but I am told that for some reasons I cannot
> > udnerstand this is not easy to change. So reasonable or not this is not
> > all that uncommon in PPC land.
> >
> > We definitely shouldn't optimize for those setups but we shouldn't make
> > them suffer even more as well. Besides that it seems that a single
> > rcu_synchronize per offline operation should be doable.
>
> I too feel it is doable but the code might look to need to traverse the
> sections of a memory block twice.

yes, this is to be expected unless you really want to optimize that
further by some global flag which is probably not worth it. Traversing
the sequence of section is not really an expensive operation comparing
to do an expenensive operation to each of the iteration.

> 1) with synchronize_rcu() calling for each memory section of a memblock:
> ---------------------------------------------------------------------
> for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION)
> __free_page_ext(pfn):
> ms->page_ext = NULL
> synchronize_rcu();// Called on every section.
> free_page_ext();//free the page_ext.
>
> 2) With a single synchronize_rcu() for each offlined block:
> -------------------------------------------------------
> for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
> __free_page_ext(pfn):
> ms->page_ext = NULL;

you want WRITE_ONCE here

> }
> synchronize_rcu(); // call synchronize_rcu for just once
> for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION)
> free_page_ext(); // Free the page_ext.

Yes this is what I really meant but the thing is that you need the
pointer to know what to free. One way to handle that would be to not
clear the pointer but make it visibly invalid for later checks. E.g. set
the lowest bit and check for it rather for NULL.

> Any better code you have in mind here please?
>
> But since there are few sections the overhead of traversing them will be
> definitely less compared to synchronize_rcu() per memsection.
>
> But I really doubt if there will be a real impact making sync_rcu per
> section because,(as david also mentioned and you also corrected it I
> think), the concern here is for ppc where the min memblock size is 256M
> with section size of 16M and there is a single offline operation on that
> block but can end up in calling 16 sync_rcu() calls. Should we really
> optimize this case here? If yes, I can go with the approach 2) mentioned
> above. Sorry if I am really undermining the problem here.

I would prefer a single rcu barrier solution. Unless this turns into
monster complicated stuff. Which I hope we can avoid completely by
simply not doing the separate lifetime as well...

Making any assumptions on the number of sections that are handled here
is just a ticket to get issues later. Let's not put land mines ahead of
us please.
--
Michal Hocko
SUSE Labs