Re: CVE-2024-36000: mm/hugetlb: fix missing hugetlb_lock for resv uncharge

From: Oscar Salvador
Date: Thu May 23 2024 - 06:33:50 EST


On Thu, May 23, 2024 at 09:30:59AM +0200, Michal Hocko wrote:
> Let me add Oscar,

Thanks

>
> On Tue 21-05-24 15:38:45, Peter Xu wrote:
> > That commit mentioned that we rely on the lock to make sure all hugetlb
> > folios on the active list will have a valid memcg. However I'm not sure
> > whether it's still required now (after all that's 2012..), e.g., I'm
> > looking at hugetlb_cgroup_css_offline(), and hugetlb_cgroup_move_parent()
> > looks all safe to even take empty memcg folios with the latest code at
> > least:
> >
> > /*
> > * We can have pages in active list without any cgroup
> > * ie, hugepage with less than 3 pages. We can safely
> > * ignore those pages.
> > */
> > if (!page_hcg || page_hcg != h_cg)
> > goto out;

Ok, I had a look at hugetlb_cgroup implementation.
First time looking at that code, so bear with me.

I looked back at commit

commit 94ae8ba7176666d1e7d8bbb9f93670a27540b6a8 (HEAD)
Author: Aneesh Kumar K.V <aneesh.kumar@xxxxxxxxxxxxxxxxxx>
Date: Tue Jul 31 16:42:35 2012 -0700

hugetlb/cgroup: assign the page hugetlb cgroup when we move the page to active list.

to understand why the lock was needed.

On the theoretical part:

And we could have

CPU0 CPU1
dequeue_huge_page_vma
dequeue_huge_page_node
move_page_to_active_list
release_lock
hugetlb_cgroup_pre_destroy
for_each_page_in_active_list
<-- pages empty cgroups are skipped -->
hugetlb_cgroup_move_parent
move_page_to_parent
hugetlb_cgroup_commit_charge <-- too late
page[2].lru.next = (void *)h_cg;

So, the above page should have been moved to the parent, but since by the time
we were checking the activelist this page did not have any cgroup attach ot it,
it was skipped.

Notice I said theoretical because I noticed that
cgroup_call_pre_destroy()->hugetlb_cgroup_pre_destroy() is called from
cgroup_rmdir(). I am not sure under which circumstances cgroup_rmdir()
can succeed (does the cgroup refcount have dropped to 0?)


Now onto the current days.

After Peter's fix, we had:

CPU0 CPU1
dequeue_huge_page_vma
dequeue_hugetlb_folio_node_exact
move_page_to_active_list
hugetlb_cgroup_commit_charge
: folio->_hugetlb_cgroup = cgroup
hugetlb_cgroup_commit_charge_rsvd
: folio->_hugetlb_cgroup_rsvd = cgroup
release lock
hugetlb_cgroup_css_offline
take lock
for_each_page_in_active_list
hugetlb_cgroup_move_parent
: set folio->_hugetlb_cgroup
: _hugetlb_cgroup_rsvd is not set
: still contains the old value
release lock
__hugetlb_cgroup_uncharge_folio_rsvd
hugetlb_cgroup_from_folio_rsvd
...
folio->_hugetlb_cgroup_rsvd = NUL


So, as you can see, we charged a folio to the parent, and this folio still
contains the previous _hugetlb_cgroup_rsvd, when it should have been NULLed.
All this meaning that since no page_counter_uncharge() was called for
the _hugetlb_group_rsvd, parent still has the old page_counter's for cgroup_rsvd.

Then, if before hugetlb_cgroup_from_folio_rsvd() gets called, someone allocates a
new cgroup under this parent, it will get both the page_counters from the
_hugetlb_cgroup and the _hugetlb_cgroup_rsvd.
And then page_counter_set_max() gets called for both counters.

What implies for that new cgroup to have those page_counters?
I do not really know.

Maybe it does not let it allocate a new page when it should? Or it does look like
it has more memory reserved than it actually does?


--
Oscar Salvador
SUSE Labs