Re: CVE-2024-36000: mm/hugetlb: fix missing hugetlb_lock for resv uncharge

From: Michal Hocko
Date: Thu May 23 2024 - 09:08:44 EST


On Thu 23-05-24 12:33:37, Oscar Salvador wrote:
> On Thu, May 23, 2024 at 09:30:59AM +0200, Michal Hocko wrote:
> > Let me add Oscar,
>
> Thanks
>
> >
> > On Tue 21-05-24 15:38:45, Peter Xu wrote:
> > > That commit mentioned that we rely on the lock to make sure all hugetlb
> > > folios on the active list will have a valid memcg. However I'm not sure
> > > whether it's still required now (after all that's 2012..), e.g., I'm
> > > looking at hugetlb_cgroup_css_offline(), and hugetlb_cgroup_move_parent()
> > > looks all safe to even take empty memcg folios with the latest code at
> > > least:
> > >
> > > /*
> > > * We can have pages in active list without any cgroup
> > > * ie, hugepage with less than 3 pages. We can safely
> > > * ignore those pages.
> > > */
> > > if (!page_hcg || page_hcg != h_cg)
> > > goto out;
>
> Ok, I had a look at hugetlb_cgroup implementation.
> First time looking at that code, so bear with me.
>
> I looked back at commit
>
> commit 94ae8ba7176666d1e7d8bbb9f93670a27540b6a8 (HEAD)
> Author: Aneesh Kumar K.V <aneesh.kumar@xxxxxxxxxxxxxxxxxx>
> Date: Tue Jul 31 16:42:35 2012 -0700
>
> hugetlb/cgroup: assign the page hugetlb cgroup when we move the page to active list.
>
> to understand why the lock was needed.
>
> On the theoretical part:
>
> And we could have
>
> CPU0 CPU1
> dequeue_huge_page_vma
> dequeue_huge_page_node
> move_page_to_active_list
> release_lock
> hugetlb_cgroup_pre_destroy
> for_each_page_in_active_list
> <-- pages empty cgroups are skipped -->
> hugetlb_cgroup_move_parent
> move_page_to_parent
> hugetlb_cgroup_commit_charge <-- too late
> page[2].lru.next = (void *)h_cg;
>
> So, the above page should have been moved to the parent, but since by the time
> we were checking the activelist this page did not have any cgroup attach ot it,
> it was skipped.
>
> Notice I said theoretical because I noticed that
> cgroup_call_pre_destroy()->hugetlb_cgroup_pre_destroy() is called from
> cgroup_rmdir(). I am not sure under which circumstances cgroup_rmdir()
> can succeed (does the cgroup refcount have dropped to 0?)

Now, it just cannot have any tasks attached nor any subgroups. So is the
race actually possible?
--
Michal Hocko
SUSE Labs