Re: [PATCH] mm: khugepaged: fix NR_FILE_PAGES accounting in collapse_file()
From: Usama Arif
Date: Thu Jan 29 2026 - 17:51:25 EST
On 29/01/2026 18:40, Shakeel Butt wrote:
> In META's fleet, we are seeing high level cgroups with zero file memcg
> stat but their descendants have non-zero file stat. This should not be
> possible. On further inspection by looking at kernel data structures
> though drgn, it was revealed that the high level cgroups have negative
> file stat which was aggregated from their children.
>
> Another interesting point was that this specific issue start happening
> more often as we started deploying thp-always more widely which
> indicates some correlation between file memory and THPs and indeed it
> was found that file memcg stat accounting is buggy in the collapse code
> path from the start.
>
> When collapse_file() replaces small folios with a large THP, it fails to
> properly update the NR_FILE_PAGES memcg stat for both the old folios
> being freed and the new THP being added. It assumes the old and new
> folios belong to the same cgroup. However this assumption breaks in
> couple of scenarios:
>
> 1. Binary (executable) package downloader running in a different cgroup
> than the actual job executing the downloaded package.
>
> 2. File shared and mapped by processes running in different cgroups. One
> process read-in the file and the second process either through
> madvise(COLLAPSE) or khugepaged on behalf of second process
> collapsing the file.
>
> So, the current code has two bugs:
>
> 1. For non-shmem files, NR_FILE_PAGES is never incremented for the new
> THP because nr_none is always 0 for non-shmem, and the stat update is
> inside the "if (nr_none)" block.
>
> 2. When freeing old folios, NR_FILE_PAGES is never decremented because
> folio->mapping is set to NULL directly without calling
> filemap_unaccount_folio().
>
> These bugs cause incorrect per-memcg accounting when the process
> triggering the collapse (MADV_COLLAPSE or khugepaged) belongs to a
> different memcg than the process that originally faulted in the pages:
>
> - Process A (memcg X) reads file, creating 512 small page cache folios
> charged to memcg X (NR_FILE_PAGES += 512 for memcg X)
>
> - Process B (memcg Y) triggers collapse via MADV_COLLAPSE or khugepaged
> scans B's mm. The new THP is charged to memcg Y.
>
> - Old folios freed: NR_FILE_PAGES not decremented (bug)
> New THP added: NR_FILE_PAGES not incremented (bug)
>
> - Later, THP removed from page cache: NR_FILE_PAGES -= 512 for memcg Y
>
> Result: memcg X has +512 inflated pages, memcg Y has -512 (negative!)
>
> Fix this by:
> 1. Always incrementing NR_FILE_PAGES by HPAGE_PMD_NR for the new THP
> 2. Decrementing NR_FILE_PAGES for each old folio before clearing its
> mapping pointer
>
> For shmem with holes (nr_none > 0), the net change is still +nr_none
> since we decrement (HPAGE_PMD_NR - nr_none) old pages and increment
> HPAGE_PMD_NR new pages.
>
> Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
> Signed-off-by: Shakeel Butt <shakeel.butt@xxxxxxxxx
Acked-by: Usama Arif <usamaarif642@xxxxxxxxx>