Re: [PATCH] mm/filemap: Adding missing mem_cgroup_uncharge() to __add_to_page_cache_locked()

From: Waiman Long
Date: Tue Jan 26 2021 - 04:08:52 EST


On 1/25/21 1:29 PM, Matthew Wilcox wrote:
On Mon, Jan 25, 2021 at 01:23:58PM -0500, Waiman Long wrote:
On 1/25/21 1:14 PM, Michal Hocko wrote:
On Mon 25-01-21 17:41:19, Michal Hocko wrote:
On Mon 25-01-21 16:25:06, Matthew Wilcox wrote:
On Mon, Jan 25, 2021 at 05:03:28PM +0100, Michal Hocko wrote:
On Mon 25-01-21 10:57:54, Waiman Long wrote:
On 1/25/21 4:28 AM, Michal Hocko wrote:
On Sun 24-01-21 23:24:41, Waiman Long wrote:
The commit 3fea5a499d57 ("mm: memcontrol: convert page
cache to a new mem_cgroup_charge() API") introduced a bug in
__add_to_page_cache_locked() causing the following splat:

[ 1570.068330] page dumped because: VM_BUG_ON_PAGE(page_memcg(page))
[ 1570.068333] pages's memcg:ffff8889a4116000
[ 1570.068343] ------------[ cut here ]------------
[ 1570.068346] kernel BUG at mm/memcontrol.c:2924!
[ 1570.068355] invalid opcode: 0000 [#1] SMP KASAN PTI
[ 1570.068359] CPU: 35 PID: 12345 Comm: cat Tainted: G S W I 5.11.0-rc4-debug+ #1
[ 1570.068363] Hardware name: HP HP Z8 G4 Workstation/81C7, BIOS P60 v01.25 12/06/2017
[ 1570.068365] RIP: 0010:commit_charge+0xf4/0x130
:
[ 1570.068375] RSP: 0018:ffff8881b38d70e8 EFLAGS: 00010286
[ 1570.068379] RAX: 0000000000000000 RBX: ffffea00260ddd00 RCX: 0000000000000027
[ 1570.068382] RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff88907ebe05a8
[ 1570.068384] RBP: ffffea00260ddd00 R08: ffffed120fd7c0b6 R09: ffffed120fd7c0b6
[ 1570.068386] R10: ffff88907ebe05ab R11: ffffed120fd7c0b5 R12: ffffea00260ddd38
[ 1570.068389] R13: ffff8889a4116000 R14: ffff8889a4116000 R15: 0000000000000001
[ 1570.068391] FS: 00007ff039638680(0000) GS:ffff88907ea00000(0000) knlGS:0000000000000000
[ 1570.068394] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1570.068396] CR2: 00007f36f354cc20 CR3: 00000008a0126006 CR4: 00000000007706e0
[ 1570.068398] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1570.068400] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1570.068402] PKRU: 55555554
[ 1570.068404] Call Trace:
[ 1570.068407] mem_cgroup_charge+0x175/0x770
[ 1570.068413] __add_to_page_cache_locked+0x712/0xad0
[ 1570.068439] add_to_page_cache_lru+0xc5/0x1f0
[ 1570.068461] cachefiles_read_or_alloc_pages+0x895/0x2e10 [cachefiles]
[ 1570.068524] __fscache_read_or_alloc_pages+0x6c0/0xa00 [fscache]
[ 1570.068540] __nfs_readpages_from_fscache+0x16d/0x630 [nfs]
[ 1570.068585] nfs_readpages+0x24e/0x540 [nfs]
[ 1570.068693] read_pages+0x5b1/0xc40
[ 1570.068711] page_cache_ra_unbounded+0x460/0x750
[ 1570.068729] generic_file_buffered_read_get_pages+0x290/0x1710
[ 1570.068756] generic_file_buffered_read+0x2a9/0xc30
[ 1570.068832] nfs_file_read+0x13f/0x230 [nfs]
[ 1570.068872] new_sync_read+0x3af/0x610
[ 1570.068901] vfs_read+0x339/0x4b0
[ 1570.068909] ksys_read+0xf1/0x1c0
[ 1570.068920] do_syscall_64+0x33/0x40
[ 1570.068926] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1570.068930] RIP: 0033:0x7ff039135595

Before that commit, there was a try_charge() and commit_charge()
in __add_to_page_cache_locked(). These 2 separated charge functions
were replaced by a single mem_cgroup_charge(). However, it forgot
to add a matching mem_cgroup_uncharge() when the xarray insertion
failed with the page released back to the pool. Fix this by adding a
mem_cgroup_uncharge() call when insertion error happens.

Fixes: 3fea5a499d57 ("mm: memcontrol: convert page cache to a new mem_cgroup_charge() API")
Signed-off-by: Waiman Long <longman@xxxxxxxxxx>
OK, this is indeed a subtle bug. The patch aimed at simplifying the
charge lifetime so that users do not really have to think about when to
uncharge as that happens when the page is freed. fscache somehow breaks
that assumption because it doesn't free up pages but it keeps some of
them in the cache.

I have tried to wrap my head around the cached object life time in
fscache but failed and got lost in the maze. Is this the only instance
of the problem? Would it make more sense to explicitly handle charges in
the fscache code or there are other potential users to fall into this
trap?
There may be other places that have similar problem. I focus on the
filemap.c case as I have a test case that can reliably produce the bug
splat. This patch does fix it for my test case.
I believe this needs a more general fix than catching a random places
which you can trigger. Would it make more sense to address this at the
fscache level and always make sure that a page returned to the pool is
always uncharged instead?
I believe you mean "page cache" -- there is a separate thing called
'fscache' which is used to cache network filesystems.
Yes, I really had fscache in mind because it does have an "unusual" page
life time rules.

I don't understand the memcg code at all, so I have no useful feedback
on what you're saying other than this.
Well the memcg accounting rules after the rework should have simplified
the API usage for most users. You will get memory charged when it is
used and it will go away when the page is freed. If a page is not really
freed in some cases and it can be reused then it doesn't really fit into
this scheme automagically. I do undestand that this puts some additional
burden on those special cases. I am not really sure what is the right
way here myself but considering there might be other similar cases like
that I would lean towards special casing where the pool is implemented.
I would expect there is some state to be maintain for that purpose
already.
After some more thinking I've came to conclusion that the patch as
proposed is the proper way forward. It is easier to follow if the
unwinding of state changes are local to the function.
I think so. It is easier to understand if the charge and uncharge functions
are grouped together in the same function.
With the proposed simplification by Willy
Acked-by: Michal Hocko <mhocko@xxxxxxxx>
Thank for the ack. However, I am a bit confused about what you mean by
simplification. There is another linux-next patch that changes the condition
for mem_cgroup_charge() to

-       if (!huge) {
+       if (!huge && !page_is_secretmem(page)) {
                error = mem_cgroup_charge(page, current->mm, gfp);

That is the main reason why I introduced the boolean variable as I don't
want to call the external page_is_secretmem() function twice.
See earlier emails to Mike suggesting that the accounting of secretmem
here is wrong.

I am not in the secretmem discussion and so I am not aware of the issue with that patchset. Anyway, my goal is to minimize potential conflict with other patches.

Cheers,
Longman