Re: Bad psi_group_cpu.tasks[NR_MEMSTALL] counter

From: Gao Xiang
Date: Thu Nov 21 2024 - 08:18:20 EST


Hi Max!

On 2024/11/21 16:43, Max Kellermann wrote:
On Thu, Nov 21, 2024 at 5:51 AM Christoph Hellwig <hch@xxxxxx> wrote:
Something seems to be going wrong here, though, but the trace below
doesn't really tell me anything about the workload or file system
used, and if this is even calling into readahead.

In case you were asking :-) these are web servers (shared webhosting),
running PHP most of the time. The host itself runs on an ext4, but I
don't think the ext4 system partition has anything to do with this.
PHP runs in containers that are erofs, the PHP sources plus
memory-mapped opcache files are in btrfs (read-only snapshot) and the
runtime data is on NFS or Ceph (there have been stalls on both server
types).
My limited experience with Linux MM suggests that this happens during
the page fault of a memory mapped file. PHP processes usually mmap
only files from erofs and btrfs.
The servers are always somewhat under memory pressure; our container
manager keeps as many containers alive as possible and only shuts them
down when the server reaches the memory limit. At any given time,
there are thousands of containers.

Just saw this. I guess your _recent_ 6.11.9 bug is actually
related to EROFS since EROFS uses readahead_expand(). I think
your recent report was introduced by a recent backport fix
commit 9e2f9d34dd12 ("erofs: handle overlapped pclusters out of crafted images properly")
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v6.11.9&id=9cfa199bcbbbba31cbf97b2786f44f4464f3f29a

bio can be NULL after this patch and causes
unbalanced psi_memstall_{enter,leave}(). It can be fixed as
(the diff below could be damaged due to my email client):

diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
index 01f147505487..19ef4ff2a134 100644
--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -1792,9 +1792,9 @@ static void z_erofs_submit_queue(struct z_erofs_decompress_frontend *f,
erofs_fscache_submit_bio(bio);
else
submit_bio(bio);
- if (memstall)
- psi_memstall_leave(&pflags);
}
+ if (memstall)
+ psi_memstall_leave(&pflags);

/*
* although background is preferred, no one is pending for submission.

But your original report is without the very recent
commit 9e2f9d34dd12, before this commit bio cannot
be NULL so I don't think they are the same issue.

I will submit a formal fix for the recent bug later,
thanks!

Thanks,
Gao Xiang