Re: Bad psi_group_cpu.tasks[NR_MEMSTALL] counter

From: Peter Zijlstra
Date: Wed Jun 12 2024 - 05:49:19 EST


On Wed, Jun 12, 2024 at 08:49:02AM +0200, Max Kellermann wrote:
> On Wed, Jun 12, 2024 at 7:01 AM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
> > Instead I think what might be happening is that the task is terminated
> > while it's in memstall.
>
> How is it possible to terminate a task that's in memstall?
> This must be between psi_memstall_enter() and psi_memstall_leave(),
> but I had already checked all the callers and found nothing
> suspicious; no obvious way to escape the section without
> psi_memstall_leave(). In my understanding, it's impossible to
> terminate a task that's currently stuck in the kernel. First, it needs
> to leave the kernel and go back to userspace, doesn't it?

Yeah, this. I can't find anything obvious either. The trickiest one is
read-ahead though, I couldn't immediately track all the
readahead_expand() callers, any such caller must then end up calling
read_pages() in order to land on the psi_memstall_leave(). This is
typically through page_cache_ra_*().

The erofs one is also not entirely obvious, but irrelevant if you're not using
it... the below should make it a little more obvious, but what do I know.

(whitespace mangled)

--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -1698,9 +1698,9 @@ static void z_erofs_submit_queue(struct z_erofs_decompress_frontend *f,
submit_bio(bio);
else
erofs_fscache_submit_bio(bio);
- if (memstall)
- psi_memstall_leave(&pflags);
}
+ if (memstall)
+ psi_memstall_leave(&pflags);

/*
* although background is preferred, no one is pending for submission.


> > I think if your theory was
> > correct and psi_task_change() was called while task's cgroup is
> > destroyed then task_psi_group() would have returned an invalid pointer
> > and we would crash once that value is dereferenced.
>
> I was thinking of something slightly different; something about the
> cgroup being deleted or a task being terminated and the bookkeeping of
> the PSI flags getting wrong, maybe some data race. I found the whole
> PSI code with per-task flags, per-cpu per-cgroup counters and flags
> somewhat obscure (but somebody else's code is always obscure, of
> course);

It really is a bit tricky, performance and all that.

> I thought there was a lot of potential for mistakes with the
> bookkeeping, but I found nothing specific.
>
> Anyway, thanks for looking into this - I hope we can get a grip on
> this issue, as it's preventing me from using PSI values for actual
> process management; the servers that go into this state will always
> appear overloaded and that would lead to killing all the workload
> processes forever.

Best case would be if you could somehow find a reproducer, but
I realize this might be tricky.