Re: Bad psi_group_cpu.tasks[NR_MEMSTALL] counter

From: Max Kellermann
Date: Wed Jun 12 2024 - 02:49:28 EST


On Wed, Jun 12, 2024 at 7:01 AM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
> Instead I think what might be happening is that the task is terminated
> while it's in memstall.

How is it possible to terminate a task that's in memstall?
This must be between psi_memstall_enter() and psi_memstall_leave(),
but I had already checked all the callers and found nothing
suspicious; no obvious way to escape the section without
psi_memstall_leave(). In my understanding, it's impossible to
terminate a task that's currently stuck in the kernel. First, it needs
to leave the kernel and go back to userspace, doesn't it?

> I think if your theory was
> correct and psi_task_change() was called while task's cgroup is
> destroyed then task_psi_group() would have returned an invalid pointer
> and we would crash once that value is dereferenced.

I was thinking of something slightly different; something about the
cgroup being deleted or a task being terminated and the bookkeeping of
the PSI flags getting wrong, maybe some data race. I found the whole
PSI code with per-task flags, per-cpu per-cgroup counters and flags
somewhat obscure (but somebody else's code is always obscure, of
course); I thought there was a lot of potential for mistakes with the
bookkeeping, but I found nothing specific.

Anyway, thanks for looking into this - I hope we can get a grip on
this issue, as it's preventing me from using PSI values for actual
process management; the servers that go into this state will always
appear overloaded and that would lead to killing all the workload
processes forever.

Max