Re: Bad psi_group_cpu.tasks[NR_MEMSTALL] counter

From: Suren Baghdasaryan
Date: Wed Jun 12 2024 - 11:17:54 EST


On Tue, Jun 11, 2024 at 11:49 PM Max Kellermann
<max.kellermann@xxxxxxxxx> wrote:
>
> On Wed, Jun 12, 2024 at 7:01 AM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
> > Instead I think what might be happening is that the task is terminated
> > while it's in memstall.
>
> How is it possible to terminate a task that's in memstall?
> This must be between psi_memstall_enter() and psi_memstall_leave(),
> but I had already checked all the callers and found nothing
> suspicious; no obvious way to escape the section without
> psi_memstall_leave(). In my understanding, it's impossible to
> terminate a task that's currently stuck in the kernel. First, it needs
> to leave the kernel and go back to userspace, doesn't it?

Doh! I made an assumption that this can happen while it should not,
unless psi_memstall_enter()/psi_memstall_leave() are not balanced. My
bad.

Since the issue is hard to reproduce, maybe you could add debugging
code to store _RET_IP_ inside the task_struct at the end of
psi_memstall_enter() and clear it inside psi_memstall_leave(). Then in
do_exit() you check if it's still set and generate a warning reporting
recorded _RET_IP_. This should hint us to which psi_memstall_enter()
was missing its psi_memstall_leave().

>
> > I think if your theory was
> > correct and psi_task_change() was called while task's cgroup is
> > destroyed then task_psi_group() would have returned an invalid pointer
> > and we would crash once that value is dereferenced.
>
> I was thinking of something slightly different; something about the
> cgroup being deleted or a task being terminated and the bookkeeping of
> the PSI flags getting wrong, maybe some data race. I found the whole
> PSI code with per-task flags, per-cpu per-cgroup counters and flags
> somewhat obscure (but somebody else's code is always obscure, of
> course); I thought there was a lot of potential for mistakes with the
> bookkeeping, but I found nothing specific.
>
> Anyway, thanks for looking into this - I hope we can get a grip on
> this issue, as it's preventing me from using PSI values for actual
> process management; the servers that go into this state will always
> appear overloaded and that would lead to killing all the workload
> processes forever.
>
> Max