Re: [PATCH] psi: fix psi state corruption when schedule() races with cgroup move
From: Peter Zijlstra
Date: Tue May 04 2021 - 07:25:16 EST
On Mon, May 03, 2021 at 01:49:17PM -0400, Johannes Weiner wrote:
> 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
> introduced a race condition that corrupts internal psi state. This
> manifests as kernel warnings, sometimes followed by bogusly high IO
> pressure:
>
> psi: task underflow! cpu=1 t=2 tasks=[0 0 0 0] clear=c set=0
> (schedule() decreasing RUNNING and ONCPU, both of which are 0)
>
> psi: incosistent task state! task=2412744:systemd cpu=17 psi_flags=e clear=3 set=0
> (cgroup_move_task() clearing MEMSTALL and IOWAIT, but task is MEMSTALL | RUNNING | ONCPU)
>
> What the offending commit does is batch the two psi callbacks in
> schedule() to reduce the number of cgroup tree updates. When prev is
> deactivated and removed from the runqueue, nothing is done in psi at
> first; when the task switch completes, TSK_RUNNING and TSK_IOWAIT are
> updated along with TSK_ONCPU.
>
> However, the deactivation and the task switch inside schedule() aren't
> atomic: pick_next_task() may drop the rq lock for load balancing. When
> this happens, cgroup_move_task() can run after the task has been
> physically dequeued, but the psi updates are still pending. Since it
> looks at the task's scheduler state, it doesn't move everything to the
> new cgroup that the task switch that follows is about to clear from
> it. cgroup_move_task() will leak the TSK_RUNNING count in the old
> cgroup, and psi_sched_switch() will underflow it in the new cgroup.
>
> A similar thing can happen for iowait. TSK_IOWAIT is usually set when
> a p->in_iowait task is dequeued, but again this update is deferred to
> the switch. cgroup_move_task() can see an unqueued p->in_iowait task
> and move a non-existent TSK_IOWAIT. This results in the inconsistent
> task state warning, as well as a counter underflow that will result in
> permanent IO ghost pressure being reported.
>
> Fix this bug by making cgroup_move_task() use task->psi_flags instead
> of looking at the potentially mismatching scheduler state.
>
> [ We used the scheduler state historically in order to not rely on
> task->psi_flags for anything but debugging. But that ship has sailed
> anyway, and this is simpler and more robust.
>
> We previously already batched TSK_ONCPU clearing with the
> TSK_RUNNING update inside the deactivation call from schedule(). But
> that ordering was safe and didn't result in TSK_ONCPU corruption:
> unlike most places in the scheduler, cgroup_move_task() only checked
> task_current() and handled TSK_ONCPU if the task was still queued. ]
>
> Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
> Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
Thanks! queued for sched/urgent.