Re: [[RFC]PATCH] psi: fix race between psi_trigger_create and psimon

From: Zhaoyang Huang
Date: Mon May 17 2021 - 20:41:47 EST


On Tue, May 18, 2021 at 5:30 AM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
>
> On Mon, May 17, 2021 at 12:33 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
> >
> > On Mon, May 17, 2021 at 11:36 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> > >
> > > CC Suren
> >
> > Thanks!
> >
> > >
> > > On Mon, May 17, 2021 at 05:04:09PM +0800, Huangzhaoyang wrote:
> > > > From: Zhaoyang Huang <zhaoyang.huang@xxxxxxxxxx>
> > > >
> > > > Race detected between psimon_new and psimon_old as shown below, which
> > > > cause panic by accessing invalid psi_system->poll_wait->wait_queue_entry
> > > > and psi_system->poll_timer->entry->next. It is not necessary to reinit
> > > > resource of psi_system when psi_trigger_create.
> >
> > resource of psi_system will not be reinitialized because
> > init_waitqueue_head(&group->poll_wait) and friends are initialized
> > only during the creation of the first trigger for that group (see this
> > condition: https://elixir.bootlin.com/linux/latest/source/kernel/sched/psi.c#L1119).
> >
> > > >
> > > > psi_trigger_create psimon_new psimon_old
> > > > init_waitqueue_head finish_wait
> > > > spin_lock(lock_old)
> > > > spin_lock_init(lock_new)
> > > > wake_up_process(psimon_new)
> > > >
> > > > finish_wait
> > > > spin_lock(lock_new)
> > > > list_del list_del
> >
> > Could you please clarify this race a bit? I'm having trouble
> > deciphering this diagram. I'm guessing psimon_new/psimon_old refer to
> > a new trigger being created while an old one is being deleted, so it
> > seems like a race between psi_trigger_create/psi_trigger_destroy. The
> > combination of trigger_lock and RCU should be protecting us from that
> > but maybe I missed something?
> > I'm excluding a possibility of a race between psi_trigger_create with
> > another existing trigger on the same group because the codepath
> > calling init_waitqueue_head(&group->poll_wait) happens only when the
> > first trigger for that group is created. Therefore if there is an
> > existing trigger in that group that codepath will not be taken.
>
> Ok, looking at the current code I think you can hit the following race
> when psi_trigger_destroy is destroying the last trigger in a psi group
> while racing with psi_trigger_create:
>
> psi_trigger_destroy psi_trigger_create
> mutex_lock(trigger_lock);
> rcu_assign_pointer(poll_task, NULL);
> mutex_unlock(trigger_lock);
> mutex_lock(trigger_lock);
> if
> (!rcu_access_pointer(group->poll_task)) {
>
> timer_setup(poll_timer, poll_timer_fn, 0);
>
> rcu_assign_pointer(poll_task, task);
> }
> mutex_unlock(trigger_lock);
>
> synchronize_rcu();
> del_timer_sync(poll_timer); <-- poll_timer has been reinitialized by
> psi_trigger_create
>
> So, trigger_lock/RCU correctly protects destruction of
> group->poll_task but misses this race affecting poll_timer and
> poll_wait.
> Let me think if we can handle this without moving initialization into
> group_init().
Right, this is exactly what we met during a monkey test on an android
system, where the psimon will be destroyed/recreated by unref/recreate
the psi_trigger. IMHO, poll_timer and poll_wait should exist during
whole period
>
> >
> > > >
> > > > Signed-off-by: ziwei.dai <ziwei.dai@xxxxxxxxxx>
> > > > Signed-off-by: ke.wang <ke.wang@xxxxxxxxxx>
> > > > Signed-off-by: Zhaoyang Huang <zhaoyang.huang@xxxxxxxxxx>
> > > > ---
> > > > kernel/sched/psi.c | 6 ++++--
> > > > 1 file changed, 4 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> > > > index cc25a3c..d00e585 100644
> > > > --- a/kernel/sched/psi.c
> > > > +++ b/kernel/sched/psi.c
> > > > @@ -182,6 +182,8 @@ struct psi_group psi_system = {
> > > >
> > > > static void psi_avgs_work(struct work_struct *work);
> > > >
> > > > +static void poll_timer_fn(struct timer_list *t);
> > > > +
> > > > static void group_init(struct psi_group *group)
> > > > {
> > > > int cpu;
> > > > @@ -201,6 +203,8 @@ static void group_init(struct psi_group *group)
> > > > memset(group->polling_total, 0, sizeof(group->polling_total));
> > > > group->polling_next_update = ULLONG_MAX;
> > > > group->polling_until = 0;
> > > > + init_waitqueue_head(&group->poll_wait);
> > > > + timer_setup(&group->poll_timer, poll_timer_fn, 0);
> > >
> > > This makes sense.
> >
> > Well, this means we initialize resources for triggers in each psi
> > group even if the user never creates any triggers. Current logic
> > initializes them when the first trigger in the group gets created.
> >
> > >
> > > > rcu_assign_pointer(group->poll_task, NULL);
> > > > }
> > > >
> > > > @@ -1157,7 +1161,6 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group,
> > > > return ERR_CAST(task);
> > > > }
> > > > atomic_set(&group->poll_wakeup, 0);
> > > > - init_waitqueue_head(&group->poll_wait);
> > > > wake_up_process(task);
> > > > timer_setup(&group->poll_timer, poll_timer_fn, 0);
> > >
> > > This looks now unncessary?
> > >
> > > > rcu_assign_pointer(group->poll_task, task);
> > > > @@ -1233,7 +1236,6 @@ static void psi_trigger_destroy(struct kref *ref)
> > > > * But it might have been already scheduled before
> > > > * that - deschedule it cleanly before destroying it.
> > > > */
> > > > - del_timer_sync(&group->poll_timer);
> > >
> > > And this looks wrong. Did you mean to delete the timer_setup() line
> > > instead?
> >
> > I would like to get more details about this race before trying to fix
> > it. Please clarify.
> > Thanks!