Re: [PATCH v2] kernel/kthread.c: kthread_worker: add work status check in timer_fn
From: Petr Mladek
Date: Fri Sep 25 2020 - 09:33:51 EST
On Fri 2020-09-25 20:38:21, Hillf Danton wrote:
>
> On Fri, 25 Sep 2020 11:30:46 +0200 Petr Mladek wrote:
> >
> > On Fri 2020-09-25 13:07:59, qiang.zhang@xxxxxxxxxxxxx wrote:
> > > From: Zqiang <qiang.zhang@xxxxxxxxxxxxx>
> > >
> > > When queue delayed work to worker, at some point after that the timer_fn
> > > will be call, add work to worker's work_list, at this time, the work may
> > > be cancel, so add "work->canceling" check current work status.
> >
> > Great catch!
> >
> > I was able to understand the problem from the description. Though I
> > would still try to improve it a bit. I suggest:
> >
> > <new_text>
> > Subject: kthread_worker: Prevent queuing delayed work from timer_fn when it is being canceled
> >
> > There is a small race window when a delayed work is being canceled and
> > the work still might be queued from the timer_fn:
> >
> > CPU0 CPU1
> >
> > kthread_cancel_delayed_work_sync()
> > __kthread_cancel_work_sync()
> > __kthread_cancel_work()
> > work->canceling++;
> >
> > kthread_delayed_work_timer_fn()
> > kthread_insert_work();
> >
> > BUG: kthread_insert_work() should not get called when work->canceling
> > is set.
>
> Seems like the diagram above can't cover the case that the timer fired
> and started acquiring the lock half a tick before here came the cancel
> that took the lock then set the canceling mark.
> Nor is it a bug given that cancel is always flushing the current work.
This is the same as:
CPU0 CPU1
kthread_delayed_work_timer_fn()
kthread_insert_work();
kthread_cancel_delayed_work_sync()
It just shows that kthread_cancel_delayed_work_sync() can't stop work
that is already being proceed. In this case, it has to wait until it
finishes.
By other words, this patch can't fix any real problem with timing.
kthread_cancel_delayed_work_sync() only guarantees that the work
in neither queued nor running when it finishes.
But I still think that the patch makes sense. work->lock synchronizes
manipulation of the work state. And it is wrong to queue the work
when canceling flag is set.
By other words. The bug was harmless. But it still was a bug.
Best Regards,
Petr