Re: workqueue: race in mod_delayed_work_on?

From: Konstantin Khlebnikov
Date: Thu May 12 2016 - 09:06:59 EST


On 10.05.2016 20:20, Konstantin Khlebnikov wrote:
On 10.05.2016 19:36, Tejun Heo wrote:
Hello,

On Tue, May 10, 2016 at 07:28:08PM +0300, Konstantin Khlebnikov wrote:
On 10.05.2016 11:21, Konstantin Khlebnikov wrote:
I've got plenty warnings, bugs and oops around trivial use of mod_delayed_work in drivers/infiniband/core/addr.c

Looks like problem in mod_delayed_work_on was hidden because add_timer is equal to mod_timer

The timer usages are gated behind PENDING bit, so whether add_timer()
is equal to mod_timer() shouldn't matter.

Hmm... this looks little bit more complicated than one bit.

Yep, problem was here - both timer and work can be active at the same time.

So try_to_grab_pending can return success for two concurrent callers:
first get del_timer, second removes work from workqueue. After that
both call add timer and one of them either catch BUG_ON or corrupt timer list.

I see two possible fixes: always remove timer and work in try_to_grab_pending
but this must be carefully synchronized. This will make it slower for sure.
Or always use mod_timer in __queue_delayed_work() - both callers will modify timer,
but here is no mod_timer_on().



but Sasha accidentally backported 874bbfe600a660cba9c776b3957b1ce393151b76
(workqueue: make sure delayed work run in local cpu) into 3.18.25

I don't see reason why that commit could break delayed work,
most likely it highlighted some other problem.

What are you running? Can you reproduce the issue on upstream kernel?


This is slight patched 3.18.y. Looks like this started when we upgraded kernel to 3.18.25 and
somebody have loaded module ib_addr (ip in infiniband or something) which actually unused
because these machines have no infiniband at all. But this code is poked from ethernet arp
sometimes. So, it crashes somewhere from time to time. I'll try to stresstest this piece.





--
Konstantin