Re: [PATCH] soundwire: cadence: fix race condition between suspend and Slave device alerts
From: Pierre-Louis Bossart
Date: Fri Aug 21 2020 - 11:17:32 EST
cancel_work_sync() will either
a) wait until the current work completes, or
b) prevent a new one from starting.
there's no way to really 'abort' a workqueue, 'cancel' means either complete
or don't start.
Quite right, as that is how everyone deals with it. Stop the irq from
firing first and then wait until work is cancelled or completed, hence
cancel_work_sync()
No, this cannot work... The work queue in progress will initiate
transactions which would never complete because the interrupts are disabled.
if you disable the interrupts then cancel the work, you have a risk of not
letting the work complete if it already started (case a).
The race is
a) the interrupt thread (this function) starts
b) the work is scheduled and starts
c) the suspend handler starts and disables interrupts in [1] below.
d) the work initiates transactions which will never complete since Cadence
interrupts have been disabled.
Would it not be better to let work handle the case of interrupts
disabled and not initiates transactions which wont complete here? That
sounds more reasonable to do rather than complete the work which anyone
doesn't matter as you are suspending
A in-progress workqueue has no notion that interrupts are disabled, nor
that the device is in the process of suspending. It writes into a fifo
and blocks with wait_for_completion(). the complete() is done in an
interrupt thread, triggered when the RX Fifo reaches a watermark.
So if you disable interrupts, the complete() never happens.
The safe way to do things with the current code is to let the workqueue
complete, then disable interrupts.
We only disable interrupts on suspend, we don't need to test if
interrupts are enabled for every single byte that's transmitted on the
bus. Not to mention that this additional test would be racy as hell and
require yet another synchronization primitive making the code more
complicated.
So yes, the current transactions will complete and will be ignored, but
it's a lot better than trying to prevent these transactions from
happening with extra layers of complexity that will impact *every*
transaction.
BTW I looked at another alternative based on the msg lock, so that
interrupts cannot be disabled while a message is being sent. However
because a workqueue may send multiple messages, e.g. to read registers
are non-contiguous locations, there is no way to guarantee what happens
how messages and interrupt disabling are scheduled, so there'd still be
a case of a workqueue not completing and being stuck on a mutex_lock(),
not so good either.
In short, this is the simplest way to fix the timeout on resume.