Re: [Letux-kernel] [PATCH 5/5] input: twl6040-vibra: remove mutex
From: H. Nikolaus Schaller
Date: Wed Apr 20 2016 - 14:10:48 EST
Hi,
> Am 20.04.2016 um 19:49 schrieb Dmitry Torokhov <dmitry.torokhov@xxxxxxxxx>:
>
> On Wed, Apr 20, 2016 at 11:22:53AM +0200, H. Nikolaus Schaller wrote:
>>
>>> Am 19.04.2016 um 10:08 schrieb H. Nikolaus Schaller <hns@xxxxxxxxxxxxx>:
>>>
>>>
>>>> Am 19.04.2016 um 10:01 schrieb Dmitry Torokhov <dmitry.torokhov@xxxxxxxxx>:
>>>>
>>>> On Tue, Apr 19, 2016 at 09:49:01AM +0200, H. Nikolaus Schaller wrote:
>>>>>
>>>>>> Am 18.04.2016 um 23:20 schrieb Dmitry Torokhov <dmitry.torokhov@xxxxxxxxx>:
>>>>>>
>>>>>> On Mon, Apr 18, 2016 at 09:55:41PM +0200, H. Nikolaus Schaller wrote:
>>>>>>> The mutex does not seem to be needed.
>>>>>>
>>>>>> twl6040_vibra_suspend() and vibra_play_work() may run concurrently, no?
>>>>>
>>>>> Hm. I don't know about the rule that would give an answer to this question...
>>>>
>>>> Sorry, that was actually a statement, not really a question.
>>>
>>> Indeed. In doubt about the answer we should take measures for the worst case.
>>>
>>>> It is
>>>> possible (although very unlikely) that userspace posts play request and
>>>> workqueue will not run until after suspend callback.
>>>>
>>>> Thinking about it some more I wonder if we better do what
>>>> twl6040_vibra_close() does and cancel the work before shutting off the
>>>> device, so that there is no chance of work executing after suspend
>>>> callback and reenabling the device. This way we can indeed remove the
>>>> mutex.
>>>
>>> Ok, I am fine with this.
>>>
>>> Will post an update ASAP.
>>
>> While doing testing I did run again into another locking related issue which I
>> had not yet tried to address with my patch set. It is a:
>>
>> BUG: scheduling while atomic
>>
>> report which sometimes ends in a kernel panic.
>>
>> I have attached such a log. vibra.py is here:
>>
>> http://git.goldelico.com/?p=gta04-kernel.git;a=blob;f=Letux/root/vibra.py;hb=refs/heads/letux-4.6-rc4
>>
>> Basically it does an ioctl(EVIOCSFF) which triggers vibra_play.
>>
>> Maybe, can you decipher from the log what the reason could be?
>>
>> I only conjecture that it happens when vibra_play tries to read the regmap
>> of the twl6040 to get twl6040_get_vibralr_status (which has no pendant
>> in the twl4030 driver).
>>
>> Do we have to configure the twl6040 regmap differently?
>
> Right, vibra_play is called with interrupts disabled (that's why we have
> workqueue to enable/disable regulators, etc, when we start or stop
> vibration), so twl6040_get_vibralr_status() should not sleep, but
> apparently it does.
Yes, regmap using i2c communication may sleep.
> Maybe the check for audio configuration should be
> moved into vibra_play_work().
Hm. It is there to disable while in audio routing mode, but
a workqueue can't report error values back to the scheduling thread...
So we can either silently make vibra not work (or just report
in the kernel log) if "Vibra is configured for audio".
Or we need to get a different mechanism to know the vibra status.
Hm.
Research shows that it regmap was introduced long ago but between 3.11 and
3.12 a private cache for these control registers was removed.
http://lxr.free-electrons.com/source/drivers/mfd/twl6040.c?v=3.11#L66
http://lxr.free-electrons.com/source/drivers/mfd/twl6040.c?v=3.11#L85
http://lxr.free-electrons.com/source/drivers/mfd/twl6040.c?v=3.11#L462
This had been non-blocking and was probably there exactly to protect
against this locking issue!
After removing the private cache the code must rely on twl6040_get_vibralr_status
not fetching from the chip.
Ah, this correlates to that I see this issue only once and then everything
works.
This means we have to fetch the current vibra control registers once
outside of vibra_play(). Probably during probing by a single call to
twl6040_get_vibralr_status() and ignoring the result.
After it has been fetched once (to know any status from the last reboot)
the regmap should track all changes arriving through the sound subsystem
(audio vibra enable) and the call to twl6040_get_vibralr_status in interrupt
context should no longer block.
Does this sound reasonable?
BR and thanks,
Nikolaus