Re: [PATCH] firmware: add firmware to new device's devres list for second time cache

From: Kai-Heng Feng
Date: Fri Oct 06 2017 - 00:55:32 EST


On Thu, Oct 5, 2017 at 5:17 AM, Luis R. Rodriguez <mcgrof@xxxxxxxxxx> wrote:
> On Tue, Aug 22, 2017 at 03:52:46PM +0800, Kai-Heng Feng wrote:
>> Currently, firmware will only be chached if assign_firmware_buf() gets
>> called.
>
> True, but also more importantly we peg the fw cache to the device via devres
> *iff* the firmware actually was found. We do this so that we don't try to look
> for bogus firmwares or firmwares which we currently do not have on our
> filesystem (consider a driver that uses a series of revisions of firmwares).
>
>> When a device loses its power or a USB device gets plugged to another
>> port under suspend, request_firmware() can still find cached firmware,
>> but firmware name no longer associates with the new device's devres.
>> So next time the system suspend, those firmware won't be cached.
>
> Gah, its a bit more complicated than that. During suspend we call out to
> request firmware proactively for all firmwares in our fw cache. The
> fw cache is used simply to fetch the caches for firwmares during suspend
> so that on resume they exist to avoid races against the filesystem. It
> however uses the same functionality as the batched firwmare feature, which
> in turn is used to share one buf over different requests.
>
> If a device is unplugged its not clear to me why the old cache would
> not work for the new one as its all shared, so the only thing that I
> can think of is if the old device being disconnected is processed first,
> and therefore releases the old cache, so when the new device is processed
> it does not use the new cache. This should still not be an issue though,
> unless of course real races happen with the filesystem.

Because the new one's devres doesn't have the fw, i.e.
fw_add_devm_name() not gets called in this case.

>
> As of recently though we have had new findings which indicate that the
> old UMH lock was causing issues on resume on BT devices which *only*
> needed firmware on resume, but not on boot, so their first fw cache
> was not generated. That issue can resemble this one, in that no cache
> can be present, and a race happens on resume.
>
> The old UMH lock then was causing a failure on resume, and one of the
> solutions which could have worked was a proactive "hey set this cache
> up for me" even though the device didn't need the firmware. This
> is no longer needed given that the UMH lock stuff is gone from
> direct FS lookups and the issue should no longer be present.
>
> That is, Linus' revert of commit f007cad159e99fa2acd3b2e9364fbb32ad28b971
> ("Revert "firmware: add sanity check on shutdown/suspend") I believe
> should fix this issue.
>
> I'm actually inclined to remove the fw cache stuff as I no longer see
> the advantage of it given we are doing FS lookups and I see no races
> possible anymore.
>
>> Hence, we should add the firmware name to the devres when the firmware
>> is found in cache, to make the firmware cacheable next time.
>>
>> Signed-off-by: Kai-Heng Feng <kai.heng.feng@xxxxxxxxxxxxx>
>> ---
>> drivers/base/firmware_class.c | 4 ++++
>> 1 file changed, 4 insertions(+)
>>
>> diff --git a/drivers/base/firmware_class.c b/drivers/base/firmware_class.c
>> index bfbe1e154128..a99de34e3fdc 100644
>> --- a/drivers/base/firmware_class.c
>> +++ b/drivers/base/firmware_class.c
>> @@ -1177,6 +1177,10 @@ _request_firmware_prepare(struct firmware **firmware_p, const char *name,
>>
>> ret = fw_lookup_and_allocate_buf(name, &fw_cache, &buf, dbuf, size);
>>
>> + /* device might be a new one, add it to devres list */
>> + if (ret == 0 || ret == 1)
>> + fw_add_devm_name(device, name);
>> +
>
> Even if this was correct notice we have requests for which a FW cache is not
> desired -- see FW_OPT_NOCACHE, and the above does not respect it.

I think this is needed when it's not FW_OPT_NOCACHE.

>
> Given all the above, can you test with a kernel which has
> commit f007cad159e99fa2acd3b2e9364fbb32ad28b971 and tell me if you
> still see the issue?

It's fixed after f007cad159e99fa2acd3b2e9364fbb32ad28b971.
Again, thanks for your detailed explanation.


Kai-Heng,

>
> Luis