Re: [PATCH] firmware: stratix10-rsu: avoid RSU mutex deadlock on update

From: NG, TZE YEE

Date: Fri May 08 2026 - 02:57:38 EST

On 6/5/2026 7:40 am, Dinh Nguyen wrote:
> From sashiko claude-sonnet-4-6,
>
> On 5/4/26 21:51, tze.yee.ng@xxxxxxxxxx wrote:
>> From: Tze Yee Ng <tze.yee.ng@xxxxxxxxxx>
>>
>> Use mutex_trylock() in rsu_send_msg() instead of mutex_lock() so callers
>> that already hold priv->lock or race with an in-flight RSU operation get
>> -EAGAIN instead of blocking forever.
>>
>> In reboot_image_store(), treat -EAGAIN as a no-op success so sysfs write
>> does not hang when the RSU path is busy.
>>
>> Signed-off-by: Tze Yee Ng <tze.yee.ng@xxxxxxxxxx>
>> ---
>> drivers/firmware/stratix10-rsu.c | 15 ++++++++++++---
>> 1 file changed, 12 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/firmware/stratix10-rsu.c b/drivers/firmware/
>> stratix10-rsu.c
>> index e1912108a0fe..f3ee0b4e49a8 100644
>> --- a/drivers/firmware/stratix10-rsu.c
>> +++ b/drivers/firmware/stratix10-rsu.c
>> @@ -254,7 +254,13 @@ static void
>> rsu_async_get_spt_table_callback(struct device *dev,
>>    * is necessary to get RSU boot log or set the address of bitstream to
>>    * boot after reboot.
>>    *
>> - * Returns 0 on success or -ETIMEDOUT on error.
>> + * Return:
>> + * * 0 on success
>> + * * -EAGAIN if the driver mutex could not be acquired
>> + * * a negative errno from stratix10_svc_send() on failure to send
>> the request
>> + * * -ETIMEDOUT if waiting for the SMC callback times out
>> + * * a negative errno from
>> wait_for_completion_interruptible_timeout() if
>> + *   interrupted or otherwise signaled while waiting (for example -
>> ERESTARTSYS)
>>    */
>> static int rsu_send_msg(struct stratix10_rsu_priv *priv,
>>               enum stratix10_svc_command_code command,
>> @@ -264,7 +270,8 @@ static int rsu_send_msg(struct stratix10_rsu_priv
>> *priv,
>>       struct stratix10_svc_client_msg msg;
>>       int ret;
>> -    mutex_lock(&priv->lock);
>> +    if (!mutex_trylock(&priv->lock))
>> +        return -EAGAIN;
>
> The commit message says this prevents deadlocks from "callers that
> already hold priv->lock." Does such a caller actually exist?
>
> Looking at the driver, priv->lock is initialized in
> stratix10_rsu_probe() and is acquired only inside rsu_send_msg()
> itself. The completion callbacks (rsu_command_callback and friends)
> call complete() without taking priv->lock, so there is no reentrant
> acquisition path through callbacks either.
>
> Two concurrent sysfs writes would serialize correctly under the
> original mutex_lock() — the second writer would block until the first
> completed. That is the expected behavior for a one-shot firmware
> configuration command, not a deadlock.
>
>
You are right, there is no recursive deadlock path from a caller already
holding priv->lock, and that wording in my commit message is inaccurate.
The issue I am trying to address is a potential indefinite or blocking
wait in practice when an RSU transaction does not complete as expected,
which is observed as a sysfs write hang. In that situation, subsequent
sysfs writes queue behind mutex_lock() and appear hung from user space.
So the intent is to make reboot_image_store() non-blocking when RSU is
already busy by returning early on -EAGAIN, not to fix lock recursion.
I will correct the commit wording in v2 patch.>>
reinit_completion(&priv->completion);
>>       priv->client.receive_cb = callback;
>> @@ -597,7 +604,9 @@ static ssize_t reboot_image_store(struct device *dev,
>>       ret = rsu_send_msg(priv, COMMAND_RSU_UPDATE,
>>                  address, rsu_command_callback);
>> -    if (ret) {
>> +    if (ret == -EAGAIN) {
>> +        return 0;
>
> Two concerns here.
>
> First, does returning 0 violate the sysfs store callback contract?
> Sysfs store handlers are expected to return the number of bytes
> consumed on success, or a negative errno on error. Returning 0 tells
> kernfs that zero bytes were consumed, which is a short write. A
> POSIX-compliant write loop seeing 0 bytes returned on a non-zero
> write will retry indefinitely. The intended silent no-op should
> return count, not 0.
>
> Second, when mutex_trylock fails, COMMAND_RSU_UPDATE is never sent
> to the firmware, yet reboot_image_store() returns without an error.
> The operator's request to change the persistent boot image is
> silently dropped with no log message and no error to userspace.
>
> On the next reboot the device will boot the old image rather than
> the one that was written to the attribute.
>
> For a one-shot persistent firmware state change, is silently
> discarding the request the right tradeoff? Blocking in a sysfs
> store callback is permitted, so keeping the original mutex_lock()
> would let concurrent writers serialize correctly without any data
> loss.
>
> Dinh
I agree with both points.
Returning 0 from a sysfs store callback is incorrect here.
A successful sysfs store should return count; returning 0 can be treated
as a short write and may cause retry loops in userspace.
I also agree that silently dropping COMMAND_RSU_UPDATE when
mutex_trylock() fails is not acceptable for a one-shot persistent
boot-image update.
I will drop the reboot_image_store() special-case no-op success on
-EAGAIN in v2.

Thanks,
Regards,
Tze Yee