Re: [PATCH v2 2/2] of: overlay: Synchronize of_overlay_remove() with the devlink removals

From: Nuno Sá
Date: Thu Feb 29 2024 - 04:44:24 EST


On Thu, 2024-02-29 at 09:39 +0100, Herve Codina wrote:
> In the following sequence:
>   1) of_platform_depopulate()
>   2) of_overlay_remove()
>
> During the step 1, devices are destroyed and devlinks are removed.
> During the step 2, OF nodes are destroyed but
> __of_changeset_entry_destroy() can raise warnings related to missing
> of_node_put():
>   ERROR: memory leak, expected refcount 1 instead of 2 ...
>
> Indeed, during the devlink removals performed at step 1, the removal
> itself releasing the device (and the attached of_node) is done by a job
> queued in a workqueue and so, it is done asynchronously with respect to
> function calls.
> When the warning is present, of_node_put() will be called but wrongly
> too late from the workqueue job.
>
> In order to be sure that any ongoing devlink removals are done before
> the of_node destruction, synchronize the of_overlay_remove() with the
> devlink removals.
>
> Fixes: 80dd33cf72d1 ("drivers: base: Fix device link removal")
> Cc: stable@xxxxxxxxxxxxxxx
> Signed-off-by: Herve Codina <herve.codina@xxxxxxxxxxx>
> ---
>  drivers/of/overlay.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
> index 2ae7e9d24a64..99659ae9fb28 100644
> --- a/drivers/of/overlay.c
> +++ b/drivers/of/overlay.c
> @@ -853,6 +853,14 @@ static void free_overlay_changeset(struct
> overlay_changeset *ovcs)
>  {
>   int i;
>  
> + /*
> + * Wait for any ongoing device link removals before removing some of
> + * nodes. Drop the global lock while waiting
> + */
> + mutex_unlock(&of_mutex);
> + device_link_wait_removal();
> + mutex_lock(&of_mutex);

I'm still not convinced we need to drop the lock. What happens if someone else
grabs the lock while we are in device_link_wait_removal()? Can we guarantee that
we can't screw things badly?

The question is, do you have a system/use case where you can really see the
deadlock happening? Until I see one, I'm very skeptical about this. And if we
have one, I'm not really sure this is also the right solution for it.

- Nuno Sá