Re: [PATCH] driver: of: overlay: demote message to warning

From: Frank Rowand
Date: Fri Sep 16 2022 - 18:48:04 EST


On 9/12/22 19:51, Daniel Walker wrote:
> On Mon, Sep 12, 2022 at 03:32:31PM -0500, Frank Rowand wrote:
>> On 9/12/22 12:05, Daniel Walker wrote:
>>> On Mon, Sep 12, 2022 at 01:45:40AM -0500, Frank Rowand wrote:
>>>> On 9/8/22 12:55, Frank Rowand wrote:
>>>>> On 9/7/22 19:35, Daniel Walker wrote:
>>>>>> On Wed, Sep 07, 2022 at 06:54:02PM -0500, Frank Rowand wrote:
>>>>>>> On 9/7/22 18:07, Daniel Walker wrote:
>>>>>>>> This warning message shows by default on the vast majority of overlays
>>>>>>>> applied. Despite the text identifying this as a warning it is marked
>>>>>>>> with the loglevel for error. At Cisco we filter the loglevels to only
>>>>>>>> show error messages. We end up seeing this message but it's not really
>>>>>>>> an error.
>>>>>>>>
>>>>>>>> For this reason it makes sense to demote the message to the warning
>>>>>>>> loglevel.
>>>>>>>>
>>>>>>>> Cc: xe-linux-external@xxxxxxxxx
>>>>>>>> Signed-off-by: Daniel Walker <danielwa@xxxxxxxxx>
>>>>>>>> ---
>>>>>>>> drivers/of/overlay.c | 2 +-
>>>>>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
>>>>>>>> index bd8ff4df723d..4ae276ed9a65 100644
>>>>>>>> --- a/drivers/of/overlay.c
>>>>>>>> +++ b/drivers/of/overlay.c
>>>>>>>> @@ -358,7 +358,7 @@ static int add_changeset_property(struct overlay_changeset *ovcs,
>>>>>>>> }
>>>>>>>>
>>>>>>>> if (!of_node_check_flag(target->np, OF_OVERLAY))
>>>>>>>> - pr_err("WARNING: memory leak will occur if overlay removed, property: %pOF/%s\n",
>>>>>>>> + pr_warn("WARNING: memory leak will occur if overlay removed, property: %pOF/%s\n",
>>>>>>>> target->np, new_prop->name);
>>>>>>>>
>>>>>>>> if (ret) {
>>>>>>>
>>>>>>> NACK
>>>>>>>
>>>>>>> This is showing a real problem with the overlay.
>>>>>>
>>>>>> What's the real problem ?
>>>>>>
>>>>>> Daniel
>>>>>
>>>>> A memory leak when the overlay is removed.
>>>>>
>>>>> I'll send a patch to update the overlay file in Documumentation/devicetree/ to provide
>>>>> more information about this. If you don't see a patch by tomorrow, feel free to
>>>>> ping me.
>>>>>
>>>>> -Frank
>>>>
>>>> The good news is that your question prodded me to start improving the in kernel documentation
>>>> of overlays. The promised patch is a rough start at:
>>>>
>>>> https://lore.kernel.org/all/20220912062615.3727029-1-frowand.list@xxxxxxxxx/
>>>>
>>>> The bad news is that what I wrote doesn't explain the memory leak in any more detail.
>>>> If an overlay adds a property to a node in the base device tree then the memory
>>>> allocated to do the add will not be freed when the overlay is removed. Since it is
>>>> possible to add and remove overlays multiple times, the ensuing size of the memory
>>>> leak is potentially unbounded.
>>>
>>> Isn't this only a problem if you remove the overlay?
>>
>> Yes, but we don't know if the overlay will be removed. And I will not accept a
>> change that suppresses the message if there is no expectation to remove the
>> overlay.
>
> I haven't researched the whole overlay system but there was one removal function
> that I noted, I think in the link you provided above, called
> of_overlay_remove(). It appears to call free_overlay_changeset() which calls kfree().

free_overlay_changeset() frees data structures related to the changeset. It does not
free memory allocated to the devicetree. The memory allocated to the devicetree is
freed as a result of the decreasing reference counts on the devicetree data.

>
> so your API seems to deal with freeing the memory. I would think the expectation is that
> people using the API would free the overlay thru your API.

Yes, the way to remove an overlay is of_overlay_remove() or of_overlay_remove_all().

>
> The only in tree usage of your API (besides the unit test) was drm/rcar-du which
> had no ability to remove the overlay that I can see. That component of the driver was
> removed several months ago.

rcar was an early overlay use that was grandfathered in for a long time until
we were finally able to remove it. It was not a model for how to use overlays.

>
>>>
>>> if the dt fixup driver does have the ability to remove the overlay doesn't it
>>> have responsibility to free the memory? Or is it impossible to free the memory?
>>
>> It is difficult due to architectural issues. Reference counting occurs at the node
>> level, and not at the property level. So memory related to properties is freed
>> when the corresponding overlay node reference count leads to the node being freed.
>
> How does of_overlay_remove() work then? It seems like it might not be possible
> to do overlay removal, but your code has removal functions. I also see this one
> of_overlay_remove_all() ..
>
> It seems like your API supports removal. Is there an issue where your API is
> maybe not complete or maybe doesn't currentily work ?

The overlay implementation is definitely not complete. For some details please see:

https://elinux.org/Device_Tree_Reference#Overlays

and:

https://elinux.org/Frank%27s_Evolving_Overlay_Thoughts

If you have not read those two links recently, _please_ read them for a better
understanding of the state of overlay implementation!


>
> Maybe you could add a flag or other indicator which would indicate the overlay will never be
> removed. Then your code could rely on this property to inform on if the author
> has consider the removal issues related to overlays.

No. I guess I wasn't clear enough above, where I said:

"And I will not accept a
change that suppresses the message if there is no expectation to remove the
overlay."

There are multiple reasons for this, but the most fundamental is that if a
new overlay is not removable, then any overlay already applied can not be
removed (because overlays must be removed in the reverse order that they
are applied). It would be incredibly bad architecture to allow an overlay
to block another overlay from being removed.

-Frank

>
> Daniel