Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac resume back

From: Jon Hunter
Date: Thu Mar 25 2021 - 04:01:46 EST



On 25/03/2021 07:53, Joakim Zhang wrote:
>
>> -----Original Message-----
>> From: Jon Hunter <jonathanh@xxxxxxxxxx>
>> Sent: 2021年3月24日 20:39
>> To: Joakim Zhang <qiangqing.zhang@xxxxxxx>
>> Cc: netdev@xxxxxxxxxxxxxxx; Linux Kernel Mailing List
>> <linux-kernel@xxxxxxxxxxxxxxx>; linux-tegra <linux-tegra@xxxxxxxxxxxxxxx>;
>> Jakub Kicinski <kuba@xxxxxxxxxx>
>> Subject: Re: Regression v5.12-rc3: net: stmmac: re-init rx buffers when mac
>> resume back
>>
>>
>>
>> On 24/03/2021 12:20, Joakim Zhang wrote:
>>
>> ...
>>
>>> Sorry for this breakage at your side.
>>>
>>> You mean one of your boards? Does other boards with STMMAC can work
>> fine?
>>
>> We have two devices with the STMMAC and one works OK and the other fails.
>> They are different generation of device and so there could be some
>> architectural differences which is causing this to only be seen on one device.
> It's really strange, but I also don't know what architectural differences could affect this. Sorry.


Maybe caching somewhere? In other words, could there be any cache
flushing that we are missing here?

>>> We do daily test with NFS to mount rootfs, on issue found. And I add this
>> patch at the resume patch, and on error check, this should not break suspend.
>>> I even did the overnight stress test, there is no issue found.
>>>
>>> Could you please do more test to see where the issue happen?
>>
>> The issue occurs 100% of the time on the failing board and always on the first
>> resume from suspend. Is there any more debug I can enable to track down
>> what the problem is?
>>
>
> As commit messages described, the patch aims to re-init rx buffers address, since the address is not fixed, so I only can
> recycle and then re-allocate all of them. The page pool is allocated once when open the net device.
>
> Could you please debug if it fails at some functions, such as page_pool_dev_alloc_pages() ?


Yes that was the first thing I tried, but no obvious failures from
allocating the pools.

Are you certain that the problem you are seeing, that is being fixed by
this change, is generic to all devices? The commit message states that
'descriptor write back by DMA could exhibit unusual behavior', is this a
known issue in the STMMAC controller? If so does this impact all
versions and what is the actual problem?

Jon

--
nvpublic