Re: Regression: memory corruption on Atmel SAMA5D31
From: Peter Rosin
Date: Fri Mar 04 2022 - 07:38:42 EST
Hi!
On 2022-03-04 12:12, Tudor.Ambarus@xxxxxxxxxxxxx wrote:
> Hi, Peter!
>
> On 3/4/22 12:57, Peter Rosin wrote:
>> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
>>
>> On 2022-03-04 07:57, Peter Rosin wrote:
>>> On 2022-03-04 04:55, Saravana Kannan wrote:
>>>> On Thu, Mar 3, 2022 at 1:17 AM Peter Rosin <peda@xxxxxxxxxx> wrote:
>>>>>
>>>>> On 2022-03-03 04:02, Saravana Kannan wrote:
>>>>>> On Wed, Mar 2, 2022 at 4:29 PM Peter Rosin <peda@xxxxxxxxxx> wrote:
>>>>>>>
>>>>>>> Hi!
>>>>>>>
>>>>>>> I'm seeing a weird problem, and I'd like some help with further
>>>>>>> things to try in order to track down what's going on. I have
>>>>>>> bisected the issue to
>>>>>>>
>>>>>>> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>>>>>>
>>>>>> I skimmed through your email and I'll read it more closely tomorrow,
>>>>>> but it wasn't clear if you see this on Linus's tip of the tree too.
>>>>>> Asking because of:
>>>>>> https://lore.kernel.org/lkml/20210930085714.2057460-1-yangyingliang@xxxxxxxxxx/
>>>>>>
>>>>>> Also, a couple of other data points that _might_ help. Try kernel
>>>>>> command line option fw_devlink=permissive vs fw_devlink=on (I forget
>>>>>> if this was the default by 5.10) vs fw_devlink=off.
>>>>>>
>>>>>> I'm expecting "off" to fix the issue for you. But if permissive vs on
>>>>>> shows a difference driver issues would start becoming a real
>>>>>> possibility.
>>>>>>
>>>>>> -Saravana
>>>>>
>>>>> Thanks for the quick reply! I don't think I tested the very tip of
>>>>> Linus tree before, only latest rc or something like that, but now I
>>>>> have. I.e.
>>>>>
>>>>> 5859a2b19911 ("Merge branch 'ucount-rlimit-fixes-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace")
>>>>>
>>>>> It would have been typical if an issue that existed for a couple of
>>>>> years had been fixed the last few weeks, but alas, no.
>>>>>
>>>>> On that kernel, and with whatever the default fw_devlink value is, the
>>>>
>>>> It's fw_devlink=on by default from at least 5.12-rc4 or so.
>>>>
>>>>> issue is there. It's a bit hard to tell if the incident probability
>>>>> is the same when trying fw_devlink arguments, but roughly so, and I
>>>>> do not have to wait for long to get a bad hash with the first
>>>>> reproducer
>>>>>
>>>>> while :; do cat testfile | sha256sum; done
>>>>>
>>>>> The output is typical:
>>>>> 78464c59faa203413aceb5f75de85bbf4cde64f21b2d0449a2d72cd2aadac2a3 -
>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>>> e03c5524ac6d16622b6c43f917aae730bc0793643f461253c4646b860c1a7215 -
>>>>> 1b8db6218f481cb8e4316c26118918359e764cc2c29393fd9ef4f2730274bb00 -
>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>>> 7d60bf848911d3b919d26941be33c928c666e9e5666f392d905af2d62d400570 -
>>>>> 212e1fe02c24134857ffb098f1834a2d87c655e0e5b9e08d4929f49a070be97c -
>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>>> 7e33e751eb99a0f63b4f7d64b0a24f3306ffaf7c4bc4b27b82e5886c8ea31bc3 -
>>>>> d7a1f08aa9d0374d46d828fc3582f5927e076ff229b38c28089007cd0599c645 -
>>>>> 4fc963b7c7b14df9d669500f7c062bf378ff2751f705bb91eecd20d2f896f6fe -
>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>>> 9360d886046c12d983b8bc73dd22302c57b0aafe58215700604fa977b4715fbe -
>>>>> 4f9173f63cb2e13d1470e59e1b5c657f3b0f4f2e9a55ab6facffbb03f34ce04d -
>>>>>
>>>>> Setting fw_devlink=off makes no difference, AFAICT.
>>>>
>>>> By this, I'm assuming you set fw_devlink=off in the kernel command
>>>> line and you still saw the corruption.
>>>
>>> Yes. On a bad kernel it's the same with all of the following kernel
>>> command lines.
>>>
>>> console=ttyS0,115200 rw oops=panic panic=30 fw_devlink=on ip=none root=ubi0:rootfs ubi.mtd=6 rootfstype=ubifs noinitrd mtdparts=atmel_nand:256k(at91bootstrap),384k(barebox),256k@768k(bareboxenv),256k(bareboxenv2),128k@1536k(oftree),5M@2M(kernel),248M@8M(rootfs),-@256M(ovlfs)
>>>
>>> console=ttyS0,115200 rw oops=panic panic=30 fw_devlink=off ip=none root=ubi0:rootfs ubi.mtd=6 rootfstype=ubifs noinitrd mtdparts=atmel_nand:256k(at91bootstrap),384k(barebox),256k@768k(bareboxenv),256k(bareboxenv2),128k@1536k(oftree),5M@2M(kernel),248M@8M(rootfs),-@256M(ovlfs)
>>>
>>> console=ttyS0,115200 rw oops=panic panic=30 fw_devlink=permissive ip=none root=ubi0:rootfs ubi.mtd=6 rootfstype=ubifs noinitrd mtdparts=atmel_nand:256k(at91bootstrap),384k(barebox),256k@768k(bareboxenv),256k(bareboxenv2),128k@1536k(oftree),5M@2M(kernel),248M@8M(rootfs),-@256M(ovlfs)
>>>
>>>> If that's the case, I can't see how this could possibly have anything
>>>> to do with:
>>>> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>>>>
>>>> If you look at fw_devlink_link_device(), you'll see that the function
>>>> is NOP if fw_devlink=off (the !fw_devlink_flags check). And from
>>>> there, the rest of the code in the series doesn't run because more
>>>> fields wouldn't get set, etc. That pretty much disables ALL the code
>>>> in the entire series. The only remaining diff would be header file
>>>> changes where I add/remove fields. But that's unlikely to cause any
>>>> issues here because I'm either deleting fields that aren't used or
>>>> adding fields that won't be used (with fw_devlink=off). I think the
>>>> patch was just causing enough timing changes that it's masking the
>>>> real issue.
>>>
>>> When I compare fw_devlink_link_device() from before and after
>>> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>>> I notice that you also removed an unconditional call to
>>> device_link_add_missing_supplier_links() that was live before,
>>> regardless of any fw_devlink parameter.
>>>
>>> I don't know if that's relevant. Is it?
>>>
>>> Not knowing this code at all, and without any serious attempt
>>> at reading it, from here the comment of that removed function
>>> sure looks like it might cause a different ordering before and
>>> after the patch that is not restored with any fw_devlink
>>> argument.
>>
>> It appears that the device_link_add_missing_supplier_links() difference
>> is not relevant after all. What actually happened in the header file in
>> the "bad" commit was that two fields were removed (none added). Like so:
>>
>> struct dev_links_info {
>> struct list_head suppliers;
>> struct list_head consumers;
>> - struct list_head needs_suppliers;
>> struct list_head defer_sync;
>> - bool need_for_probe;
>> enum dl_dev_state status;
>> };
>>
>> If I restore those fields on a bad kernel, the issue is no longer
>> visible. That is true for the first bad kernel, i.e.
>>
>> f9aa460672c9 ("driver core: Refactor fw_devlink feature")
>>
>> and for tip of Linus as of recently, i.e.
>>
>> 5859a2b19911 ("Merge branch 'ucount-rlimit-fixes-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace")
>>
>> Which is of course insane and a whole different level of bad. WTF!?!
>>
>> I wonder if I can dig out the old SAMA5D31 evaluation kit and reproduce
>> there? I think that's next on the list...
>>
>
> I have a sama5d3_xplained that uses a SAMA5D36 and has a 256MBytes DDR2 and a
> 256MBytes NAND Flash. I tried a test with a 200MB file, rootfs on sdcard and
> I couldn't reproduce the bug. I'm using Linus's latest kernel:
> 38f80f42147f (HEAD, origin/master, origin/HEAD) MAINTAINERS: Remove dead patchwork link
>
> root@sama5d3-xplained-sd:~# dd if=/dev/urandom of=testfile bs=1024 count=200000
> 200000+0 records in
> 200000+0 records out
> 204800000 bytes (205 MB, 195 MiB) copied, 37.6424 s, 5.4 MB/s
> root@sama5d3-xplained-sd:~# for i in 1 2 3 4 5 6 7 8; do cat testfile | sha256sum; done
> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
> 2a4f1534aec6ace9d68f2f42fa28c1f1fe7bd281f960f2218797557aa41fe8de -
> root@sama5d3-xplained-sd:~#
>
> I'll put the rootfs on NAND and try to retest. Maybe to do some other tests
> in parallel to have more interrupts on the system. Will let you know if I can
> reproduce the bug on sama5d3_xplained.
Thanks for testing!
Since you (probably) don't have the interrupt source from the USB
serial chip that I have, that is not completely unexpected.
$ lsusb
Bus 001 Device 002: ID 0403:6011 Future Technology Devices International, Ltd FT4232H Quad HS USB-UART/FIFO IC
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
$ cat /sys/bus/usb-serial/devices/ttyUSB?/latency_timer
1
1
1
1
Also, your file is perhaps too small? You leave approx 50MB for the
system, so it might be the case that the page cache can hold the whole
file?
So, can you please try that again with a slightly bigger file or if you
restrict how much RAM you allow the kernel to see?
And if you don't have the FTDI usb-serial chip, you should probably go
with the other reproducer, namely to simply copy the random file to a
different host using scp.
Thanks again!
Cheers,
Peter