Kernel crash due to memory corruption with v5.4.26-rt17 and PowerPC e500

From: Mark Marshall
Date: Mon May 04 2020 - 05:40:23 EST

Next message: Roman Penyaev: "Re: [PATCH] epoll: ensure ep_poll() doesn't miss wakeup events"
Previous message: Grygorii Strashko: "Re: [PATCH 1/5] dt-bindings: soc: ti: add binding for k3 platforms chipid module"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi RT experts,

We are using the RT kernel with the PowerPC e500. Until recently we
were on the 4.19 kernel series, and are in the process of upgrading.
When we switched to the v5.4 version, we get a reproducible kernel
crash. The crashes all contain the "BUG: Bad rss-counter state" line,
and then after that it appears that a structure of type mm_struct or
vm_area_struct is corrupted.

The easiest way we have found to reproduce the crash is to repeatedly
insert and then remove a module. The crash then appears to be related
to either paging in the module or in exiting the mdev process. (The
crash does also happen at other times, but it is hard to reproduce
reliably then). This simple script will almost always crash:

for i in $(seq 1000) ; do echo $i ; modprobe crc7 ; rmmod crc7 ; done

(The crc7 module is chosen as it is small and simple. Any module will
work / crash).

We have tried kernels v5.0, v5.2 and v5.6. The v5.0 and v5.2 kernels
do not show the problem. The v5.6 kernel does show the problem.
Switching of RT fixes the problem.

I have reduced the functionality in the kernel to a bare minimum
(removing networking, USB and PCI, as we have some out-of-tree patches
in those areas) and we still get the crash.

Here are a couple of example stack traces:

000: NIP [c003f8e0] __mmdrop+0x2c8/0x3dc
000: LR [c003f8e0] __mmdrop+0x2c8/0x3dc
000: Call Trace:
000: [e953fd48] [c003f8e0] __mmdrop+0x2c8/0x3dc
000: (unreliable)
000: [e953fd88] [c00c6d28] rcu_core+0x324/0x78c
000: [e953fe58] [c00c79e0] rcu_cpu_kthread+0x1f4/0x42c
000: [e953fe98] [c00838fc] smpboot_thread_fn+0x2e8/0x488
000: [e953fef8] [c007d514] kthread+0x1b0/0x1b8
000: [e953ff38] [c001a26c] ret_from_kernel_thread+0x14/0x1c

000: NIP [c010cdd4] acct_collect+0x3a8/0x3e0
000: LR [c010cdd4] acct_collect+0x3a8/0x3e0
000: Call Trace:
000: [c6f2bbe0] [c010cdd4] acct_collect+0x3a8/0x3e0
000: (unreliable)
000: [c6f2bc10] [c0049354] do_exit+0x294/0xf9c
000: [c6f2bcf0] [c0013030] die+0x220/0x2c4
000: [c6f2bd30] [c00132cc] exception_common+0x1f8/0x238
000: [c6f2bd30] [c00132cc] exception_common+0x1f8/0x238
000: [c6f2bd70] [c0013404] _exception+0x34/0x80
000: [c6f2bd90] [c001a4a8] ret_from_except_full+0x0/0x4

I have added some debugging code where the mm_struct and
vma_area_struct have "poision" values at the start and the end, and
this seems to show that the vma_area_struct is getting corrupted, but
I'm not able to see where.

We have switched on all of the debugging that we can, including
KASAN, and this shows nothing.

Can anyone help us? What can we try next? Is anyone using the e500
with the RT kernel? Does anyone have any idea how to debug problems
related to the error message "Bad rss-counter state"?

Any help or advice would be most gratefully received.

Many thanks,
Mark Marshall and Thomas Graziadei

PS. Thomas Grazidei (my colleague) did find a bug in the start_32.S
file for the e500, and we have the fix for that included. We have
also tried removing the LAZY_PREEMPTION patch completely, and this
doesn't help.

Next message: Roman Penyaev: "Re: [PATCH] epoll: ensure ep_poll() doesn't miss wakeup events"
Previous message: Grygorii Strashko: "Re: [PATCH 1/5] dt-bindings: soc: ti: add binding for k3 platforms chipid module"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]