Re: arm: TI BeagleBoard X15 : Unable to handle kernel NULL pointer dereference at virtual address 00000369 - Internal error: Oops: 5 [#1] SMP ARM

From: Arnd Bergmann
Date: Fri Nov 11 2022 - 04:21:54 EST


On Fri, Nov 11, 2022, at 01:48, Dmitry Torokhov wrote:
> On Wed, Nov 9, 2022 at 2:20 PM Arnd Bergmann <arnd@xxxxxxxx> wrote:
>>
>> On Wed, Nov 9, 2022, at 13:57, Arnd Bergmann wrote:
>> >
>> > One thing that sticks out is the print_constraints_debug() function
>> > in the regulator framework, which uses a larger-than-average stack
>> > to hold a string buffer, and then calls into the low-level
>> > driver to get the actual data (regulator_get_voltage_rdev,
>> > _regulator_is_enabled). Splitting the device access out into a
>> > different function from the string handling might reduce the
>> > stack usage enough to stay just under the 8KB limit, though it's
>> > probably not a complete fix. I added the regulator maintainers
>> > to Cc for thoughts on this.
>>
>> I checked the stack usage for each of the 147 functions in the
>> backtrace, and as I was guessing print_constraints_debug() is
>> the largest, but it's still only 168 bytes, and everything else
>> is smaller, so no point hacking this.
>
> You mentioned that we are doing probing of a device 6 levels deep.
> Could one of the parent devices be marked for an asynchronous probe
> thus breaking the chain?

Ah right, I forgot that we already have a per-driver flag for this,
thanks a lot for the suggestion!

This means it might be as easy as this oneliner, picking
one of the drivers in the middle of the call chain that is
not shared across too many other systems:

diff --git a/drivers/mfd/palmas.c b/drivers/mfd/palmas.c
index 8b7429bd2e3e..f4a96eb98eea 100644
--- a/drivers/mfd/palmas.c
+++ b/drivers/mfd/palmas.c
@@ -731,6 +731,7 @@ static struct i2c_driver palmas_i2c_driver = {
.driver = {
.name = "palmas",
.of_match_table = of_palmas_match_tbl,
+ .probe_type = PROBE_PREFER_ASYNCHRONOUS,
},
.probe = palmas_i2c_probe,
.remove = palmas_i2c_remove,

There is still a small regression risk for other OMAP platforms
that may rely on probe ordering, but it should reliably fix
the issue.

There is a related idea that I'll try to take another look
at: since the bug only happens sometimes, and not at all on
mainline kernels with IRQ_STACK, I had the idea of making the
kernel stack size runtime configurable on mainline kernels, by
reserving a fixed amount of the 8KB total. This should make it
possible to narrow down the actual maximum stack usage before
a guaranteed crash, and then validate that a fix correctly
addresses it.

Arnd