[regression] Bug 216856 - [ptdma] NULL pointer dereference in pt_cmd_callback during server shutdown
From: Thorsten Leemhuis
Date: Fri Dec 30 2022 - 03:27:15 EST
Hi, this is your Linux kernel regression tracker speaking.
I noticed a bug report in bugzilla.kernel.org that looks a lot like a
regression to my untrained eyes (it's not entirely clear). As many
(most?) kernel developer don't keep an eye on it, I decided to forward
it by mail. Quoting from
https://bugzilla.kernel.org/show_bug.cgi?id=216856 :
> Eric Pilmore 2022-12-27 22:23:50 UTC
>
> Observed kernel panic during host shutdown on a AMD (Milan CPU) based
> server. The issue ended up being a NULL pointer dereference in
> pt_cmd_callback() when
> called from pt_issue_pending(). If you follow the flow in
> pt_issue_pending() you will note that if pt_next_dma_desc() returns
> NULL, then engine_is_idle will remain as TRUE, including if
> pt_next_dma_desc() is still returning NULL in the 2nd call just prior to
> doing the call to pt_cmd_callback().
>
> The stack flow leading up to the panic was:
> dma_sync_wait() -> dma_async_issue_pending() -> pt_issue_pending() ->
> pt_cmd_callback()
>
> Temporarily I worked around the issue by simply changing the IF
> condition for the call to pt_cmd_callback() to also check for a non-NULL
> desc, i.e.
>
> if (engine_is_idle && desc)
> pt_cmd_callback(desc, 0);
>
> This resolved the issue for me, however I don't know enough about the
> driver or the context here to know if this is really the desirable fix,
> and so I'm submitting this bug rather than attempting to patch myself. I
> wasn't sure if the secondary pt_next_dma_desc() call was mistakenly
> leftover from the change that introduced the engine_is_idle variable or
> not. Note that vchan_issue_pending() will return a boolean as to whether
> there are any descriptors on the Issue list, i.e. active descriptors.
> So, maybe that could be used to qualify the need to take some action?
> Also, if pt_cmd_callback() is really going to start processing on the
> next descriptor, I wonder if it should be called under the chan->vc.lock
> lock. I'm not sure of the safety of this, but if you are peeking at
> descriptors on the Issue list that you might want to ensure they're
> protected from being accessed/removed by some other thread.
See the ticket for more details.
BTW, let me use this mail to also add the report to the list of tracked
regressions to ensure it's doesn't fall through the cracks:
#regzbot introduced: v6.1..v6.2-rc1
https://bugzilla.kernel.org/show_bug.cgi?id=216856
#regzbot title: ptdma: kernel panic during host shutdown
#regzbot ignore-activity
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.