Re: NMI hardlock stacktrace deadlock [was Re: Linux 5.2-rc5]

From: Linus Torvalds
Date: Wed Jun 19 2019 - 16:56:17 EST


On Wed, Jun 19, 2019 at 12:19 PM Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> wrote:
>
> > Do you have the oops itself at all?
>
> An example at
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6310/fi-kbl-x1275/dmesg0.log
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6310/fi-kbl-x1275/boot0.log
>
> The bug causing the oops is clearly a driver problem. The rc5 fallout
> just seems to be because of some shrinker changes affecting some object
> reaping that were unfortunately still active. What perturbed the CI
> team was the machine failed to panic & reboot.

Hmm. It's hard to guess at the cause of that. The oopses themselves
don't look like they are happening in any particularly bad context, so
all the normal reboot-on-oops etc stuff _should_ work.

So it would help a lot if you could bisect the bad problem at least a
bit, if it is at all reproducible. Because with no other clues, it's
hard to even guess at what might be up.

The fact that you say "NMI watchdog firing as we dumped the ftrace"
means that maybe it might be some ftrace / stacktrace issue where the
dumping itself leads to some endless loop, but who knows.

For example, one thing that has happened during this development cycle
is the stacktrace common infrastructure changes (arch_stack_walk() and
friends). I'm, not seeing why that would cause your issues, but I'm
adding a few random people for ftrace / stacktrace changes.

Linus