Re: drm/mgag200: doesn't work in panic context

From: Daniel Vetter
Date: Wed Jul 01 2015 - 05:59:46 EST


On Wed, Jul 1, 2015 at 9:26 AM, Rui Wang <rui.y.wang@xxxxxxxxx> wrote:
> On Tuesday, June 30, 2015 11:24 PM, Daniel Vetter <daniel.vetter@xxxxxxxx> wrote:
>> On Tue, Jun 30, 2015 at 9:23 AM, Rui Wang <rui.y.wang@xxxxxxxxx> wrote:
>> > But einj does something more than what an IPI can do, it injects hardware
>> > errors which trigger exceptions in NMI context... and the exception handler
>> > usually panics on fatal errors. And the display may be the only way to catch
>> > what has happened. I'm just hoping that the future version may work in
>> > NMI context.
>>
>> NMI sounds ... ambigous ;-) But yeah if we can somehow inject
>> something as an NMI too then that would be even better. What I want to
>> avoid is forcing reboots, since that means you can't run a basic
>> modeset test afterwards to make sure nothing was trampled too badly.
>> Of course we'd have replace the screen contents, but the important
>> part is that the panic handler doen't touch anything if the driver is
>> in modeset code right now (because it'll massively increase the risk
>> of dying completely), and an easy way to check that it didn't step all
>> over modeset state unduly is to do a modeset afterwards. If that works
>> we'll be fine.
>>
>> Also with that approach we can make sure that no real errors get into
>> dmesg (as opposed to a real panic), which means we can capture dmesg
>> afterwards and if there is a seroius log message (or even backtrace)
>> then drm panic handling has a bug.
>>
>> All that isn't possible when we force a real panic to happen.
>>
>> Actually thinking more about NMI that shouldn't be a problem. The
>> important thing with nmi vs. hardirq is that you can't even reliably
>> grab an irqsave spinlock, it's trylocks all the way down. But that
>> also holds for the panic handler, it's trylocks only. Could we somehow
>> just check that using lockdep - is there an NMI lockdep context
>> somewhere we could fake-grab? That's another upside of using an IPI
>> btw: Real panics kill lockdep ;-)
>
> Einj is supported by ACPI in combination with the hardwre. The injected
> errors result in true MCEs, truly non-maskable. Lockdep might not be useful
> in this case. Corrected Errors (CEs) don't result in panic but I guess it
> might be possible to let it invoke your future mode-setting code for testing
> purpose, without rebooting. (Notice that MCEs can happen right from inside
> your mode-setting code while accessing any memory address)

Yeah NMI can happen anywhere and that's about the worst-case panic
context we have. The problem is that NMI bugs are a giant pain to
debug, so for testing I think it'd be better to just have a hardirq
context + the help of lockdep (if possible) to make sure we only do
try_lock and lockless stuff.

> But anyway we're not looking for a 100% working solution so if it could only
> work in normal irq or ipi context, it'd already be a big plus compared to
> what we have now.

NMI vs ipi vs other stuff is just about what's the best debug/testing
strategy. Most of the work there will really be in writing tons of
testcases to race the drm panic handler against drm modeset ioctls.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/