Re: [RFC] firmware coredump: add new firmware coredump class

From: Johannes Berg
Date: Wed Sep 03 2014 - 15:06:13 EST


On Wed, 2014-09-03 at 16:19 +0200, Daniel Vetter wrote:
> [super-embarrassing resend, the previous one contained html gunk.]
>
> If the idea is to also convert gpu crash dumps to this we should add
> dri-devel. And there the crashes are usually not due to firmware, but
> because the shaders and command batches userspace submitted have
> issues, so this should also be renamed to dev_coredump I think.

I don't know if the idea is to convert gpu crash dumps - I was just
wondering if you could and would want to use such a generic framework.
If the answer turns out to be no, that's perfectly reasonable I think.

However, renaming seems easy to do anyway :)

> On the overall design I wonder whether this shouldn't work more like a
> real core dump and dump to a real file. At least currently the dumps
> i915 creates are only useful as a general guide to where things went
> wrong, but if we actually want to submit them as traces to the
> hardware people we need to dump a _lot_ more. Otoh with the future of
> shared virtual address spaces between gpu/cpu we might just do a real
> core dump, so maybe this use case should be out of scope for your
> patch here.

I'm not really sure I'd want to actually sys_write() to a file here -
sounds like a big can of worms. If you have direct access (like shared
memory space) it seems we could still use the same mechanisms with the
coredumpm() method, no?

> On the logic itself I'm not sure whether the timeout is all that
> useful - at least in i915 our crash recovery works well enough that
> reporters often don't realize right away when it happened, but only
> later on when looking through logs to explain the tiny corruptions. If
> the crashdupm has evapored meanwhile that's not that useful.

Right. We might want to make it configurable, maybe even in Kconfig. I
was thinking that there would be userspace that would (automatically)
pick it up, and if such userspace doesn't exist or isn't running then
we'd want to free the memory eventually.

> Also, at least for gpus it's usually not interesting to grab
> subsequent dumps: Often the gpu is in a bad mood due to the first
> crash, or it's just a massive row of duplicated dumps. So in i915 we
> only record the first crash and keep it around forever. And tooling
> can still free it by writing to the file. This also ensures that we
> don't waste excessive amounts of memory with crash dumps.

Right, we discussed this but then I completely forgot. I think keeping
the first one is reasonable. If userspace has already picked it up
you'll still get multiple and maybe want to have a policy there as well.

> And if we want to use this for i915 we need some way for tools to go
> from the i915 drm class device node to the error state, not just from
> the error state back to the device.

Interesting. That's probably not all that difficult to do (maybe even
set up a child/parent relationship?) but I actually wanted to avoid a
hard dependency since there may be cases where the failing device
disappears, e.g. in the case of USB. I have to think about this case
more, I guess.

johannes

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/