Re: [RFC 0/5] kernel: backtrace unwind support

From: Ingo Molnar
Date: Fri Feb 10 2012 - 14:44:49 EST



* Arnaldo Carvalho de Melo <acme@xxxxxxxxxx> wrote:

> Em Fri, Feb 10, 2012 at 10:59:51AM -0800, Linus Torvalds escreveu:
> > On Fri, Feb 10, 2012 at 9:43 AM, Peter Zijlstra <a.p.zijlstra@xxxxxxxxx> wrote:
> > >
> > > So I CC'ed Linus who has a strong here, jejb since he's the one that
> > > told me several time there's a number of literate dwarfs already in the
> > > kernel and Jan because I think it was him that tried last on x86.
> >
> > I never *ever* want to see this code ever again.
> >
> > Sorry, but last time was too f*cking painful. The whole (and *only*)
> > point of unwinders is to make debugging easy when a bug occurs. But
> > the f*cking dwarf unwinder had bugs itself, or our dwarf information
> > had bugs, and in either case it actually turned several "trivial" bugs
> > into a total undebuggable hell.
> >
> > It was made doubly painful by the developers involved then several
> > times ignoring the problem, and claiming the code was bug-free when it
> > clearly wasn't, or trying to claim that the problem was that we set up
> > some random dwarf information wrong, when THAT GOES WITHOUT SAYING
> > (since dwarf is a complex mess that never gets any actual testing
> > except when things go wrong - at which point the code had better work
> > regardless of whether the dwarf info was correct or not).
> >
> > So no. An unwinder that is several hundred lines long is simply not
> > even *remotely* interesting to me.
> >
> > If you can mathematically prove that the unwinder is correct - even in
> > the presence of bogus and actively incorrect unwinding information -
> > and never ever follows a bad pointer, I'll reconsider.
> >
> > In the absence of that, just follow the damn chain on the stack
> > *without* the "smarts" of an inevitably buggy piece of crap.
>
> "Vote for --fno-omit-frame-pointer! One register is a cheap
> price to pay for not going insane!"
>
> /me goes back to non political things.

Well, instead of dropping it we could try to meet Linus's
challenge, at least to a fair degree.

Also lets fundamentally treat GCC provided data as untrusted,
hostile data and lets put lockdep-alike redundancy and resilence
around it.

As a first step lets try input randomization unit tests. A lot
of the broken unwind code was really just sloppy about boundary
conditions.

I had a quick peek and I don't think it's constructed in a
resilent enough form right now. For example there's no clear
separation and checking of what comes from GCC and what not.

It *can* be done: lockdep is not hundreds but thousands of lines
of highly complex code (with non-trivial algorithms such as
graph walks), and still it has a very good track record - so
it's possible.

Once that is done I'd like to try it myself in practice, without
offering it as a pull to Linus. I see a *lot* of weird oopses
all day in and out, often in impossible contexts, and the old
dwarf unwinder was crap.

I'd also love to see perf callchains work on all kernels and
extend into user-space as well, if that's possible in a sane
fashion. 90% of the interesting apps out there are build with
framepointers off, and the context of overhead is often rather
obscure. Looking at good callchains is a good learning
experience all around.

So it's not *entirely* crazy IMO, lets iterate this please.
Jiri, are you still interested in it?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/