Re: [RFC 0/5] kernel: backtrace unwind support

From: Jiri Olsa
Date: Fri Feb 10 2012 - 15:19:10 EST


On Fri, Feb 10, 2012 at 08:44:26PM +0100, Ingo Molnar wrote:
>
> * Arnaldo Carvalho de Melo <acme@xxxxxxxxxx> wrote:
>
> > Em Fri, Feb 10, 2012 at 10:59:51AM -0800, Linus Torvalds escreveu:
> > > On Fri, Feb 10, 2012 at 9:43 AM, Peter Zijlstra <a.p.zijlstra@xxxxxxxxx> wrote:
> > > >
> > > > So I CC'ed Linus who has a strong here, jejb since he's the one that
> > > > told me several time there's a number of literate dwarfs already in the
> > > > kernel and Jan because I think it was him that tried last on x86.
> > >
> > > I never *ever* want to see this code ever again.
> > >
> > > Sorry, but last time was too f*cking painful. The whole (and *only*)
> > > point of unwinders is to make debugging easy when a bug occurs. But
> > > the f*cking dwarf unwinder had bugs itself, or our dwarf information
> > > had bugs, and in either case it actually turned several "trivial" bugs
> > > into a total undebuggable hell.
> > >
> > > It was made doubly painful by the developers involved then several
> > > times ignoring the problem, and claiming the code was bug-free when it
> > > clearly wasn't, or trying to claim that the problem was that we set up
> > > some random dwarf information wrong, when THAT GOES WITHOUT SAYING
> > > (since dwarf is a complex mess that never gets any actual testing
> > > except when things go wrong - at which point the code had better work
> > > regardless of whether the dwarf info was correct or not).
> > >
> > > So no. An unwinder that is several hundred lines long is simply not
> > > even *remotely* interesting to me.
> > >
> > > If you can mathematically prove that the unwinder is correct - even in
> > > the presence of bogus and actively incorrect unwinding information -
> > > and never ever follows a bad pointer, I'll reconsider.
> > >
> > > In the absence of that, just follow the damn chain on the stack
> > > *without* the "smarts" of an inevitably buggy piece of crap.
> >
> > "Vote for --fno-omit-frame-pointer! One register is a cheap
> > price to pay for not going insane!"
> >
> > /me goes back to non political things.
>
> Well, instead of dropping it we could try to meet Linus's
> challenge, at least to a fair degree.
>
> Also lets fundamentally treat GCC provided data as untrusted,
> hostile data and lets put lockdep-alike redundancy and resilence
> around it.
>
> As a first step lets try input randomization unit tests. A lot
> of the broken unwind code was really just sloppy about boundary
> conditions.

right, looks like crucial part.. :)

>
> I had a quick peek and I don't think it's constructed in a
> resilent enough form right now. For example there's no clear
> separation and checking of what comes from GCC and what not.

yes, there's nothing like this in now,
I'll see what can be done about that..

>
> It *can* be done: lockdep is not hundreds but thousands of lines
> of highly complex code (with non-trivial algorithms such as
> graph walks), and still it has a very good track record - so
> it's possible.
>
> Once that is done I'd like to try it myself in practice, without
> offering it as a pull to Linus. I see a *lot* of weird oopses
> all day in and out, often in impossible contexts, and the old
> dwarf unwinder was crap.
>
> I'd also love to see perf callchains work on all kernels and
> extend into user-space as well, if that's possible in a sane
> fashion. 90% of the interesting apps out there are build with
> framepointers off, and the context of overhead is often rather
> obscure. Looking at good callchains is a good learning
> experience all around.
>
> So it's not *entirely* crazy IMO, lets iterate this please.
> Jiri, are you still interested in it?

yep, looks interesting.. not sure about the mathematical proof though ;)

jirka
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/