RE: [PATCH] [LBR] Dump LBRs on Oops

From: Berthier, Emmanuel
Date: Wed Nov 26 2014 - 10:43:34 EST


> -----Original Message-----
> From: Thomas Gleixner [mailto:tglx@xxxxxxxxxxxxx]
> Sent: Wednesday, November 26, 2014 3:47 PM
> To: Berthier, Emmanuel
> Cc: mingo@xxxxxxxxxx; hpa@xxxxxxxxx; x86@xxxxxxxxxx; Jarzmik, Robert;
> linux-kernel@xxxxxxxxxxxxxxx
> Subject: RE: [PATCH] [LBR] Dump LBRs on Oops
>
> On Wed, 26 Nov 2014, Berthier, Emmanuel wrote:
> > > We add printks not for people who work on the support of unreleased
> > > hardware. They should better know what they are doing. If they can't
> > > figure that out they should not touch the kernel in the first place.
> >
> > LoL
> > I'm part of those people, I've touched the kernel and I've figured out what
> was wrong.
> > And I would like to be helped next year for the next Core: I'm an old
> > man and I need to leave a white stone trail ;-) Could we agree on
> > that one?
>
> We already have a printk in init_intel_pmu() where we tell about the
> 'unidentified cpu', so we better extend that instead of having something
> dependent on a OOPS.

That's right, I have a patch for that also.

> > > > > Aside of that if we want to debug with the LBR then we better
> > > > > freeze that whole thing across a dump and be done with it.
> > > >
> > > > I met that case but did no dig deeply into it...
> > >
> > > Hmm, a corrupted stack might trigger this together with some of the
> > > other debug options enabled. So we really might to put it in front.
> >
> > Didn't catch you. Could you elaborate on that?
>
> Assume a stack corruption, so the stack dumper follows it w/o noticing and
> hits an unmapped page. So that would be an argument to move the LBR print
> out ahead of the stack dump.

Ok, so no need to change anything here?

> > > 1) We want to enable/disable this at boot time.
> > >
> > > In the disabled case we might also stub out the test/jz and replace
> > > it by an unconditional jump, but that needs more thought.
> >
> > I can add a cmdline option to disable it at boot time.
>
> Enable. Should be disabled by default I think.

ok

> > Do you propose to use code instruction patching (same as ftrace) also?
> > Is-it really worth to bypass test/jz as page fault handling is much
> > more than few instructions?
>
> That's why I said: but that needs more thought.
>
> Though OTOH we keep adding stuff there and if we want to enable that LBR
> feature more widely we should think about keeping the overhead low if it is
> disabled.
>
> We can discuss this after we have a agreed on patch for that feature.

Ok

> > > 2) Right now you stop the trace on every exception no matter whether
> > > it comes from user or kernel space.
> > >
> > > Stopping the trace when we handle a user space fault does not make
> > > any sense and inflicts just pointless overhead.
> > >
> > > Aside of that if the fault handler then crashes we do not have the
> > > LBR information because we froze it when entering from user space
> > > in the first place.
> >
> > Agree, but the LBR buffer contains only 8 records: we have to stop it
> > as soon as possible. If we add some test/jump/call before stopping
> > it, relevant info will be flushed out.
>
> Well, you can certainly test that w/o a jump. Hint:
>
> if (enabled && is_kernel)
> goto x;
>
> can be written in ASM with a single branch as well :)
>
> That adds more instructions before the jz, which might in fact make the code
> patching for the disabled case more interesting.

I will keep it small, trust me! ;-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/