Re: [RFC PATCH] x86 NMI-safe INT3 and Page Fault (v3)

From: Mathieu Desnoyers
Date: Sat Apr 19 2008 - 17:18:49 EST


* Andi Kleen (andi@xxxxxxxxxxxxxx) wrote:
>
> > arch/x86/oprofile/nmi_timer_int.c: profile_timer_exceptions_notify()
> > calls
> > drivers/oprofile/oprofile_add_sample()
> > which calls oprofile_add_ext_sample()
> > where
> > if (log_sample(cpu_buf, pc, is_kernel, event))
> > oprofile_ops.backtrace(regs, backtrace_depth);
>
> A red hering: The notifier setup calls vmalloc_sync_all() and oprofile
> allocates its buffers before registering the notifier.
>

Ah, yes, you are right on this one, it was well hidden. :)

> > First, log_sample writes into the vmalloc'd cpu buffer. That's for one
> > possible page fault.
>
>
> > Then, is a kernel backtrace happen, then I am not sure if printk_address
> > won't try to read any of the module data, which is vmalloc'd.
>
> Yes, admittedly the backtrace mode was always somewhat flakey. It probably
> has more problems too.
>
> The right fix for that is to call vmalloc_sync_all() after module load
> when any nmi notifiers are registered.

I guess it would work, but it certainly looks like a patchy workaround.

>
> >
> >
> >> NMI are maybe 5-6 functions all over the kernel.
> >>
> >> I just don't think it makes any sense to put markers in there.
> >> It is a really small part of the kernel the kernel that is unlikely
> >> to be really useful for anybody. You should rather first solve the
> >> problem of tracing the other 99.999999% of the kernel properly.
> >>
> >
> > The fact is that NMIs are very useful and powerful when it comes to try
> > to understand where code disabling interrupts is stucked, to get
> > performance counter reads periodically
>
> First there are no truly periodic (as in time) NMIs. The NMI watchdog
> is not really periodic but is delayed arbitrarily all the time when the CPU
> is in sleep states.
>

The is no such thing as "perfect" periodicity. There is just better and
worse periodicity. NMIs just tend to have much less jitter.

> Then oprofile does this already what you describe. Why do we need
> another questionable infrastructure to reimplement what is
> already there?
>

I don't like to duplicate work. I would just like to dump performance
counters in LTTng trace buffers at specific points. I guess building on
top of oprofile would be a good way to do it.

> without suffering from IRQ
> > latency
>
> Just from all kind of other latency caused by non ticking performance
> counters.
>
> . Also, when trying to figure out what is actually happening in
> > the kernel timekeeping, having a stable periodic time source can be
> > pretty useful.
>
> Haha. You seem to be so deep into nonsense land, it is hard to comprehend.
>
> > That would be one way to do it, except that it would not deal with int3.
> > Also, it would have to be taken into account at module load time. To me,
> > that looks like an error-prone design. If the problem is at the lower
> > end of the architecture, in the interrupt return path, why don't we
> > simply fix it there for good ?
>
> There are all kinds of problems with NMIs, this is only one of them.
> And NMIs are a really really obscure case
>

Which other problems ? I am listening.

> Frankly, if you spend all your time on fringe cases like this instead
> of getting it to work on the 99.99999999999999% case it doesn't
> surprise me that the markers don't make any progress for years now.
>

One thing is that I really don't want to add fragility to a traced
kernel. A tracer that would make the kernel more fragile is the last
thing I want. Therefore, I make sure the tracer provides good reentrancy
so it can be called from virtually any kernel context and I also make
sure it stays in its own sandbox as much as possible, using atomic
operations to update its data structures/buffers.

> And yes, boot code is one of the first thing embedded system
> > developers want to instrument.
>
> Crap. That code runs once. The only interest is correctness and
> if it's not correct you just step it through with a JTAG debugger.
>

looking at these links tells us that some embedded developers are
interested in speeding up Linux boot time, and that's a task a kernel
tracer is very good at.

http://www.linuxdevices.com/news/NS5907201615.html
http://elinux.org/Boot_Time

> > I wonder if they are used so rarely because the underlying kernel is
> > buggy with respect with NMIs or because they are useless.
>
> lockless programming is just really hard and not doing it is in most
> cases the sanest option.
>

I think kernel tracing would be an exception; that a kernel tracer
should be designed not to use any sort of lock, to have as little
dependency on the rest of the kernel as possible and to update its own
data structures atomically. That insures the tracer can be called from
virtually anywhere without having to worry about side-effects. Given
that LTTng users have been happy with it for the past 2.5 years, I tend
to think I was right.

Mathieu

> Anyways I give up. Do what you want.
>
> -Andi
>

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/