Re: [Perfctr-devel] 2.6.16-rc5 perfmon2 new code base + libpfm with Montecito support

From: Stephane Eranian
Date: Mon Mar 13 2006 - 19:00:40 EST


John,

On Mon, Mar 13, 2006 at 11:20:57PM +0000, John Levon wrote:
> On Mon, Mar 13, 2006 at 01:03:54PM -0800, Stephane Eranian wrote:
>
> > > 1) the event names are synchronised so we don't need massive duplication
> >
> > The kernel perfmon2 interface does not know anything about event names.
>
> Well, it sounded like Will was proposing extra event directories.

I think Will was trying to solve the register naming differences.
I do not know how you deal with this in ophelp.

> > > 2) that "start a thread on each CPU" API is fixed to be sane
> >
> > Please develop on this point some more.
>
> The kernel interface should just let me say "I want this setup on all
> CPUs" and do the IPIs for me.
>
That's is because you are assuming a model were you always want to monitor
all CPUs each time and measure the same thing everywhere. This does not
always makes sense in large configurations. You may want to only monitor
a subset of CPUs or you may want to monitor different things on different
CPUs at the same time, for instance because they handle different workloads
or interrupts.

When it comes to sampling, I think you will agree with me that the kernel level
sampling buffer must be per-cpu. I think this is also how you manage
it in OProfile. I think it also makes sense to process the buffer locally for
cache affinity reasons for instance. Keep monitoring overhead minimum by exploiting
locality. I think (correct me if I am wrong) in Oprofile you somehow merge the
per-cpu buffers into a single buffer which is then read via read() by user level
applications. For some measurements merging is not necessarily what is needed.

In your model, I would have to pass a bitmap of CPUs to monitor, then internally
the kernel would have propagate the setup via IPI and maintain a context per-cpu.
Then upon return, it would have to pass information as to how to mmap the per-cpu
buffers. You have a choice of doing one mmap() per buffer or to do
a single large mmap() covering the possibly discountiguous physical pages
backing each per-cpu buffer. In either case, it would make sense to ensure
that the thread processing each buffer runs on the CPU where the samples
have been collected to minimize cache traffic which is very important on NUMA
machines. Typically on those machines, every effort is made to keep all memory
accesses local, I do not see why this would not also apply to profiling.

Note that in the new perfmon code base, you do not have to create one thread
per monitored CPU. All you need to do is to ensure that the thread runs on
the CPU it needs to access when issuing perfmon2 calls causing actual PMU
HW accesses. A single thread can very well control lots of context bound to
different threads or CPU.

But again, I am always open to discussions/proposals on this.

--
-Stephane
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/