Re: bts & perf_counters

From: Ingo Molnar
Date: Tue Jun 30 2009 - 15:32:47 EST

Next message: Yinghai Lu: "Re: [BUG 2.6.31-rc1] HIGHMEM64G causes hang in PCI init on 32-bitx86"
Previous message: Roland McGrath: "Re: [rfc] do not place sub-threads on task_struct->children list"
In reply to: Metzger, Markus T: "bts & perf_counters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

* Metzger, Markus T <markus.t.metzger@xxxxxxxxx> wrote:

> > How does 'interval' get mixed with BTS?
>
> We could view BTS as event-based sampling with interval=1. The
> sample we collect is the <from, to> address pair of an executed
> branch and the sampling interval is 1, i.e. we store a sample for
> every branch. Wouldn't this be how BTS integrates into
> perf_counters?

Yeah, this is how i view it too.

> One of the big advantages that comes with using the perf_counter
> framework is that you could mix branch tracing with other forms of
> profiling and sampling.

Correct.

> >> Would it be possible for a user to profile the same task twice?
> >> He could then use different buffers for different sampling
> >> intervals.
> >
> > It's possibe to open multiple counters to the same task, yes.
>
> That's good. And users could mmap every counter they open in order
> to get multiple perf event streams?

Yes.

> OK. The existing implementation reconfigured DS area to have the
> h/w already collect the trace into the correct buffer. The only
> copying that is ever needed is to copy it into user-space while
> translating the arch-specific format into an arch-independent
> format.
>
> This is obviously only possible for a single user. Copying the
> data is definitely more flexible if we expect multiple users of
> that data with different-sized buffers.

Yeah. [ That decoupling is nice as it also allows multiplexing -
there's nothing that prevents from two independent monitor tasks
from sampling the same task. (beyond the inevitable runtime overhead
that is inherent in BTS anyway.) ]

> > If a task schedules out then it will have its DS area drained
> > already to the mmap buffer - i.e. it's all properly
> > synchronized.
>
> When is that draining done? Somewhere in schedule()? Wouldn't that
> be quite expensive for a few pages of BTS buffer?

Well, it is an open question how frequently we want to move
information from the DS area into the mmap pages.

The most direct approach would be to 'flush' the DS from two places:
the threshold IRQ handler plus from the context switch code if the
BTS counter gets deactivated. In the latter case BTS activities have
to stop anyway, so the DS can be flushed to the mmap pages.

Or is your mental model for getting the BTS records from the DS to
the mmap pages significantly different?

I think we should shoot for the simplest approach initially - we can
do other, more sophisticated streaming modes later as well - they
will not differ in functionality, only in performance.

> Hmmm, I'll see what I can do. Please don't expect a minimally
> working prototype to be bug-free from the beginning.

Sure, i dont.

> I see identifying the beginning of the stream as well as random
> accesses into the stream as bigger open points.
>
> Maybe we could add a mode where records are zero-extended to a
> fixed size. This would leave the choice to the user: compact
> format or random access.

I agree that streaming is a problem because the debugger does not
want to poll() really - such an output mode and a 'ignore data_tail
and overwrite old entries' ring-buffer modus operandi should be
added.

The latter would be useful for tracepoints too for example, so such
a 'flight recorder' or 'history buffer' mode is not limited to BTS.

So feel free to add something that meets your constant-size records
needs - and we'll make sure it fits well into the rest of
perfcounters.

So based on your suggestion we'd have two streaming models:

- 'no information loss' output model where user-space poll()s and
tries hard not to lose events (this is what profilers and
reliable tracers do)

- 'history ring-buffer' model - this is useful for debuggers and is
useful for certain modes of tracing as well. (crash-tracing for
example)

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Yinghai Lu: "Re: [BUG 2.6.31-rc1] HIGHMEM64G causes hang in PCI init on 32-bitx86"
Previous message: Roland McGrath: "Re: [rfc] do not place sub-threads on task_struct->children list"
In reply to: Metzger, Markus T: "bts & perf_counters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]