Re: I.5 - Mmaped count

From: stephane eranian
Date: Mon Jun 22 2009 - 08:54:33 EST

Next message: David Woodhouse: "Re:"
Previous message: Andre Noll: "Re: [PATCH] MD: md, fix lock imbalance"
In reply to: Peter Zijlstra: "Re: I.5 - Mmaped count"
Next in thread: Peter Zijlstra: "Re: I.5 - Mmaped count"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, Jun 22, 2009 at 2:35 PM, Peter Zijlstra<a.p.zijlstra@xxxxxxxxx> wrote:
> On Mon, 2009-06-22 at 14:25 +0200, stephane eranian wrote:
>> On Mon, Jun 22, 2009 at 1:52 PM, Ingo Molnar<mingo@xxxxxxx> wrote:
>> >> 5/ Mmaped count
>> >>
>> >> It is possible to read counts directly from user space for
>> >> self-monitoring threads. This leverages a HW capability present on
>> >> some processors. On X86, this is possible via RDPMC.
>> >>
>> >> The full 64-bit count is constructed by combining the hardware
>> >> value extracted with an assembly instruction and a base value made
>> >> available thru the mmap. There is an atomic generation count
>> >> available to deal with the race condition.
>> >>
>> >> I believe there is a problem with this approach given that the PMU
>> >> is shared and that events can be multiplexed. That means that even
>> >> though you are self-monitoring, events get replaced on the PMU.
>> >> The assembly instruction is unaware of that, it reads a register
>> >> not an event.
>> >>
>> >> On x86, assume event A is hosted in counter 0, thus you need
>> >> RDPMC(0) to extract the count. But then, the event is replaced by
>> >> another one which reuses counter 0. At the user level, you will
>> >> still use RDPMC(0) but it will read the HW value from a different
>> >> event and combine it with a base count from another one.
>> >>
>> >> To avoid this, you need to pin the event so it stays in the PMU at
>> >> all times. Now, here is something unclear to me. Pinning does not
>> >> mean stay in the SAME register, it means the event stays on the
>> >> PMU but it can possibly change register. To prevent that, I
>> >> believe you need to also set exclusive so that no other group can
>> >> be scheduled, and thus possibly use the same counter.
>> >>
>> >> Looks like this is the only way you can make this actually work.
>> >> Not setting pinned+exclusive, is another pitfall in which many
>> >> people will fall into.
>> >
>> > Â do {
>> > Â Â seq = pc->lock;
>> >
>> > Â Â barrier()
>> > Â Â if (pc->index) {
>> > Â Â Â count = pmc_read(pc->index - 1);
>> > Â Â Â count += pc->offset;
>> > Â Â } else
>> > Â Â Â goto regular_read;
>> >
>> > Â Â barrier();
>> > Â } while (pc->lock != seq);
>> >
>> > We don't see the hole you are referring to. The sequence lock
>> > ensures you get a consistent view.
>> >
>> Let's take an example, with two groups, one event in each group.
>> Both events scheduled on counter0, i.e,, rdpmc(0). The 2 groups
>> are multiplexed, one each tick. The user gets 2 file descriptors
>> and thus two mmap'ed pages.
>>
>> Suppose the user wants to read, using the above loop, the value of the
>> event in the first group BUT it's the 2nd group Âthat is currently active
>> and loaded on counter0, i.e., rdpmc(0) returns the value of the 2nd event.
>>
>> Unless you tell me that pc->index is marked invalid (0) when the
>> event is not scheduled. I don't see how you can avoid reading
>> the wrong value. I am assuming that is the event is not scheduled
>> lock remains constant.
>
> Indeed, pc->index == 0 means its not currently available.

I don't see where you clear that field on x86.
Looks like it comes from hwc->idx. I suspect you need
to do something in x86_pmu_disable() to be symmetrical
with x86_pmu_enable().

I suspect something similar needs to be done on Power.

>
>> Assuming the event is active when you enter the loop and you
>> read a value. How to get the timing information to scale the
>> count?
>
> I think we would have to add that do the data page,.. something like the
> below?
>
Yes.

> ---
> Index: linux-2.6/include/linux/perf_counter.h
> ===================================================================
> --- linux-2.6.orig/include/linux/perf_counter.h
> +++ linux-2.6/include/linux/perf_counter.h
> @@ -232,6 +232,10 @@ struct perf_counter_mmap_page {
> Â Â Â Â__u32 Â lock; Â Â Â Â Â Â Â Â Â /* seqlock for synchronization */
> Â Â Â Â__u32 Â index; Â Â Â Â Â Â Â Â Â/* hardware counter identifier */
> Â Â Â Â__s64 Â offset; Â Â Â Â Â Â Â Â /* add to hardware counter value */
> + Â Â Â __u64 Â total_time; Â Â Â Â Â Â /* total time counter active */
> + Â Â Â __u64 Â running_time; Â Â Â Â Â /* time counter on cpu */
> +
> + Â Â Â __u64 Â __reserved[123]; Â Â Â Â/* align at 1k */
>
> Â Â Â Â/*
> Â Â Â Â * Control data for the mmap() data buffer.
> Index: linux-2.6/kernel/perf_counter.c
> ===================================================================
> --- linux-2.6.orig/kernel/perf_counter.c
> +++ linux-2.6/kernel/perf_counter.c
> @@ -1782,6 +1782,12 @@ void perf_counter_update_userpage(struct
> Â Â Â Âif (counter->state == PERF_COUNTER_STATE_ACTIVE)
> Â Â Â Â Â Â Â Âuserpg->offset -= atomic64_read(&counter->hw.prev_count);
>
> + Â Â Â userpg->total_time = counter->total_time_enabled +
> + Â Â Â Â Â Â Â Â Â Â Â atomic64_read(&counter->child_total_time_enabled);
> +
> + Â Â Â userpg->running_time = counter->total_time_running +
> + Â Â Â Â Â Â Â Â Â Â Â atomic64_read(&counter->child_total_time_running);
> +
> Â Â Â Âbarrier();
> Â Â Â Â++userpg->lock;
> Â Â Â Âpreempt_enable();
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: David Woodhouse: "Re:"
Previous message: Andre Noll: "Re: [PATCH] MD: md, fix lock imbalance"
In reply to: Peter Zijlstra: "Re: I.5 - Mmaped count"
Next in thread: Peter Zijlstra: "Re: I.5 - Mmaped count"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]