Re: [RFC PATCH 2/3 v3] perf: Implement Nehalem uncore pmu

From: Stephane Eranian
Date: Sat Dec 11 2010 - 00:49:36 EST


Hi,

Ok, so I have an explanation for what we are seeing. In fact, what
bothered me in all
of this is that I did not recall ever running into this problem of
double-interrupt with HT
when I implemented the perfmon support for uncore sampling. The reason
is in fact
real simple.

You are getting interrupts on both threads because you have enabled uncore PMU
on all CPUs, in uncore_cpu_starting():
+ rdmsrl(MSR_IA32_DEBUGCTLMSR, val);
+ wrmsrl(MSR_IA32_DEBUGCTLMSR, val | DEBUGCTLMSR_ENABLE_UNCORE_PMI);

You need to do that only on ONE of the two siblings. In fact, you want
that only ONCE
per socket. You can do this on the first CPU to use the uncore to add
something a bit
more dynamic and to make sure you have some control over where the
overhead is applied.
Once you do that, only one CPU/socket will get the interrupt and all
will be fine.



On Fri, Dec 10, 2010 at 11:47 AM, Peter Zijlstra <a.p.zijlstra@xxxxxxxxx> wrote:
> On Fri, 2010-12-10 at 00:46 +0100, Stephane Eranian wrote:
>> Hi,
>>
>> So I have tested this patch a bit on WSM and as I expected there
>> are issues with sampling.
>>
>> When HT is on, both siblings CPUs get the interrupt. The HW does not
>> allow you to only point interrupts to a single HT thread (CPU).
>
> Egads, how ugly :/
>
>> I did verify that indeed both threads get the interrupt and that you have a
>> race condition. Both sibling CPUs stop uncore, get the status. They may get
>> the same overflow status. Both will pass the uncore->active_mask because
>> it's shared among siblings cores. Thus, Âyou have a race for the whole
>> interrupt handler execution.
>>
>> You need some serialization in there. But the patch does not address this.
>> The problem is different from the back-to-back interrupt issue that
>> Don worked on.
>> The per-cpu marked/handled trick cannot work to avoid this problem.
>>
>> You cannot simply say "the lowest indexed" CPU of a sibling pair
>> handles the interrupt
>> because you don't know if this in an uncore intr, core interrupt or
>> something else. You
>> need to check. That means each HT thread needs to check uncore
>> ovfl_status. IF the
>> status is zero, then return. Otherwise, you need to do a 2nd level
>> check before you can
>> execute the handler. You need to know if the sibling CPU has already
>> "consumed" that
>> interrupt.
>>
>> I think you need some sort of generation counter per physical core and
>> per HT thread.
>> On interrupt, you could do something along the line of:
>> Â Â Â if (mycpu->intr_count == mysibling->intr_count) {
>> Â Â Â Â Â then mycpu->intr_count++
>> Â Â Â Â Â execute intr_handler()
>> Â Â Â } else {
>> Â Â Â Â Â mycpu->intr_count++
>> Â Â Â Â Â return;
>> Â Â Â }
>> Of course, the above needs some atomicity and ad locking
>
> Does that guarantee that the same sibling handles all interrupts? Since
> a lot of the infrastructure uses local*_t we're not good with cross-cpu
> stuff.
>
> Damn what a mess.. we need to serialize enough for both cpus to at least
> see the overflow bit.. maybe something like:
>
>
> struct intel_percore {
> Â ...
> Â atomic_t uncore_barrier;
> };
>
> void uncore_barrier(void)
> {
> Â Â Â Âstruct intel_percore *percore = this_cpu_ptr(cpu_hw_events)->percore;
> Â Â Â Âint armed;
>
> Â Â Â Âarmed = atomic_cmpxchg(&percore->uncore_barrier, 0, 1) == 0;
> Â Â Â Âif (armed) {
> Â Â Â Â Â Â Â Â/* we armed, it, now wait for completion */
> Â Â Â Â Â Â Â Âwhile (atomic_read(&percore->uncore_barrier))
> Â Â Â Â Â Â Â Â Â Â Â Âcpu_relax();
> Â Â Â Â} else {
> Â Â Â Â Â Â Â Â/* our sibling must have, decrement it */
> Â Â Â Â Â Â Â Âif (atomic_cmpxchg(&percore->uncore_barrier, 1, 0) != 1)
> Â Â Â Â Â Â Â Â Â Â Â ÂBUG();
> Â Â Â Â}
> }
>
> Then have something like:
>
> handle_uncore_interrupt()
> {
> Â Â Â Âu64 overflow = rdmsrl(MSR_UNCORE_PERF_GLOBAL_OVF_STATUS);
> Â Â Â Âint cpu;
>
> Â Â Â Âif (!overflow)
> Â Â Â Â Â Â Â Âreturn 0; /* not our interrupt to handle */
>
> Â Â Â Âuncore_barrier(); /* wait so our sibling will also observe the overflow */
>
> Â Â Â Âcpu = smp_processor_id();
> Â Â Â Âif (cpu != cpumask_first(topology_thread_cpumask(cpu)))
> Â Â Â Â Â Â Â Âreturn 1; /* our sibling will handle it, eat the NMI */
>
> Â Â Â Â/* OK, we've got an overflow and we're the first CPU in the thread mask */
>
> Â Â Â Â... do fancy stuff ...
>
> Â Â Â Âreturn 1; /* we handled it, eat the NMI */
> }
>
>
>> (but I don't think you can use locks in NMI context).
>
> You can, as long as they're never used from !NMI, its icky, but it
> works.
>
>> This makes me wonder if vectoring uncore to NMI is really needed,
>> given you cannot
>> correlated to an IP, incl. a kernel IP. If we were to vector to a
>> dedicated (lower prio)
>> vector, then we could use the trick of saying the lowest indexed CPU in a pair
>> handles the interrupt (because we would already know this is an uncore
>> interrupt).
>> This would be much simpler. Price: not samples in kernel's critical
>> section. But those
>> are useless anyway with uncore events.
>
> But the uncore uses the same PMI line, right? You cannot point the
> uncore to another vector. /me goes find the docs -- ok, its too early in
> the morning to clearly parse that...
>
> Besides, people asked for the sampling thing didn't they (also we need
> it to fold the count to avoid overflow on 48bit). Also the PAPI people
> even want per-task uncore counters because that's the only thing PAPI
> can do.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/