Re: [patch 21/24] perfmon: Intel architectural PMU support (x86)

From: stephane eranian
Date: Wed Nov 26 2008 - 10:45:39 EST


Thomas,


On Wed, Nov 26, 2008 at 3:55 PM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
> On Wed, 26 Nov 2008, eranian@xxxxxxxxxxxxxx wrote:
>
>> +static u64 enable_mask[PFM_MAX_PMCS];
>
> Why do we need enable_mask twice for AMD and Intel ?
>
Ok, some explanations are needed here.

Both perfmon_intel_arch.c and perfmon_amd64.c are supposed to be kernel modules.
They are hardcoded right now to make the patch simpler. Have PMU description be
modules is a key features because it allows adding new processor support without
necessarily patch the core kernel or rebooting. This has been working
nicely on Itanium.
With the introduction of Intel architectural Perfmon (IA32 SDM chapter
18), this becomes
possible on Intel X86 as well. on AMD64 processor, this is not really
an issue as they
all use the same architected PMU, except for family 10h which has nice
extensions.

In anycase, the idea is to encapsulate as much as possible code
related into a PMU model
into each module. That is why you are seing some redundancy.

There is a difference between enable_mask and used_pmcs. The used_pmcs
bitmasks shows
all the config registers in use. Whereas enable_mask shows the all
config registers which have
start/stop capabilities. For the basic AMD64 PMU (4 counters)
used_pmcs and enable_mask
are equivalent, but that is not the case on Barcelona once we support
IBS and sampling. So
for now, I could clean this up and drop enable_mask to use plain used_pmcs.


>> +static u16 max_enable;
>> +static int pfm_intel_arch_version;
>> +
>> +DEFINE_PER_CPU(u64, saved_global_ctrl);
>
> static
>
>> +/*
>> + * layout of EAX for CPUID.0xa leaf function
>> + */
>> +struct pmu_eax {
>> + unsigned int version:8; /* architectural perfmon version */
>> + unsigned int num_cnt:8; /* number of generic counters */
>> + unsigned int cnt_width:8; /* width of generic counters */
>> + unsigned int ebx_length:8; /* number of architected events */
>> +};
>
> in arch/x86/include/asm/intel_arch_perfmon.h we have already:
>
> union cpuid10_eax {
> struct {
> unsigned int version_id:8;
> unsigned int num_counters:8;
> unsigned int bit_width:8;
> unsigned int mask_length:8;
> } split;
> unsigned int full;
> };
>
> Can we either use this or remove it ?
>
>> +/*
>> + * layout of EDX for CPUID.0xa leaf function when perfmon v2 is detected
>> + */
>> +struct pmu_edx {
>> + unsigned int num_cnt:5; /* number of fixed counters */
>> + unsigned int cnt_width:8; /* width of fixed counters */
>> + unsigned int reserved:19;
>> +};
>
>> +static void pfm_intel_arch_acquire_pmu_percpu(void);
>> +static void pfm_intel_arch_release_pmu_percpu(void);
>> +static int pfm_intel_arch_stop_save(struct pfm_context *ctx,
>> + struct pfm_event_set *set);
>> +static int pfm_intel_arch_has_ovfls(struct pfm_context *ctx);
>> +static void __kprobes pfm_intel_arch_quiesce(void);
>> +
>> +/*
>> + * physical addresses of MSR controlling the perfevtsel and counter registers
>> + */
>> +struct pfm_arch_pmu_info pfm_intel_arch_pmu_info = {
>> + .stop_save = pfm_intel_arch_stop_save,
>> + .has_ovfls = pfm_intel_arch_has_ovfls,
>> + .quiesce = pfm_intel_arch_quiesce,
>> + .acquire_pmu_percpu = pfm_intel_arch_acquire_pmu_percpu,
>> + .release_pmu_percpu = pfm_intel_arch_release_pmu_percpu
>> +};
>
> A) static
> B) Please move it to the bottom to avoid all the forward declarations.
>
>> +static void pfm_intel_arch_check_errata(void)
>
> __init ?
>
>> +static void pfm_intel_arch_setup_generic(unsigned int version,
>
> Ditto.
>
>> +static void pfm_intel_arch_setup_fixed(unsigned int version,
>
> Ditto.
>
>> +static int pfm_intel_arch_probe_pmu(void)
>
> Ditto.
>
>> + /*
>> + * handle version new anythread bit (bit 2)
>> + */
>
> -ENOPARSE
>
>> + if (version == 3)
>> + rsvd = 1ULL << 3;
>
> This sets bit 3
>
>> + else
>> + rsvd = 3ULL << 2;
>
> And this sets bit 2 and 3.
>
>> +static int pfm_intel_arch_stop_save(struct pfm_context *ctx,
>> + struct pfm_event_set *set)
>> +{
>> + u64 used_mask[PFM_PMC_BV];
>> + u64 val, wmask, ovfl_mask;
>> + u32 i, count;
>> +
>> + wmask = 1ULL << pfm_pmu_conf->counter_width;
>> +
>> + pfm_arch_bv_and(used_mask,
>> + set->used_pmcs,
>> + enable_mask,
>> + max_enable);
>> +
>> + count = pfm_arch_bv_weight(used_mask, max_enable);
>
> So we have:
>
> set->used_pmcs and enable_mask and max_enable.
>
> Why can set->used_pmcs contain bits which are not in the enable_mask
> in the first place ? Why does the arch code not tell the generic code
> which pmcs are available so we can avoid all this mask, weight
> whatever magic ?
>

Because used_pmcs is part of generic code and enable_mask is a x86 construct.
As I said above, for now, I could drop enable_mask.
The arch code already export the list of available pmcs and pmds in
impl_pmcs and impl_pmds.

There is a difference between impl_pmcs and used_pmcs. The former list
everything that is
available. The latter shows what we are currently using. We may be
using fewer registers than
what is available, and we use this information to avoid
saving/restoring MSR on context switch
for instance.

> We store the same information in slightly different incarnations in
> various places and then we need to mingle them all together to get to
> the real data. That makes no sense at all.
>
>> + /*
>> + * stop monitoring
>> + * Unfortunately, this is very expensive!
>> + * wrmsrl() is serializing.
>> + */
>> + for (i = 0; count; i++) {
>> + if (pfm_arch_bv_test_bit(i, used_mask)) {
>> + wrmsrl(pfm_pmu_conf->pmc_desc[i].hw_addr, 0);
>> + count--;
>> + }
>> + }
>> +
>> + /*
>> + * if we already having a pending overflow condition, we simply
>> + * return to take care of this first.
>> + */
>> + if (set->npend_ovfls)
>> + return 1;
>
> Why are the counters enabled at all when an overflow is pending, which
> stopped the counters anyway ?
>
Because on Intel and AMD64, counters are not automatically frozen on interrupt.
On Intel X86, they can be configured to do so, but it is an all or
nothing setting.
I am not using this option because we would then have a problem with the NMI
watchdog given that it is also using a counter.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/