Re: [PATCH] x86/perf: Add hardware performance events support for Zhaoxin CPU.

From: CodyYao-oc
Date: Wed Apr 08 2020 - 03:20:45 EST


On 2020/3/31 äå6:18, Peter Zijlstra wrote:
On Tue, Mar 31, 2020 at 05:39:59PM +0800, CodyYao-oc wrote:
Zhaoxin CPU has provided facilities for monitoring performance
via PMU(Performance Monitor Unit), but the functionality is unused so far.
Therefore, add support for zhaoxin pmu to make performance related
hardware events available.

This looks like an Intel Architectural PMU v2 or so, is that correct? Do
you have a link to documentation for your CPU?

I'm sorry for such a late reply. Yes, the usage method is the same as Intel PMU v2, but there is no online document at present. For your convenience, provide some event descriptions as follows:

-----------------------------------------------------------------------------------------------------------------------------------
Event | Event | Umask | Description
| Select | |
-----------------------------------------------------------------------------------------------------------------------------------
cpu-cycles | 82h | 00h | unhalt core clock
instructions | 00h | 00h | number of instructions at retirement.
cache-references | 15h | 05h | number of fillq pushs at the current cycle.
cache-misses | 1ah | 05h | number of l2 miss pushed by fillq.
branch-instructions | 28h | 00h | counts the number of branch instructions retired.
branch-misses | 29h | 00h | mispredicted branch instructions at retirement.
bus-cycles | 83h | 00h | unhalt bus clock
stalled-cycles-frontend | 01h | 01h | Increments each cycle the # of Uops issued by the RAT to RS.
stalled-cycles-backend | 0fh | 04h | RS0/1/2/3/45 empty
L1-dcache-loads | 68h | 05h | number of retire/commit load.
L1-dcache-load-misses | 4bh | 05h | retired load uops whose data source followed an L1 miss.
L1-dcache-stores | 69h | 06h | number of retire/commit Store,no LEA
L1-dcache-store-misses | 62h | 05h | cache lines in M state evicted out of L1D due to Snoop HitM or dirty line replacement.
L1-icache-loads | 00h | 03h | number of l1i cache access for valid normal fetch,including un-cacheable access.
L1-icache-load-misses | 01h | 03h | number of l1i cache miss for valid normal fetch,including un-cacheable miss.
L1-icache-prefetches | 0ah | 03h | number of prefetch.
L1-icache-prefetch-misses | 0bh | 03h | number of prefetch miss.
dTLB-loads | 68h | 05h | number of retire/commit load
dTLB-load-misses | 2ch | 05h | number of load operations miss all level tlbs and cause a tablewalk.
dTLB-stores | 69h | 06h | number of retire/commit Store,no LEA
dTLB-store-misses | 30h | 05h | number of store operations miss all level tlbs and cause a tablewalk.
dTLB-prefetches | 64h | 05h | number of hardware pte prefetch requests dispatched out of the prefetch FIFO.
dTLB-prefetch-misses | 65h | 05h | number of hardware pte prefetch requests miss the l1d data cache.
iTLB-load | 00h | 00h | actually counter instructions.
iTLB-load-misses | 34h | 05h | number of code operations miss all level tlbs and cause a tablewalk.
-----------------------------------------------------------------------------------------------------------------------------------


+static void zhaoxin_pmu_disable_all(void)
+{
+ wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
+}
+
+static void zhaoxin_pmu_enable_all(int added)
+{
+ wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, x86_pmu.intel_ctrl);
+}
+
+static inline u64 zhaoxin_pmu_get_status(void)
+{
+ u64 status;
+
+ rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, status);
+
+ return status;
+}
+
+static inline void zhaoxin_pmu_ack_status(u64 ack)
+{
+ wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, ack);
+}

+static int zhaoxin_pmu_handle_irq(struct pt_regs *regs)
+{
+ struct perf_sample_data data;
+ struct cpu_hw_events *cpuc;
+ int bit;
+ u64 status;
+ bool is_zxc;
+ int handled = 0;
+
+ cpuc = this_cpu_ptr(&cpu_hw_events);
+ apic_write(APIC_LVTPC, APIC_DM_NMI);
+ zhaoxin_pmu_disable_all();
+ status = zhaoxin_pmu_get_status();
+ if (!status)
+ goto done;
+
+ if (boot_cpu_data.x86 == 0x06 &&
+ (boot_cpu_data.x86_model == 0x0f ||
+ boot_cpu_data.x86_model == 0x19))
+ is_zxc = true;
+again:
+
+ /*Clearing status works only if the global control is enable on zxc.*/
+ if (is_zxc)
+ wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, x86_pmu.intel_ctrl);
+
+ zhaoxin_pmu_ack_status(status);
+
+ if (is_zxc)
+ wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);

That's an unfortunate errata; perhaps write it like so:


Thank you very much for your advice, I have changed the code.

static inline void zxc_pmu_ack_status(u64 status)
{
/*
* ZXC needs global control enabled in order to clear status bits.
*/
zhaoxin_pmu_enable_all(0);
zhaoxin_pmu_ack_status(status);
zhaoxin_pmu_disable_all();
}

if (is_zxc)
zxc_pmu_ack_status(status);
else
zhaoxin_pmu_ack_status(status);

Alternatively; you can do a whole zxc specific handle_irq() and move the
get/ack status before disable_all(). If you do that, then factor this:

+ /*
+ * CondChgd bit 63 doesn't mean any overflow status. Ignore
+ * and clear the bit.
+ */
+ if (__test_and_clear_bit(63, (unsigned long *)&status)) {
+ if (!status)
+ goto done;
+ }
+
+ for_each_set_bit(bit, (unsigned long *)&status, X86_PMC_IDX_MAX) {
+ struct perf_event *event = cpuc->events[bit];
+
+ handled++;
+
+ if (!test_bit(bit, cpuc->active_mask))
+ continue;
+
+ x86_perf_event_update(event);
+ perf_sample_data_init(&data, 0, event->hw.last_period);
+
+ if (!x86_perf_event_set_period(event))
+ continue;
+
+ if (perf_event_overflow(event, &data, regs))
+ x86_pmu_stop(event, 0);
+ }

bit into it's own function so you don't have to duplicate it. Then the
two handle_irq() functions only differ in the status handling.

+
+ /*
+ * Repeat if there is more work to be done:
+ */
+ status = zhaoxin_pmu_get_status();
+ if (status)
+ goto again;
+
+done:
+ zhaoxin_pmu_enable_all(0);
+ return handled;
+}
.