Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

From: Peter Zijlstra
Date: Wed Aug 08 2018 - 03:52:01 EST

On Tue, Aug 07, 2018 at 03:47:15PM -0700, Reinette Chatre wrote:
> > FWIW, how long is that IRQ disabled section? It looks like something
> > that could be taking a bit of time. We have these people that care about
> > IRQ latency.
> We work closely with customers needing low latency as well as customers
> needing deterministic behavior.
> This measurement is triggered by the user as a validation mechanism of
> the pseudo-locked memory region after it has been created as part of
> system setup as well as during runtime if there are any concerns with
> the performance of an application that uses it.
> This measurement would thus be triggered before the sensitive workloads
> start - during system setup, or if an issue is already present. In
> either case the measurement is triggered by the administrator via debugfs.

That does not in fact include the answer to the question. Also, it
assumes a competent operator (something I've found is not always true).

> > - I don't much fancy people accessing the guts of events like that;
> > would not an inline function like:
> >
> > static inline u64 x86_perf_rdpmc(struct perf_event *event)
> > {
> > u64 val;
> >
> > lockdep_assert_irqs_disabled();
> >
> > rdpmcl(event->hw.event_base_rdpmc, val);
> > return val;
> > }
> >
> > Work for you?
> No. This does not provide accurate results. Implementing the above produces:
> pseudo_lock_mea-366 [002] .... 34.950740: pseudo_lock_l2: hits=4096
> miss=4

But it being an inline function should allow the compiler to optimize
and lift the event->hw.event_base_rdpmc load like you now do manually.
Also, like Tony already suggested, you can prime that load just fine by
doing an extra invocation.

(and note that the above function is _much_ simpler than

> > - native_read_pmc(); are you 100% sure this code only ever runs on
> > native and not in some dodgy virt environment?
> My understanding is that a virtual environment would be a customer of a
> RDT allocation (cache or memory bandwidth). I do not see if/where this
> is restricted though - I'll move to rdpmcl() but the usage of a cache
> allocation feature like this from a virtual machine needs more
> investigation.

I can imagine that hypervisors that allow physical partitioning could
allow delegating the rdt crud to their guests when they 'own' a full
socket or whatever the domain is for this.

> Will do. I created the following helper function that can be used after
> interrupts are disabled:
> static inline int perf_event_error_state(struct perf_event *event)
> {
> int ret = 0;
> u64 tmp;
> ret = perf_event_read_local(event, &tmp, NULL, NULL);
> if (ret < 0)
> return ret;
> if (event->attr.pinned && event->oncpu != smp_processor_id())
> return -EBUSY;
> return ret;
> }

Nah, stick the test in perf_event_read_local(), that actually needs it.

> > Also, while you disable IRQs, your fancy pants loop is still subject to
> > NMIs that can/will perturb your measurements, how do you deal with
> > those?

> Customers interested in this feature are familiar with dealing with them
> (and also SMIs). The user space counterpart is able to detect such an
> occurrence.

You're very optimistic about your customers capabilities. And this might
be true for the current people you're talking to, but once this is
available and public, joe monkey will have access and he _will_ screw it

> Please note that if an NMI arrives it would be handled with the
> currently active cache capacity bitmask so none of the pseudo-locked
> memory will be evicted since no capacity bitmask overlaps with the
> pseudo-locked region.

So exceptions change / have their own bitmask?