Re: [PATCH 0/2] x86/intel_rdt and perf/x86: Fix lack of coordination with perf

From: Reinette Chatre
Date: Wed Aug 08 2018 - 13:33:05 EST


Hi Peter and Tony,

On 8/8/2018 12:51 AM, Peter Zijlstra wrote:
> On Tue, Aug 07, 2018 at 03:47:15PM -0700, Reinette Chatre wrote:
>>> FWIW, how long is that IRQ disabled section? It looks like something
>>> that could be taking a bit of time. We have these people that care about
>>> IRQ latency.
>>
>> We work closely with customers needing low latency as well as customers
>> needing deterministic behavior.
>>
>> This measurement is triggered by the user as a validation mechanism of
>> the pseudo-locked memory region after it has been created as part of
>> system setup as well as during runtime if there are any concerns with
>> the performance of an application that uses it.
>>
>> This measurement would thus be triggered before the sensitive workloads
>> start - during system setup, or if an issue is already present. In
>> either case the measurement is triggered by the administrator via debugfs.
>
> That does not in fact include the answer to the question. Also, it
> assumes a competent operator (something I've found is not always true).

My apologies, I did not intend to avoid your question. The main goal of
this measurement is for an operator to test if a pseudo-locked region
has been set up correctly. It is thus targeted for system setup time,
before the IRQ sensitive workloads - some of which would use these
memory regions - start.

>>> - I don't much fancy people accessing the guts of events like that;
>>> would not an inline function like:
>>>
>>> static inline u64 x86_perf_rdpmc(struct perf_event *event)
>>> {
>>> u64 val;
>>>
>>> lockdep_assert_irqs_disabled();
>>>
>>> rdpmcl(event->hw.event_base_rdpmc, val);
>>> return val;
>>> }
>>>
>>> Work for you?
>>
>> No. This does not provide accurate results. Implementing the above produces:
>> pseudo_lock_mea-366 [002] .... 34.950740: pseudo_lock_l2: hits=4096
>> miss=4
>
> But it being an inline function should allow the compiler to optimize
> and lift the event->hw.event_base_rdpmc load like you now do manually.
> Also, like Tony already suggested, you can prime that load just fine by
> doing an extra invocation.
>
> (and note that the above function is _much_ simpler than
> perf_event_read_local())

Unfortunately I do not find this to be the case. When I implement
x86_perf_rdpmc() _exactly_ as you suggest above and do the measurement like:

l2_hits_before = x86_perf_rdpmc(l2_hit_event);
l2_miss_before = x86_perf_rdpmc(l2_miss_event);
l2_hits_before = x86_perf_rdpmc(l2_hit_event);
l2_miss_before = x86_perf_rdpmc(l2_miss_event);
/* read memory */
l2_hits_after = x86_perf_rdpmc(l2_hit_event);
l2_miss_after = x86_perf_rdpmc(l2_miss_event);


Then the results are not accurate, neither are the consistently
inaccurate to consider a constant adjustment:

pseudo_lock_mea-409 [002] .... 194.322611: pseudo_lock_l2: hits=4100
miss=0
pseudo_lock_mea-412 [002] .... 195.520203: pseudo_lock_l2: hits=4096
miss=3
pseudo_lock_mea-415 [002] .... 196.571114: pseudo_lock_l2: hits=4097
miss=3
pseudo_lock_mea-422 [002] .... 197.629118: pseudo_lock_l2: hits=4097
miss=3
pseudo_lock_mea-425 [002] .... 198.687160: pseudo_lock_l2: hits=4096
miss=3
pseudo_lock_mea-428 [002] .... 199.744156: pseudo_lock_l2: hits=4096
miss=2
pseudo_lock_mea-431 [002] .... 200.801131: pseudo_lock_l2: hits=4097
miss=2
pseudo_lock_mea-434 [002] .... 201.858141: pseudo_lock_l2: hits=4097
miss=2
pseudo_lock_mea-437 [002] .... 202.917168: pseudo_lock_l2: hits=4096
miss=2

I was able to test Tony's theory and replacing the reading of the
"after" counts with a direct rdpmcl() improve the results. What I mean
is this:

l2_hit_pmcnum = x86_perf_rdpmc_ctr_get(l2_hit_event);
l2_miss_pmcnum = x86_perf_rdpmc_ctr_get(l2_miss_event);
l2_hits_before = x86_perf_rdpmc(l2_hit_event);
l2_miss_before = x86_perf_rdpmc(l2_miss_event);
l2_hits_before = x86_perf_rdpmc(l2_hit_event);
l2_miss_before = x86_perf_rdpmc(l2_miss_event);
/* read memory */
rdpmcl(l2_hit_pmcnum, l2_hits_after);
rdpmcl(l2_miss_pmcnum, l2_miss_after);

I did not run my full tests with the above but a simple read of 256KB
pseudo-locked memory gives:
pseudo_lock_mea-492 [002] .... 372.001385: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-495 [002] .... 373.059748: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-498 [002] .... 374.117027: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-501 [002] .... 375.182864: pseudo_lock_l2: hits=4096
miss=0
pseudo_lock_mea-504 [002] .... 376.243958: pseudo_lock_l2: hits=4096
miss=0

We thus seem to be encountering the issue Tony predicted where the
memory being tested is evicting the earlier measurement code and data.

>>> - native_read_pmc(); are you 100% sure this code only ever runs on
>>> native and not in some dodgy virt environment?
>>
>> My understanding is that a virtual environment would be a customer of a
>> RDT allocation (cache or memory bandwidth). I do not see if/where this
>> is restricted though - I'll move to rdpmcl() but the usage of a cache
>> allocation feature like this from a virtual machine needs more
>> investigation.
>
> I can imagine that hypervisors that allow physical partitioning could
> allow delegating the rdt crud to their guests when they 'own' a full
> socket or whatever the domain is for this.

I am using rdpmcl() now

>
>> Will do. I created the following helper function that can be used after
>> interrupts are disabled:
>>
>> static inline int perf_event_error_state(struct perf_event *event)
>> {
>> int ret = 0;
>> u64 tmp;
>>
>> ret = perf_event_read_local(event, &tmp, NULL, NULL);
>> if (ret < 0)
>> return ret;
>>
>> if (event->attr.pinned && event->oncpu != smp_processor_id())
>> return -EBUSY;
>>
>> return ret;
>> }
>
> Nah, stick the test in perf_event_read_local(), that actually needs it.

I will keep this outside for now since it does not seem as though
perf_event_read_local() works well for this use case.


>>> Also, while you disable IRQs, your fancy pants loop is still subject to
>>> NMIs that can/will perturb your measurements, how do you deal with
>>> those?
>
>> Customers interested in this feature are familiar with dealing with them
>> (and also SMIs). The user space counterpart is able to detect such an
>> occurrence.
>
> You're very optimistic about your customers capabilities. And this might
> be true for the current people you're talking to, but once this is
> available and public, joe monkey will have access and he _will_ screw it
> up.

I think this ventures into an area that is an existing issue for all
workloads that fall into this category, not unique to this.

>
>> Please note that if an NMI arrives it would be handled with the
>> currently active cache capacity bitmask so none of the pseudo-locked
>> memory will be evicted since no capacity bitmask overlaps with the
>> pseudo-locked region.
>
> So exceptions change / have their own bitmask?

My understanding is that interrupts are run with the current active
capacity bitmask. The capacity bitmasks specify which portions of cache
a cpu and/or task may allocate into. No capacity bitmask would overlap
with the pseudo-locked region and thus no cpu or task would be able to
evict the data from the pseudo-locked region. Even if later interrupts
do obtain their own class of service with its own capacity bitmasks -
these btimasks would still not be allowed to overlap with the
pseudo-locked region.

Reinette