Re: [PATCH 01/14] x86/cqm: Intel Resource Monitoring Documentation
From: David Carrillo-Cisneros
Date: Tue Dec 27 2016 - 16:35:01 EST
On Tue, Dec 27, 2016 at 12:00 PM, Andi Kleen <andi@xxxxxxxxxxxxxx> wrote:
> Shivappa Vikas <vikas.shivappa@xxxxxxxxx> writes:
>> Ok , looks like the interface is the problem. Will try to fix
>> this. We are just trying to have a light weight monitoring
>> option so that its reasonable to monitor for a
>> very long time (like lifetime of process etc). Mainly to not have all
>> the perf scheduling overhead.
> That seems like an odd reason to define a completely new user interface.
> This is to avoid one MSR write for a RMID change per context switch
> in/out cgroup or is it other code too?
> Is there some number you can put to the overhead?
I obtained some timing by manually instrumenting the kernel in a Haswell EP.
When using one intel_cmt/llc_occupancy/ cgroup perf_event in one CPU, the
avg time to do __perf_event_task_sched_out + __perf_event_task_sched_in is
most of the time is spend in cgroup ctx switch (~1120ns) .
When using continuous monitoring in CQM driver, the avg time to
find the rmid to write inside of pqr_context switch is ~16ns
Note that this excludes the MSR write. It's only the overhead of
finding the RMID
to write in PQR_ASSOC. Both paths call the same routine to find the
RMID, so there are
about 1100 ns of overhead in perf_cgroup_switch. By inspection I assume most
of it comes from iterating over the pmu list.
> Or is there some other overhead other than the MSR write
> you're concerned about?
No, that problem is solved with the PQR software cache introduced in the series.
> Perhaps some optimization could be done in the code to make it faster,
> then the new interface wouldn't be needed.
There are some. One in my list is to create a list of pmus with at
least one cgroup event
and use it to iterate over in perf_cgroup_switch, instead of using the
The pmus list has grown a lot recently with the addition of all the uncore pmus.
Despite this optimization, it's unlikely that the whole sched_out +
sched_in gets that
close to the 15 ns of the non perf_event approach.
Please note that context switch time for llc_occupancy events has more
impact than for
other events because in order to obtain reliable measurements, the
RMID switch must
be active _all_ the time, not only while the event is read.
> FWIW there are some pending changes to context switch that will
> eliminate at least one common MSR write . If that was fixed
> you could do the RMID MSR write "for free"
That may save the need for the PQR software cache in this series, but
won't speed up
the context switch.