Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

From: Luck, Tony
Date: Tue Feb 07 2017 - 14:04:22 EST


On Tue, Feb 07, 2017 at 12:08:09AM -0800, Stephane Eranian wrote:
> Hi,
>
> I wanted to take a few steps back and look at the overall goals for
> cache monitoring.
> From the various threads and discussion, my understanding is as follows.
>
> I think the design must ensure that the following usage models can be monitored:
> - the allocations in your CAT partitions
> - the allocations from a task (inclusive of children tasks)
> - the allocations from a group of tasks (inclusive of children tasks)
> - the allocations from a CPU
> - the allocations from a group of CPUs
>
> All cases but first one (CAT) are natural usage. So I want to describe
> the CAT in more details.
> The goal, as I understand it, it to monitor what is going on inside
> the CAT partition to detect
> whether it saturates or if it has room to "breathe". Let's take a
> simple example.

By "natural usage" you mean "like perf(1) provides for other events"?

But we are trying to figure out requirements here ... what data do people
need to manage caches and memory bandwidth. So from this perspective
monitoring a CAT group is a natural first choice ... did we provision
this group with too much, or too little cache.

>From that starting point I can see that a possible next step when
finding that a CAT group has too small a cache is to drill down to
find out how the tasks in the group are using cache. Armed with that
information you could move tasks that hog too much cache (and are believed
to be streaming through memory) into a different CAT group.

What I'm not seeing is how drilling to CPUs helps you.

Say you have CPUs=CPU0,CPU1 in the CAT group and you collect data that
shows that 75% of the cache occupancy is attributed to CPU0, and only
25% to CPU1. What can you do with this information to improve things?
If it is deemed too complex (from a kernel code perspective) to
implement per-CPU reporting how bad a loss would that be?

-Tony