x86 performance monitor counters save/restore on context switch

From: Will Hawkins
Date: Fri Mar 09 2018 - 02:30:03 EST


Mr. Rostedt and others interested reading on the LKML,

I hope that this is the proper venue to ask this (longwinded)
question. If it is not, I apologize for the SPAM and wasting
everyone's time and bits. I am emailing to ask for clarification about
the "policy" of saving and restoring x86 performance monitor counters
(and other PMU-related registers) on context switch in the Kernel.

Having plumbed through the code for scheduling, I get the sense that
code in the perf subsystem is the only code that would, if conditions
are right, save/restore performance registers on a context switch.

In my investigation, I started from the top where
prepare_task_switch() calls perf_event_task_sched_out() and where
finish_task_switch() calls perf_event_task_sched_in(). Having traced
the implementation of each of those functions to (what I think is)
their lowest levels, the Kernel will only save and restore performance
monitor counters if:

1. The task, process of task's CPU is actively monitoring performance.
That monitoring would have been initiated by a user by calling
perf_event_open() (or using a high level library that eventually calls
that function).
2. The performance aspects being monitored are hardware counters/events.

I am sure that there are other conditions, but those are the two that
stuck out to me the most.

All that is a long (perhaps incorrect) preface to a very simple question:

Is it only the performance counting registers that are actively in use
(again, as told to the perf subsystem by a call to perf_event_open())
that are saved/restored on context switch?

I ask because I have written code (mostly out of curiosity and not
necessarily for production) that accesses those registers directly by
writing/reading their values through the msr kernel module. If what I
said above is correct, then I have to be wary of the fact that the
values read from those counters reflect statistics from all the
processes/threads running on the same CPU at the same time. At first
blush, this was the way I expected the performance monitoring
registers and counters to work, but I wanted to confirm and you seemed
like the right person to ask.

If I was wrong about asking for your help, I apologize and hope that I
didn't waste your valuable time.

Thanks for all the work that you do on the performance monitoring
systems for Linux -- they are invaluable for debugging those
hard-to-find bottlenecks that inevitably pop up when you really need
something to "just work."

Will



.