Re: [RFC] perf: need to expose sched_clock to correlate user sampleswith kernel samples

From: John Stultz
Date: Mon Nov 12 2012 - 17:40:11 EST


On 11/12/2012 12:54 PM, Stephane Eranian wrote:
On Mon, Nov 12, 2012 at 7:53 PM, John Stultz <john.stultz@xxxxxxxxxx> wrote:
On 11/11/2012 12:32 PM, Stephane Eranian wrote:
On Sat, Nov 10, 2012 at 3:04 AM, John Stultz <john.stultz@xxxxxxxxxx>
wrote:
Also I worry that it will be abused in the same way that direct TSC
access
is, where the seemingly better performance from the more careful/correct
CLOCK_MONOTONIC would cause developers to write fragile userland code
that
will break when moved from one machine to the next.

The only goal for this new time source is for correlating user-level
samples with
kernel level samples, i.e., application level events with a PMU counter
overflow
for instance. Anybody trying anything else would be on their own.

clock_gettime(CLOCK_PERF): guarantee to return the same time source as
that used by the perf_event subsystem to timestamp samples when
PERF_SAMPLE_TIME is requested in attr->sample_type.

I'm not familiar enough with perf's interfaces, but if you are going to make
this clockid bound so tightly with perf, could you maybe export a perf
timestamp from one of perf's interfaces rather then using the more generic
clock_gettime() interface?

Yeah, I considered that as well. But it is more complicated. The only syscall
we could extend for perf_events is ioctl(). But that one requires that an
event be created so we obtain a file descriptor for the ioctl() call
So we'd have to
pretend programming a dummy event just for the purpose of obtained a timestamp.
We could do that but that's not so nice. But more amenable to the

Sorry, you trailed off. Did you want to finish that thought? (I do that all the time. :)

Keep in mind that the clock_gettime() would be used by programs which are not
self-monitoring but may be monitored externally by a tool such as perf. We just
need to them to emit their events with a timestamp that can be
correlated offline
with those of perf_events.

Again, forgive me for not really knowing much about perf here, but could you have a perf log an event when clock_gettime() was called, possibly recording the returned value, so you could correlate that data yourself?


I'd probably rather perf output timestamps to userland using sane clocks
(CLOCK_MONOTONIC), rather then trying to introduce a new time domain to
userland. But I probably could be convinced I'm wrong.

Can you get CLOCK_MONOTONIC efficiently and in ALL circumstances without
grabbing any locks because that would need to run from NMI context?
No, of course why we have sched_clock. But I'm suggesting we consider
changing what perf exports (via maybe interpolation/translation) to be
CLOCK_MONOTONIC-ish.

Explain to me the key difference between monotonic and what sched_clock()
is returning today? Does this have to do with the global monotonic vs.
the cpu-wide
monotonic?

So CLOCK_MONOTONIC is the number of NTP corrected (for accuracy) seconds + nsecs that the machine has been up for (so that doesn't include time in suspend). Its promised to be globally monotonic across cpus.

In my understanding, sched_clock's definition has changed over time. It used to be a fast but possibly incorrect nanoseconds since boot, but with suspend and other events it could reset/overflow and users (then only the scheduler) would be able to deal with it. It also wasn't guaranteed to be consistent across cpus. So it was limited to calculating approximate time intervals on a single cpu.

However, with cfs (And Peter or Ingo could probably hop in and clarify further) I believe it started to require some cross-cpu consistency and reset events would cause probelms with the scheduler, so additional layers have been added to try to enforce these additional requirements.

I suspect they aren't that far off, except calibration frequency errors go uncorrected with sched_clock. But was thinking you could get periodic timestamps in perf that correlated CLOCK_MONOTONIC with sched_clock and then allow the kernel to interpolate the sched_clock times out to something pretty close to CLOCK_MONOTONIC. That way perf wouldn't leak the sched_clock time domain to userland.

Again, sorry for being a pain here. The CLOCK_PERF would be a easy solution, but I just want to make sure its really the best one long term.

thanks
-john



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/