Re: Posix process cpu timer inaccuracies

From: Delyan Kratunov
Date: Mon Feb 26 2024 - 19:45:32 EST


Thanks for your detailed response, Thomas, I appreciate you taking the time
with my random side quest!

> [...]
>
> That's wishful thinking and there is no way to ensure that.
> Just for the record: setitimer() has been marked obsolescent in the
> POSIX standard issue 7 in 2018. The replacement is timer_settime() which
> has a few interesting properties vs. the overrun handling.

This is a great point and I think it overrides anything I have to say about
setitimer. Overall, I have nothing to rehash on the process signal delivery
point, I understand the situation now, thanks to your thorough explanation!

> [...]
> I don't know and those assumptions have been clearly wrong at the point
> where the tool was written.

That was my impression as well, thanks for confirming. (I've found at least 3
tools with this same incorrect belief)

> [...]
> > they still have the same distribution issues.
>
> CLOCK_THREAD_CPUTIME_ID exists for a reason and user space can correlate
> the thread data nicely.
>
> Aside of that there are PMUs and perf which solve all the problems you
> are trying to solve in one go.

Absolutely, the ability to write a profiler with perf_event_open is not in
question at all. However, not every situation allows for PMU or
perf_event_open access. Timers could form a nice middle ground, in exactly the
way people have tried to use them.

I'd like to push back a little on the "CLOCK_THREAD_CPUTIME_ID fixes things"
point, though. From an application and library point of view, the per-thread
clocks are harder to use - you need to either orchestrate every thread to
participate voluntarily or poll the thread ids and create timers from another
thread. In perf_event_open, this is solved via the .inherit/.inherit_thread
bits.

More importantly, they don't work for all workloads. If I have 10 threads that
each run for 5ms, a 10ms process timer would fire 5 times, while per-thread
10ms timers would never fire. You can easily imagine an application that
accrues all its cpu time in a way that doesn't generate a single signal (in
the extreme, threads only living a single tick).

Overall, what I want to establish is whether there's a path to achieve the
_assumed_ interface that these tools expect - process-wide cpu signals that
correlate with where cpu time is spent - through any existing or extended
timer API. This interface would be imminently useful, as people have clearly,
albeit misguidedly, demonstrated.

If the answer is definitely "no," I'd like to at least add some notes to the
man pages.

-- Delyan