timerfd waking up before timer really expires

From: Lucas De Marchi
Date: Tue Mar 03 2015 - 14:59:48 EST


Hi,

I was debugging my application and noticed that a timerfd event was being
triggered *before* the timer expires. I'd like to know if this behavior is
expected. More details below.


I reduced the scope of the program to test a single timerfd and measure the
difference in the result of clock_gettime() between two reads. Helper functions
are elsewhere, but the loop is very simple:

loop_time_fd = setup_timerfd(interval, 0);
iter = 0;
old = curr = now_usec();
do {
if (iter > iterations)
break;

r = read(loop_time_fd, &events, sizeof(events));
if (r < 0) {
if (errno == EINTR)
continue;
return log_error_errno(errno, "Error reading timerfd: %m");
}

curr = now_usec();

/* mark as USEC_INFINITY those events we missed, the real wait
* time is nonetheless in the next position */
for (; events > 1; events--)
elapsed[iter++] = USEC_INFINITY;

elapsed[iter++] = curr - old;
old = curr;
} while (1);


Here we have:
- now_usec() simply calls clock_gettime() with CLOCK_MONOTONIC as clockid and
converts to usec.
- setup_timerfd() creates the timerfd with CLOCK_MONOTONIC and let it
in blocking mode (there's another version of the test with poll()
and non-block fd,
but it shows the same behavior)

First I thought it could be because of the timerslack. I disabled it with
prctl(). I'm also running the process with RT priority so AFAIK there shouldn't
be a problem with timerslack. In order to reduce the overhead for the process I
also tried to isolate it alone on a single cpu by booting with isolcpus and
then setting the affinity accordingly.

For whatever interval I configure and 10000 iterations, what I'm seeing in the
elapsed vector are values like

interval +- 70usec

Due to the wakeup overhead, I understand the +70usec, but I'm not getting why
there are elapsed values shorter than the configured interval.

Tracing for wakeup events with ftrace[1] I also get the same results... the time
difference between 2 wakeup events for ~30% of events are shorter the
configured interval.

I'm currently running with kernel version 3.18.6. Is there anything I'm
missing?

thanks

--
Lucas De Marchi

[1] trace-cmd record -F -e sched:sched_wakeup ./test-timer 1000 10000
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/