posix timer freeze after some random time, under pthread create/destroy load
From: Anthony Mallet
Date: Wed Nov 06 2024 - 16:34:29 EST
Hi,
I'm facing an issue with posix timers configured to send SIGALRM
signal upon expiry. The symptom is that the timer randomly freezes
(the signal handler not triggered anymore). After analysis, this happens
in combination with pthreads creation / destruction.
I have attached a test case that can reliably reproduce my issue on
affected kernels. It involves creating a timer that increments a
global counter at each tick, while the main thread is spawning and
destroying other threads. At some point, the counter gets stalled. In
the context of this test case, I do heavy thread creation and
destruction, so that the issue triggers almost immediately. Regarding
the real-world issue, it happens in the context of aio(7) work, which
also involves thread creation and destruction but presumably at a much
lower rate, and the issue consequently triggers much less often.
I could reproduce the issue reliably with mainline kernels from 6.4
to 6.11 (included), and on several distributions, different hardware
and glibc versions. Kernels earlier than 6.3 (included) do not exhibit
the problem at all.
Once the issue triggers, simply resetting the timer (with
timer_settime(2)) makes it work again, until next
stall. timer_gettime(2) does not show garbage and the values are still
as expected. Only the signal handler is not called. Manually sending
SIGALRM with raise(SIGALRM) also works and invokes the signal handler
as expected.
Also note that using setitimer(2) instead of a posix timer does not
show any problem with the same test program.
Before filling a proper bug report, I wanted to have your opinion
about this. This e-mail is already probably too long for an
introduction, but I can of course provide you with any missing detail
that you would deem necessary.
Thanks for you attention,
Anthony Mallet
/* Public domain - Anthony Mallet on Mon Nov 4 2024 */
#include <err.h>
#include <errno.h>
#include <pthread.h>
#include <signal.h>
#include <stdint.h>
#include <stdio.h>
#include <time.h>
static volatile int ticks;
/* SIGALRM handler */
void
tick(int arg)
{
(void)arg; /* unused */
/* global counter - even if access is not atomic, we don't care here as the
* exact value is not used, only the fact that the value changes is relevant
*/
ticks++;
}
/* thread forking thread */
void *
thr(void *arg)
{
pthread_attr_t attr;
pthread_t t;
(void)arg; /* unused */
/* spawn a new thread in detached state so that we don't grow too much */
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
if (pthread_create(&t, &attr, thr, NULL))
err(2, "pthread_create");
return NULL;
}
int
main()
{
int hz = 1000; /* 1kHz timer - the higher, the faster the issue happens */
struct sigaction act;
struct itimerspec tv;
struct timespec pts, ts, rem;
sigset_t sigset;
timer_t timer;
int i, c1, c2;
/* SIGALRM handler */
act.sa_handler = tick;
sigemptyset(&act.sa_mask);
act.sa_flags = 0;
if (sigaction(SIGALRM, &act, NULL) == -1)
err(2, "sigaction");
sigemptyset(&sigset);
sigaddset(&sigset, SIGALRM);
if (pthread_sigmask(SIG_UNBLOCK, &sigset, NULL) == -1)
err(2, "pthread_sigmask");
/* SIGALRM timer at 'hz' frequency */
if (timer_create(CLOCK_REALTIME, NULL, &timer) == -1)
err(2, "timer_create");
tv.it_interval.tv_nsec = 1000000000/hz;
tv.it_interval.tv_sec = 0;
tv.it_value = tv.it_interval;
/* thread forking threads - this is an issue spotted on ubuntu-22.04 and
* 24.04, as well as other distributions, that affects timer signal
* delivrery. This seems to affect kernels from 6.4 to 6.11 inclusive. */
thr(NULL);
/* start timer */
if (timer_settime(timer, 0, &tv, NULL) == -1)
err(2, "timer_settime");
/* 100 periods delay */
pts.tv_sec = 0;
pts.tv_nsec = tv.it_interval.tv_nsec * 100; /* 100ms */
while(pts.tv_nsec >= 1000000000) {
pts.tv_nsec -= 1000000000;
pts.tv_sec++;
}
/* for 1s */
for (i = 0; i < 10; i++) {
ts = pts;
c1 = ticks;
while (nanosleep(&ts, &rem) != 0 && errno == EINTR) ts = rem;
c2 = ticks;
if (c1 == c2) {
/* the counter is stuck, SIGALRM not firing anymore */
fprintf(stderr, "SIGALRM issue after %d ticks\n", c1);
/* just resetting the timer at this point makes it work again: */
/* timer_settime(timer, 0, &tv, NULL); */
/* (the issue will trigger again after some time) */
/* also note that timer_gettime(timer, &tv) will show both correct
* tv.it_interval and tv.it_value changing normally */
/* manually sending SIGALRM also still works: */
/* raise(SIGALRM); */
return 2;
}
}
printf("OK, no issue\n");
return 0;
}