bug in RLIMIT_SIGPENDING

From: Miguel Freitas
Date: Sat Apr 28 2007 - 18:00:47 EST


summary: there seems to be a bug in RLIMIT_SIGPENDING accounting that
can cause it to go negative. associated with this fact, the given
process may get stuck forever trying to enter a 'clone' syscall.

long version:

- several people have experienced this problem of Xorg hanging forever
(100% cpu usage) trying to enter the 'clone' syscall to execute
xkbcomp.

- the syscall is aborted with ERESTARTNOINTR because there is a
SIGALRM signal pending. status shows:

SigQ: 1/18446744073709551615
SigPnd: 0000000000000000
ShdPnd: 0000000000002000
SigBlk: 0000000000000000
SigIgn: 0000000000301000
SigCgt: 0000000061c06ecb

note the weird SigQ value, is 64 bits' -1 for RLIMIT_SIGPENDING.

- the signal handler is executed (as confirmed under gdb).

- kernel will then force reentering the syscall by means of the
following code in handle_signal():

case -ERESTARTNOINTR:
regs->rax = regs->orig_rax;
regs->rip -= 2;
break;

- this effectively puts user space in a kind of spinlock that never ends.

- the code that sets signal handler is quoted here from Xorg gitweb:

1529 #define SMART_SCHEDULE_SIGNAL SIGALRM
(...)
1588 bzero ((char *) &act, sizeof(struct sigaction));
1589
1590 /* Set up the timer signal function */
1591 act.sa_handler = SmartScheduleTimer;
1592 sigemptyset (&act.sa_mask);
1593 sigaddset (&act.sa_mask, SMART_SCHEDULE_SIGNAL);
1594 if (sigaction (SMART_SCHEDULE_SIGNAL, &act, 0) < 0)
1595 {
1596 perror ("sigaction for smart scheduler");
1597 return FALSE;
1598 }

- the code that sets the timer is quoted here from Xorg gitweb:

1548 Bool
1549 SmartScheduleStartTimer (void)
1550 {
1551 #ifdef SMART_SCHEDULE_POSSIBLE
1552 struct itimerval timer;
1553
1554 SmartScheduleTimerStopped = FALSE;
1555 timer.it_interval.tv_sec = 0;
1556 timer.it_interval.tv_usec = SmartScheduleInterval * 1000;
1557 timer.it_value.tv_sec = 0;
1558 timer.it_value.tv_usec = SmartScheduleInterval * 1000;
1559 return setitimer (ITIMER_REAL, &timer, 0) >= 0;
1560 #endif
1561 return FALSE;
1562 }

- having this negative rlimit may cause problem to the
__sigqueue_alloc() kernel function. however, as far as i can see, this
would possibly prevent new signals from being enqueued - not existing
ones from being dequeued/cleared/whatever.

- bugzilla entry for the complete investigation can be seen here:

https://bugs.freedesktop.org/show_bug.cgi?id=10525

thanks,

Miguel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/