Re: Why processes on linux loses signals?

From: Ray Lee
Date: Sun Nov 22 2009 - 21:04:45 EST


[ adding potential interested parties to the CC:. Michael, please respond
with the latest kernel version you've tried that exhibits the problem, as well
as whether or not you've been able to create a test-case that shows the
signal loss. ]

On Sun, Nov 22, 2009 at 1:14 PM, Michael Tokarev <mjt@xxxxxxxxxx> wrote:
> It's a very old issue, but I still don't know an answer.
>
> In short, processes on linux loses signals. ÂIt happens
> rarely, but it happens, and the frequency of this happening
> is enough to be annoying.
>
> For example, I've a program that used alarm(2) to periodically
> check for something. ÂNothing fancy, nothing interesting is done
> in the signal handler, no long operations or something, plain
> signal(2) with sighandler just setting a global variable. ÂWhen
> under heavy usage (it's a DNS nameserver), in about a week
> (sometimes a few hours, sometimes after a month) it stops checking
> for updates, because apparently some sigalrm got lost.
>
> For this program I had to replace alarm() with setitimer(), but
> only on linux. ÂOn all other operating systems (Solaris, FreeBSD,
> HP/UX, AIX) where it is used, everything works as expected.
>
> Another common issue is SIGIO-based event loop. ÂFor a classical
> form of it, on a non-heavily-loaded process. ÂQuite often server
> loses SIGIO so even if an I/O is possible, the process does not
> know. ÂThe pending (or stuck) I/O gets processed on receipt of
> next SIGIO that indicates readiness of another filedescriptor --
> since after SIGIO a process does poll() it notices both.
>
> A "classical" (for me) example of this is an Oracle database
> version 8 (we've many of these in production still; in later
> versions they rewrote the event loop to use different techniques).
> There, there's a dispatcher process that does nothing but listens
> on the network, receives requests and sends them to a set of
> worker processes. ÂEverything is non-blocking and the process
> mostly does nothing. ÂIt is very annoying when trivial actions
> in a user application causes loooong delays - when an app sent
> some request to oracle db and that request stuck in the event
> queue because the corresponding SIGIO was never delivered. ÂIt
> helps immediately to make another connection to the same DB to
> "unstuck" that request. ÂIt is done transparently when there are
> many users are working with the database at the same time, each
> making requests --- this way any stuck/lost I/O unstucks immediately
> because new requests are coming from other users; but at evenings
> or over periods of small activity it becomes real problem.
>
> I looked at the server behavour numerous times -- the server (oracle)
> works quite reasonable, strace is sane enough. ÂThat to say, one
> can't blame "stupid closed-source programmers" for this.
>
> There are other examples like this, all involving lost signals.
> The two above are just the most "famous" for me.
>
> The problem becomes much much worse when a system has multiple
> cores. ÂOn single-CPU system such situation is rare enough to
> become almost unnoticeable. ÂBut with even second core the issue
> emerges almost immediately - enough for many users to start calling
> techsupport because their apps are very slow.
>
> Last time I asked similar question here, I was told that signals
> are unreliable and should not be used. ÂBut what is the reason for
> the unreliability, and why signals should be unreliable on linux
> only?
>
> Thanks!
>
> /mjt
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/