Re: BUG (non-kernel), can hurt developers.

From: Richard B. Johnson
Date: Wed Nov 26 2003 - 15:15:07 EST


On Wed, 26 Nov 2003, Jamie Lokier wrote:

> Richard B. Johnson wrote:
> > The actual problem in the production machine involves two absolutely
> > independent tasks that end up using the same shared 'C' runtime
> > library. There should be no interaction between them, none
> > whatsover. However, when they both execute rand(), they interact in
> > bad ways. This interraction occurs on random days at monthly
> > intervals.
>
> On Linux (unlike Windows), there is _no_ interaction between the
> libraries of different tasks. Neither of them sees changes to the
> other's memory space.
>
> If you are seeing a fault, then there might well be a bug, even a
> kernel bug, but your test program does not illustrate the same problem.
>
> What is the "bad interaction" that you observed at monthly intervals?
> Also a SIGSEGV?
>

Yes. When the call to rand() was replaced with a static-linked
clone it went away.

> > This is likely caused by the failure to use "-s" in the compilation
> > of a shared library function, fixed in subsequent releases.
>
> No, this has nothing to do with it. Unlike Windows and some embedded
> environments, Linux shared libraries do not have "shared writable data"
> sections.

Well the libc rand() does something that looks like that.

>
> > So, I allowed rand() to be "interrupted" just as it would be in a
> > context-switch. I simply used a signal handler, knowing quite well
> > that the "interrupt" could occur at any time. [...] What I brought
> > to light was a SIGSEGV that can occur when the shared-library rand()
> > function is "interrupted".
>
> You have made a mistake. You program shows a different problem to the
> one which you noticed every month or so.
>

The calling rand() from a handler in a newer libc doesn't seg-fault.

> Calling a function from a signal handler while it is being interrupted
> by that handler is _very_ different from tasks context switching.
> They are not similar at all! (Yes, signals can be used to simulate
> context switches, but not like this!)
>

Not with the emulation. The problem is that rand() uses a thread-
specific pointer to find the seed (history variable), just like
'errno' which isn't really a static variable, but a function
that returns a pointer to a thread-specific integer. If this
is interrupted in a critical section, and that same pointer
is used, that pointer is left pointing to a variable in somebody
else's address space. That same problem is observed to happen when
the same shared runtime library was used by entirely different tasks.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.22 on an i686 machine (797.90 BogoMips).
Note 96.31% of all statistics are fiction.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/