Re: Jiffy wraparound problems.

Ville Herva (vherva@niksula.hut.fi)
Thu, 29 Apr 1999 08:52:03 +0300


On Wed, 28 Apr 1999, David Benson <daveb@ffem.org> wrote:
>
> We have quite a few machines with long uptimes here.
>
> We had noticed a collection of problems with them, in particular
> select(2) breaks, once the "jiffies" global variable reaches
> 2**31, and starts looking signed. The bad behaviour is that
> the "timeval" structure (the fifth argument to select)
> gets set withgarbage valuables, much larger
> that those passed in. In the 2.0.* series of kernels this appears
> to be a bug in the sys_select function in fs/select.c
> and in 2.2.* it is in schedule_timeout.

Interesting, I just came cross the same problem. I actually reported it
yesterday on this list. I haven't heard any comments about the proposed
fix.

> The 2.2 problems are somewhat conjectural, but I wrote a test
> kernel-module which sets `jiffies' to 0xf000000; this machine then
> exhibits the select bug.

Well, I was not that careful with 2.2. I just quickly looked whether the
exact same error was being made, and it wasn't.

> Other visible problems:
> times() returns -1; perhaps this is inevitable/correct...

According to man page, times should return number of clock ticks that have
elapsed since the system has been up, so I think that's not correct.

I have not experienced that problem with times - clock() on the other hand
does just that. (Are you using glibc or libc5?) The problem with clock()
is/was that kernel passes the (unsigned long) jiffies to glibc clock() as
it is, but glibc expects a signed int (time_t). If the values glibc gets
from the kernel, glibc intrepets that as an error and returns -1. Kernel
even uses some negative error codes when it can't allocate memory. This
has been fixed in the CVS glibc according to the glibc maintainers I was
in touch with. (Glibc doesn't return error in any case, even if kernel
returned an error code (-501...-1). OTOH, kernel can return -501...-1
also, when jiffies is in that range.)

> The way we discovered the select problem, btw, was that `zsh's
> interface depends on the select() returned time.

Same here. I'm not sure about zsh-3.0.5, but I did investigate 3.1.5.
There the problem lied in Zle/zle_main.c. The zsh team has already fixed
this:

=================================================================
Date: Wed, 28 Apr 1999 09:51:16 +0200
From: Peter Stephenson <pws@ibmth.df.unipi.it>
Subject: Re: [Solved] Re: Terminal problem with linux-2.0.34
To: Ville Herva <vherva@babbage.tky.hut.fi>,
zsh-workers@sunsite.auc.dk (Zsh hackers list)

Ville Herva wrote:
>
> This proved to be the select call in line 517 in Zle/zle_main.c (zsh-3.1.5
> vanilla):
>
> if (!kungetct && select(SHTTY+1, (SELECT_ARG_2_T) & foofd,
> NULL, NULL, &tv) <= 0)
>
> If I add the following line before the above call, zsh works well:
>
> tv.tv_sec = 0;

Thanks, this is the diff for 3.1.5-pws-16 (Bart's already put something
into 3.0.6).

--- Src/Zle/zle_main.c.tv Mon Apr 19 11:40:09 1999
+++ Src/Zle/zle_main.c Wed Apr 28 09:48:30 1999
@@ -413,7 +413,6 @@

baud = getiparam("BAUD");
costmult = (baud) ? 3840000L / baud : 0;
- tv.tv_sec = 0;
#endif

/* ZLE doesn't currently work recursively. This is needed in case a
*
@@ -523,6 +522,7 @@
#ifdef HAVE_SELECT
if (baud && !(lastcmd & ZLE_MENUCMP)) {
FD_SET(SHTTY, &foofd);
+ tv.tv_sec = 0;
if ((tv.tv_usec = cost * costmult) > 500000)
tv.tv_usec = 500000;
if (!kungetct && select(SHTTY+1, (SELECT_ARG_2_T) & foofd,
=================================================================

> Are there any plans to fix these issues? Are they known problems?

I proposed yesterday the following:

=================================================================
If the timeout is reached during do_select, current->timeout becomes zero.
The line

if ((long) timeout < 0)
timeout = 0;

is propably for that. However, if jiffies > 2**31, that won't work (since
timeout wraps around and actually becomes positive), and select returns
false values. I would suggest something like

timeout = current->timeout ? current->timeout - jiffies - 1 : 0;

instead of

timeout = current->timeout - jiffies - 1;
=================================================================

But I have no clue whether Alan Cox is planning to include the fix in
2.0.37.

-- v --

v@iki.fi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/