Re: CFS review

From: Willy Tarreau
Date: Thu Aug 02 2007 - 00:59:30 EST


On Wed, Aug 01, 2007 at 07:17:51PM -0700, Linus Torvalds wrote:
>
>
> On Wed, 1 Aug 2007, Roman Zippel wrote:
> >
> > I'm not so sure about that. sched_clock() has to be fast, so many archs
> > may want to continue to use jiffies. As soon as one does that one can also
> > save a lot of computational overhead by using 32bit instead of 64bit.
> > The question is then how easy that is possible.
>
> I have to say, it would be interesting to try to use 32-bit arithmetic.
>
> I also think it's likely a mistake to do a nanosecond resolution. That's
> part of what forces us to 64 bits, and it's just not even an *interesting*
> resolution.

I would add that I have been bothered by the 64-bit arithmetics when
trying to see what could be improved in the code. In fact, it's very
hard to optimize anything when you have arithmetics on integers larger
than the CPU's, and gcc is known not to emit very good code in this
situation (I remember it could not play with registers renaming, etc...).

However, I undertand why Ingo chose to use 64 bits. It has the advantage
that the numbers never wrap within 584 years. I'm well aware that it's
very difficult to keep tasks ordered according to a key which can wrap.

But if we consider that we don't need to be more precise than the return
value from gettimeofday() that all applications use, we see that a bunch
of microseconds is enough. 32 bits at the microsecond level wraps around
every hour. We may accept to recompute all keys every hour. It's not that
dramatic. The problem is how to detect that we will need to.

I remember a trick used by Tim Schmielau in his jiffies64 patch for 2.4.
He kept a copy of the highest bit of the lower word in the lowest bit of
the higher word, and considered that the lower one could not wrap before
we could check it. I liked this approach, which could be translated here
in something like the following :

Have all keys use 32-bit resolution, and monitor the 32nd bit. All tasks
must have the same value in this bit, otherwise we consider that their
keys have wrapped. The "current" value of this bit is copied somewhere.
When we walk the tree and find a task with a key which does not have its
32nd bit equal to the current value, it means that this key has wrapped,
so we have to use this information in our arithmetics.

When all keys have their 32nd bit different from the "current" value,
then we switch this value to reflect the new 32nd bit, and everything is
in sync again. The only requirement is that no key wraps around before
the "current" value is switched. This implies that no couple of tasks
could have their keys distant by more than 31 bits (35 minutes), which
seems reasonable. If we can recompute all tasks' keys when all of them
have wrapped, then we do not have to store the "current" bit value
anymore, and consider that it is always zero instead (I don't know if
the code permits this).

It is possible that using the 32nd bit to detect the wrapping may impose
us to perform some computations on 33 bits. If this is the case, then it
would be fine if we reduced the range to 31 bits, with all tasks distant
from at most 30 bits (17 minutes).

Also, I remember that the key is signed. I've never experimented with the
tricks above on signed values, but we might be able to define something
like this for the higher bits :

00 = positive, no wrap
01 = positive, wrapped
10 = negative, wrapped
11 = negative, no wrap

I have no code to show, I just wanted to expose this idea. I know that if
Ingo likes it, he will beat everyone at implementing it ;-)

> Linus

Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/