*Much* worse fork failure prob than expected!!! NFS death!

Harvey J. Stein (hjstein@bfr.co.il)
Thu, 23 Jul 1998 13:50:54 +0300


The probability of not having 2 free consecutive aligned pages is much
worse than was originally calculated (sometimes more than a factor of
25)! For example, here's a table where I solved for how much memory
must be free to have a <= 3% chance of failure (1 failure every 33
tries).

% of free memory needed to to have a <= 3% chance of double page
allocation failure:

Mem size (mb) % free
4 8.0%
8 5.7%
16 4.1%
32 2.9%
64 2.1%

The situation for 4pg allocations (which I understand NFS needs for 8k
packets) is probably going to be much worse, but I have to think about
it more to work it out.

The only good thing I can say is that the fcn drops off quickly for
any given memory size.

% free & % failures for 64mb:

% free % failures
1.0 44.311
1.1 37.002
1.2 30.713
1.3 25.082
1.4 19.870
1.5 15.692
1.6 11.994
1.7 9.157
1.8 6.877
1.9 4.983
2.0 3.617
2.1 2.527
2.2 1.772
2.3 1.222
2.4 0.809
2.5 0.539
2.6 0.353
2.7 0.221
2.8 0.140
2.9 0.084
3.0 0.052

This seems to say that:

a) for any given memory size, there's a fairly large pool of memory
needed to allow 2 page allocations to mostly succeed, and
b) a small drop in free memory can suddenly cause a large % of 2
page allocation failures.

Given that page usage is constantly rising and falling, it seems
fairly likely one would bump into regions of serious allocation
failures, for example, with 64mb of ram, if % free is ranging from 2.5
to 1.5, you'll see a 2 page alloc failure rate running from 1 in 200
to 3 in 20!

And, actually, given that the kernel and certain unswappable regions
allocate low memory, the numbers are even worse than the above
indicates. The free pool isn't the full memory size - allocations are
made from a smaller contiguous block of ram.

The solutions are:

1) Guarantee a certain % free to guarantee a certain max % of 2 page
allocation failures. The % free would be a complicated fcn of
total ram, probably not calculatable in the kernel, except maybe by
using Stirling's approximation. I can work out a table if someone
tells me what failure % is desired.
2) Estimate the probability dynamically. I.e. - start freeing
wildly if the % of failed 2 page allocations is greater than y.
Note that the function grows *fast* as memory depletes. You'll
have to be careful to get agressive faster.
3) Even the odds - do something to cause clumping of free pages.
4) Reduce the # of required 2 page allocations - don't require a 2
page allocation for each process.
5) Make allocation coarser. Allocate in 2 page chunks instead of 1
page chunks. This'd waste space but make allocations work. Tune
the kernel to use the bigger chunks.
6) Something not listed above.

The computations are as follows:

hjstein@bfr.co.il (Harvey J. Stein) writes:

> Linus Torvalds <torvalds@transmeta.com> writes:
>
> > On 15 Jul 1998, Harvey J. Stein wrote:
> > > The math isn't quite right. Consider:
> > >
> > > 1st hole placed on 1st page.
> > >
> > > 2nd hole placed on 4th page.
> > >
> > > There are now x-5 places to place the 3rd hole, not x-4.
> >
> > No, the thing is that you also have to worry about the alignement: we get
> > a 8kB area only if we have two consecutive 4kB holes that are also
> > aligned.
> >
> > As such 0+1 would be a 8kB area, and 4+5 would be one, but 3+4 and 1+2
> > would not.
>
> Ok. In that case I'm not counting the right thing. I'll redo my
> computations and see what happens.

Ok. I worked it out (with the help of my office mate). Memory looks
like this:

-- -- | -- -- | -- -- | ...

The -- represent pages. The | just divide all pages into pairs (which
I'll call buckets).

Suppose there are x pages (x/2 buckets). Suppose that y pages are
free.

We can make a 2 page allocation iff there are 2 consecutive aligned
pages. The latter holds iff there are 2 free pages in one of the
above buckets.

Thus, the number of memory configurations in which we can't make a 2
page allocation is the number of ways to place the free pages such
that there's at most 1 free page in each bucket.

The number of ways to do this, not counting the position of the free
page in each bucket, is just the number of ways to choose y buckets,
which is x/2 choose y.

The number of ways to do this, accounting for the free page internal
bucket position is (x/2 choose y) * 2^y. I.e. - multiply by the
number of ways to arrange the free pages in the chosen buckets.

The total number of ways to allocate the free pages is just (x choose
y).

Thus, the probability of having to swap is:

(x/2 choose y) * 2^y
--------------------
(x choose y)

(where x choose y = x!/(y!(x-y)!) = the number of ways to select y
items from a set of x items).

I've appended Linus' original table along with the percentage of "bad"
configurations as computed by the above formula. BTW, the numbers
start out higher & drop off slower (as X increases with Y/X = %
free memory/100 held constant). Actually, the old numbers for X total
pages are almost identical to the new numbers for 2X total pages, so
any arguments about machines with n MB will basically carry over to
machines with 2*n MB. The "Orig %" & "By above" columns are the
% chance that there are no pair of free pages that are adjacent &
aligned.

total pgs free pgs tot mb % free Orig % By above
X = 1024, Y = 20 ( 4MB, 2% free) = 68.6673% 82.765%
X = 1024, Y = 51 ( 4MB, 5% free) = 7.6047% 26.974%
X = 1024, Y = 81 ( 4MB, 8% free) = 0.1245% 3.212%

X = 2048, Y = 40 ( 8MB, 2% free) = 46.2224% 67.816%
X = 2048, Y = 102 ( 8MB, 5% free) = 0.5488% 7.083%
X = 2048, Y = 163 ( 8MB, 8% free) = 0.0001% 0.090%

X = 4096, Y = 81 ( 16MB, 2% free) = 20.1256% 44.623%
X = 4096, Y = 204 ( 16MB, 5% free) = 0.0029% 0.488%
X = 4096, Y = 327 ( 16MB, 8% free) = 0.0000% 0.00007101%

X = 8192, Y = 163 ( 32MB, 2% free) = 3.8125% 19.312%
X = 8192, Y = 409 ( 32MB, 5% free) = 0.0000% 0.0021999%
X = 8192, Y = 655 ( 32MB, 8% free) = 0.0000% 0.4401e-10%

X = 16384, Y = 327 ( 64MB, 2% free) = 0.1368% 3.6167%
X = 16384, Y = 819 ( 64MB, 5% free) = 0.0000% 0.4463e-7%
X = 16384, Y = 1310 ( 64MB, 8% free) = 0.0000% 0.1851e-22%

X = 32768, Y = 655 (128MB, 2% free) = 0.0002% 0.1268%
X = 32768, Y = 1638 (128MB, 5% free) = 0.0000% 0.1939e-16%
X = 32768, Y = 2621 (128MB, 8% free) = 0.0000% 0.2989e-47%

X = 65536, Y = 1310 (256MB, 2% free) = 0.0000% 0.0001592%
X = 65536, Y = 3276 (256MB, 5% free) = 0.0000% 0.3659e-35%
X = 65536, Y = 5242 (256MB, 8% free) = 0.0000% 0.8538e-97%

- --
Harvey J. Stein
BFM Financial Research
hjstein@bfr.co.il

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.altern.org/andrebalsa/doc/lkml-faq.html