Re: ZONE_PADDING wastes 4 bytes of the new cacheline

From: Andrea Arcangeli
Date: Thu Oct 21 2004 - 20:25:38 EST


On Fri, Oct 22, 2004 at 10:34:13AM +1000, Nick Piggin wrote:
> Andrea Arcangeli wrote:
>
> >
> >looks reasonable. only cons is that this rejects on my tree ;), pages_*
> >and protection is gone in my tree, replaced by watermarks[] using the
> >very same optimal and proven algo of 2.4 (enabled by default of course).
> >I'll reevaluate the false sharing later on.
> >
>
> May I again ask what you think is wrong with ->protection[] apart from
> it being turned off by default? (I don't think our previous conversation
> ever reached a conclusion...)

the API is flawed, how can set it up by default? if somebody tweaks
pages_min it get screwed.

plus it's worthless to have pages_min/low/high and protection[] when you
can combine the two things together. those pages_min/low/high and
protection combined when protection itself is calculated in function of
pages_min/low/high just creates confusion. I believe this comments
explains it well enough:

/*
* setup_per_zone_protection - called whenver min_free_kbytes or
* sysctl_lower_zone_protection changes. Ensures that each zone
* has a correct pages_protected value, so an adequate number of
* pages are left in the zone after a successful __alloc_pages().
*
* This algorithm is way confusing. I tries to keep the same
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* behavior
* as we had with the incremental min iterative algorithm.
*/

I've to agree with the comment, if I've to start doing math on top of
this algorithm, it's almost faster to replace it with the 2.4 one that
has clear semantics.

Example of the runtime behavaiour of the very well understood
lowmem_reserve from 2.4, it's easier to make an example than to explain
it with words:

1G machine -> (16M dma, 800M-16M normal, 1G-800M high)
1G machine -> (16M dma, 784M normal, 224M high)

sysctl defaults are:

int sysctl_lower_zone_reserve_ratio[MAX_NR_ZONES-1] = { 256, 32 };

the results will be:

1) NORMAL allocation will leave 784M/256 of ram reserved in the ZONE_DMA

2) HIGHMEM allocation will leave 224M/32 of ram reserved in ZONE_NORMAL

3) HIGHMEM allocation will (224M+784M)/256 of ram reserved in ZONE_DMA

I invented this algorithm to scale in all machines and to make the
memory reservation not noticeable and only beneficial. With this API is
trivial to tune and to understand what will happen, the default value
looks good and they're proven by years of production in 2.4. Most
important: it's only in function of the sizes of the zones, the
pages_min levels have nothing to do with it.

I'm still unsure if the 2.6 lower_zone_protection completely mimics the
2.4 lowmem_zone_reserve algorithm if tuned by reversing the pages_min
settings accordingly, but I believe it's easier to drop it and replace
with a clear understandable API that as well drops the pages_min levels
that have no reason to exists anymore, than to leave it in its current
state and to start reversing pages_min algorithm to tune it from
userspace (in the hope nobody could ever tweak pages_min calculation in
the kernel, to avoid breaking the userspace that would require
kernel-internal knowledge to have a chance to tune lowmem_protection
from a rc.d script).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/