Re: [Announce] BKL shifting into drivers and filesystems - beware

From: Andrea Arcangeli (andrea@suse.de)
Date: Sat Jul 15 2000 - 17:59:13 EST


On Sat, 15 Jul 2000, Rik van Riel wrote:

>> However the zone_t is anyway necessary to optimize the memory
>> balancing, thanks to it when somebody asks for a __GFP_DMA
>> allocation and he have to block freeing memory himself,
>
>This is absolutely WRONG!
>
>The zone_t argument would make sense when the majority of
>the allocations would require memory from a specific zone,

In my current 2.4.0-test4-pre6 tree there are:

andrea@inspiron:~/kernel/vm > gid GFP_KERNEL | wc -l
   1808

allocations that require memory from a specific zone (they don't accept
memory from all the zones in the system).

>but in reality we observe that almost all allocations just
>don't care from which zone the page is allocated.

In my tree there are around:

andrea@inspiron:~/kernel/vm > gid -r '__GFP_HIGHMEM|GFP_HIGHUSER' | wc -l
     14

allocations that doesn't care which zone the page is allocated from (of
course I understand in some workload the page cache and anonymous memory
are more frequent than allocating inodes or dcache or buffer cache or
network skbs).

And anyway I hate things that doesn't scale correctly, when you use them
to do something different than usual so I don't buy your "we observe that
almost all allocations" argument.

>(most allocations seem to be for userspace data, which can
>live everywhere)

Lots of things can't live everywhere (see above).

>Not giving a zone_t argument to the page freeing routines
>gives those routines the freedom to free the oldest pages
>from the system first (up to a certain limit, of course),
>usually clearing zone->zone_wake_kswapd for one zone only.

Trying to free (or even worse doing write throttling on the page) is
_wasted_ time if we are not asking memory from such zone. You can't say
whatever you want but it will be still wasted time that you should do
asynchronously. What I mean is that if I would be in place of the kernel I
would understand I don't need to do write throttling on a highmem page if
somebody is asking for GFP_KERNEL memory and I'd better spend my time
freeing memory from the NORMAL classzone (aka DMA+NORMAL zone).

Also the missing zone_t infomration causes you to strictly need the below
check:

                /*
                 * Page is from a zone we don't care about.
                 * Don't drop page cache entries in vain.
                 */
                if (page->zone->free_pages > page->zone->pages_high)
                        goto cache_unlock_continue;

and the above breaks the LRU ordering.

Take this example:

o somebody asks a page using GFP_HIGHMEM (that means he asks a
        page between 0 giga and 64giga on IA32)
o the lru have this layout:

      |-> 3giga -> 3giga+1page -> 3giga+2pages -> 3giga+3pages -> 0giga --|
      --------------------------------------------------------------------|

Then you start shrink_mmap and you drop the 3giga, 3giga+1page,
3giga+2pages and incidentally freeing these threee pages caused the:

       "page->zone->free_pages > page->zone->pages_high"

check to become true. So then you skip the 3giga+3pages page and you
instead free the younger 0giga page. This is _wrong_, the user asked for
memory between 0giga and 64giga, so there's no one single reason you
should prefer to free the 0giga page instead of the 3giga+3pages page
that is younger.

classzone frees the 3giga+3pages page before the 0giga in the same
scenario because it knows about the overlaps between the zones and what
the user is asking for.

You miss informations that could get with the only cost of a push on the
stack (or writing the zone_t to %%ecx using FASTCALL in shrink_mmap). Then
you say me that you can get also that information by redesigning the whole
memory management?

And whatever you do to get that information without passing it from the
allocator to the memory balancing code, you will always get the
information about the _past_ workloads and the past means nothing at the
present (so you definitely can still be wrong at least _once_ when the
different workload starts).

Andrea

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sat Jul 15 2000 - 21:00:22 EST