Re: [PATCH] proposed scheduler enhancements and fixes

From: Andrea Arcangeli (andrea@suse.de)
Date: Wed Mar 01 2000 - 22:30:57 EST


On Mon, 28 Feb 2000, Andrea Arcangeli wrote:

>On Sat, 26 Feb 2000, Dimitris Michailidis wrote:
>
>>This was MP (4 Xeon @500MHz). As another data point, on the same machine,
>>dbench 32 has a throughput of about 42MB/s with 2.3.40. With stock 2.3.47
>>it's just under 7MB/s and the CPUs have a much lower utilization (27% out of
>>the 4 CPUs versus 100+% with 2.3.40).
>
>How much memory for cache do you have? Please try again after:
>
> echo 95 > /proc/sys/vm/bdflush
>
>In the latest kernels I fixed a dirty memory balance issue that wasn't
>harming in real life but that was potentially harmful.
>
>Increasing the percentage of memory that can be dirty should emulate the
>old behaviour.

Woops, increasing bdflush dirty percentage limit couldn't emulate the old
behaviour completly (thinko: that would be true in 2.2.x where all dirty
memory is accounted as buffermem and not as page_cache, excluding the
dirty shared mapping pages that needs msync at close at least) and such
change I did was the real issue that was harming dbench numbers as I
suspected at first. The reason changing bdflush didn't helped made me to
think it was something else but there wasn't much more that could
influence the dbench numbers (apologies for having suspected of softnet!).

This backouts my change and it fix dbench numbers, please Linus apply to
2.3.49pre2:

--- 2.3.49pre2/mm/page_alloc.c Sun Feb 27 06:19:45 2000
+++ /tmp/page_alloc.c Thu Mar 2 03:57:21 2000
@@ -337,7 +337,7 @@
         zone_t *zone;
         int i;
 
- sum = nr_lru_pages - atomic_read(&page_cache_size);
+ sum = nr_lru_pages;
         for (i = 0; i < NUMNODES; i++)
                 for (zone = NODE_DATA(i)->node_zones; zone <= NODE_DATA(i)->node_zones+ZONE_NORMAL; zone++)
                         sum += zone->free_pages;

Now it's interesting to know the numbers between 2.3.45 + above patch and
2.3.46 + above patch, to see how much the new elevator harmed the dbench
numbers.

(side note about the new elevator: the new IO scheduler will certainly
harm the dbench number but that's a feature. The feature is that we _must_
add the seeks to avoid one dbench task to get starved indefinitely by the
other other tasks. In dbench the feature is not visible because dbench is
a global performance benchmark, dbench will show only the cons, but in
real life each dbench task is an user that is reading and writing emails
(just to make an common real life example 8) and the users (unlike dbench
childs) will definitely care about latency and proper I/O scheduling.

Certainly I'll keep doing my best to make the elevator better to improve
global performances too.)

With the above patch applyed there's a memory balance potential problem
though (so theorically the old code was safe and the above patch is
re-inserting a bug ;). But that's more theorical than pratical and we have
to live with it for now due the new page-cache design in 2.3.x.

Basically the reason I did such change is that if some part of VM are
locked in memory we are accounting them as freeable by not doing
'nr_lru_pages - page_cache_size' and we could run out of memory due dirty
buffers eating all available memory.

But the thoerically correct code is too much slow as proven by dbench. And
such dirty memory balancing issue will be _automatically_ fixed (no need
to change such function anymore) when in the page-LRU there will be queued
only the non mapped pages. So it's the correct code at least in the long
term ;).

Andrea

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Tue Mar 07 2000 - 21:00:11 EST