Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure

From: Corrado Zoccolo
Date: Tue Mar 23 2010 - 17:35:30 EST


Hi Mel,
On Tue, Mar 23, 2010 at 12:50 AM, Mel Gorman <mel@xxxxxxxxx> wrote:
> On Mon, Mar 15, 2010 at 01:09:35PM -0700, Andrew Morton wrote:
>> On Mon, 15 Mar 2010 13:34:50 +0100
>> Christian Ehrhardt <ehrhardt@xxxxxxxxxxxxxxxxxx> wrote:
>>
>> > c) If direct reclaim did reasonable progress in try_to_free but did not
>> > get a page, AND there is no write in flight at all then let it try again
>> > to free up something.
>> > This could be extended by some kind of max retry to avoid some weird
>> > looping cases as well.
>> >
>> > d) Another way might be as easy as letting congestion_wait return
>> > immediately if there are no outstanding writes - this would keep the
>> > behavior for cases with write and avoid the "running always in full
>> > timeout" issue without writes.
>>
>> They're pretty much equivalent and would work. ÂBut there are two
>> things I still don't understand:
>>
>> 1: Why is direct reclaim calling congestion_wait() at all? ÂIf no
>> writes are going on there's lots of clean pagecache around so reclaim
>> should trivially succeed. ÂWhat's preventing it from doing so?
>>
>> 2: This is, I think, new behaviour. ÂA regression. ÂWhat caused it?
>>
>
> 120+ kernels and a lot of hurt later;
>
> Short summary - The number of times kswapd and the page allocator have been
> Â Â Â Âcalling congestion_wait and the length of time it spends in there
> Â Â Â Âhas been increasing since 2.6.29. Oddly, it has little to do
> Â Â Â Âwith the page allocator itself.
>
> Test scenario
> =============
> X86-64 machine 1 socket 4 cores
> 4 consumer-grade disks connected as RAID-0 - software raid. RAID controller
> Â Â Â Âon-board and a piece of crap, and a decent RAID card could blow
> Â Â Â Âthe budget.
> Booted mem=256 to ensure it is fully IO-bound and match closer to what
> Â Â Â ÂChristian was doing
>
> At each test, the disks are partitioned, the raid arrays created and an
> ext2 filesystem created. iozone sequential read/write tests are run with
> increasing number of processes up to 64. Each test creates 8G of files. i.e.
> 1 process = 8G. 2 processes = 2x4G etc
>
> Â Â Â Âiozone -s 8388608 -t 1 -r 64 -i 0 -i 1
> Â Â Â Âiozone -s 4194304 -t 2 -r 64 -i 0 -i 1
> Â Â Â Âetc.
>
> Metrics
> =======
>
> Each kernel was instrumented to collected the following stats
>
>    Âpg-Stall    ÂPage allocator stalled calling congestion_wait
>    Âpg-Wait     The amount of time spent in congestion_wait
>    Âpg-Rclm     Pages reclaimed by direct reclaim
>    Âksd-stall    balance_pgdat() (ie kswapd) staled on congestion_wait
>    Âksd-wait    ÂTime spend by balance_pgdat in congestion_wait
>
> Large differences in this do not necessarily show up in iozone because the
> disks are so slow that the stalls are a tiny percentage overall. However, in
> the event that there are many disks, it might be a greater problem. I believe
> Christian is hitting a corner case where small delays trigger a much larger
> stall.
>
> Why The Increases
> =================
>
> The big problem here is that there was no one change. Instead, it has been
> a steady build-up of a number of problems. The ones I identified are in the
> block IO, CFQ IO scheduler, tty and page reclaim. Some of these are fixed
> but need backporting and others I expect are a major surprise. Whether they
> are worth backporting or not heavily depends on whether Christian's problem
> is resolved.
>
> Some of the "fixes" below are obviously not fixes at all. Gathering this data
> took a significant amount of time. It'd be nice if people more familiar with
> the relevant problem patches could spring a theory or patch.
>
> The Problems
> ============
>
> 1. Block layer congestion queue async/sync difficulty
> Â Â Â Âfix title: asyncconfusion
> Â Â Â Âfixed in mainline? yes, in 2.6.31
> Â Â Â Âaffects: 2.6.30
>
> Â Â Â Â2.6.30 replaced congestion queues based on read/write with sync/async
> Â Â Â Âin commit 1faa16d2. Problems were identified with this and fixed in
> Â Â Â Â2.6.31 but not backported. Backporting 8aa7e847 and 373c0a7e brings
> Â Â Â Â2.6.30 in line with 2.6.29 performance. It's not an issue for 2.6.31.
>
> 2. TTY using high order allocations more frequently
> Â Â Â Âfix title: ttyfix
> Â Â Â Âfixed in mainline? yes, in 2.6.34-rc2
> Â Â Â Âaffects: 2.6.31 to 2.6.34-rc1
>
> Â Â Â Â2.6.31 made pty's use the same buffering logic as tty. ÂUnfortunately,
> Â Â Â Âit was also allowed to make high-order GFP_ATOMIC allocations. This
> Â Â Â Âtriggers some high-order reclaim and introduces some stalls. It's
> Â Â Â Âfixed in 2.6.34-rc2 but needs back-porting.
>
> 3. Page reclaim evict-once logic from 56e49d21 hurts really badly
> Â Â Â Âfix title: revertevict
> Â Â Â Âfixed in mainline? no
> Â Â Â Âaffects: 2.6.31 to now
>
> Â Â Â ÂFor reasons that are not immediately obvious, the evict-once patches
> Â Â Â Â*really* hurt the time spent on congestion and the number of pages
> Â Â Â Âreclaimed. Rik, I'm afaid I'm punting this to you for explanation
> Â Â Â Âbecause clearly you tested this for AIM7 and might have some
> Â Â Â Âtheories. For the purposes of testing, I just reverted the changes.
>
> 4. CFQ scheduler fairness commit 718eee057 causes some hurt
> Â Â Â Âfix title: none available
> Â Â Â Âfixed in mainline? no
> Â Â Â Âaffects: 2.6.33 to now
>
> Â Â Â ÂA bisection finger printed this patch as being a problem introduced
> Â Â Â Âbetween 2.6.32 and 2.6.33. It increases a small amount the number of
> Â Â Â Âtimes the page allocator stalls but drastically increased the number
> Â Â Â Âof pages reclaimed. It's not clear why the commit is such a problem.
>
> Â Â Â ÂUnfortunately, I could not test a revert of this patch. The CFQ and
> Â Â Â Âblock IO changes made in this window were extremely convulated and
> Â Â Â Âoverlapped heavily with a large number of patches altering the same
> Â Â Â Âcode as touched by commit 718eee057. I tried reverting everything
> Â Â Â Âmade on and after this commit but the results were unsatisfactory.
>
> Â Â Â ÂHence, there is no fix in the results below
>
> Results
> =======
>
> Here are the highlights of kernels tested. I'm omitting the bisection
> results for obvious reasons. The metrics were gathered at two points;
> after filesystem creation and after IOZone completed.
>
> The lower the number for each metric, the better.
>
>                           After Filesystem Setup                    After IOZone
>                     pg-Stall Âpg-Wait Âpg-Rclm Âksd-stall Âksd-wait    Âpg-Stall Âpg-Wait Âpg-Rclm Âksd-stall Âksd-wait
> 2.6.29 Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â0 Â Â Â Â0 Â Â Â Â0 Â Â Â Â Â2 Â Â Â Â 1 Â Â Â Â Â Â Â 4 Â Â Â Â3 Â Â Â183 Â Â Â Â152 Â Â Â Â 0
> 2.6.30 Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â1 Â Â Â Â5 Â Â Â 34 Â Â Â Â Â1 Â Â Â Â25 Â Â Â Â Â Â 783 Â Â 3752 Â Â31939 Â Â Â Â 76 Â Â Â Â 0
> 2.6.30-asyncconfusion              0    Â0    Â0     Â3     1       Â44    60   2656    Â893     0
> 2.6.30.10 Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â 0 Â Â Â Â0 Â Â Â Â0 Â Â Â Â Â2 Â Â Â Â43 Â Â Â Â Â Â 777 Â Â 3699 Â Â32661 Â Â Â Â 74 Â Â Â Â 0
> 2.6.30.10-asyncconfusion            Â0    Â0    Â0     Â2     1       Â36    88   1699    1114     0
>
> asyncconfusion can be back-ported easily to 2.6.30.10. Performance is not
> perfectly in line with 2.6.29 but it's better.
>
> 2.6.31 Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â0 Â Â Â Â0 Â Â Â Â0 Â Â Â Â Â3 Â Â Â Â 1 Â Â Â Â Â 49175 Â 245727 Â2730626 Â Â 176344 Â Â Â Â 0
> 2.6.31-revertevict               Â0    Â0    Â0     Â3     2       Â31   Â147   1887    Â114     0
> 2.6.31-ttyfix                  0    Â0    Â0     Â2     2      46238  231000 Â2549462   170912     0
> 2.6.31-ttyfix-revertevict            0    Â0    Â0     Â3     0        7    35   Â448    Â121     0
> 2.6.31.12 Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â 0 Â Â Â Â0 Â Â Â Â0 Â Â Â Â Â2 Â Â Â Â 0 Â Â Â Â Â 68897 Â 344268 Â4050646 Â Â 183523 Â Â Â Â 0
> 2.6.31.12-revertevict              0    Â0    Â0     Â3     1       Â18    87   1009    Â147     0
> 2.6.31.12-ttyfix                Â0    Â0    Â0     Â2     0      62797  313805 Â3786539   173398     0
> 2.6.31.12-ttyfix-revertevict          Â0    Â0    Â0     Â3     2        7    35   Â448    Â199     0
>
> Applying the tty fixes from 2.6.34-rc2 and getting rid of the evict-once
> patches bring things back in line with 2.6.29 again.
>
> Rik, any theory on evict-once?
>
> 2.6.32 Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â0 Â Â Â Â0 Â Â Â Â0 Â Â Â Â Â3 Â Â Â Â 2 Â Â Â Â Â 44437 Â 221753 Â2760857 Â Â 132517 Â Â Â Â 0
> 2.6.32-revertevict               Â0    Â0    Â0     Â3     2       Â35    14   1570    Â460     0
> 2.6.32-ttyfix                  0    Â0    Â0     Â2     0      60770  303206 Â3659254   166293     0
> 2.6.32-ttyfix-revertevict            0    Â0    Â0     Â3     0       Â55    62   2496    Â494     0
> 2.6.32.10 Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â 0 Â Â Â Â0 Â Â Â Â0 Â Â Â Â Â2 Â Â Â Â 1 Â Â Â Â Â 90769 Â 447702 Â4251448 Â Â 234868 Â Â Â Â 0
> 2.6.32.10-revertevict              0    Â0    Â0     Â3     2       148   Â597   8642    Â478     0
> 2.6.32.10-ttyfix                Â0    Â0    Â0     Â3     0      91729  453337 Â4374070   238593     0
> 2.6.32.10-ttyfix-revertevict          Â0    Â0    Â0     Â3     1       Â65   Â146   3408    Â347     0
>
> Again, fixing tty and reverting evict-once helps bring figures more in line
> with 2.6.29.
>
> 2.6.33 Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â0 Â Â Â Â0 Â Â Â Â0 Â Â Â Â Â3 Â Â Â Â 0 Â Â Â Â Â152248 Â 754226 Â4940952 Â Â 267214 Â Â Â Â 0
> 2.6.33-revertevict               Â0    Â0    Â0     Â3     0       883   4306  Â28918    Â507     0
> 2.6.33-ttyfix                  0    Â0    Â0     Â3     0     Â157831  782473 Â5129011   237116     0
> 2.6.33-ttyfix-revertevict            0    Â0    Â0     Â2     0      Â1056   5235  Â34796    Â519     0
> 2.6.33.1 Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â0 Â Â Â Â0 Â Â Â Â0 Â Â Â Â Â3 Â Â Â Â 1 Â Â Â Â Â156422 Â 776724 Â5078145 Â Â 234938 Â Â Â Â 0
> 2.6.33.1-revertevict              Â0    Â0    Â0     Â2     0      Â1095   5405  Â36058    Â477     0
> 2.6.33.1-ttyfix                 0    Â0    Â0     Â3     1     Â136324  673148 Â4434461   236597     0
> 2.6.33.1-ttyfix-revertevict           0    Â0    Â0     Â1     1      Â1339   6624  Â43583    Â466     0
>
> At this point, the CFQ commit "cfq-iosched: fairness for sync no-idle
> queues" has lodged itself deep within CGQ and I couldn't tear it out or
> see how to fix it. Fixing tty and reverting evict-once helps but the number
> of stalls is significantly increased and a much larger number of pages get
> reclaimed overall.
>
> Corrado?

The major changes in I/O scheduing behaviour are:
* buffered writes:
* before we could schedule few writes, then interrupt them to do
some reads, and then go back to writes; now we guarantee some
uninterruptible time slice for writes, but the delay between two
slices is increased. The total write throughput averaged over a time
window larger than 300ms should be comparable, or even better with
2.6.33. Note that the commit you cite has introduced a bug regarding
write throughput on NCQ disks that was later fixed by 1efe8fe1, merged
before 2.6.33 (this may lead to confusing bisection results).
* reads (and sync writes):
* before, we serviced a single process for 100ms, then switched to
an other, and so on.
* after, we go round robin for random requests (they get a unified
time slice, like buffered writes do), and we have consecutive time
slices for sequential requests, but the length of the slice is reduced
when the number of concurrent processes doing I/O increases.

This means that with 16 processes doing sequential I/O on the same
disk, before you were switching between processes every 100ms, and now
every 32ms. The old behaviour can be brought back by setting
/sys/block/sd*/queue/iosched/low_latency to 0.
For random I/O, the situation (going round robin, it will translate to
switching every 8 ms on average) is not revertable via flags.

>
> 2.6.34-rc1 Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â0 Â Â Â Â0 Â Â Â Â0 Â Â Â Â Â1 Â Â Â Â 1 Â Â Â Â Â150629 Â 746901 Â4895328 Â Â 239233 Â Â Â Â 0
> 2.6.34-rc1-revertevict             Â0    Â0    Â0     Â1     0      Â2595  Â12901  Â84988    Â622     0
> 2.6.34-rc1-ttyfix                0    Â0    Â0     Â1     1     Â159603  791056 Â5186082   223458     0
> 2.6.34-rc1-ttyfix-revertevict          0    Â0    Â0     Â0     0      Â1549   7641  Â50484    Â679     0
>
> Again, ttyfix and revertevict help a lot but CFQ needs to be fixed to get
> back to 2.6.29 performance.
>
> Next Steps
> ==========
>
> Jens, any problems with me backporting the async/sync fixes from 2.6.31 to
> 2.6.30.x (assuming that is still maintained, Greg?)?
>
> Rik, any suggestions on what can be done with evict-once?
>
> Corrado, any suggestions on what can be done with CFQ?

If my intuition that switching between processes too often is
detrimental when you have memory pressure (higher probability to need
to re-page-in some of the pages that were just discarded), I suggest
trying setting low_latency to 0, and maybe increasing the slice_sync
(to get more slice to a single process before switching to an other),
slice_async (to give more uninterruptible time to buffered writes) and
slice_async_rq (to higher the limit of consecutive write requests can
be sent to disk).
While this would normally lead to a bad user experience on a system
with plenty of memory, it should keep things acceptable when paging in
/ swapping / dirty page writeback is overwhelming.

Corrado

>
> Christian, can you test the following amalgamated patch on 2.6.32.10 and
> 2.6.33 please? Note it's 2.6.32.10 because the patches below will not apply
> cleanly to 2.6.32 but it will against 2.6.33. It's a combination of ttyfix
> and revertevict. If your problem goes away, it implies that the stalls I
> can measure are roughly correlated to the more significant problem you have.
>
> ===== CUT HERE =====
>
> From d9661adfb8e53a7647360140af3b92284cbe52d4 Mon Sep 17 00:00:00 2001
> From: Alan Cox <alan@xxxxxxxxxxxxxxx>
> Date: Thu, 18 Feb 2010 16:43:47 +0000
> Subject: [PATCH] tty: Keep the default buffering to sub-page units
>
> We allocate during interrupts so while our buffering is normally diced up
> small anyway on some hardware at speed we can pressure the VM excessively
> for page pairs. We don't really need big buffers to be linear so don't try
> so hard.
>
> In order to make this work well we will tidy up excess callers to request_room,
> which cannot itself enforce this break up.
>
> Signed-off-by: Alan Cox <alan@xxxxxxxxxxxxxxx>
> Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxx>
>
> diff --git a/drivers/char/tty_buffer.c b/drivers/char/tty_buffer.c
> index 66fa4e1..f27c4d6 100644
> --- a/drivers/char/tty_buffer.c
> +++ b/drivers/char/tty_buffer.c
> @@ -247,7 +247,8 @@ int tty_insert_flip_string(struct tty_struct *tty, const unsigned char *chars,
> Â{
> Â Â Â Âint copied = 0;
> Â Â Â Âdo {
> - Â Â Â Â Â Â Â int space = tty_buffer_request_room(tty, size - copied);
> + Â Â Â Â Â Â Â int goal = min(size - copied, TTY_BUFFER_PAGE);
> + Â Â Â Â Â Â Â int space = tty_buffer_request_room(tty, goal);
> Â Â Â Â Â Â Â Âstruct tty_buffer *tb = tty->buf.tail;
> Â Â Â Â Â Â Â Â/* If there is no space then tb may be NULL */
> Â Â Â Â Â Â Â Âif (unlikely(space == 0))
> @@ -283,7 +284,8 @@ int tty_insert_flip_string_flags(struct tty_struct *tty,
> Â{
> Â Â Â Âint copied = 0;
> Â Â Â Âdo {
> - Â Â Â Â Â Â Â int space = tty_buffer_request_room(tty, size - copied);
> + Â Â Â Â Â Â Â int goal = min(size - copied, TTY_BUFFER_PAGE);
> + Â Â Â Â Â Â Â int space = tty_buffer_request_room(tty, goal);
> Â Â Â Â Â Â Â Âstruct tty_buffer *tb = tty->buf.tail;
> Â Â Â Â Â Â Â Â/* If there is no space then tb may be NULL */
> Â Â Â Â Â Â Â Âif (unlikely(space == 0))
> diff --git a/include/linux/tty.h b/include/linux/tty.h
> index 6abfcf5..d96e588 100644
> --- a/include/linux/tty.h
> +++ b/include/linux/tty.h
> @@ -68,6 +68,16 @@ struct tty_buffer {
> Â Â Â Âunsigned long data[0];
> Â};
>
> +/*
> + * We default to dicing tty buffer allocations to this many characters
> + * in order to avoid multiple page allocations. We assume tty_buffer itself
> + * is under 256 bytes. See tty_buffer_find for the allocation logic this
> + * must match
> + */
> +
> +#define TTY_BUFFER_PAGE Â Â Â Â Â Â Â Â((PAGE_SIZE Â- 256) / 2)
> +
> +
> Âstruct tty_bufhead {
> Â Â Â Âstruct delayed_work work;
> Â Â Â Âspinlock_t lock;
> From 352fa6ad16b89f8ffd1a93b4419b1a8f2259feab Mon Sep 17 00:00:00 2001
> From: Mel Gorman <mel@xxxxxxxxx>
> Date: Tue, 2 Mar 2010 22:24:19 +0000
> Subject: [PATCH] tty: Take a 256 byte padding into account when buffering below sub-page units
>
> The TTY layer takes some care to ensure that only sub-page allocations
> are made with interrupts disabled. It does this by setting a goal of
> "TTY_BUFFER_PAGE" to allocate. Unfortunately, while TTY_BUFFER_PAGE takes the
> size of tty_buffer into account, it fails to account that tty_buffer_find()
> rounds the buffer size out to the next 256 byte boundary before adding on
> the size of the tty_buffer.
>
> This patch adjusts the TTY_BUFFER_PAGE calculation to take into account the
> size of the tty_buffer and the padding. Once applied, tty_buffer_alloc()
> should not require high-order allocations.
>
> Signed-off-by: Mel Gorman <mel@xxxxxxxxx>
> Cc: stable <stable@xxxxxxxxxx>
> Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxx>
>
> diff --git a/include/linux/tty.h b/include/linux/tty.h
> index 568369a..593228a 100644
> --- a/include/linux/tty.h
> +++ b/include/linux/tty.h
> @@ -70,12 +70,13 @@ struct tty_buffer {
>
> Â/*
> Â* We default to dicing tty buffer allocations to this many characters
> - * in order to avoid multiple page allocations. We assume tty_buffer itself
> - * is under 256 bytes. See tty_buffer_find for the allocation logic this
> - * must match
> + * in order to avoid multiple page allocations. We know the size of
> + * tty_buffer itself but it must also be taken into account that the
> + * the buffer is 256 byte aligned. See tty_buffer_find for the allocation
> + * logic this must match
> Â*/
>
> -#define TTY_BUFFER_PAGE Â Â Â Â Â Â Â Â((PAGE_SIZE Â- 256) / 2)
> +#define TTY_BUFFER_PAGE Â Â Â Â(((PAGE_SIZE - sizeof(struct tty_buffer)) / 2) & ~0xFF)
>
>
> Âstruct tty_bufhead {
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index bf9213b..5ba0d9a 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -94,7 +94,6 @@ extern void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem,
> Âextern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem,
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âint priority);
> Âint mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg);
> -int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg);
> Âunsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â struct zone *zone,
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â enum lru_list lru);
> @@ -243,12 +242,6 @@ mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
> Â Â Â Âreturn 1;
> Â}
>
> -static inline int
> -mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
> -{
> - Â Â Â return 1;
> -}
> -
> Âstatic inline unsigned long
> Âmem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, struct zone *zone,
> Â Â Â Â Â Â Â Â Â Â Â Â enum lru_list lru)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 66035bf..bbb0eda 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -843,17 +843,6 @@ int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
> Â Â Â Âreturn 0;
> Â}
>
> -int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
> -{
> - Â Â Â unsigned long active;
> - Â Â Â unsigned long inactive;
> -
> - Â Â Â inactive = mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE);
> - Â Â Â active = mem_cgroup_get_local_zonestat(memcg, LRU_ACTIVE_FILE);
> -
> - Â Â Â return (active > inactive);
> -}
> -
> Âunsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â struct zone *zone,
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â enum lru_list lru)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 692807f..5512301 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1428,59 +1428,13 @@ static int inactive_anon_is_low(struct zone *zone, struct scan_control *sc)
> Â Â Â Âreturn low;
> Â}
>
> -static int inactive_file_is_low_global(struct zone *zone)
> -{
> - Â Â Â unsigned long active, inactive;
> -
> - Â Â Â active = zone_page_state(zone, NR_ACTIVE_FILE);
> - Â Â Â inactive = zone_page_state(zone, NR_INACTIVE_FILE);
> -
> - Â Â Â return (active > inactive);
> -}
> -
> -/**
> - * inactive_file_is_low - check if file pages need to be deactivated
> - * @zone: zone to check
> - * @sc: Â scan control of this context
> - *
> - * When the system is doing streaming IO, memory pressure here
> - * ensures that active file pages get deactivated, until more
> - * than half of the file pages are on the inactive list.
> - *
> - * Once we get to that situation, protect the system's working
> - * set from being evicted by disabling active file page aging.
> - *
> - * This uses a different ratio than the anonymous pages, because
> - * the page cache uses a use-once replacement algorithm.
> - */
> -static int inactive_file_is_low(struct zone *zone, struct scan_control *sc)
> -{
> - Â Â Â int low;
> -
> - Â Â Â if (scanning_global_lru(sc))
> - Â Â Â Â Â Â Â low = inactive_file_is_low_global(zone);
> - Â Â Â else
> - Â Â Â Â Â Â Â low = mem_cgroup_inactive_file_is_low(sc->mem_cgroup);
> - Â Â Â return low;
> -}
> -
> -static int inactive_list_is_low(struct zone *zone, struct scan_control *sc,
> - Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â int file)
> -{
> - Â Â Â if (file)
> - Â Â Â Â Â Â Â return inactive_file_is_low(zone, sc);
> - Â Â Â else
> - Â Â Â Â Â Â Â return inactive_anon_is_low(zone, sc);
> -}
> -
> Âstatic unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
> Â Â Â Âstruct zone *zone, struct scan_control *sc, int priority)
> Â{
> Â Â Â Âint file = is_file_lru(lru);
>
> - Â Â Â if (is_active_lru(lru)) {
> - Â Â Â Â Â Â Â if (inactive_list_is_low(zone, sc, file))
> - Â Â Â Â Â Â Â Â Â shrink_active_list(nr_to_scan, zone, sc, priority, file);
> + Â Â Â if (lru == LRU_ACTIVE_FILE) {
> + Â Â Â Â Â Â Â shrink_active_list(nr_to_scan, zone, sc, priority, file);
> Â Â Â Â Â Â Â Âreturn 0;
> Â Â Â Â}
>
>
> --
> Mel Gorman
> Part-time Phd Student             ÂLinux Technology Center
> University of Limerick             IBM Dublin Software Lab
>
¢éì®&Þ~º&¶¬–+-±éÝ¥Šw®žË±Êâmébžìdz¹Þ)í…æèw*jg¬±¨¶‰šŽŠÝj/êäz¹ÞŠà2ŠÞ¨è­Ú&¢)ß«a¶Úþø®G«éh®æj:+v‰¨Šwè†Ù>Wš±êÞiÛaxPjØm¶Ÿÿà -»+ƒùdš_