Re: [RFC][PATCH v3 0/10] memcg async reclaim

From: Ying Han
Date: Tue May 31 2011 - 13:04:14 EST


Some testing result on the patchset based on mmotm-2011-05-12-15-52.

Test: create a 4g memcg on 32g host, and change the default dirty
ratio to "10% dirty_raio. 5% dirty_background_ratio" due to the lack
for per-memcg dirty limit.

I did a simple streaming IO test ( read/write 20g file) and used the
pagefault histogram to capture the page fault latency under reclaim.
On both tests, there is no pgfault latency introduced from the reclaim
path by enabling the per-memcg async reclaim . I will post the V2 of
the patch right after.

$ cat /export/hdc3/dd_A/tf0 > /dev/zero

w/o the async reclaim:
page reclaim latency histogram (us):
< 150 542
< 200 15402
< 250 481
< 300 23
< 350 1
< 400 0
< 450 0
< rest 0
real>---4m26.604s
user>---0m0.294s
sys>----0m26.632s

with async reclaim:
page reclaim latency histogram (us):
< 150 0
< 200 0
< 250 0
< 300 0
< 350 1
< 400 1
< 450 0
< rest 1
real>---4m26.605s
user>---0m0.246s
sys>----0m24.341s

$ dd if=/dev/zero of=/export/hdc3/dd/tf0 bs=1024 count=20971520

w/o async reclaim:
page reclaim latency histogram (us):
< 150 4
< 200 17382
< 250 3238
< 300 276
< 350 18
< 400 1
< 450 1
< rest 103
real>---4m26.984s
user>---0m4.964s
sys>----1m8.718s

with async reclaim:
page reclaim latency histogram (us):
< 150 0
< 200 0
< 250 0
< 300 0
< 350 0
< 400 0
< 450 0
< rest 0
real>---4m16.355s
user>---0m4.593s
sys>----1m5.896s

Thanks

--Ying

On Fri, May 27, 2011 at 12:20 AM, Ying Han <yinghan@xxxxxxxxxx> wrote:
> On Thu, May 26, 2011 at 9:49 PM, Ying Han <yinghan@xxxxxxxxxx> wrote:
>> On Thu, May 26, 2011 at 9:34 PM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote:
>>> On Thu, 26 May 2011 21:33:32 -0700
>>> Ying Han <yinghan@xxxxxxxxxx> wrote:
>>>
>>>> On Thu, May 26, 2011 at 7:16 PM, KAMEZAWA Hiroyuki
>>>> <kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote:
>>>> > On Thu, 26 May 2011 18:49:26 -0700
>>>> > Ying Han <yinghan@xxxxxxxxxx> wrote:
>>>> >
>>>> >> On Wed, May 25, 2011 at 10:10 PM, KAMEZAWA Hiroyuki
>>>> >> <kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote:
>>>> >> >
>>>> >> > It's now merge window...I just dump my patch queue to hear other's idea.
>>>> >> > I wonder I should wait until dirty_ratio for memcg is queued to mmotm...
>>>> >> > I'll be busy with LinuxCon Japan etc...in the next week.
>>>> >> >
>>>> >> > This patch is onto mmotm-May-11 + some patches queued in mmotm, as numa_stat.
>>>> >> >
>>>> >> > This is a patch for memcg to keep margin to the limit in background.
>>>> >> > By keeping some margin to the limit in background, application can
>>>> >> > avoid foreground memory reclaim at charge() and this will help latency.
>>>> >> >
>>>> >> > Main changes from v2 is.
>>>> >> >  - use SCHED_IDLE.
>>>> >> >  - removed most of heuristic codes. Now, code is very simple.
>>>> >> >
>>>> >> > By using SCHED_IDLE, async memory reclaim can only consume 0.3%? of cpu
>>>> >> > if the system is truely busy but can use much CPU if the cpu is idle.
>>>> >> > Because my purpose is for reducing latency without affecting other running
>>>> >> > applications, SCHED_IDLE fits this work.
>>>> >> >
>>>> >> > If application need to stop by some I/O or event, background memory reclaim
>>>> >> > will cull memory while the system is idle.
>>>> >> >
>>>> >> > Perforemce:
>>>> >> >  Running an httpd (apache) under 300M limit. And access 600MB working set
>>>> >> >  with normalized distribution access by apatch-bench.
>>>> >> >  apatch bench's concurrency was 4 and did 40960 accesses.
>>>> >> >
>>>> >> > Without async reclaim:
>>>> >> > Connection Times (ms)
>>>> >> >              min  mean[+/-sd] median   max
>>>> >> > Connect:        0    0   0.0      0       2
>>>> >> > Processing:    30   37  28.3     32    1793
>>>> >> > Waiting:       28   35  25.5     31    1792
>>>> >> > Total:         30   37  28.4     32    1793
>>>> >> >
>>>> >> > Percentage of the requests served within a certain time (ms)
>>>> >> >  50%     32
>>>> >> >  66%     32
>>>> >> >  75%     33
>>>> >> >  80%     34
>>>> >> >  90%     39
>>>> >> >  95%     60
>>>> >> >  98%    100
>>>> >> >  99%    133
>>>> >> >  100%   1793 (longest request)
>>>> >> >
>>>> >> > With async reclaim:
>>>> >> > Connection Times (ms)
>>>> >> >              min  mean[+/-sd] median   max
>>>> >> > Connect:        0    0   0.0      0       2
>>>> >> > Processing:    30   35  12.3     32     678
>>>> >> > Waiting:       28   34  12.0     31     658
>>>> >> > Total:         30   35  12.3     32     678
>>>> >> >
>>>> >> > Percentage of the requests served within a certain time (ms)
>>>> >> >  50%     32
>>>> >> >  66%     32
>>>> >> >  75%     33
>>>> >> >  80%     34
>>>> >> >  90%     39
>>>> >> >  95%     49
>>>> >> >  98%     71
>>>> >> >  99%     86
>>>> >> >  100%    678 (longest request)
>>>> >> >
>>>> >> >
>>>> >> > It seems latency is stabilized by hiding memory reclaim.
>>>> >> >
>>>> >> > The score for memory reclaim was following.
>>>> >> > See patch 10 for meaning of each member.
>>>> >> >
>>>> >> > == without async reclaim ==
>>>> >> > recent_scan_success_ratio 44
>>>> >> > limit_scan_pages 388463
>>>> >> > limit_freed_pages 162238
>>>> >> > limit_elapsed_ns 13852159231
>>>> >> > soft_scan_pages 0
>>>> >> > soft_freed_pages 0
>>>> >> > soft_elapsed_ns 0
>>>> >> > margin_scan_pages 0
>>>> >> > margin_freed_pages 0
>>>> >> > margin_elapsed_ns 0
>>>> >> >
>>>> >> > == with async reclaim ==
>>>> >> > recent_scan_success_ratio 6
>>>> >> > limit_scan_pages 0
>>>> >> > limit_freed_pages 0
>>>> >> > limit_elapsed_ns 0
>>>> >> > soft_scan_pages 0
>>>> >> > soft_freed_pages 0
>>>> >> > soft_elapsed_ns 0
>>>> >> > margin_scan_pages 1295556
>>>> >> > margin_freed_pages 122450
>>>> >> > margin_elapsed_ns 644881521
>>>> >> >
>>>> >> >
>>>> >> > For this case, SCHED_IDLE workqueue can reclaim enough memory to the httpd.
>>>> >> >
>>>> >> > I may need to dig why scan_success_ratio is far different in the both case.
>>>> >> > I guess the difference of epalsed_ns is because several threads enter
>>>> >> > memory reclaim when async reclaim doesn't run. But may not...
>>>> >> >
>>>> >>
>>>> >>
>>>> >> Hmm.. I noticed a very strange behavior on a simple test w/ the patch set.
>>>> >>
>>>> >> Test:
>>>> >> I created a 4g memcg and start doing cat. Then the memcg being OOM
>>>> >> killed as soon as it reaches its hard_limit. We shouldn't hit OOM even
>>>> >> w/o async-reclaim.
>>>> >>
>>>> >> Again, I will read through the patch. But like to post the test result first.
>>>> >>
>>>> >> $ echo $$ >/dev/cgroup/memory/A/tasks
>>>> >> $ cat /dev/cgroup/memory/A/memory.limit_in_bytes
>>>> >> 4294967296
>>>> >>
>>>> >> $ time cat /export/hdc3/dd_A/tf0 > /dev/zero
>>>> >> Killed
>>>> >>
>>>> >
>>>> > I did the same kind of test without any problem...but ok, I'll do more test
>>>> > later.
>>>> >
>>>> >
>>>> >
>>>> >> real  0m53.565s
>>>> >> user  0m0.061s
>>>> >> sys   0m4.814s
>>>> >>
>>>> >> Here is the OOM log:
>>>> >>
>>>> >> May 26 18:43:00  kernel: [  963.489112] cat invoked oom-killer:
>>>> >> gfp_mask=0xd0, order=0, oom_adj=0, oom_score_adj=0
>>>> >> May 26 18:43:00  kernel: [  963.489121] Pid: 9425, comm: cat Tainted:
>>>> >> G        W   2.6.39-mcg-DEV #131
>>>> >> May 26 18:43:00  kernel: [  963.489123] Call Trace:
>>>> >> May 26 18:43:00  kernel: [  963.489134]  [<ffffffff810e3512>]
>>>> >> dump_header+0x82/0x1af
>>>> >> May 26 18:43:00  kernel: [  963.489137]  [<ffffffff810e33ca>] ?
>>>> >> spin_lock+0xe/0x10
>>>> >> May 26 18:43:00  kernel: [  963.489140]  [<ffffffff810e33f9>] ?
>>>> >> find_lock_task_mm+0x2d/0x67
>>>> >> May 26 18:43:00  kernel: [  963.489143]  [<ffffffff810e38dd>]
>>>> >> oom_kill_process+0x50/0x27b
>>>> >> May 26 18:43:00  kernel: [  963.489155]  [<ffffffff810e3dc6>]
>>>> >> mem_cgroup_out_of_memory+0x9a/0xe4
>>>> >> May 26 18:43:00  kernel: [  963.489160]  [<ffffffff811153aa>]
>>>> >> mem_cgroup_handle_oom+0x134/0x1fe
>>>> >> May 26 18:43:00  kernel: [  963.489163]  [<ffffffff81114a72>] ?
>>>> >> __mem_cgroup_insert_exceeded+0x83/0x83
>>>> >> May 26 18:43:00  kernel: [  963.489176]  [<ffffffff811166e9>]
>>>> >> __mem_cgroup_try_charge.clone.3+0x368/0x43a
>>>> >> May 26 18:43:00  kernel: [  963.489179]  [<ffffffff81117586>]
>>>> >> mem_cgroup_cache_charge+0x95/0x123
>>>> >> May 26 18:43:00  kernel: [  963.489183]  [<ffffffff810e16d8>]
>>>> >> add_to_page_cache_locked+0x42/0x114
>>>> >> May 26 18:43:00  kernel: [  963.489185]  [<ffffffff810e17db>]
>>>> >> add_to_page_cache_lru+0x31/0x5f
>>>> >> May 26 18:43:00  kernel: [  963.489189]  [<ffffffff81145636>]
>>>> >> mpage_readpages+0xb6/0x132
>>>> >> May 26 18:43:00  kernel: [  963.489194]  [<ffffffff8119992f>] ?
>>>> >> noalloc_get_block_write+0x24/0x24
>>>> >> May 26 18:43:00  kernel: [  963.489197]  [<ffffffff8119992f>] ?
>>>> >> noalloc_get_block_write+0x24/0x24
>>>> >> May 26 18:43:00  kernel: [  963.489201]  [<ffffffff81036742>] ?
>>>> >> __switch_to+0x160/0x212
>>>> >> May 26 18:43:00  kernel: [  963.489205]  [<ffffffff811978b2>]
>>>> >> ext4_readpages+0x1d/0x1f
>>>> >> May 26 18:43:00  kernel: [  963.489209]  [<ffffffff810e8d4b>]
>>>> >> __do_page_cache_readahead+0x144/0x1e3
>>>> >> May 26 18:43:00  kernel: [  963.489212]  [<ffffffff810e8e0b>]
>>>> >> ra_submit+0x21/0x25
>>>> >> May 26 18:43:00  kernel: [  963.489215]  [<ffffffff810e9075>]
>>>> >> ondemand_readahead+0x18c/0x19f
>>>> >> May 26 18:43:00  kernel: [  963.489218]  [<ffffffff810e9105>]
>>>> >> page_cache_async_readahead+0x7d/0x86
>>>> >> May 26 18:43:00  kernel: [  963.489221]  [<ffffffff810e2b7e>]
>>>> >> generic_file_aio_read+0x2d8/0x5fe
>>>> >> May 26 18:43:00  kernel: [  963.489225]  [<ffffffff81119626>]
>>>> >> do_sync_read+0xcb/0x108
>>>> >> May 26 18:43:00  kernel: [  963.489230]  [<ffffffff811f168a>] ?
>>>> >> fsnotify_perm+0x66/0x72
>>>> >> May 26 18:43:00  kernel: [  963.489233]  [<ffffffff811f16f7>] ?
>>>> >> security_file_permission+0x2e/0x33
>>>> >> May 26 18:43:00  kernel: [  963.489236]  [<ffffffff8111a0c8>]
>>>> >> vfs_read+0xab/0x107
>>>> >> May 26 18:43:00  kernel: [  963.489239]  [<ffffffff8111a1e4>] sys_read+0x4a/0x6e
>>>> >> May 26 18:43:00  kernel: [  963.489244]  [<ffffffff8140f469>]
>>>> >> sysenter_dispatch+0x7/0x27
>>>> >> May 26 18:43:00  kernel: [  963.489248] Task in /A killed as a result
>>>> >> of limit of /A
>>>> >> May 26 18:43:00  kernel: [  963.489251] memory: usage 4194304kB, limit
>>>> >> 4194304kB, failcnt 26
>>>> >> May 26 18:43:00  kernel: [  963.489253] memory+swap: usage 0kB, limit
>>>> >> 9007199254740991kB, failcnt 0
>>>> >>
>>>> >
>>>> > Hmm, why memory+swap usage 0kb here...
>>>> >
>>>> > In this set, I used mem_cgroup_margin() rather than res_counter_margin().
>>>> > Hmm, do you disable swap accounting ? If so, I may miss some.
>>>>
>>>> Yes, I disabled the swap accounting in .config:
>>>> # CONFIG_CGROUP_MEM_RES_CTLR_SWAP is not set
>>>>
>>>>
>>>> Here is how i reproduce it:
>>>>
>>>> $ mkdir /dev/cgroup/memory/D
>>>> $ echo 4g >/dev/cgroup/memory/D/memory.limit_in_bytes
>>>>
>>>> $ cat /dev/cgroup/memory/D/memory.limit_in_bytes
>>>> 4294967296
>>>>
>>>> $ cat /dev/cgroup/memory/D/memory.
>>>> memory.async_control             memory.max_usage_in_bytes
>>>> memory.soft_limit_in_bytes       memory.use_hierarchy
>>>> memory.failcnt                   memory.move_charge_at_immigrate
>>>> memory.stat
>>>> memory.force_empty               memory.oom_control
>>>> memory.swappiness
>>>> memory.limit_in_bytes            memory.reclaim_stat
>>>> memory.usage_in_bytes
>>>>
>>>> $ cat /dev/cgroup/memory/D/memory.async_control
>>>> 0
>>>> $ echo 1 >/dev/cgroup/memory/D/memory.async_control
>>>> $ cat /dev/cgroup/memory/D/memory.async_control
>>>> 1
>>>>
>>>> $ echo $$ >/dev/cgroup/memory/D/tasks
>>>> $ cat /proc/4358/cgroup
>>>> 3:memory:/D
>>>>
>>>> $ time cat /export/hdc3/dd_A/tf0 > /dev/zero
>>>> Killed
>>>>
>>>
>>> If you applied my patches collectly, async_control can be seen if
>>> swap controller is configured because of BUG in patch.
>>
>> I noticed the BUG at the very beginning, so all my tests are having the fix.
>>
>>>
>>> I could cat 20G file under 4G limit without any problem with boot option
>>> swapaccount=0. no problem if async_control == 0 ?
>>
>> $ cat /dev/cgroup/memory/D/memory.async_control
>> 1
>>
>> I have the .config
>> # CONFIG_CGROUP_MEM_RES_CTLR_SWAP is not set
>>
>> Not sure if that makes difference. I will test next to turn that on.
>
> I know what's the problem and also verified. Our configuration might
> differs on the "#if MAX_NUMNODES > 1"
>
> Please apply the following patch:
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6a52699..0b88d71 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1217,7 +1217,7 @@ unsigned long
> mem_cgroup_zone_reclaimable_pages(struct mem_cgroup *memcg,
>        struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
>
>        nr = MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_FILE) +
> -               MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_FILE);
> +               MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_FILE);
>        if (nr_swap_pages > 0)
>                nr += MEM_CGROUP_ZSTAT(mz, NR_ACTIVE_ANON) +
>                        MEM_CGROUP_ZSTAT(mz, NR_INACTIVE_ANON);
>
> --Ying
>
>>
>> --Ying
>>
>>
>>>
>>>
>>>
>>> Thanks,
>>> -Kame
>>>
>>>
>>>
>>>
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/