Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaimand use a_ops->writepages() where possible

From: KAMEZAWA Hiroyuki
Date: Wed Jun 09 2010 - 20:43:25 EST

Next message: Dave Chinner: "Re: xfs, 2.6.27=>.32 sync write 10 times slowdown [was: xfs,aacraid 2.6.27 => 2.6.32 results in 6 times slowdown]"
Previous message: David Miller: "Re: [PATCH 2/2 -next] niu: always include of_device.h"
In reply to: Mel Gorman: "Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaimand use a_ops->writepages() where possible"
Next in thread: Mel Gorman: "Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaimand use a_ops->writepages() where possible"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, 9 Jun 2010 10:52:00 +0100
Mel Gorman <mel@xxxxxxxxx> wrote:

> On Wed, Jun 09, 2010 at 11:52:11AM +0900, KAMEZAWA Hiroyuki wrote:

> > > <SNIP>
> >
> > My concern is how memcg should work. IOW, what changes will be necessary for
> > memcg to work with the new vmscan logic as no-direct-writeback.
> >
>
> At worst, memcg waits on background flushers to clean their pages but
> obviously this could lead to stalls in containers if it happened to be full
> of dirty pages.
>
yes.

> Do you have test scenarios already setup for functional and performance
> regression testing of containers? If so, can you run tests with this series
> and see what sort of impact you find? I haven't done performance testing
> with containers to date so I don't know what the expected values are.
>
Maybe kernbench is enough. I think it does enough write and malloc.
'Limit' size for test depends on your host. I sometimes does this on
8cpu SMP box.

# mount -t cgroup none /cgroups -o memory
# mkdir /cgroups/A
# echo $$ > /cgroups/A
# echo 300M > /cgroups/memory.limit_in_bytes
# make -j 8 or make -j 16

Comparing size of swap and speed will be interesting.
(Above 300M is enough small because my test machine has 24G memory.)

Or
# mount -t cgroup none /cgroups -o memory
# mkdir /cgroups/A
# echo $$ > /cgroups/A
# echo 50M > /cgroups/memory.limit_in_bytes
# dd if=/dev/zero of=./tmpfile bs=65536 count=100000

or some. When I tested the original patch for "avoiding writeback" by
Dave Chinner, I saw 2 ooms in 10 tests.
If not patched, I never see OOM.

> > Maybe an ideal solution will be
> > - support buffered I/O tracking in I/O cgroup.
> > - flusher threads should work with I/O cgroup.
> > - memcg itself should support dirty ratio. and add a trigger to kick flusher
> > threads for dirty pages in a memcg.
> > But I know it's a long way.
> >
>
> I'm not very familiar with memcg I'm afraid or its requirements so I am
> having trouble guessing which of these would behave the best. You could take
> a gamble on having memcg doing writeback in direct reclaim but you may run
> into the same problem of overflowing stacks.
>
maybe.

> I'm not sure how a flusher thread would work just within a cgroup. It
> would have to do a lot of searching to find the pages it needs
> considering that it's looking at inodes rather than pages.
>
yes. So, I(we) need some way for coloring inode for selectable writeback.
But people in this area are very nervous about performance (me too ;), I've
not found the answer yet.

> One possibility I guess would be to create a flusher-like thread if a direct
> reclaimer finds that the dirty pages in the container are above the dirty
> ratio. It would scan and clean all dirty pages in the container LRU on behalf
> of dirty reclaimers.
>
Yes, that's possible. But Andrew recommends not to do add more threads. So,
I'll use workqueue if necessary.

> Another possibility would be to have kswapd work in containers.
> Specifically, if wakeup_kswapd() is called with a cgroup that it's added
> to a list. kswapd gives priority to global reclaim but would
> occasionally check if there is a container that needs kswapd on a
> pending list and if so, work within the container. Is there a good
> reason why kswapd does not work within container groups?
>
One reason is node v.s. memcg.
Because memcg doesn't limit memory placement, a container can contain pages
from the all nodes. So,it's a bit problem which node's kswapd we should run .
(but yes, maybe small problem.)
Another is memory-reclaim-prioirty between memcg.
(I don't want to add such a knob...)

Maybe it's time to consider about that.
Now, we're using kswapd for softlimit. I think similar hints for kswapd
should work. yes.

> Finally, you could just allow reclaim within a memcg do writeback. Right
> now, the check is based on current_is_kswapd() but I could create a helper
> function that also checked for sc->mem_cgroup. Direct reclaim from the
> page allocator never appears to work within a container group (which
> raises questions in itself such as why a process in a container would
> reclaim pages outside the container?) so it would remain safe.
>
isolate_lru_pages() for memcg finds only pages in a memcg ;)

> > How the new logic works with memcg ? Because memcg doesn't trigger kswapd,
> > memcg has to wait for a flusher thread make pages clean ?
>
> Right now, memcg has to wait for a flusher thread to make pages clean.
>
ok.

> > Or memcg should have kswapd-for-memcg ?
> >
> > Is it okay to call writeback directly when !scanning_global_lru() ?
> > memcg's reclaim routine is only called from specific positions, so, I guess
> > no stack problem.
>
> It's a judgement call from you really. I see that direct reclaimers do
> not set mem_cgroup so it's down to - are you reasonably sure that all
> the paths that reclaim based on a container are not deep?

One concerns is add_to_page_cache(). If it's called in deep stack, my assumption
is wrong.

> I looked
> around for a while and the bulk appeared to be in the fault path so I
> would guess "yes" but as I'm not familiar with the memcg implementation
> I'll have missed a lot.
>
> > But we just have I/O pattern problem.
>
> True.
>

Okay, I'll consider about how to kick kswapd via memcg or flusher-for-memcg.
Please go ahead as you want. I love good I/O pattern, too.

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Dave Chinner: "Re: xfs, 2.6.27=>.32 sync write 10 times slowdown [was: xfs,aacraid 2.6.27 => 2.6.32 results in 6 times slowdown]"
Previous message: David Miller: "Re: [PATCH 2/2 -next] niu: always include of_device.h"
In reply to: Mel Gorman: "Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaimand use a_ops->writepages() where possible"
Next in thread: Mel Gorman: "Re: [RFC PATCH 0/6] Do not call ->writepage[s] from direct reclaimand use a_ops->writepages() where possible"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]