Re: [RFC][PATCH 5/6] memcg: mem+swap controller

From: Daisuke Nishimura
Date: Fri Nov 07 2008 - 04:08:56 EST


On Wed, 5 Nov 2008 17:23:16 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote:
> Mem+Swap controller core.
>
> This patch implements per cgroup limit for usage of memory+swap.
> However there are SwapCache, double counting of swap-cache and
> swap-entry is avoided.
>
> Mem+Swap controller works as following.
> - memory usage is limited by memory.limit_in_bytes.
> - memory + swap usage is limited by memory.memsw_limit_in_bytes.
>
>
> This has following benefits.
> - A user can limit total resource usage of mem+swap.
>
> Without this, because memory resource controller doesn't take care of
> usage of swap, a process can exhaust all the swap (by memory leak.)
> We can avoid this case.
>
> And Swap is shared resource but it cannot be reclaimed (goes back to memory)
> until it's used. This characteristic can be trouble when the memory
> is divided into some parts by cpuset or memcg.
> Assume group A and group B.
> After some application executes, the system can be..
>
> Group A -- very large free memory space but occupy 99% of swap.
> Group B -- under memory shortage but cannot use swap...it's nearly full.
>
> Ability to set appropriate swap limit for each group is required.
>
> Maybe someone wonder "why not swap but mem+swap ?"
>
> - The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
> to move account from memory to swap...there is no change in usage of
> mem+swap.
>
> In other words, when we want to limit the usage of swap without affecting
> global LRU, mem+swap limit is better than just limiting swap.
>
>
> Accounting target information is stored in swap_cgroup which is
> per swap entry record.
>
> Charge is done as following.
> map
> - charge page and memsw.
>
> unmap
> - uncharge page/memsw if not SwapCache.
>
> swap-out (__delete_from_swap_cache)
> - uncharge page
> - record mem_cgroup information to swap_cgroup.
>
> swap-in (do_swap_page)
> - charged as page and memsw.
> record in swap_cgroup is cleared.
> memsw accounting is decremented.
>
> swap-free (swap_free())
> - if swap entry is freed, memsw is uncharged by PAGE_SIZE.
>
>
> After this, usual memory resource controller handles SwapCache.
> (It was lacked(ignored) feature in current memcg but must be handled.)
>
SwapCache has been handled in [2/6] already :)

(snip)
> @@ -514,12 +534,25 @@ static int __mem_cgroup_try_charge(struc
> css_get(&mem->css);
> }
>
> + while (1) {
> + int ret;
> + bool noswap = false;
>
> - while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE))) {
> + ret = res_counter_charge(&mem->res, PAGE_SIZE);
> + if (likely(!ret)) {
> + if (!do_swap_account)
> + break;
> + ret = res_counter_charge(&mem->memsw, PAGE_SIZE);
> + if (likely(!ret))
> + break;
> + /* mem+swap counter fails */
> + res_counter_uncharge(&mem->res, PAGE_SIZE);
> + noswap = true;
> + }
> if (!(gfp_mask & __GFP_WAIT))
> goto nomem;
>
> - if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
> + if (try_to_free_mem_cgroup_pages(mem, gfp_mask, noswap))
> continue;
>
> /*
I have two comment about try_charge.

1. It would be better if possible to avoid charging memsw at swapin (and uncharging
it again at mem_cgroup_cache_charge_swapin/mem_cgroup_commit_charge_swapin).
How about adding a new argument "charge_memsw" ? (it has many args already now...)
2. Should we use swap when exceeding mem.limit but mem.limit == memsw.limit ?

(snip)
> void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
> @@ -838,6 +947,7 @@ void mem_cgroup_cancel_charge_swapin(str
> if (!mem)
> return;
> res_counter_uncharge(&mem->res, PAGE_SIZE);
> + res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> css_put(&mem->css);
> }
>
"if (do_swap_account)" is needed before uncharging memsw.

(snip)
> static struct cftype mem_cgroup_files[] = {
> {
> .name = "usage_in_bytes",
> - .private = RES_USAGE,
> + .private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
> .read_u64 = mem_cgroup_read,
> },
> {
> .name = "max_usage_in_bytes",
> - .private = RES_MAX_USAGE,
> + .private = MEMFILE_PRIVATE(_MEM, RES_MAX_USAGE),
> .trigger = mem_cgroup_reset,
> .read_u64 = mem_cgroup_read,
> },
> {
> .name = "limit_in_bytes",
> - .private = RES_LIMIT,
> + .private = MEMFILE_PRIVATE(_MEM, RES_LIMIT),
> .write_string = mem_cgroup_write,
> .read_u64 = mem_cgroup_read,
> },
> {
> .name = "failcnt",
> - .private = RES_FAILCNT,
> + .private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
> .trigger = mem_cgroup_reset,
> .read_u64 = mem_cgroup_read,
> },
> @@ -1317,6 +1541,31 @@ static struct cftype mem_cgroup_files[]
> .name = "stat",
> .read_map = mem_control_stat_show,
> },
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> + {
> + .name = "memsw.usage_in_bytes",
> + .private = MEMFILE_PRIVATE(_MEMSWAP, RES_USAGE),
> + .read_u64 = mem_cgroup_read,
> + },
> + {
> + .name = "memsw.max_usage_in_bytes",
> + .private = MEMFILE_PRIVATE(_MEMSWAP, RES_MAX_USAGE),
> + .trigger = mem_cgroup_reset,
> + .read_u64 = mem_cgroup_read,
> + },
> + {
> + .name = "memsw.limit_in_bytes",
> + .private = MEMFILE_PRIVATE(_MEMSWAP, RES_LIMIT),
> + .write_string = mem_cgroup_write,
> + .read_u64 = mem_cgroup_read,
> + },
> + {
> + .name = "memsw.failcnt",
> + .private = MEMFILE_PRIVATE(_MEMSWAP, RES_FAILCNT),
> + .trigger = mem_cgroup_reset,
> + .read_u64 = mem_cgroup_read,
> + },
> +#endif
> };
>
IMHO, it would be better to define those "memsw.*" files as memsw_cgroup_files[],
and change mem_cgroup_populate() like:

static int mem_cgroup_populate(struct cgroup_subsys *ss,
struct cgroup *cont)
{
int ret;

ret = cgroup_add_files(cont, ss, mem_cgroup_files,
ARRAY_SIZE(mem_cgroup_files));
if (!ret && do_swap_account)
ret = cgroup_add_files(cont, ss, memsw_cgroup_files,
ARRAY_SIZE(memsw_cgroup_files));

return ret;
}

so that those files appear only when swap accounting is enabled.


Thanks,
Daisuke Nishimura.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/