Re: [RFC] Shared page accounting for memory cgroup

From: Balbir Singh
Date: Mon Jan 18 2010 - 03:26:55 EST

On Monday 18 January 2010 06:19 AM, Daisuke Nishimura wrote:
> On Mon, 18 Jan 2010 01:00:44 +0530, Balbir Singh <balbir@xxxxxxxxxxxxxxxxxx> wrote:
>> On Fri, Jan 8, 2010 at 5:17 AM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote:
>>> On Thu, 7 Jan 2010 14:57:36 +0530
>>> Balbir Singh <balbir@xxxxxxxxxxxxxxxxxx> wrote:
>>>> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> [2010-01-07 18:08:00]:
>>>>> On Thu, 7 Jan 2010 17:48:14 +0900
>>>>> KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote:
>>>>>>>> "How pages are shared" doesn't show good hints. I don't hear such parameter
>>>>>>>> is used in production's resource monitoring software.
>>>>>>> You mean "How many pages are shared" are not good hints, please see my
>>>>>>> justification above. With Virtualization (look at KSM for example),
>>>>>>> shared pages are going to be increasingly important part of the
>>>>>>> accounting.
>>>>>> Considering KSM, your cuounting style is tooo bad.
>>>>>> You should add
>>>> No.. I am just talking about shared memory being important and shared
>>>> accounting being useful, no counters for KSM in particular (in the
>>>> memcg context).
>>> Think so ? The number of memcg-private pages is in interest in my point of view.
>>> Anyway, I don't change my opinion as "sum of rss" is not necessary to be calculated
>>> in the kernel.
>>> If you want to provide that in memcg, please add it to global VM as /proc/meminfo.
>>> IIUC, KSM/SHMEM has some official method in global VM.
>> Kamezawa-San,
>> I implemented the same in user space and I get really bad results, here is why
>> 1. I need to hold and walk the tasks list in cgroups and extract RSS
>> through /proc (results in worse hold times for the fork() scenario you
>> menioned)
>> 2. The data is highly inconsistent due to the higher margin of error
>> in accumulating data which is changing as we run. By the time we total
>> and look at the memcg data, the data is stale
>> Would you be OK with the patch, if I renamed "shared_usage_in_bytes"
>> to "non_private_usage_in_bytes"?
> I think the name is still ambiguous.
> For example, if process A belongs to /cgroup/memory/01 and process B to /cgroup/memory/02,
> both process have 10MB anonymous pages and 10MB file caches of the same pages, and all of the
> file caches are charged to 01.
> In this case, the value in 01 is 0MB(=20MB - 20MB) and 10MB(20MB - 10MB), right?

Correct, file cache is almost always considered shared, so it has

1. non-private or shared usage of 10MB
2. 10 MB of file cache

> I don't think "non private usage" is appropriate to this value.
> Why don't you just show "sum_of_each_process_rss" ? I think it would be easier
> to understand for users.

Here is my concern

1. The gap between looking at memcg stat and sum of all RSS is way
higher in user space
2. Summing up all rss without walking the tasks atomically can and
will lead to consistency issues. Data can be stale as long as it
represents a consistent snapshot of data

We need to differentiate between

1. Data snapshot (taken at a time, but valid at that point)
2. Data taken from different sources that does not form a uniform
snapshot, because the timestamping of the each of the collected data
items is different

> But, hmm, I don't see any strong reason to do this in kernel, then :(

Please see my reason above for doing it in the kernel.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at