Re: [RFC] Shared page accounting for memory cgroup

From: Balbir Singh
Date: Mon Jan 18 2010 - 23:02:09 EST

On Tuesday 19 January 2010 08:04 AM, Daisuke Nishimura wrote:
> On Tue, 19 Jan 2010 07:19:42 +0530, Balbir Singh <balbir@xxxxxxxxxxxxxxxxxx> wrote:
>> On Tue, Jan 19, 2010 at 6:52 AM, Daisuke Nishimura
>> <nishimura@xxxxxxxxxxxxxxxxx> wrote:
>> [snip]
>>>> Correct, file cache is almost always considered shared, so it has
>>>> 1. non-private or shared usage of 10MB
>>>> 2. 10 MB of file cache
>>>>> I don't think "non private usage" is appropriate to this value.
>>>>> Why don't you just show "sum_of_each_process_rss" ? I think it would be easier
>>>>> to understand for users.
>>>> Here is my concern
>>>> 1. The gap between looking at memcg stat and sum of all RSS is way
>>>> higher in user space
>>>> 2. Summing up all rss without walking the tasks atomically can and
>>>> will lead to consistency issues. Data can be stale as long as it
>>>> represents a consistent snapshot of data
>>>> We need to differentiate between
>>>> 1. Data snapshot (taken at a time, but valid at that point)
>>>> 2. Data taken from different sources that does not form a uniform
>>>> snapshot, because the timestamping of the each of the collected data
>>>> items is different
>>> Hmm, I'm sorry I can't understand why you need "difference".
>>> IOW, what can users or middlewares know by the value in the above case
>>> (0MB in 01 and 10MB in 02)? I've read this thread, but I can't understande about
>>> this point... Why can this value mean some of the groups are "heavy" ?
>> Consider a default cgroup that is not root and assume all applications
>> move there initially. Now with a lot of shared memory,
>> the default cgroup will be the first one to page in a lot of the
>> memory and its usage will be very high. Without the concept of
>> showing how much is non-private, how does one decide if the default
>> cgroup is using a lot of memory or sharing it? How
>> do we decide on limits of a cgroup without knowing its actual usage -
>> PSS equivalent for a region of memory for a task.
> As for limit, I think we should decide it based on the actual usage because
> we account and limit the accual usage. Why we should take account of the sum of rss ?

I am talking of non-private pages or potentially shared pages - which is
derived as follows

sum_of_all_rss - (rss + file_mapped) (from .stat file)

file cache is considered to be shared always

> I agree that we'd better not to ignore the sum of rss completely, but could you show me
> how the value 0MB/10MB can be used to caluculate the limit in 01/02 in detail ?

In your example, usage shows that the real usage of the cgroup is 20 MB
for 01 and 10 MB for 02. Today we show that we are using 40MB instead of
30MB (when summed). If an administrator has to make a decision to say
add more resources, the one with 20MB would be the right place w.r.t.

> I wouldn't argue against you if I could understand the value would be useful,
> but I can't understand how the value can be used, so I'm asking :)

I understand, I am not completely closed to suggestions from you and
Kamezawa-San, just trying to find a way to get useful information about
shared memory usage back to user space. Remember walking the LRU or even
VMA's to find shared pages is expensive. We could do it lazily at rmap
time, it works well for charging, but not too good for uncharging, since
we'll need to keep track of the mm's, so that if the mm that charge can
be properly marked as private or shared in the correct memcg. It will
require more invasive work.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at