Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps

From: Sonny Rao
Date: Mon Aug 22 2016 - 18:45:21 EST


On Mon, Aug 22, 2016 at 12:54 AM, Michal Hocko <mhocko@xxxxxxxxxx> wrote:
> On Fri 19-08-16 10:57:48, Sonny Rao wrote:
>> On Fri, Aug 19, 2016 at 12:59 AM, Michal Hocko <mhocko@xxxxxxxxxx> wrote:
>> > On Thu 18-08-16 23:43:39, Sonny Rao wrote:
>> >> On Thu, Aug 18, 2016 at 11:01 AM, Michal Hocko <mhocko@xxxxxxxxxx> wrote:
>> >> > On Thu 18-08-16 10:47:57, Sonny Rao wrote:
>> >> >> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko <mhocko@xxxxxxxxxx> wrote:
>> >> >> > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
>> >> > [...]
>> >> >> >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
>> >> >> >> than let the kernel's OOM killer activate and need to gather this
>> >> >> >> information and we'd like to be able to get this information to make
>> >> >> >> the decision much faster than 400ms
>> >> >> >
>> >> >> > Global OOM handling in userspace is really dubious if you ask me. I
>> >> >> > understand you want something better than SIGKILL and in fact this is
>> >> >> > already possible with memory cgroup controller (btw. memcg will give
>> >> >> > you a cheap access to rss, amount of shared, swapped out memory as
>> >> >> > well). Anyway if you are getting close to the OOM your system will most
>> >> >> > probably be really busy and chances are that also reading your new file
>> >> >> > will take much more time. I am also not quite sure how is pss useful for
>> >> >> > oom decisions.
>> >> >>
>> >> >> I mentioned it before, but based on experience RSS just isn't good
>> >> >> enough -- there's too much sharing going on in our use case to make
>> >> >> the correct decision based on RSS. If RSS were good enough, simply
>> >> >> put, this patch wouldn't exist.
>> >> >
>> >> > But that doesn't answer my question, I am afraid. So how exactly do you
>> >> > use pss for oom decisions?
>> >>
>> >> We use PSS to calculate the memory used by a process among all the
>> >> processes in the system, in the case of Chrome this tells us how much
>> >> each renderer process (which is roughly tied to a particular "tab" in
>> >> Chrome) is using and how much it has swapped out, so we know what the
>> >> worst offenders are -- I'm not sure what's unclear about that?
>> >
>> > So let me ask more specifically. How can you make any decision based on
>> > the pss when you do not know _what_ is the shared resource. In other
>> > words if you select a task to terminate based on the pss then you have to
>> > kill others who share the same resource otherwise you do not release
>> > that shared resource. Not to mention that such a shared resource might
>> > be on tmpfs/shmem and it won't get released even after all processes
>> > which map it are gone.
>>
>> Ok I see why you're confused now, sorry.
>>
>> In our case that we do know what is being shared in general because
>> the sharing is mostly between those processes that we're looking at
>> and not other random processes or tmpfs, so PSS gives us useful data
>> in the context of these processes which are sharing the data
>> especially for monitoring between the set of these renderer processes.
>
> OK, I see and agree that pss might be useful when you _know_ what is
> shared. But this sounds quite specific to a particular workload. How
> many users are in a similar situation? In other words, if we present
> a single number without the context, how much useful it will be in
> general? Is it possible that presenting such a number could be even
> misleading for somebody who doesn't have an idea which resources are
> shared? These are all questions which should be answered before we
> actually add this number (be it a new/existing proc file or a syscall).
> I still believe that the number without wider context is just not all
> that useful.


I see the specific point about PSS -- because you need to know what
is being shared or otherwise use it in a whole system context, but I
still think the whole system context is a valid and generally useful
thing. But what about the private_clean and private_dirty? Surely
those are more generally useful for calculating a lower bound on
process memory usage without additional knowledge?

At the end of the day all of these metrics are approximations, and it
comes down to how far off the various approximations are and what
trade offs we are willing to make.
RSS is the cheapest but the most coarse.

PSS (with the correct context) and Private data plus swap are much
better but also more expensive due to the PT walk.
As far as I know, to get anything but RSS we have to go through smaps
or use memcg. Swap seems to be available in /proc/<pid>/status.

I looked at the "shared" value in /proc/<pid>/statm but it doesn't
seem to correlate well with the shared value in smaps -- not sure why?

It might be useful to show the magnitude of difference of using RSS vs
PSS/Private in the case of the Chrome renderer processes. On the
system I was looking at there were about 40 of these processes, but I
picked a few to give an idea:

localhost ~ # cat /proc/21550/totmaps
Rss: 98972 kB
Pss: 54717 kB
Shared_Clean: 19020 kB
Shared_Dirty: 26352 kB
Private_Clean: 0 kB
Private_Dirty: 53600 kB
Referenced: 92184 kB
Anonymous: 46524 kB
AnonHugePages: 24576 kB
Swap: 13148 kB


RSS is 80% higher than PSS and 84% higher than private data

localhost ~ # cat /proc/21470/totmaps
Rss: 118420 kB
Pss: 70938 kB
Shared_Clean: 22212 kB
Shared_Dirty: 26520 kB
Private_Clean: 0 kB
Private_Dirty: 69688 kB
Referenced: 111500 kB
Anonymous: 79928 kB
AnonHugePages: 24576 kB
Swap: 12964 kB

RSS is 66% higher than RSS and 69% higher than private data

localhost ~ # cat /proc/21435/totmaps
Rss: 97156 kB
Pss: 50044 kB
Shared_Clean: 21920 kB
Shared_Dirty: 26400 kB
Private_Clean: 0 kB
Private_Dirty: 48836 kB
Referenced: 90012 kB
Anonymous: 75228 kB
AnonHugePages: 24576 kB
Swap: 13064 kB

RSS is 94% higher than PSS and 98% higher than private data.

It looks like there's a set of about 40MB of shared pages which cause
the difference in this case.
Swap was roughly even on these but I don't think it's always going to be true.


>
>> We also use the private clean and private dirty and swap fields to
>> make a few metrics for the processes and charge each process for it's
>> private, shared, and swap data. Private clean and dirty are used for
>> estimating a lower bound on how much memory would be freed.
>
> I can imagine that this kind of information might be useful and
> presented in /proc/<pid>/statm. The question is whether some of the
> existing consumers would see the performance impact due to he page table
> walk. Anyway even these counters might get quite tricky because even
> shareable resources are considered private if the process is the only
> one to map them (so again this might be a file on tmpfs...).
>
>> Swap and
>> PSS also give us some indication of additional memory which might get
>> freed up.
> --
> Michal Hocko
> SUSE Labs