Re: [PATCH 0/2] mm: memcontrol: cgroup2 memory statistics

From: Johannes Weiner
Date: Fri Jan 15 2016 - 15:31:48 EST

On Fri, Jan 15, 2016 at 12:58:34PM +0300, Vladimir Davydov wrote:
> With the follow-up it looks good to me. All exported counters look
> justified enough and the format follows that of other cgroup2
> controllers (cpu, blkio). Thanks!
> Acked-by: Vladimir Davydov <vdavydov@xxxxxxxxxxxxx>

Thanks Vladimir.

> One addition though. May be, we could add 'total' field which would show
> memory.current? Yeah, this would result in a little redundancy, but I
> think that from userspace pov it's much more convenient to read the
> only file and get all stat counters than having them spread throughout
> several files.

I am not fully convinced that a total value or even memory.current
will be looked at that often in practice, because in all but a few
cornercases that value will be pegged to the configured limit. In
those instances I think it should be okay to check another file.

> Come to think of it, do we really need separate file?
> Can't these counters live in memory.stat either?

I think it sits at a different level of the interface. The events file
indicates cgroup-specific dynamics between configuration and memory
footprint, and so it sits on the same level as low, high, max, and
current. These are the parts involved in the most basic control loop
between the kernel and the job scheduler--monitor and adjust or notify
the admin. It's for the entity that allocates and manages the system.

The memory.stat file on the other hand is geared toward analyzing and
understanding workload-specific performance (whether by humans or with
some automated heuristics) and if necessary correcting the config file
that describes the application's requirements to the job scheduler.

I think it makes sense to not conflate these two interfaces.

> Yeah, this file
> generates events, but IMHO it's not very useful the way it is currently
> implemented:
> Suppose, a user wants to receive notifications about OOM or LOW events,
> which are rather rare normally and might require immediate action. The
> only way to do that is to listen to, but this file can
> generate tons of MAX/HIGH when the cgroup is performing normally. The
> userspace app will have to wake up every time the cgroup performs
> reclaim and check just to ensure no OOM happened and this
> all will result in wasting cpu time.

Under optimal system load there is no limit reclaim, and memory
pressure comes exclusively from a shortage of physical pages that
global reclaim balances based on memory.low. If groups run into their
own limits, it means that there are idle resources left on the table.

So events only happen when the machine is over or under utilized, and
as per above, the events file is mainly meant for something like a job
scheduler tasked with allocating the machine's resources. It's hard to
imagine a job scheduler scenario where the long-term goal is anything
other than optimal utilization.

There are reasonable cases in which memory could be temporarily left
idle, say to keep startup latency of new jobs low. In those it's true
that the max and high notifications might become annoying. But do you
really think that could become problematic in practice? In that case
it should be enough if we ratelimit the file-changed notifications.

> May be, we could generate LOW/HIGH/MAX events on memory.low/high/max?
> This would look natural IMO. Don't know where OOM events should go in
> this case though.

Without a natural place for OOM notifications, it probably makes sense
to stick with