Re: [PATCH 3/3] vmevent: Implement special low-memory attribute

From: Pekka Enberg
Date: Tue May 08 2012 - 03:36:28 EST

On Tue, May 8, 2012 at 10:11 AM, KOSAKI Motohiro
<kosaki.motohiro@xxxxxxxxx> wrote:
> Ok, sane. Then I take my time a little and review current vmevent code briefly.
> (I read vmevent/core branch in pekka's tree. please let me know if
> there is newer repositry)

It's the latest one.

On Tue, May 8, 2012 at 10:11 AM, KOSAKI Motohiro
<kosaki.motohiro@xxxxxxxxx> wrote:
> 1) sample_period is brain damaged idea. If people ONLY need to
> sampling stastics, they
>  only need to read /proc/vmstat periodically. just remove it and
> implement push notification.
>  _IF_ someone need unfrequent level trigger, just use
> "usleep(timeout); read(vmevent_fd)"
>  on userland code.

That comes from a real-world requirement. See Leonid's email on the topic:

> 2) VMEVENT_ATTR_STATE_ONE_SHOT is misleading name. That is effect as
> edge trigger shot. not only once.


> 3) vmevent_fd() seems sane interface. but it has name space unaware.
> maybe we discuss how to harmonize name space feature.  No hurry. but we have
> to think that issue since at beginning.

You mean VFS namespaces? Yeah, we need to take care of that.

> 4) Currently, vmstat have per-cpu batch and vmstat updating makes 3
> second delay at maximum.
>  This is fine for usual case because almost userland watcher only
> read /proc/vmstat per second.
>  But, for vmevent_fd() case, 3 seconds may be unacceptable delay. At
> worst, 128 batch x 4096
>  x 4k pagesize = 2G bytes inaccurate is there.

That's pretty awful. Anton, Leonid, comments?

> 5) __VMEVENT_ATTR_STATE_VALUE_WAS_LT should be removed from userland
> exporting files.
>  When exporing kenrel internal, always silly gus used them and made unhappy.

Agreed. Anton, care to cook up a patch to do that?

> 6) Also vmevent_event must hide from userland.

Why? That's part of the ABI.

> 7) vmevent_config::size must be removed. In 20th century, M$ API
> prefer to use this technique. But
>  They dropped the way because a lot of application don't initialize
> size member and they can't use it for keeping upper compitibility.

It's there to support forward/backward ABI compatibility like perf
does. I'm going to keep it for now but I'm open to dropping it when
the ABI is more mature.

> 8) memcg unaware
> 9) numa unaware
> 10) zone unaware


> And, we may need vm internal change if we really need lowmem
> notification. current kernel don't have such info. _And_ there is one more
> big problem. Currently the kernel maintain memory per
> zone. But almost all userland application aren't aware zone nor node.
> Thus raw notification aren't useful for userland. In the other hands, total
> memory and total free memory is useful? Definitely No!
> Even though total free memory are lots, system may start swap out and
> oom invokation. If we can't oom invocation, this feature has serious raison
> d'etre issue. (i.e. (4), (8), (9) and (19) are not ignorable issue. I think)

I'm guessing most of the existing solutions get away with
approximations and soft limits because they're mostly used on UMA
embedded machines.

But yes, we need to do better here.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at