So some urgent questions are: how are we going to do mem hotunplug and
per-container RSS?
Our basic unit of memory management is the zone. Right now, a zone maps
onto some hardware-imposed thing. But the zone-based MM works *well*. I
suspect that a good way to solve both per-container RSS and mem hotunplug
is to split the zone concept away from its hardware limitations: create a
"software zone" and a "hardware zone". All the existing page allocator and
reclaim code remains basically unchanged, and it operates on "software
zones". Each software zones always lies within a single hardware zone. The software zones are resizeable. For per-container RSS we give each
container one (or perhaps multiple) resizeable software zones.
For memory hotunplug, some of the hardware zone's software zones are marked
reclaimable and some are not; DIMMs which are wholly within reclaimable
zones can be depopulated and powered off or removed.
NUMA and cpusets screwed up: they've gone and used nodes as their basic
unit of memory management whereas they should have used zones. This will
need to be untangled.
Anyway, that's just a shot in the dark. Could be that we implement unplug
and RSS control by totally different means. But I do wish that we'd sort
out what those means will be before we potentially complicate the story a
lot by adding antifragmentation.