Re: Ways to deprecate /sys/devices/system/memory/memoryX/phys_device ?

From: Michal Hocko
Date: Mon Sep 14 2020 - 07:25:53 EST


On Fri 11-09-20 12:09:52, David Hildenbrand wrote:
> On 11.09.20 11:12, Michal Hocko wrote:
> > On Fri 11-09-20 10:09:07, David Hildenbrand wrote:
> > [...]
> >> Consider two cases:
> >>
> >> 1. Hot(un)plugging huge DIMMs: many (not all!) use cases want to
> >> online/offline the whole thing. HW can effectively only plug/unplug the
> >> whole thing. It makes sense in some (most?) setups to represent one DIMM
> >> as one memory block device.
> >
> > Yes, for the physical hotplug it doesn't really make much sense to me to
> > offline portions that the HW cannot hotremove.
>
> I've seen people offline parts of memory to simulate systems with less
> RAM and people offline parts of memory on demand to save energy
> (poweroff banks). People won't stop being creative with what we provided
> to them :D

Heh, I have seen people shooting their foot for fun. But more seriously,
I do undestand different usecases and we shouldn't cut them off their
toys.

> >> 2. Hot(un)plugging small memory increments. This is mostly the case in
> >> virtualized environments - especially hyper-v balloon, xen balloon,
> >> virtio-mem and (drumroll) ppc dlpar and s390x standby memory. On PPC,
> >> you want at least all (16MB!) memory block devices that can get
> >> unplugged again individually ("LMBs") as separate memory blocks. Same on
> >> s390x on memory increment size (currently effectively the memory block
> >> size).
> >
> > Yes I do recognize those usecase even though I will not pretend I
> > consider it quesitonable. E.g. any hotplug with a smaller granularity
> > than the memory model in Linus allows is just dubious. We simply cannot
> > implement that without a lot of wasting and then the question is what is
> > the real point.
>
> Having the section size as small as possible in these environments is
> most certainly preferable, to clean up metadata where possible.

There is a certain line that is hard to maintain. I consider a section
to be the smallest granularity that makes sense to support. Current
section sizing makes sense from the VMEMMAP point of view. If there are
strong reasons to allow smaller once then I belive this should be
compile time option.

> Otherwise, hot(un)plugging smaller granularity behaves more like memory
> ballooning (and I think I don't have to tell you that ballooning is used
> excessively even though it wastes memory on metadata ;) ). Anyhow,
> that's another discussion.

Yeah, I am aware of that. And honestly subsection offlining makes very
little sense to me. It was hard to argue against that for nvdimm
usecases where we simply had to workaround the reality where devices
couldn't have been aligned properly. I do not think we want to claim a
support for general hotplug though.

[...]

> > There is only one certainty. Providing a long term interface with ever
> > growing (ab)users is a hard target. And shinyN might be needed in the
> > end. Who knows. My main point is that the existing interface is hitting
> > a wall on usecases which _do_not_care_ about memory hotplug. And that is
> > something we should be looking at.
>
> Agreed. I can see 3 scenarios
>
> a) no memory hotplug support, no sysfs.
> b) memory hotplug support, no sysfs
> c) memory hotplug support, sysfs
>
> Starting with a) and c) is the easiest way to go.

Yes, the first and the simplest way would be to provide
memory_hotplug=[disabled|v1]

where disabled would be no sysfs interface, v1 would be the existing
infrastructure. I would hope to land with v2 in a future which would
provide a new interface.

--
Michal Hocko
SUSE Labs