On Tue 27-06-23 15:14:11, David Hildenbrand wrote:
On 27.06.23 14:40, Michal Hocko wrote:
On Tue 27-06-23 13:22:18, David Hildenbrand wrote:
John Hubbard writes [1]:
Some device drivers add memory to the system via memory hotplug.
When the driver is unloaded, that memory is hot-unplugged.
However, memory hot unplug can fail. And these days, it fails a
little too easily, with respect to the above case. Specifically, if
a signal is pending on the process, hot unplug fails.
[...]
So in this case, other things (unmovable pages, un-splittable huge
pages) can also cause the above problem. However, those are
demonstrably less common than simply having a pending signal. I've
got bug reports from users who can trivially reproduce this by
killing their process with a "kill -9", for example.
This looks like a bug of the said driver no? If the tear down process is
killed it could very well happen right before offlining so you end up in
the very same state. Or what am I missing?
IIUC (John can correct me if I am wrong):
1) The process holds the device node open
2) The process gets killed or quits
3) As the process gets torn down, it closes the device node
4) Closing the device node results in the driver removing the device and
calling offline_and_remove_memory()
So it's not a "tear down process" that triggers that offlining_removal
somehow explicitly, it's just a side-product of it letting go of the device
node as the process gets torn down.
Isn't that just fragile? The operation might fail for other reasons. Why
cannot there be a hold on the resource to control the tear down
explicitly?
Especially with ZONE_MOVABLE, offlining is supposed to work in most
cases when offlining actually hotplugged (not boot) memory, and only fail
in rare corner cases (e.g., some driver holds a reference to a page in
ZONE_MOVABLE, turning it unmovable).
In these corner cases we really don't want to be stuck forever in
offline_and_remove_memory(). But in the general cases, we really want to
do our best to make memory offlining succeed -- in a reasonable
timeframe.
Reliably failing in the described case when there is a fatal signal pending
is sub-optimal. The pending signal check is mostly only relevant when user
space explicitly triggers offlining of memory using sysfs device attributes
("state" or "online" attribute), but not when coming via
offline_and_remove_memory().
So let's use a timer instead and ignore fatal signals, because they are
not really expressive for offline_and_remove_memory() users. Let's default
to 30 seconds if no timeout was specified, and limit the timeout to 120
seconds.
I really hate having timeouts back. They just proven to be hard to get
right and it is essentially a policy implemented in the kernel. They
simply do not belong to the kernel space IMHO.
As much as I agree with you in terms of offlining triggered from user space
(e.g., write "state" or "online" attribute) where user-space is actually in
charge and can do something reasonable (timeout, retry, whatever), in these
the offline_and_remove_memory() case it's the driver that wants a
best-effort memory offlining+removal.
If it times out, virtio-mem will simply try another block or retry later.
Right now, it could get stuck forever in offline_and_remove_memory(), which
is obviously "not great". Fortunately, for virtio-mem it's configurable and
we use the alloc_contig_range()-method for now as default.
It seems that offline_and_remove_memory is using a wrong operation then.
If it wants an opportunistic offlining with some sort of policy. Timeout
might be just one policy to use but failure mode or a retry count might
be a better fit for some users. So rather than (ab)using offline_pages,
would be make more sense to extract basic offlining steps and allow
drivers like virtio-mem to reuse them and define their own policy?
If it would time out for John's driver, we most certainly don't want to be
stuck in offline_and_remove_memory(), blocking device/driver unloading (and
even a reboot IIRC) possibly forever.
Now I am confused. John driver wants to tear down the device now? There
is no way to release that memory otherwise AFAIU from the initial
problem description.