Avi Kivity wrote:
Nick Piggin wrote:
What's stopping the NFS server from ooming the machine then? Every time some bit of memory becomes free, the server will consume it instantly. Eventually ext3 will not be able to write anything out because it is out of memory.The NFS server should do the writeout a page at a time.
The NFS server writes not only in response to page reclaim (as a local NFS client), but also in response to pressure from non-local clients. If both ext3 and NFS have the same allocation limits, NFS may starve out ext3.
What do you mean starve out ext3? ext3 gets written to *by the NFS server*
which is PF_MEMALLOC.
(In my case the NFS server actually writes data asynchronously, so it doesn't really know it is responding to page reclaim, but the problem occurs even in a synchrounous NFS server.)
I can't see this being the responsibility of the kernel. The NFS server
could probably find out if it is servicing a loopback request or not.
Remote requests don't help to free memory... unless maybe you want a
filesystem on a remote nbd to be exported back to server via NFS or
something crazy.
An even more complex case is when ext3 depends on some other process, say it is mounted on a loopback nbd.The memory allocators will block when memory reaches the reserved
dirty NFS data -> NFS server -> ext3 -> nbd -> nbd server on localhost -> ext3/raw device
You can't have both the NFS server and the nbd server PF_MEMALLOC, since the NFS server may consume all memory, then wait for the nbd server to reclaim.
mark. Page reclaim will ask NFS to free one page, so the server
will write something out to the filesystem, this will cause the nbd
server (also PF_MEMALLOC) to write out to its backing filesystem.
If NFS and nbd have the same limit, then NFS may cause nbd to stall. We've already established that NFS must be PF_MEMALLOC, so nbd must be PF_MEMALLOC_HARDER or something like that.
No, your NFS server has to be coded differently. You can't allow it
to use up all PF_MEMALLOC memory just because it can.
The solution I have in mind is to replace the sync allocation logic from
if (free_mem() < some_global_limit && !current->PF_MEMALLOC)
wait_for_kswapd()
to
if (free_mem() < current->limit)
wait_for_kswapd()
kswapd would have the lowest ->limit, other processes as their place in the food chain dictates.
I think this is barking up the wrong tree. It really doesn't matter
what process is freeing memory. There isn't really anything special
about the way kswapd frees memory.
To free memory you need (a) to allocate memory (b) possibly wait for some freeing process to make some progress. That means all processes in the freeing chain must be able to allocate at least some memory. If two processes in the chain share the same blocking logic, they may deadlock on each other.
The PF_MEMALLOC path isn't to be used like that. If a *single*
PF_MEMALLOC task were to allocate all its memory then that would
be a bug too.