Re: [PATCH 0/2] scop GFP_NOFS api
From: NeilBrown
Date: Thu May 05 2016 - 23:21:11 EST
On Wed, May 04 2016, Dave Chinner wrote:
> FWIW, I don't think making evict() non-blocking is going to be worth
> the effort here. Making memory reclaim wait on a priority ordered
> queue while asynchronous reclaim threads run reclaim as efficiently
> as possible and wakes waiters as it frees the memory the waiters
> require is a model that has been proven to work in the past, and
> this appears to me to be the model you are advocating for. I agree
> that direct reclaim needs to die and be replaced with something
> entirely more predictable, controllable and less prone to deadlock
> contexts - you just need to convince the mm developers that it will
> perform and scale better than what we have now.
>
> In the mean time, having a slightly more fine grained GFP_NOFS
> equivalent context will allow us to avoid the worst of the current
> GFP_NOFS problems with very little extra code.
You have painted two pictures here. The first is an ideal which does
look a lot like the sort of outcome I was aiming for, but is more than a
small step away.
The second is a band-aid which would take us in exactly the wrong
direction. It makes an interface which people apparently find hard to
use (or easy to misused) - the setting of __GFP_FS - and makes it more
complex. Certainly it would be more powerful, but I think it would also
be more misused.
So I ask myself: can we take some small steps towards 'A' and thereby
enable at least the functionality enabled by 'B'?
A core design principle for me is to enable filesystems to take control
of their own destiny. They should have the information available to
make the decisions they need to make, and the opportunity to carry them
out.
All the places where direct reclaim currently calls into filesystems
carry the 'gfp' flags so the file system can decide what to do, with one
exception: evict_inode. So my first proposal would be to rectify that.
- redefine .nr_cached_objects and .free_cached_objects so that, if they
are defined, they are responsible for s_dentry_lru and s_inode_lru.
e.g. super_cache_count *either* calls ->nr_cached_objects *or* makes
two calls to list_lru_shrink_count. This would require exporting
prune_dcache_sb and prune_icache_sb but otherwise should be a fairly
straight forward change.
If nr_cached_objects were defined, super_cache_scan would no longer
abort without __GFP_FS - that test would be left to the filesystem.
- Now any filesystem that wants to can stash it's super_block pointer
in current->journal_info while doing memory allocations, and abort
any reclaim attempts (release_page, shrinker, nr_cached_objects) if
and only if current->journal_info == "my superblock". This can be
done without the core mm code knowing any more than it already does.
- A more sophisticated filesystem might import much of the code for
prune_icache_sb() - either by copy/paste or by exporting some vfs
internals - and then store an inode pointer in current->journal_info
and only abort reclaim which touches that inode.
- if a filesystem happens to know that it will never block in any of
these reclaim calls, it can always allow prune_dcache_sb to run, and
never needs to use GFP_NOFS. I think NFS might be close to being
able to do this as it flushes everything on last-close. But that is
something that NFS developers can care about (or not) quite
independently from mm people.
- Maybe some fs developer will try to enable free_cached_objects to do
as much work as possible for every inode, but never deadlock. It
could do its own fs-specfic deadlock detection, or could queue work
to a work queue and wait a limited time for it. Or something.
If some filesystem developer comes up with something that works
really well, developers of other filesystems might copy it - or not
as they choose.
Maybe ->journal_info isn't perfect for this. It is currently only safe
for reclaim code to compare it against a known value. It is not safe to
dereference it to see if it points to a known value. That could possibly be
cleaned up, or another task_struct field could be provided for
filesystems to track their state. Or do you find a task_struct field
unacceptable and there is some reason and that an explicitly passed cookie
is superior?
My key point is that we shouldn't try to plumb some new abstraction
through the MM code so there is a new pattern for all filesystems to
follow. Rather the mm/vfs should get out of the filesystems' way as much
as possible and let them innovate independently.
Thanks for your time,
NeilBrown
Attachment:
signature.asc
Description: PGP signature