Re: [PATCH 2/3] sysctl: add support for drop_caches for individual filesystem

From: Jan Kara
Date: Mon Oct 14 2024 - 07:24:47 EST


On Fri 11-10-24 13:44:57, Amir Goldstein wrote:
> On Thu, Oct 10, 2024 at 7:04 PM Jan Kara <jack@xxxxxxx> wrote:
> >
> > On Thu 10-10-24 09:35:46, Benjamin Coddington wrote:
> > > On 10 Oct 2024, at 8:16, Jan Kara wrote:
> > >
> > > > On Thu 10-10-24 19:25:42, Ye Bin wrote:
> > > >> From: Ye Bin <yebin10@xxxxxxxxxx>
> > > >>
> > > >> In order to better analyze the issue of file system uninstallation caused
> > > >> by kernel module opening files, it is necessary to perform dentry recycling
> > > >
> > > > I don't quite understand the use case you mention here. Can you explain it
> > > > a bit more (that being said I've needed dropping caches for a particular sb
> > > > myself a few times for debugging purposes so I generally agree it is a
> > > > useful feature).
> > > >
> > > >> on a single file system. But now, apart from global dentry recycling, it is
> > > >> not supported to do dentry recycling on a single file system separately.
> > > >> This feature has usage scenarios in problem localization scenarios.At the
> > > >> same time, it also provides users with a slightly fine-grained
> > > >> pagecache/entry recycling mechanism.
> > > >> This patch supports the recycling of pagecache/entry for individual file
> > > >> systems.
> > > >>
> > > >> Signed-off-by: Ye Bin <yebin10@xxxxxxxxxx>
> > > >> ---
> > > >> fs/drop_caches.c | 43 +++++++++++++++++++++++++++++++++++++++++++
> > > >> include/linux/mm.h | 2 ++
> > > >> kernel/sysctl.c | 9 +++++++++
> > > >> 3 files changed, 54 insertions(+)
> > > >>
> > > >> diff --git a/fs/drop_caches.c b/fs/drop_caches.c
> > > >> index d45ef541d848..99d412cf3e52 100644
> > > >> --- a/fs/drop_caches.c
> > > >> +++ b/fs/drop_caches.c
> > > >> @@ -77,3 +77,46 @@ int drop_caches_sysctl_handler(const struct ctl_table *table, int write,
> > > >> }
> > > >> return 0;
> > > >> }
> > > >> +
> > > >> +int drop_fs_caches_sysctl_handler(const struct ctl_table *table, int write,
> > > >> + void *buffer, size_t *length, loff_t *ppos)
> > > >> +{
> > > >> + unsigned int major, minor;
> > > >> + unsigned int ctl;
> > > >> + struct super_block *sb;
> > > >> + static int stfu;
> > > >> +
> > > >> + if (!write)
> > > >> + return 0;
> > > >> +
> > > >> + if (sscanf(buffer, "%u:%u:%u", &major, &minor, &ctl) != 3)
> > > >> + return -EINVAL;
> > > >
> > > > I think specifying bdev major & minor number is not a great interface these
> > > > days. In particular for filesystems which are not bdev based such as NFS. I
> > > > think specifying path to some file/dir in the filesystem is nicer and you
> > > > can easily resolve that to sb here as well.
> > >
> > > Slight disagreement here since NFS uses set_anon_super() and major:minor
> > > will work fine with it.
> >
> > OK, fair point, anon bdev numbers can be used. But filesystems using
> > get_tree_nodev() would still be problematic.
> >
> > > I'd prefer it actually since it avoids this
> > > interface having to do a pathwalk and make decisions about what's mounted
> > > where and in what namespace.
> >
> > I don't understand the problem here. We'd do user_path_at(AT_FDCWD, ...,
> > &path) and then take path.mnt->mnt_sb. That doesn't look terribly
> > complicated to me. Plus it naturally deals with issues like namespacing
> > etc. although they are not a huge issue here because the functionality
> > should be restricted to CAP_SYS_ADMIN anyway.
> >
>
> Both looking up bdev and looking up path from write() can make syzbot
> and lockdep very upset:
> https://lore.kernel.org/linux-fsdevel/00000000000098f75506153551a1@xxxxxxxxxx/

OK, thanks for the reference.

> I thought Christian had a proposal for dropping cache per-sb API via
> fadvise() or something?
>
> Why use sysfs API for this and not fd to reference an sb?

I guess because the original drop_caches is in the sysfs. But yes, in
principle we could use fd pointing to the filesystem for this. I'm just not
sure fadvise(2) is really the right syscall for this because it is
currently all about page cache of a file and this call should shrink also
the dcache / icache. But ioctl() (not sure if this debug-mostly
functionality is worth a syscall) implemented in VFS would certainly be
possible and perhaps nicer than sysfs interface.

Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR