Re: [PATCH/RFC] core/nfsd: allow kernel threads to use task_work.
From: Jeff Layton
Date: Thu Nov 30 2023 - 12:48:05 EST
On Wed, 2023-11-29 at 09:04 -0500, Chuck Lever wrote:
> On Wed, Nov 29, 2023 at 10:20:23AM +1100, NeilBrown wrote:
> > On Wed, 29 Nov 2023, Christian Brauner wrote:
> > > [Reusing the trimmed Cc]
> > >
> > > On Tue, Nov 28, 2023 at 11:16:06AM +1100, NeilBrown wrote:
> > > > On Tue, 28 Nov 2023, Chuck Lever wrote:
> > > > > On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote:
> > > > > >
> > > > > > I have evidence from a customer site of 256 nfsd threads adding files to
> > > > > > delayed_fput_lists nearly twice as fast they are retired by a single
> > > > > > work-queue thread running delayed_fput(). As you might imagine this
> > > > > > does not end well (20 million files in the queue at the time a snapshot
> > > > > > was taken for analysis).
> > > > > >
> > > > > > While this might point to a problem with the filesystem not handling the
> > > > > > final close efficiently, such problems should only hurt throughput, not
> > > > > > lead to memory exhaustion.
> > > > >
> > > > > I have this patch queued for v6.8:
> > > > >
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-next&id=c42661ffa58acfeaf73b932dec1e6f04ce8a98c0
> > > > >
> > > >
> > > > Thanks....
> > > > I think that change is good, but I don't think it addresses the problem
> > > > mentioned in the description, and it is not directly relevant to the
> > > > problem I saw ... though it is complicated.
> > > >
> > > > The problem "workqueue ... hogged cpu..." probably means that
> > > > nfsd_file_dispose_list() needs a cond_resched() call in the loop.
> > > > That will stop it from hogging the CPU whether it is tied to one CPU or
> > > > free to roam.
> > > >
> > > > Also that work is calling filp_close() which primarily calls
> > > > filp_flush().
> > > > It also calls fput() but that does minimal work. If there is much work
> > > > to do then that is offloaded to another work-item. *That* is the
> > > > workitem that I had problems with.
> > > >
> > > > The problem I saw was with an older kernel which didn't have the nfsd
> > > > file cache and so probably is calling filp_close more often. So maybe
> > > > my patch isn't so important now. Particularly as nfsd now isn't closing
> > > > most files in-task but instead offloads that to another task. So the
> > > > final fput will not be handled by the nfsd task either.
> > > >
> > > > But I think there is room for improvement. Gathering lots of files
> > > > together into a list and closing them sequentially is not going to be as
> > > > efficient as closing them in parallel.
> > > >
> > > > >
> > > > > > For normal threads, the thread that closes the file also calls the
> > > > > > final fput so there is natural rate limiting preventing excessive growth
> > > > > > in the list of delayed fputs. For kernel threads, and particularly for
> > > > > > nfsd, delayed in the final fput do not impose any throttling to prevent
> > > > > > the thread from closing more files.
> > > > >
> > > > > I don't think we want to block nfsd threads waiting for files to
> > > > > close. Won't that be a potential denial of service?
> > > >
> > > > Not as much as the denial of service caused by memory exhaustion due to
> > > > an indefinitely growing list of files waiting to be closed by a single
> > > > thread of workqueue.
> > >
> > > It seems less likely that you run into memory exhausting than a DOS
> > > because nfsd() is busy closing fds. Especially because you default to
> > > single nfsd thread afaict.
> >
> > An nfsd thread would not end up being busy closing fds any more than it
> > can already be busy reading data or busy syncing out changes or busying
> > renaming a file.
> > Which it is say: of course it can be busy doing this, but doing this sort
> > of thing is its whole purpose in life.
> >
> > If an nfsd thread only completes the close that it initiated the close
> > on (which is what I am currently proposing) then there would be at most
> > one, or maybe 2, fds to close after handling each request.
>
> Closing files more aggressively would seem to entirely defeat the
> purpose of the file cache, which is to avoid the overhead of opens
> and closes on frequently-used files.
>
> And usually Linux prefers to let the workload consume as many free
> resources as possible before it applies back pressure or cache
> eviction.
>
> IMO the first step should be removing head-of-queue blocking from
> the file cache's background closing mechanism. That might be enough
> to avoid forming a backlog in most cases.
>
>
That's not quite what task_work does. Neil's patch wouldn't result in
closes happening more aggressively. It would just make it so that we
don't queue the delayed part of the fput process to a workqueue like we
do today.
Instead, the nfsd threads would have to clean that part up themselves,
like syscalls do before returning to userland. I think that idea makes
sense overall since that mirrors what we already do in userland.
In the event that all of the nfsd threads are tied up in slow task_work
jobs...tough luck. That at least makes it more of a self-limiting
problem since RPCs will start being queueing, rather than allowing dead
files to just pile onto the list.
> > > > For NFSv3 it is more complex. On the kernel where I saw a problem the
> > > > filp_close happen after each READ or WRITE (though I think the customer
> > > > was using NFSv4...). With the file cache there is no thread that is
> > > > obviously responsible for the close.
> > > > To get the sort of throttling that I think is need, we could possibly
> > > > have each "nfsd_open" check if there are pending closes, and to wait for
> > > > some small amount of progress.
> > > >
> > > > But don't think it is reasonable for the nfsd threads to take none of
> > > > the burden of closing files as that can result in imbalance.
> > >
> > > It feels that this really needs to be tested under a similar workload in
> > > question to see whether this is a viable solution.
> >
> > Creating that workload might be a challenge. I know it involved
> > accessing 10s of millions of files with a server that was somewhat
> > memory constrained. I don't know anything about the access pattern.
> >
> > Certainly I'll try to reproduce something similar by inserting delays in
> > suitable places. This will help exercise the code, but won't really
> > replicate the actual workload.
>
> It's likely that the fundamental bottleneck is writeback during
> close.
>
--
Jeff Layton <jlayton@xxxxxxxxxx>