Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement
From: Ian Kent
Date: Sat Jun 20 2020 - 23:22:17 EST
On Fri, 2020-06-19 at 11:38 -0400, Tejun Heo wrote:
> Hello, Ian.
>
> On Wed, Jun 17, 2020 at 03:37:43PM +0800, Ian Kent wrote:
> > The series here tries to reduce the locking needed during path
> > walks
> > based on the assumption that there are many path walks with a
> > fairly
> > large portion of those for non-existent paths, as described above.
> >
> > That was done by adding kernfs negative dentry caching (non-
> > existent
> > paths) to avoid continual alloc/free cycle of dentries and a
> > read/write
> > semaphore introduced to increase kernfs concurrency during path
> > walks.
> >
> > With these changes we still need kernel parameters of
> > udev.children-max=2048
> > and systemd.default_timeout_start_sec=300 for the fastest boot
> > times of
> > under 5 minutes.
>
> I don't have strong objections to the series but the rationales don't
> seem
> particularly strong. It's solving a suspected problem but only half
> way. It
> isn't clear whether this can be the long term solution for the
> problem
> machine and whether it will benefit anyone else in a meaningful way
> either.
>
> I think Greg already asked this but how are the 100,000+ memory
> objects
> used? Is that justified in the first place?
The problem is real enough, however, whether improvements can be made
in other areas flowing on from the arch specific device creation
notifications is not clear cut.
There's no question that there is very high contention between the
VFS and sysfs and that's all the series is trying to improve, nothing
more.
What both you and Greg have raised are good questions but are
unfortunately very difficult to answer.
I tried to add some discussion about it, to the extent that I could,
in the cover letter.
Basically the division of memory into 256M chunks is something that's
needed to provide flexibility for arbitrary partition creation (a set
of hardware allocated that's used for, essentially, a bare metal OS
install). Whether that's many small partitions for load balanced server
farms (or whatever) or much larger partitions for for demanding
applications, such as Oracle systems, is not something that can be
known in advance.
So the division into small memory chunks can't change.
The question of sysfs node creation, what uses them and when they
are used is much harder.
I'm not able to find that out and, I doubt even IBM would know, if
their customers use applications that need to consult the sysfs
file system for this information or when it's needed if it is need
at all. So I'm stuck on this question.
One thing is for sure though, it would be (at the very least) risky
for a vendor to assume they either aren't needed or aren't needed early
during system start up.
OTOH I've looked at what gets invoked on udev notifications (which
is the source of the heavy path walk activity, I admit I need to
dig deeper) and that doesn't appear to be doing anything obviously
wrong so that far seems ok.
For my part, as long as the series proves to be sound, why not,
it does substantially reduce contention between the VFS and sysfs
is the face of heavy sysfs path walk activity so I think that
stands alone as sufficient to consider the change worthwhile.
Ian