Re: [PATCH] lib/genalloc: use try_cmpxchg in {set,clear}_bits_ll

From: Mateusz Guzik
Date: Mon Jan 23 2023 - 19:11:34 EST


On 1/23/23, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> On Mon, Jan 23, 2023 at 7:59 AM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
>> Basic idea would be to have competing CPUs aggregate the total change
>> on the lock and then have one of them make it happen across the
>> interconnect -- so instead of add/sub 1 you would add/sub n, where n
>> can very well be quite big.
>
> No. The use is literally serialized, and done one at a time, and needs
> to be synchronous.
>
> We're talking about looking up a <i>single</i> path component, while
> at the same time making sure that that path component isn't being
> removed by another CPU - and there is by definition no other locking
> going on, because the lockref *is* the locking.
>
[snip]
>
> Thanks to RCU pathname lookup, the lockref thing *should* come into
> play only when you actually fall out of RCU mode, ie for the last
> component. That's a huge win, because that's what avoids the whole
> "everybody hammers on the root/pwd dentry counts".
>
> Your benchmark that looked up the *same* last component in parallell
> and all the time is not really a real load.
>
[snip2]
>
> So I'm not claiming lockref is perfect, but I do suspect you're
> barking up the wrong tree here. The optimizations you are talking
> about are either not realistic in the first place (because
> serialization and locking) or have already mostly been done (ie
> avoiding the op entirely except for the last component).
>

That was a minor remark in the spirit of attempting to reduce pingpong,
only made because of the paper. It was not a serious proposal, but
perhaps I failed to make it clear enough.

A more serious remark was in the last paragraph.

> HOWEVER. There is one special exception that might be interesting and
> that has never been done: 'fstat()' and friends could possibly avoid
> the "try_to_unlazy()" even for the last component.
>
> IOW, we might be able to do fstatat() without ever even finalizing the
> RCU state of the pathname, and actually looking up the stat
> information while still in RCU mode, and instead of doing the actual
> final lockref_get_not_dead() to turn an RCU path into a real
> ref-counted path, we could just do the "get the stat info, then do
> read_seqcount_retry() to verify that the RCU walk _and_ the stat info
> is still valid".
>

Ignoring mentioning specific routines by name is precisely what
I proposed in that e-mail. To quote myself:

> Personally, for lockref, I think the way to go forward is to use it less
> to begin with and to look for ways to convert it into a lock xadd-able
> primitive instead. The "doing it less thing" could be done by
> implementing a RCU-only mode for select syscalls, which defaults to
> opportunistically avoid refing squat. If the caller manages to do what
> it needed to do, that is a massive win; otherwise refs get taken. This
> could work well in the common case for syscalls like statx and access,
> but would not for open. Frankly I'm surprised something like this is
> not already implemented (maybe I missed a major showstopper?).

Now back to your response:

> But I'm not convinced that final complexity would be worth it. It
> sounds like a potentially fun and interesting exercise (Al Viro added
> to particupants so that he can say "No! That's not 'fun and exciting',
> that's just crazy!") if somebody really wants to, but see above: the
> last path component is very seldom something that sees contention. You
> look up the same root/pwd over and over again, but nobody sane looks
> up the same full pathname over and over again.
>
[snip]
> But since you clearly were looking at fstatat() performance, I can
> only point you at this case and say "There's _potentially_ upside
> there".
>

So if you strace something like gcc compiling stuff you will find:
- some access calls on shared dirs, for example:
78533 access("/usr/lib/gcc/x86_64-linux-gnu/11/", X_OK) = 0
78533 access("/usr/lib/gcc/x86_64-linux-gnu/11/", X_OK) = 0
78533 access("/usr/lib/gcc/x86_64-linux-gnu/11/", X_OK) = 0
- same with newfstatat:
87428 newfstatat(AT_FDCWD, "./arch/x86/include",
{st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0
87428 newfstatat(AT_FDCWD, "./arch/x86/include/generated",
{st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0
87428 newfstatat(AT_FDCWD, "./include", {st_mode=S_IFDIR|0755,
st_size=4096, ...}, 0) = 0
87428 newfstatat(AT_FDCWD, "./arch/x86/include/uapi",
{st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0
- there is also quite a bit of readlink:
87502 readlink("/tmp", 0x7ffe28847ac0, 1023) = -1 EINVAL (Invalid argument)
87502 readlink("/tmp/ccTh37oI.s", 0x7ffe28847ac0, 1023) = -1 EINVAL
(Invalid argument)

that last bit is glibc doing realpath(). A case can be made for making
realpath into a syscall instead, but I'm not going to flame over for
the time being. :)

Anyhow, my point is that 2 gcc compiling different files do invoke path
lookups with the same terminal path components and lockref stuff gets
bounced around when it happens.

On another point how to end up dealing with lockref less, I have to
note glibc switched fstat(fd, buf) to use newfstatat(fd, "", buf,
AT_EMPTY_PATH) internally. This adds a lot of work single-threaded as
it forces a path lookup with associated stac/clac trip, but more
importantly the kernel no longer relies on liveness of the dentry
provided by the file object, but uses lockref instead. Something which
allows to get data produced by newfstatat without forcing a lookup
sounds like a welcome addition. Same goes for statx which seems to be
the recommended syscall. I'll be writing about this to -fsdevel some
time later.

--
Mateusz Guzik <mjguzik gmail.com>