Re: [PATCH] vfs: elide smp_mb in iversion handling in the common case

From: Mateusz Guzik
Date: Tue Aug 27 2024 - 06:22:08 EST


On Tue, Aug 27, 2024 at 12:00 PM Jan Kara <jack@xxxxxxx> wrote:
>
> On Thu 15-08-24 10:33:10, Mateusz Guzik wrote:
> > According to bpftrace on these routines most calls result in cmpxchg,
> > which already provides the same guarantee.
> >
> > In inode_maybe_inc_iversion elision is possible because even if the
> > wrong value was read due to now missing smp_mb fence, the issue is going
> > to correct itself after cmpxchg. If it appears cmpxchg wont be issued,
> > the fence + reload are there bringing back previous behavior.
> >
> > Signed-off-by: Mateusz Guzik <mjguzik@xxxxxxxxx>
> > ---
> >
> > chances are this entire barrier guarantee is of no significance, but i'm
> > not signing up to review it
>
> Jeff might have a ready answer here - added to CC. I think the barrier is
> needed in principle so that you can guarantee that after a data change you
> will be able to observe an i_version change.
>

That is the description, I am saying it is unclear if anyone needs it
and that I am not interested in spending time on finding out.

> > I verified the force flag is not *always* set (but it is set in the most
> > common case).
>
> Well, I'm not convinced the more complicated code is really worth it.
> 'force' will be set when we update timestamps which happens once per tick
> (usually 1-4 ms). So that is common case on lightly / moderately loaded
> system. On heavily write(2)-loaded system, 'force' should be mostly false
> and unless you also heavily stat(2) the modified files, the common path is
> exactly the "if (!force && !(cur & I_VERSION_QUERIED))" branch. So saving
> one smp_mb() on moderately loaded system per couple of ms (per inode)
> doesn't seem like a noticeable win...
>

inode_maybe_inc_iversion is used a lot and most commonly *with* the force flag.

You can try it out yourself: bpftrace -e
'kprobe:inode_maybe_inc_iversion { @[kstack(), arg1] = count(); }'

and run your favourite fs-using workload, for example this is a top
backtrace form few seconds of building the kernel:

@[
inode_maybe_inc_iversion+5
inode_update_timestamps+238
generic_update_time+19
file_update_time+125
shmem_file_write_iter+118
vfs_write+599
ksys_write+103
do_syscall_64+82
entry_SYSCALL_64_after_hwframe+118
, 1]: 1670

the '1' at the end indicates 'force' flag set to 1.

This also shows up on unlink et al.

The context here is that vfs is dog slow single-threaded in part due
to spurious barriers sneaked in all over the place, here is another
example:
https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs.misc&id=30eb7cc03875b508ce6683c8b3cf6e88442a2f87

--
Mateusz Guzik <mjguzik gmail.com>