Re: [PATCH] fs: prevent data-race due to missing inode_lock when calling vfs_getattr

From: Mateusz Guzik
Date: Tue Nov 19 2024 - 20:44:51 EST


On Mon, Nov 18, 2024 at 07:03:30AM +0000, Al Viro wrote:
> On Mon, Nov 18, 2024 at 03:00:39PM +0900, Jeongjun Park wrote:
> > All the functions that added lock in this patch are called only via syscall,
> > so in most cases there will be no noticeable performance issue.
>
> Pardon me, but I am unable to follow your reasoning.
>

I suspect the argument is that the overhead of issuing a syscall is big
enough that the extra cost of taking the lock trip wont be visible, but
that's not accurate -- atomics are measurable when added to syscalls,
even on modern CPUs.

> > And
> > this data-race is not a problem that only occurs in theory. It is
> > a bug that syzbot has been reporting for years. Many file systems that
> > exist in the kernel lock inode_lock before calling vfs_getattr, so
> > data-race does not occur, but only fs/stat.c has had a data-race
> > for years. This alone shows that adding inode_lock to some
> > functions is a good way to solve the problem without much
> > performance degradation.
>
> Explain. First of all, these are, by far, the most frequent callers
> of vfs_getattr(); what "many filesystems" are doing around their calls
> of the same is irrelevant. Which filesystems, BTW? And which call
> chains are you talking about? Most of the filesystems never call it
> at all.
>
> Furthermore, on a lot of userland loads stat(2) is a very hot path -
> it is called a lot. And the rwsem in question has a plenty of takers -
> both shared and exclusive. The effect of piling a lot of threads
> that grab it shared on top of the existing mix is not something
> I am ready to predict without experiments - not beyond "likely to be
> unpleasant, possibly very much so".
>
> Finally, you have not offered any explanations of the reasons why
> that data race matters - and "syzbot reporting" is not one. It is
> possible that actual observable bugs exist, but it would be useful
> to have at least one of those described in details.
>
[snip]

On the stock kernel it is at least theoretically possible to transiently
observe a state which is mid-update (as in not valid), but I was under
the impression this was known and considered not a problem.

Nonetheless, as an example say an inode is owned by 0:0 and is being
chowned to 1:1 and this is handled by setattr_copy.

The ids are updated one after another:
[snip]
i_uid_update(idmap, attr, inode);
i_gid_update(idmap, attr, inode);
[/snip]

So at least in principle it may be someone issuing getattr in parallel
will happen to spot 1:0 (as opposed to 0:0 or 1:1), which was never set
on the inode and is merely an artifact of hitting the timing.

This would be a bug, but I don't believe this is serious enough to
justify taking the inode lock to get out of.

Worst case, if someone is to handle this, I guess the obvious approach
introduces a sequence counter to be modified around setattr. Then
getattr could retry few times in case of seqc comparison failure and
only give up and take the lock afterwards. This would probably avoid
taking the lock in getattr for all real cases, even in face of racing
setattr.