Re: [GIT PULL] execve updates for v6.8-rc1

From: Linus Torvalds
Date: Tue Jan 09 2024 - 22:55:01 EST


On Tue, 9 Jan 2024 at 18:21, Josh Triplett <josh@xxxxxxxxxxxxxxxx> wrote:
>
> Instead, here are some numbers from Linus's suggested benchmark
> (modified to use execvpe, and to count down rather than up so it doesn't
> need two arguments; modified version and benchmark driver script
> attached; compiled with `musl-gcc -Wall -O3 -s -static`):

Yeah, that's better. I had actually only benchmarked the success case.

And I see what the problem is: the "sane" way that only opens the
pathname once does so using

file = do_filp_open(fd, name, &open_exec_flags);

and the "open path twice" does the first open with

retval = filename_lookup(AT_FDCWD, filename, 0, &path, NULL);

and guess what the difference is?

The difference is that path_openat() starts out with

file = alloc_empty_file(op->open_flag, current_cred());

and when the open fails, it will free the file with

fput(file);

So if there are a lot of failures (because "." is at the end of the
path), it will have done a fair amount of those useless file
allocations and frees.

And - not surprisingly - the "open once" is then faster if there are
*not* a lot of failures, when the executable is found early in the
PATH.

Now, there's no fundamental *reason* to do that alloc_empty_file()
early, except for how the code is structured.

It partly makes the error handling simpler and since all the cases
want the filp in the end, doing it at the top means that it's only
done once.

And we occasionally do use the file pointer early (ie lookup_open()
will clear/set FMODE_CREATED in it even if it doesn't otherwise touch
the file pointer) even before the final lookup - and at creation time,
atomic_open() will actually want it for the final lookup.

Anyway, the real fix is definitely to just fix do_filp_open() to at
least not be *too* eager to allocate a file pointer.

In fact, that code is oddly non-optimal for another reason: it does
that "allocate and free file" not just when the path lookup fails, but
it does it for things like RCU lookup failures too.

So what happens is that if RCU lookup fails, do_filp_open() will call
path_openat() twice: first with LOOKUP_RCU, and then without it. And
path_openat() will allocate that "struct file *" twice.

On NFS - or other filesystems that can return ESTALE - it will in fact
do it three times.

That's pretty disgusting.

Al, comments? We *could* just special-case the execve() code not to
use do_filp_open() and avoid this issue that way, but it does feel
like even the regular open() case is pessimal with that whole RCU
situation.

Linus