Re: [RFC PATCH v2 1/2] vfs: syscalls: add mkdirat2() that returns an O_DIRECTORY fd

From: Christian Brauner

Date: Mon May 11 2026 - 08:06:03 EST

On Mon, May 04, 2026 at 07:41:15PM +0200, Jori Koolstra wrote:
>
> > Op 27-04-2026 17:48 CEST schreef Christian Brauner <brauner@xxxxxxxxxx>:
> >
> > So definitely a patchset worthing doing but this will be hairy. And
> > Mateusz is right. As written this doesn't work. The canonical pattern
> > how e.g., dentry_open() does it is to preallocate the file.
> >
>
> Is this because of Mateusz point that we should fail as soon as possible
> to prevent any fs changes from taking effect?
>
> But like Mateusz points out, this is not really happening for open() with
> O_CREAT either. So is there any policy for what we do and do not tolerate?
> (although I agree we should definitely preallocate the file; thanks for
> pointing that pattern out).

Your version can fail in a lot more cases than O_CREAT because the file
is allocated last which is just not acceptable.

And a misbehaving LSM that ends up preventing opening a created file is
really not our concern. The system can behave in all non-standard ways
with mandatory access control.

The other concern that was brought up in some version is truncate but I
really don't understand what that is supposed to be about. O_TRUNC and
O_CREAT raally don't get in the way of each other in the way people
think they would.

Just look at the FMODE_CREATED case. If that's raised on do_open() then
O_TRUNC is ignored for very obvious reasons.

The only reason where O_TRUNC with O_CREAT matters is if the file did
already exist which also implies O_EXCL isn't raised. In that case this
ends up as a regular truncate request and then it is possible to hit the
handle_truncate() codepath. And there it really doesn't matter. A
concurrent exec or truncate that prevents you from O_CREAT | O_TRUNC
seems perfectly benign if you didn't actually create the file in the
first place. The O_TRUNC would only be honored if we did end up creating
the file. If someone else raised us in doing their own truncate or is
attempting an exe then we should most certainly not get to truncate over
them. Failing is the right thing to do here.

>
> > I do wonder though whether we shouldn't just make O_CREAT | O_DIRECTORY
> > work. I remember that I had a vague comment about this in [1] a few
> > years ago (cf. [1]). It might even be less hairy to get that one right
> > as all the thinking for O_CREAT is already there.
> >
> > What was the rationale for mkdirat2() instead of threading this through
> > openat()/openat2() with O_CREAT?
> >
>
> Because of Mateusz' objection, but I agree with Aleksa (and you in 2023)
> that this is intuitive and you mentioned POSIX allows for it.
>
> But a more general issue, that also applies to this mkdirat2 patch,
> is Linus' objection in that same thread.[1] However, the use-case of

mkdirat2() is objectively the worse api. It forces userspace to use a
separate system call without any reason whatsoever. If you can to
O_CREAT you should also be able to to O_DIRECTORY in the same system
call. If we support O_DIRECTORY | O_CREAT we get all the lookup
restriction niceties RESOLVE_* for free. Plus, it is supportable both in
openat() and openat2() because I made that combo return an errno.

UAPI design often is a nasty mix of performance (context switches),
separation of concerns and privileges, tastefulness, and compromises you
never thought or wanted to make.

I think here it is pretty clear that O_DIRECTORY | O_CREAT is the right
thing to do. Instead of restructuring a bunch of codepaths so it can be
plumbed through to the filesystems we just reuse the existing codepaths
that give us the right context for free.

And during LSFMM the VFS maintains all agreed to proceed with
O_DIRECTORY | O_CREAT.