Re: [RFC PATCH v2 1/2] vfs: syscalls: add mkdirat2() that returns an O_DIRECTORY fd

From: Mateusz Guzik

Date: Fri Apr 24 2026 - 06:13:00 EST

On Sun, Apr 12, 2026 at 03:54:33PM +0200, Jori Koolstra wrote:
> Currently there is no way to race-freely create and open a directory.
> For regular files we have open(O_CREAT) for creating a new file inode,
> and returning a pinning fd to it. The lack of such functionality for
> directories means that when populating a directory tree there's always
> a race involved: the inodes first need to be created, and then opened
> to adjust their permissions/ownership/labels/timestamps/acls/xattrs/...,
> but in the time window between the creation and the opening they might
> be replaced by something else.
>
> Addressing this race without proper APIs is possible (by immediately
> fstat()ing what was opened, to verify that it has the right inode type),
> but difficult to get right. Hence, mkdirat2() that creates a directory
> and returns an O_DIRECTORY fd is useful.
>
> This feature idea (and description) is taken from the UAPI group:
> https://github.com/uapi-group/kernel-features?tab=readme-ov-file#race-free-creation-and-opening-of-non-file-inodes
>
> @@ -5276,6 +5294,10 @@ int filename_mkdirat(int dfd, struct filename *name, umode_t mode)
> if (IS_ERR(dentry))
> error = PTR_ERR(dentry);
> }
> + if (open && !error && !is_delegated(&delegated_inode)) {
> + const struct path new_path = { .mnt = path.mnt, .dentry = dentry };
> + filp = dentry_open(&new_path, O_DIRECTORY, current_cred());
> + }
> end_creating_path(&path, dentry);
> if (is_delegated(&delegated_inode)) {
> error = break_deleg_wait(&delegated_inode);
> 2.53.0
>

Last time around I pointed out fd allocation being an issue.

The general problem is introduction of a failure point after mkdir
itself succeeds as there is no way to backpedal from it.

With the patch as proposed this remains a factor -- dentry_open itself
can fail due to inability to allocate a file obj, and even if that
succeeds there are several ways for do_dentry_open to error out.

For the patch to be viable some rototoiling is needed to make it so that
all the prep is done before issuing the mkdir. The only thing which can
legally happen after is installatin of the file obj in the fd table.

Now that I said it, the open handling is already buggy in that way.
do_open has the following:

error = may_open(idmap, &nd->path, acc_mode, open_flag);
if (!error && !(file->f_mode & FMODE_OPENED))
error = vfs_open(&nd->path, file);
if (!error)
error = security_file_post_open(file, op->acc_mode);
if (!error && do_truncate)
error = handle_truncate(idmap, file);
if (unlikely(error > 0)) {
WARN_ON(1);
error = -EINVAL;
}

Suppose O_CREAT was passed.

There is no attempt to recover from the LSM returning an error, in which
case the file is left on the fs. The only LSM even using the hook is
ima. Even if the user being able to create the file implies the LSM
check will pass anyway, the inode itself is not locked so root can sneak
in to chmod it and trigger a failure. Suppose that's not important.

Things proceed to handle_truncate:
int error = get_write_access(inode);
if (error)
return error;

error = security_file_truncate(filp);
if (!error) {
error = do_truncate(idmap, path->dentry, 0,
ATTR_MTIME|ATTR_CTIME|ATTR_OPEN,
filp);
}

I'm going to ignore the LSM situation and do_truncate failure modes in this one.

AFAICS nothing prevents the same user from racing against file creation to
execve it, which starts with exe_file_deny_write_access. Should the
other thread win the race, get_write_access will fail and the WARN_ON
splat will be generated. That is definitely a problem.