Re: [RFC PATCH v2 1/2] vfs: syscalls: add mkdirat2() that returns an O_DIRECTORY fd

From: Stefan Metzmacher

Date: Tue Apr 28 2026 - 10:28:15 EST


Am 28.04.26 um 15:39 schrieb Stefan Metzmacher:
Am 27.04.26 um 17:48 schrieb Christian Brauner:
On Sun, Apr 12, 2026 at 03:54:33PM +0200, Jori Koolstra wrote:
Currently there is no way to race-freely create and open a directory.
For regular files we have open(O_CREAT) for creating a new file inode,
and returning a pinning fd to it. The lack of such functionality for
directories means that when populating a directory tree there's always
a race involved: the inodes first need to be created, and then opened
to adjust their permissions/ownership/labels/timestamps/acls/xattrs/...,
but in the time window between the creation and the opening they might
be replaced by something else.

Addressing this race without proper APIs is possible (by immediately
fstat()ing what was opened, to verify that it has the right inode type),
but difficult to get right. Hence, mkdirat2() that creates a directory
and returns an O_DIRECTORY fd is useful.

This feature idea (and description) is taken from the UAPI group:
https://github.com/uapi-group/kernel-features?tab=readme-ov-file#race-free-creation-and-opening-of-non-file-inodes

Signed-off-by: Jori Koolstra <jkoolstra@xxxxxxxxx>
---
  arch/x86/entry/syscalls/syscall_64.tbl |  1 +
  fs/internal.h                          |  2 ++
  fs/namei.c                             | 44 +++++++++++++++++++++++---
  include/linux/syscalls.h               |  2 ++
  include/uapi/asm-generic/unistd.h      |  5 ++-
  scripts/syscall.tbl                    |  1 +
  6 files changed, 50 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 524155d655da..e200ca2067a4 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -396,6 +396,7 @@
  469    common    file_setattr        sys_file_setattr
  470    common    listns            sys_listns
  471    common    rseq_slice_yield    sys_rseq_slice_yield
+472    common    mkdirat2        sys_mkdirat2
  #
  # Due to a historical design error, certain syscalls are numbered differently
diff --git a/fs/internal.h b/fs/internal.h
index cbc384a1aa09..c6a79afadacf 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -59,6 +59,8 @@ int may_linkat(struct mnt_idmap *idmap, const struct path *link);
  int filename_renameat2(int olddfd, struct filename *oldname, int newdfd,
           struct filename *newname, unsigned int flags);
  int filename_mkdirat(int dfd, struct filename *name, umode_t mode);
+struct file *do_file_mkdirat(int dfd, struct filename *name, umode_t mode,
+        unsigned int flags, bool open);
  int filename_mknodat(int dfd, struct filename *name, umode_t mode, unsigned int dev);
  int filename_symlinkat(struct filename *from, int newdfd, struct filename *to);
  int filename_linkat(int olddfd, struct filename *old, int newdfd,
diff --git a/fs/namei.c b/fs/namei.c
index a880454a6415..6451e96dc225 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -5255,18 +5255,36 @@ struct dentry *vfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
  }
  EXPORT_SYMBOL(vfs_mkdir);
-int filename_mkdirat(int dfd, struct filename *name, umode_t mode)
+static int mkdirat_lookup_flags(unsigned int flags)
+{
+    int lookup_flags = LOOKUP_DIRECTORY;
+
+    if (!(flags & AT_SYMLINK_NOFOLLOW))
+        lookup_flags |= LOOKUP_FOLLOW;
+    if (!(flags & AT_NO_AUTOMOUNT))
+        lookup_flags |= LOOKUP_AUTOMOUNT;
+
+    return lookup_flags;
+}
+
+int filename_mkdirat(int dfd, struct filename *name, umode_t mode) {
+    return PTR_ERR_OR_ZERO(do_file_mkdirat(dfd, name, mode, 0, false));
+}
+
+struct file *do_file_mkdirat(int dfd, struct filename *name, umode_t mode,
+        unsigned int flags, bool open)
  {
      struct dentry *dentry;
      struct path path;
      int error;
-    unsigned int lookup_flags = LOOKUP_DIRECTORY;
+    struct file *filp = NULL;
+    unsigned int lookup_flags = mkdirat_lookup_flags(flags);
      struct delegated_inode delegated_inode = { };
  retry:
      dentry = filename_create(dfd, name, &path, lookup_flags);
      if (IS_ERR(dentry))
-        return PTR_ERR(dentry);
+        return ERR_CAST(dentry);
      error = security_path_mkdir(&path, dentry,
              mode_strip_umask(path.dentry->d_inode, mode));
@@ -5276,6 +5294,10 @@ int filename_mkdirat(int dfd, struct filename *name, umode_t mode)
          if (IS_ERR(dentry))
              error = PTR_ERR(dentry);
      }
+    if (open && !error && !is_delegated(&delegated_inode)) {
+        const struct path new_path = { .mnt = path.mnt, .dentry = dentry };
+        filp = dentry_open(&new_path, O_DIRECTORY, current_cred());
+    }

So definitely a patchset worthing doing but this will be hairy. And
Mateusz is right. As written this doesn't work. The canonical pattern
how e.g., dentry_open() does it is to preallocate the file.

I do wonder though whether we shouldn't just make O_CREAT | O_DIRECTORY
work. I remember that I had a vague comment about this in [1] a few
years ago (cf. [1]). It might even be less hairy to get that one right
as all the thinking for O_CREAT is already there.

What was the rationale for mkdirat2() instead of threading this through
openat()/openat2() with O_CREAT?

And side-question: @Jeff, can nfs atomic open deal with O_CREAT |
O_DIRECTORY?

If it helps the SMB2/3 protocol only has a single SMB2 Create operation
that uses FILE_CREATE+FILE_NON_DIRECTORY_FILE or FILE_CREATE+FILE_DIRECTORY_FILE.

Given all the openat() ignores unknown flags or combinations, maybe this
should be openat2 only and even a new flag (at the for the userspace interface).
or do_sys_open() will reject it for open and openat.

I just found the interaction of __O_TMPFILE and O_DIRECTORY
there should be a O_MKDIR or something similar that's openat2 only.

While we're there an O_TMPDIR would also be wonderful to have.
Currently samba works around it by using a hidden directory name, invisible
for SMB clients, but nfs and local users see it.

That should also be openat2 only if added.

metze