Re: [RFC PATCH v2 1/2] vfs: syscalls: add mkdirat2() that returns an O_DIRECTORY fd

From: Paulo Alcantara

Date: Tue Apr 28 2026 - 10:31:31 EST


Stefan Metzmacher <metze@xxxxxxxxx> writes:

> Am 27.04.26 um 17:48 schrieb Christian Brauner:
>> On Sun, Apr 12, 2026 at 03:54:33PM +0200, Jori Koolstra wrote:
>>> Currently there is no way to race-freely create and open a directory.
>>> For regular files we have open(O_CREAT) for creating a new file inode,
>>> and returning a pinning fd to it. The lack of such functionality for
>>> directories means that when populating a directory tree there's always
>>> a race involved: the inodes first need to be created, and then opened
>>> to adjust their permissions/ownership/labels/timestamps/acls/xattrs/...,
>>> but in the time window between the creation and the opening they might
>>> be replaced by something else.
>>>
>>> Addressing this race without proper APIs is possible (by immediately
>>> fstat()ing what was opened, to verify that it has the right inode type),
>>> but difficult to get right. Hence, mkdirat2() that creates a directory
>>> and returns an O_DIRECTORY fd is useful.
>>>
>>> This feature idea (and description) is taken from the UAPI group:
>>> https://github.com/uapi-group/kernel-features?tab=readme-ov-file#race-free-creation-and-opening-of-non-file-inodes
>>>
>>> Signed-off-by: Jori Koolstra <jkoolstra@xxxxxxxxx>
>>> ---
>>> arch/x86/entry/syscalls/syscall_64.tbl | 1 +
>>> fs/internal.h | 2 ++
>>> fs/namei.c | 44 +++++++++++++++++++++++---
>>> include/linux/syscalls.h | 2 ++
>>> include/uapi/asm-generic/unistd.h | 5 ++-
>>> scripts/syscall.tbl | 1 +
>>> 6 files changed, 50 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
>>> index 524155d655da..e200ca2067a4 100644
>>> --- a/arch/x86/entry/syscalls/syscall_64.tbl
>>> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
>>> @@ -396,6 +396,7 @@
>>> 469 common file_setattr sys_file_setattr
>>> 470 common listns sys_listns
>>> 471 common rseq_slice_yield sys_rseq_slice_yield
>>> +472 common mkdirat2 sys_mkdirat2
>>>
>>> #
>>> # Due to a historical design error, certain syscalls are numbered differently
>>> diff --git a/fs/internal.h b/fs/internal.h
>>> index cbc384a1aa09..c6a79afadacf 100644
>>> --- a/fs/internal.h
>>> +++ b/fs/internal.h
>>> @@ -59,6 +59,8 @@ int may_linkat(struct mnt_idmap *idmap, const struct path *link);
>>> int filename_renameat2(int olddfd, struct filename *oldname, int newdfd,
>>> struct filename *newname, unsigned int flags);
>>> int filename_mkdirat(int dfd, struct filename *name, umode_t mode);
>>> +struct file *do_file_mkdirat(int dfd, struct filename *name, umode_t mode,
>>> + unsigned int flags, bool open);
>>> int filename_mknodat(int dfd, struct filename *name, umode_t mode, unsigned int dev);
>>> int filename_symlinkat(struct filename *from, int newdfd, struct filename *to);
>>> int filename_linkat(int olddfd, struct filename *old, int newdfd,
>>> diff --git a/fs/namei.c b/fs/namei.c
>>> index a880454a6415..6451e96dc225 100644
>>> --- a/fs/namei.c
>>> +++ b/fs/namei.c
>>> @@ -5255,18 +5255,36 @@ struct dentry *vfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
>>> }
>>> EXPORT_SYMBOL(vfs_mkdir);
>>>
>>> -int filename_mkdirat(int dfd, struct filename *name, umode_t mode)
>>> +static int mkdirat_lookup_flags(unsigned int flags)
>>> +{
>>> + int lookup_flags = LOOKUP_DIRECTORY;
>>> +
>>> + if (!(flags & AT_SYMLINK_NOFOLLOW))
>>> + lookup_flags |= LOOKUP_FOLLOW;
>>> + if (!(flags & AT_NO_AUTOMOUNT))
>>> + lookup_flags |= LOOKUP_AUTOMOUNT;
>>> +
>>> + return lookup_flags;
>>> +}
>>> +
>>> +int filename_mkdirat(int dfd, struct filename *name, umode_t mode) {
>>> + return PTR_ERR_OR_ZERO(do_file_mkdirat(dfd, name, mode, 0, false));
>>> +}
>>> +
>>> +struct file *do_file_mkdirat(int dfd, struct filename *name, umode_t mode,
>>> + unsigned int flags, bool open)
>>> {
>>> struct dentry *dentry;
>>> struct path path;
>>> int error;
>>> - unsigned int lookup_flags = LOOKUP_DIRECTORY;
>>> + struct file *filp = NULL;
>>> + unsigned int lookup_flags = mkdirat_lookup_flags(flags);
>>> struct delegated_inode delegated_inode = { };
>>>
>>> retry:
>>> dentry = filename_create(dfd, name, &path, lookup_flags);
>>> if (IS_ERR(dentry))
>>> - return PTR_ERR(dentry);
>>> + return ERR_CAST(dentry);
>>>
>>> error = security_path_mkdir(&path, dentry,
>>> mode_strip_umask(path.dentry->d_inode, mode));
>>> @@ -5276,6 +5294,10 @@ int filename_mkdirat(int dfd, struct filename *name, umode_t mode)
>>> if (IS_ERR(dentry))
>>> error = PTR_ERR(dentry);
>>> }
>>> + if (open && !error && !is_delegated(&delegated_inode)) {
>>> + const struct path new_path = { .mnt = path.mnt, .dentry = dentry };
>>> + filp = dentry_open(&new_path, O_DIRECTORY, current_cred());
>>> + }
>>
>> So definitely a patchset worthing doing but this will be hairy. And
>> Mateusz is right. As written this doesn't work. The canonical pattern
>> how e.g., dentry_open() does it is to preallocate the file.
>>
>> I do wonder though whether we shouldn't just make O_CREAT | O_DIRECTORY
>> work. I remember that I had a vague comment about this in [1] a few
>> years ago (cf. [1]). It might even be less hairy to get that one right
>> as all the thinking for O_CREAT is already there.
>>
>> What was the rationale for mkdirat2() instead of threading this through
>> openat()/openat2() with O_CREAT?
>>
>> And side-question: @Jeff, can nfs atomic open deal with O_CREAT |
>> O_DIRECTORY?
>
> If it helps the SMB2/3 protocol only has a single SMB2 Create operation
> that uses FILE_CREATE+FILE_NON_DIRECTORY_FILE or FILE_CREATE+FILE_DIRECTORY_FILE.

Yes. However cifs.ko will handle atomic open of regular files only.

IIRC, NFS also doesn't handle atomic opens of directories either. Jeff
could confirm that.