Re: [PATCH v22 07/12] landlock: Support filesystem access-control

From: Jann Horn
Date: Wed Oct 28 2020 - 21:07:17 EST

Next message: Jann Horn: "Re: [PATCH v22 03/12] landlock: Set up the security framework and manage credentials"
Previous message: Jann Horn: "Re: For review: seccomp_user_notif(2) manual page"
In reply to: Mickaël Salaün: "[PATCH v22 07/12] landlock: Support filesystem access-control"
Next in thread: Mickaël Salaün: "Re: [PATCH v22 07/12] landlock: Support filesystem access-control"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

(On Tue, Oct 27, 2020 at 9:04 PM Mickaël Salaün <mic@xxxxxxxxxxx> wrote:
> Thanks to the Landlock objects and ruleset, it is possible to identify
> inodes according to a process's domain. To enable an unprivileged
> process to express a file hierarchy, it first needs to open a directory
> (or a file) and pass this file descriptor to the kernel through
> landlock_add_rule(2). When checking if a file access request is
> allowed, we walk from the requested dentry to the real root, following
> the different mount layers. The access to each "tagged" inodes are
> collected according to their rule layer level, and ANDed to create
> access to the requested file hierarchy. This makes possible to identify
> a lot of files without tagging every inodes nor modifying the
> filesystem, while still following the view and understanding the user
> has from the filesystem.
>
> Add a new ARCH_EPHEMERAL_INODES for UML because it currently does not
> keep the same struct inodes for the same inodes whereas these inodes are
> in use.
>
> This commit adds a minimal set of supported filesystem access-control
> which doesn't enable to restrict all file-related actions. This is the
> result of multiple discussions to minimize the code of Landlock to ease
> review. Thanks to the Landlock design, extending this access-control
> without breaking user space will not be a problem. Moreover, seccomp
> filters can be used to restrict the use of syscall families which may
> not be currently handled by Landlock.
[...]
> diff --git a/include/uapi/linux/landlock.h b/include/uapi/linux/landlock.h
[...]
> +/**
> + * DOC: fs_access
> + *
> + * A set of actions on kernel objects may be defined by an attribute (e.g.
> + * &struct landlock_path_beneath_attr) including a bitmask of access.
> + *
> + * Filesystem flags
> + * ~~~~~~~~~~~~~~~~
> + *
> + * These flags enable to restrict a sandbox process to a set of actions on

s/sandbox/sandboxed/

[...]
> diff --git a/security/landlock/fs.c b/security/landlock/fs.c
[...]
> +static const struct landlock_object_underops landlock_fs_underops = {
> + .release = release_inode
> +};
[...]
> +/* Access-control management */
> +
> +static bool check_access_path_continue(
> + const struct landlock_ruleset *const domain,
> + const struct path *const path, const u32 access_request,
> + bool *const allow, u64 *const layer_mask)
> +{
> + const struct landlock_rule *rule;
> + const struct inode *inode;
> + bool next = true;
> +
> + prefetch(path->dentry->d_parent);

IIRC software prefetch() turned out to only rarely actually have a
performance benefit, and they often actually make things worse; see
e.g. <https://lwn.net/Articles/444336/>. Unless you have strong
evidence that this actually brings a performance benefit, I'd probably
get rid of this.

> + if (d_is_negative(path->dentry))
> + /* Continues to walk while there is no mapped inode. */
> + return true;
> + inode = d_backing_inode(path->dentry);
> + rcu_read_lock();
> + rule = landlock_find_rule(domain,
> + rcu_dereference(landlock_inode(inode)->object));
> + rcu_read_unlock();
> +
> + /* Checks for matching layers. */
> + if (rule && (rule->layers | *layer_mask)) {
> + *allow = (rule->access & access_request) == access_request;
> + if (*allow) {
> + *layer_mask &= ~rule->layers;
> + /* Stops when a rule from each layer granted access. */
> + next = !!*layer_mask;
> + } else {
> + next = false;
> + }
> + }
> + return next;
> +}
> +
> +static int check_access_path(const struct landlock_ruleset *const domain,
> + const struct path *const path, u32 access_request)
> +{
> + bool allow = false;
> + struct path walker_path;
> + u64 layer_mask;
> +
> + if (WARN_ON_ONCE(!domain || !path))
> + return 0;
> + /*
> + * Allows access to pseudo filesystems that will never be mountable
> + * (e.g. sockfs, pipefs), but can still be reachable through
> + * /proc/self/fd .
> + */
> + if ((path->dentry->d_sb->s_flags & SB_NOUSER) ||
> + (d_is_positive(path->dentry) &&
> + unlikely(IS_PRIVATE(d_backing_inode(path->dentry)))))
> + return 0;
> + if (WARN_ON_ONCE(domain->nb_layers < 1))
> + return -EACCES;
> +
> + layer_mask = GENMASK_ULL(domain->nb_layers - 1, 0);
> + /*
> + * An access request which is not handled by the domain should be
> + * allowed.
> + */
> + access_request &= domain->fs_access_mask;
> + if (access_request == 0)
> + return 0;
> + walker_path = *path;
> + path_get(&walker_path);
> + /*
> + * We need to walk through all the hierarchy to not miss any relevant
> + * restriction.
> + */
> + while (check_access_path_continue(domain, &walker_path, access_request,
> + &allow, &layer_mask)) {

The logic in this code might be clearer if
check_access_path_continue() just returns whether the rule permitted
the access. Then it'd look like:

bool allow = false;
[...]
while (check_access_path_continue(domain, &walker_path,
access_request, &layer_mask)) {
if (layer_mask == 0) {
allow = true;
break;
}
[...]
}

I think that would make it clearer under which conditions we can end
up returning "true" from check_access_path().

(The current code also looks correct to me, I just think it'd be
clearer this way. If you disagree, you can keep it as-is.)

> + struct dentry *parent_dentry;
> +
> +jump_up:
> + /*
> + * Does not work with orphaned/private mounts like overlayfs
> + * layers for now (cf. ovl_path_real() and ovl_path_open()).
> + */
> + if (walker_path.dentry == walker_path.mnt->mnt_root) {
> + if (follow_up(&walker_path)) {
> + /* Ignores hidden mount points. */
> + goto jump_up;
> + } else {
> + /*
> + * Stops at the real root. Denies access
> + * because not all layers have granted access.
> + */
> + allow = false;
> + break;
> + }
> + }
> + if (unlikely(IS_ROOT(walker_path.dentry))) {
> + /*
> + * Stops at disconnected root directories. Only allows
> + * access to internal filesystems (e.g. nsfs which is
> + * reachable through /proc/self/ns).
> + */
> + allow = !!(walker_path.mnt->mnt_flags & MNT_INTERNAL);
> + break;
> + }
> + parent_dentry = dget_parent(walker_path.dentry);
> + dput(walker_path.dentry);
> + walker_path.dentry = parent_dentry;
> + }
> + path_put(&walker_path);
> + return allow ? 0 : -EACCES;
> +}
[...]
> +static inline u32 get_file_access(const struct file *const file)
> +{
> + u32 access = 0;
> +
> + if (file->f_mode & FMODE_READ) {
> + /* A directory can only be opened in read mode. */
> + if (S_ISDIR(file_inode(file)->i_mode))
> + return LANDLOCK_ACCESS_FS_READ_DIR;
> + access = LANDLOCK_ACCESS_FS_READ_FILE;
> + }
> + /*
> + * A LANDLOCK_ACCESS_FS_APPEND could be added but we also need to check
> + * fcntl(2).
> + */

Once https://lore.kernel.org/linux-api/20200831153207.GO3265@xxxxxxxxxxxxxxxxxxxxx/
lands, pwritev2() with RWF_NOAPPEND will also be problematic for
classifying "write" vs "append"; you may want to include that in the
comment. (Or delete the comment.)

> + if (file->f_mode & FMODE_WRITE)
> + access |= LANDLOCK_ACCESS_FS_WRITE_FILE;
> + /* __FMODE_EXEC is indeed part of f_flags, not f_mode. */
> + if (file->f_flags & __FMODE_EXEC)
> + access |= LANDLOCK_ACCESS_FS_EXECUTE;
> + return access;
> +}
[...]

Next message: Jann Horn: "Re: [PATCH v22 03/12] landlock: Set up the security framework and manage credentials"
Previous message: Jann Horn: "Re: For review: seccomp_user_notif(2) manual page"
In reply to: Mickaël Salaün: "[PATCH v22 07/12] landlock: Support filesystem access-control"
Next in thread: Mickaël Salaün: "Re: [PATCH v22 07/12] landlock: Support filesystem access-control"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]