Re: [PATCH v17 08/15] seccomp: add system call filtering using BPF
From: Kees Cook
Date: Fri Apr 06 2012 - 16:44:52 EST
On Fri, Apr 6, 2012 at 1:23 PM, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
> On Thu, 29 Mar 2012 15:01:53 -0500
> Will Drewry <wad@xxxxxxxxxxxx> wrote:
>
>> [This patch depends on luto@xxxxxxx's no_new_privs patch:
>> https://lkml.org/lkml/2012/1/30/264
>> included in this series for ease of consumption.
>> ]
>>
>> This patch adds support for seccomp mode 2. Mode 2 introduces the
>> ability for unprivileged processes to install system call filtering
>> policy expressed in terms of a Berkeley Packet Filter (BPF) program.
>> This program will be evaluated in the kernel for each system call
>> the task makes and computes a result based on data in the format
>> of struct seccomp_data.
>> ...
>> +static void seccomp_filter_log_failure(int syscall)
>> +{
>> + int compat = 0;
>> +#ifdef CONFIG_COMPAT
>> + compat = is_compat_task();
>> +#endif
>
> hm, I'm surprised that we don't have a zero-returning implementation of
> is_compat_task() when CONFIG_COMPAT=n. Seems silly. Blames Arnd.
There is, but this chunk of the patch is removed later on in the
series. We could just drop seccomp_filter_log_failure and it's single
call entirely and depend on the later enhancement to add the logging
that is desired.
>> +static long seccomp_attach_filter(struct sock_fprog *fprog)
>> +{
>> + struct seccomp_filter *filter;
>> + unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
>> + unsigned long total_insns = fprog->len;
>> + long ret;
>> +
>> + if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
>> + return -EINVAL;
>> +
>> + for (filter = current->seccomp.filter; filter; filter = filter->prev)
>> + total_insns += filter->len + 4; /* include a 4 instr penalty */
>
> So tasks don't share filters? We copy them by value at fork? Do we do
> this at vfork() too?
The filter chain is shared (and refcounted).
>> + if (total_insns > MAX_INSNS_PER_PATH)
>> + return -ENOMEM;
>> +
>> + /*
>> + * Installing a seccomp filter requires that the task have
>> + * CAP_SYS_ADMIN in its namespace or be running with no_new_privs.
>> + * This avoids scenarios where unprivileged tasks can affect the
>> + * behavior of privileged children.
>> + */
>> + if (!current->no_new_privs &&
>> + security_capable_noaudit(current_cred(), current_user_ns(),
>> + CAP_SYS_ADMIN) != 0)
>> + return -EACCES;
>> +
>> + /* Allocate a new seccomp_filter */
>> + filter = kzalloc(sizeof(struct seccomp_filter) + fp_size, GFP_KERNEL);
>
> I think this gives userspace an easy way of causing page allocation
> failure warnings, by permitting large kmalloc() attempts. Add
> __GFP_NOWARN?
This is likely mitigated by the MAX_INSNS_PER_PATH check above, but,
yeah, there's no harm in adding __GFP_NOWARN.
>> +/* put_seccomp_filter - decrements the ref count of tsk->seccomp.filter */
>> +void put_seccomp_filter(struct task_struct *tsk)
>> +{
>> + struct seccomp_filter *orig = tsk->seccomp.filter;
>> + /* Clean up single-reference branches iteratively. */
>> + while (orig && atomic_dec_and_test(&orig->usage)) {
>> + struct seccomp_filter *freeme = orig;
>> + orig = orig->prev;
>> + kfree(freeme);
>> + }
>> +}
>
> So if one of the filters in the list has an elevated refcount, we bail
> out on the remainder of the list. Seems odd.
This so that every filter in the list doesn't need to have their
refcount raised. As long as the counting up matching the counting
down, it's fine. This allows for process trees branching the filter
list at different times still being safe. IIUC, this code was based on
how namespace refcounting is handled. I spent some time proving to
myself that it was correctly refcounted a while back. More eyes is
better, of course. :)
-Kees
--
Kees Cook
ChromeOS Security
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/