Re: RFC: fsyscall

From: Eric W. Biederman
Date: Wed Sep 09 2015 - 15:40:34 EST


David Drysdale <drysdale@xxxxxxxxxx> writes:

> On Wed, Sep 9, 2015 at 1:25 AM, Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote:
>> Andy Lutomirski <luto@xxxxxxxxxxxxxx> writes:
>> > On Tue, Sep 8, 2015 at 4:07 PM, Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote:
>>
>> >> Perhaps I had missed it but I don't recall capsicum being able to wrap
>> >> things like reboot(2).
>> >>
>> >
>> > Ah, so you want to be able to grant BPF-defined capabilities :)
>>
>> Pretty much.
>>
>> Where I am focusing is turning Posix capabilities into real
>> capabilities. I would not mind if the functionality was a bit more
>> general. Say to be able to handle things like security labels, or
>> anywhere else you might reasonably be asked can you do X?
>>
>> But I would be happy if we just managed to wrap the Posix capabilities
>> and turned them into real capablilities.
>
> Interesting idea! So kind of like the "object" in question is the root
> role, and the different rights for the corresponding object-capability
> (the file descriptor) are the POSIX capabilities (in the simple case
> at least).
>
> And yes, Capsicum doesn't generally interact with things like reboot(2);
> its checks are on top of any DAC policies rather than instead of them,
> as it's a hybrid rather than a pure object-capability system.
>
>> > Off the top of my head, I think that doing this using a nice IPC
>> > mechanism (which barely exists in Linux, but which seL4 and binder (!)
>> > can do very cleanly) would be simpler and more general, if less
>> > self-contained.
>>
>> Less self-contained becomes a problem when you want to pass them between
>> processes written at different times between different people. If there
>> is something conceptually simple we can implement in the kernel it
>> becomes worth it because that becomes the standard which everyone knows
>> to code to.
>>
>> > (Aside: how on earth does anyone think that replacing binder with
>> > kdbus makes any sense? Binder can pass capabilities, and kdbus can't.
>> > OTOH, maybe Android doesn't use the capability-passing ability.)
>>
>> kdbus has file descriptor passing. Beyond that no comment.
>>
>> >> Which really describes what I am trying to tackle. How do we create an
>> >> object that we can pass between processes that limits what we can do in
>> >> the case of the oddball syscalls that require special privileges.
>> >>
>> >> At the same time I still want the caller to be able to pass in data to
>> >> the system calls being called such as REBOOT_CMD_POWER_OFF versus
>> >> REBOOT_CMD_HALT, while being able to filter it and say you may not pass
>> >> REBOOT_CMD_CAD_OFF.
>> >>
>> >
>> > We could have a conservative whitelist of syscalls for which we allow
>> > this usage. I'm a bit worried that there will be very limited use
>> > cases, given that a lot of use cases will want to follow pointers,
>> > which has TOCTOU problems.
>>
>> Time of check to time of use problems. Interesting point.
>>
>> TOCTOU seems to make filtering of system calls in general much less
>> viable then I had hoped or imagined, and seems to be one of the better
>> arguments I have heard against ioctls.
>
> By the way, Robert Watson (one of the progenitors of Capsicum, as it
> happens) has a nice paper about TOCTOU attacks on syscall interposition
> layers that's a good read:
> http://www.watson.org/~robert/2007woot/
>
> (From this perspective, the limitation that seccomp-bpf programs only
> have access to syscall arguments by-value is actually a help -- the filter
> can't look into user memory, so can't be fooled by having memory
> contents changed underneath it. Of course, if the eBPF stuff ever
> changes that we should watch out...)
>
>> I think the cases I care about are much less likely to have TOCTOU
>> problems than system calls in general, so I still may be ok.
>>
>> However it does seem like past a certain point for good filtering the
>> entire syscall ABI needs to be turned into well defined IPC. Ick!
>
> That's roughly one of Robert's suggestions (section 8.2).
>
>> Sigh. I guess it is about time I dig up the places we call capable.
>> Ugh 1696 places in the kernel.. Even filtering out CAP_SYS_ADMIN and
>> CAP_NET_ADMIN the list is longer than I can easily look at.
>>
>> Still reboot isn't a problem ;)
>>
>> Thinking abou the TOCTOU problems with system call filtering the only
>> general solution I can see is to handle it like the compat syscalls
>> but instead of copying things into a temporary on buffer in userspace
>> we copy the data into a temporary in-kernel buffer (filter the system call)
>> fs = get_fs();
>> set_fs(get_ds());
>> /* Call the system call */
>> set_fs(fs);
>>
>> I don't like the whole set_fs() thing (especially if there is any data
>> we did not manage to copy). But it seems like a good conceptual start.
>
> Doing the copies sounds like it would involve understanding & reproducing
> the memory layouts for every syscall pointer argument, which would be a
> lot of code. Or am I misunderstanding something?

Which is what we have for ioctls and some of the system calls in the
compat case. So it is something that has been done before. However I
am going to leave the TOCTOU mess to another time.

If I assume that anything file descriptor based will need another
mechanism to filter what is allowed on a file descriptor, and as such
will need a different mechanism (capsicum perhaps?). That handily
reduces the problem space, and removes almost all cases where reading
data from userspace is interesting as I am talking about pure system calls.

The list of system calls which are not file descriptor based are listed
below. Most of those don't take weird parameter structures that would
be interesting to filter. So I think my fsyscall idea is conceptually
reasonable. It is not a complete solution for passing someone a well
defined subset you are allowed to do but it is interesting.

Eric

open
stat
lstat
mprotect
munmap
brk
rt_sigaction
rt_sigprocmask
rt_sigreturn
access
pipe
sched_yield
mremap
msync
mincore
madvise
shmget
shmat
shmctl
pause
nanosleep
getitimer
alarm
setitimer
getpid
socket
socketpair
clone
fork
vfork
execve
exit
wait4
kill
uname
semget
semop
semctl
shmdt
msgget
msgsnd
msgrcv
msgctl
truncate
getcwd
chdir
rename
mkdir
rmdir
creat
link
unlink
symlink
readlink
chmod
chown
lchown
umask
gettimeofday
getrlimit
getrusage
sysinfo
times
ptrace
getuid
syslog
getgid
setuid
setgid
geteuid
getegid
setpgid
getppid
getpgrp
setsid
setreuid
setregid
getgroups
setgroups
setresuid
getresuid
setresgid
getresgid
getpgid
setfsuid
setfsgid
getsid
capget
capset
rt_sigpending
rt_sigtimedwait
rt_sigqueueinfo
rt_sigsuspend
sigaltstack
utime
mknod
uselib
personality
ustat
statfs
sysfs
getpriority
setpriority
sched_setparam
sched_getparam
sched_setscheduler
sched_getscheduler
sched_get_priority_max
sched_get_priority_min
sched_rr_get_interval
mlock
munlock
mlockall
munlockall
vhangup
modify_ldt
pivot_root
_sysctl
prctl
arch_prctl
adjtimex
setrlimit
chroot
sync
acct
settimeofday
mount
umount2
swapon
swapoff
reboot
sethostname
setdomainname
iopl
ioperm
create_module
init_module
delete_module
get_kernel_syms
query_module
quotactl
nfsservctl
gettid
setxattr
lsetxattr
getxattr
lgetxattr
listxattr
llistxattr
removexattr
lremovexattr
tkill
time
futex
sched_setaffinity
sched_getaffinity
set_thread_area
io_setup
io_destroy
io_getevents
io_submit
io_cancel
get_thread_area
lookup_dcookie
epoll_create
epoll_ctl_old
epoll_wait_old
remap_file_pages
set_tid_address
restart_syscall
semtimedop
timer_create
timer_settime
timer_gettime
timer_getoverrun
timer_delete
clock_settime
clock_gettime
clock_getres
clock_nanosleep
exit_group
epoll_wait
epoll_ctl
tgkill
utimes
vserver
mbind
set_mempolicy
get_mempolicy
mq_open
mq_unlink
mq_timedsend
mq_timedreceive
mq_notify
mq_getsetattr
kexec_load
waitid
add_key
request_key
keyctl
ioprio_set
ioprio_get
inotify_init
inotify_add_watch
inotify_rm_watch
migrate_pages
unshare
set_robust_list
get_robust_list
splice
tee
sync_file_range
vmsplice
move_pages
utimensat
epoll_pwait
signalfd
timerfd_create
eventfd
fallocate
signalfd4
eventfd2
epoll_create1
pipe2
inotify_init1
rt_tgsigqueueinfo
perf_event_open
fanotify_init
prlimit64
clock_adjtime
getcpu
process_vm_readv
process_vm_writev
kcmp
sched_setattr
sched_getattr
seccomp
getrandom
memfd_create
bpf
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/