Re: [PATCH] x86,seccomp,prctl: Remove PR_TSC_SIGSEGV and seccomp TSC filtering
From: Andy Lutomirski
Date: Fri Oct 03 2014 - 16:22:54 EST
On Fri, Oct 3, 2014 at 1:14 PM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> On Fri, Oct 03, 2014 at 10:27:47AM -0700, Andy Lutomirski wrote:
>> [adding linux-api. whoops.]
>>
>> On Fri, Oct 3, 2014 at 10:18 AM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>> > PR_SET_TSC / PR_TSC_SIGSEGV is a security feature to prevent heavily
>> > sandboxed programs from learning the time, presumably to avoid
>> > disclosing the wall clock and to make timing attacks much harder to
>> > exploit.
>> >
>> > Unfortunately, this feature is very insecure, for multiple reasons,
>> > and has probably been insecure since before it was written.
>> >
>> > Weakness 1: Before Linux 3.16, the vvar page and the HPET (!) were
>> > part of the kernel's fixmap, so any user process could read them.
>> > The vvar page contains low-resolution timing information (with real
>> > wall clock and frequency data), and the HPET can be used for high
>> > precision timing. Even in Linux 3.16, there clean way to disable
>> > access to these pages.
>> >
>> > Weakness 2: On most configurations, most or all userspace processes
>> > have unrestricted access to RDPMC, which is even better than RDTSC
>> > for exploiting timing attacks.
>> >
>> > I would like to fix both of these issues. I want to deny access to
>> > RDPMC to processes that haven't asked for access via
>> > perf_event_open. I also want to implement real TSC blocking, which
>> > will require some vdso enhancements
>
> So the problem with the default deny is that its:
> 1) pointless -- the attacker can do sys_perf_event_open() just fine;
Not if the attacker is in a seccomp sandbox.
> 2) and expensive -- the people trying to measure performance get the
> penalty of the CR4 write.
Does this matter for performance measuring? I'm not 100% clear on how all
the perf_event stuff gets used in practice, but, by my very vague
understanding, there are two main workflows:
a) perf record, etc: one process creates a ringbuffer and wakes up
rarely to record the contents. The process being recorded doesn't
have a perf_event mapped, so the cr4 switch will only happen when
waking up the perf process.
perf record prints stuff like "[ perf record: Woken up 1 times to
write data ]", which seems to confirm my understanding.
b) self-monitoring. A task mmaps a perf_event, does rdpmc, does
something, and does rdpmc again. In that case, there's no context
switch.
>
> So I would suggest a default on, but allow a disable for the seccomp
> users, which might have also disabled the syscall. Note that is is
> possible to disable RDPMC while still allowing the syscall.
Disabling RDPMC per-process while still allowing the syscall will need
a bunch of work, right? What happens if the same perf_event is mapped
by two different users?
We could make the rule be that RDPMC is enabled if a perf event is
mmapped or TIF_SECCOMP is clear, but I'd prefer to be convinced that
there's an actual performance issue first. Ideally we can get this
all working with no API or ABI change at all.
P.S. Hey, Intel, let us context switch RDPMC accessibility of the
individual counters, please :)
--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/