Re: [RFC PATCH for 4.18 1/2] rseq: validate rseq_cs fields are < TASK_SIZE

From: Mathieu Desnoyers
Date: Thu Jun 28 2018 - 18:29:27 EST


----- On Jun 28, 2018, at 5:22 PM, Linus Torvalds torvalds@xxxxxxxxxxxxxxxxxxxx wrote:

> On Thu, Jun 28, 2018 at 1:23 PM Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>>
>> This is okay with me for a fix outside the merge window. Can you do a
>> followup for the next merge window that fixes it better, though? In
>> particular, TASK_SIZE is generally garbage. I think a better fix
>> would be something like adding a new arch-overridable helper like:
>>
>> static inline unsigned long current_max_user_addr(void) { return TASK_SIZE; }
>
> We already have that. It's called "user_addr_max()".
>
> It's the limit we use for user accesses.

Good point. I will simply replace TASK_SIZE with user_addr_max() then.

> That said, I don't see why we should even check the IP. It's not like
> that's done by signal handling either.

The goal stated by Andy and Thomas is to design rseq so it
does not need any compat syscall whatsoever. Considering
that this is a new system call, we should be able to do that.
One thing we want is to provide a consistent behavior when
a 32-bit binary is executed on native 32-bit kernel or on
64-bit kernel in compat mode, even if userspace chooses to
put garbage in the upper 32-bit of padding within 64-bit
fields.

Now let's look at the comparison with signals. If we look
at ia32_setup_frame() for instance, it sets:

regs->ip = (unsigned long) ksig->ka.sa.sa_handler;

where the sa_handler pointer has been received as input
as a 32-bit pointer by the compat syscall rt_sigaction
through struct compat_sigaction. I agree with you that
it does not appear to validate that it's below
user_addr_max(), but at least it's guaranteed to never be
over 32-bit.

The first big difference between rseq and signals: rseq
does not have a compat structure. The layout is the same
for both 32-bit and 64-bit userspace (see include/uapi/linux/rseq.h).

Another difference between rseq and signals is that rseq
only registers the TLS struct rseq for each thread. Then,
it's up to user-space to update the rseq_cs field of
struct rseq to indicate that it enters a critical section.

So the actual content of (struct rseq *)->rseq_cs is updated
with single-copy atomicity after registration of rseq.
Therefore, the current critical section pointed to by the current
rseq_cs user-space pointer also changes after registration. So we
cannot validate the content of the rseq_cs field, nor of the fields
contained within every possible struct rseq_cs descriptor when
registering rseq through sys_rseq: those need to be read when
returning to user-space. Not just on return from system call, but
also on return from interrupt/trap after a preemption.

This is very much different from registering a sigaction,
where the kernel can validate or truncate the content of
sa_handler at will.

Without validation of the content of e.g. rseq_cs->abort_ip
(read as a 64-bit integer by a 64-bit kernel), we end up setting
the return IP to that address on abort, even though user-space may
have put garbage in the high bits:

instruction_pointer_set(regs, (unsigned long)rseq_cs.abort_ip);

without any validation or truncation whatsoever.

I'm concerned that some architecture code may not deal so well without
prior validation or truncation of the IP register content upper 32 bits
when returning to a 32-bit compat task.

This is why I'm considering comparison of abort_ip against user_addr_max()
to ensure we're not provided with an incorrect user input.

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com