Re: [PATCH urgent v2] x86, asm: Disable opportunistic SYSRET if regs->flags has TF set

From: Andy Lutomirski
Date: Thu Apr 02 2015 - 10:26:48 EST

On Thu, Apr 2, 2015 at 5:31 AM, Ingo Molnar <mingo@xxxxxxxxxx> wrote:
> * Denys Vlasenko <dvlasenk@xxxxxxxxxx> wrote:
>> On 04/02/2015 01:14 PM, Brian Gerst wrote:
>> >>>> So I merged this as it's an obvious bugfix, but in hindsight I'm
>> >>>> really uneasy about the whole opportunistic SYSRET concept: it appears
>> >>>> that the chance that %rcx matches return-%rip is astronomical - this
>> >>>> is why this bug wasn't noticed live so far.
>> >>>>
>> >>>> So should we really be doing this?
>> >>>
>> >>> Andy does this not for the off-chance that userspace's RCX is equal
>> >>> to return address and R11 == RFLAGS. The chances of that are
>> >>> astronomically small.
>> >>>
>> >>> This code path triggers when ptrace/audit/seccomp is active. Instead
>> >>> of torturing ourselves trying to not divert into IRET return, now
>> >>> code is steered that way. But then immediately before actual IRET,
>> >>> we check again: "do we really need IRET?" IOW "did ptrace really
>> >>> touch pt_regs->ss? ->flags? ->rip? ->rcx?" which in vast majority of
>> >>> cases will not be true.
>> >>
>> >> I keep forgetting about that, my test systems have the audit muck
>> >> turned off ;-)
>> >>
>> >> Fair enough - and it's sensible to share the IRET path between
>> >> interrupts and complex-return system calls, even though the check
>> >> is unnecessary overhead for the pure interrupt return path...
>> >
>> >
>> > Maybe we could reintroduce TIF_IRET for this purpose instead of
>> > (ab)using TIF_NOTIFY_RESUME. Then we would only do the opportunistic
>> > check for those cases (ptrace, audit, exec, sigreturn, etc.), and skip
>> > it for interrupts.
>> The very first check in the existing code, pt_regs->cx ==
>> pt_regs->ip, will fail for interrupt returns.
>> You hardly can save anything by placing a (ti->flags &
>> TIF_TRY_SYSRET) check in front of it, it's almost as expensive.
> Well, what I was thinking of was to have a pure irq (well, async
> context) return path, not shared with the weird-syscall-IRET return
> path at all ...
> It would be open coded, not obfuscated via macros.
> That way AFAICS the upsides are:
> - it's easier to read (and maintain) what goes on in which case.
> '*intr*' labels would truly identify interrupt return related
> processing, for a change!
> - we can optimize in a more directed fashion - like here
> ... while the downsides are:
> - more code
> - a (small) chance of a fix going to one path while not the other.
> How much extra code would it be?

Negative if we did it right, perhaps.

I think the best approach is a complete rewrite, not an attempt to
incrementally improve it. The current code is held together by gotos
and bailing wire, and I'm surprised it works at all. Some of it seems
to work by accident AFAICT. For example, the sysret audit "fast path"
that I deleted as part of the opportunistic sysret work was quite
buggy AFAICT, but no one who looks at this code cares about audit, and
it's nearly impossible to even tell what the code is supposed to do.

Linus tried to rewrite some of it last year, but it was incomplete.
Here's my vague inventory of what the exit paths need to do on return
to userspace:

- syscall_trace_leave [1] (syscall only)

- context tracking if TIF_NOHZ (all exits)

- one-shot work. These are things that must be done if the flags are
set, and doing it clears the flags. These flags need to be checked
with IRQs off, and IRQs cannot be re-enabled afterwards. This
includes signal delivery, scheduling, and user return notifiers. (all

- uprobes. Maybe this counts as one-shot work. (all exits, I assume)

- Check for sysret applicability (only important for syscalls)

- espfix, iret, unless we're using sysret (all exits)

There's also the special case of interrupt returns to kernel mode.
That should schedule in preemptable kernels.

To me, this suggests that we should have a total of four asm exit paths:

- syscall return. IMO this should call a single C function, check
the sysret conditions, and jump to the espfix code.

- paranoid return to kernel. This should just return after a
possible swapgs. (We do this now.)

- interrupt/exception return. IMO this should call a single C
function and jump to the espfix code. [2]

- NMI -- this is its own crazy thing.

All of the check/careful/very_careful crap can just go. Good riddance.

[1] syscall_trace_leave contains context tracking hooks that are
AFAICT completely unnecessary. That's almost okay, though -- other
similarly silly context tracking hooks fix up the mess. This might
explain part of why context tracking is so incredibly slow.

[2] I don't even think we need the retint_careful vs retint_kernel
distinction in asm. If we ditch that, we should be able to replace it
with something like:

call prepare_intr_exception_return
testl $ebx,$ebx
jz 1f

Of course, prepare_intr_exception_return would need to DTRT if we're
returning to kernel mode, but that's easy.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at