Re: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust

From: Andrew Cooper
Date: Fri Jul 05 2019 - 16:18:05 EST


On 05/07/2019 20:06, Andy Lutomirski wrote:
> On Fri, Jul 5, 2019 at 8:47 AM Andrew Cooper <andrew.cooper3@xxxxxxxxxx> wrote:
>> On 04/07/2019 16:51, Thomas Gleixner wrote:
>>> 2) The loop termination logic is interesting at best.
>>>
>>> If the machine has no TSC or cpu_khz is not known yet it tries 1
>>> million times to ack stale IRR/ISR bits. What?
>>>
>>> With TSC it uses the TSC to calculate the loop termination. It takes a
>>> timestamp at entry and terminates the loop when:
>>>
>>> (rdtsc() - start_timestamp) >= (cpu_hkz << 10)
>>>
>>> That's roughly one second.
>>>
>>> Both methods are problematic. The APIC has 256 vectors, which means
>>> that in theory max. 256 IRR/ISR bits can be set. In practice this is
>>> impossible as the first 32 vectors are reserved and not affected and
>>> the chance that more than a few bits are set is close to zero.
>> [Disclaimer. I talked to Thomas in private first, and he asked me to
>> post this publicly as the CVE is almost a decade old already.]
>>
>> I'm afraid that this isn't quite true.
>>
>> In terms of IDT vectors, the first 32 are reserved for exceptions, but
>> only the first 16 are reserved in the LAPIC. Vectors 16-31 are fair
>> game for incoming IPIs (SDM Vol3, 10.5.2 Valid Interrupt Vectors).
>>
>> In practice, this makes Linux vulnerable to CVE-2011-1898 / XSA-3, which
>> I'm disappointed to see wasn't shared with other software vendors at the
>> time.
>>
>> Because TPR is 0, an incoming IPI can trigger #AC, #CP, #VC or #SX
>> without an error code on the stack, which results in a corrupt pt_regs
>> in the exception handler, and a stack underflow on the way back out,
>> most likely with a fault on IRET.
>>
>> These can be addressed by setting TPR to 0x10, which will inhibit
>> delivery of any errant IPIs in this range, but some extra sanity logic
>> may not go amiss. An error code on a 64bit stack can be spotted with
>> `testb $8, %spl` due to %rsp being aligned before pushing the exception
>> frame.
> Several years ago, I remember having a discussion with someone (Jan
> Beulich, maybe?) about how to efficiently make the entry code figure
> out the error code situation automatically. I suspect it was on IRC
> and I can't find the logs.

It was on IRC, but I don't remember exactly when, either.

> I'm thinking that maybe we should just
> make Linux's idtentry code do something like this.
>
> If nothing else, we could make idtentry do:
>
> testl $8, %esp /* shorter than testb IIRC */

Sadly not. test (unlike cmp and the basic mutative opcodes) doesn't
have a sign-extendable imm8 encoding. The two options are:

f7 c4 08 00 00 00ÂÂÂ ÂÂÂ testÂÂ $0x8,%esp
40 f6 c4 08ÂÂÂÂÂÂÂÂÂ ÂÂÂ testÂÂ $0x8,%spl

> jz 1f /* or jnz -- too lazy to figure it out */
> pushq $-1
> 1:

It is jz, and Xen does use this sequence for reserved/unimplemented
vectors, but we expect those codepaths never to be executed.

>
> instead of the current hardcoded push. The cost of a mispredicted
> branch here will be smallish compared to the absurdly large cost of
> the entry itself. But I thought I had something more clever than
> this. This sequence works, but it still feels like it should be
> possible to do better:
>
> .macro PUSH_ERROR_IF_NEEDED
> /*
> * Before the IRET frame is pushed, RSP is aligned to a 16-byte
> * boundary. After SS .. RIP and the error code are pushed, RSP is
> * once again aligned. Pushing -1 will put -1 in the error code slot
> * (regs->orig_ax) if there was no error code.
> */
>
> pushq $-1 /* orig_ax = -1, maybe */
> /* now RSP points to orig_ax (aligned) or di (misaligned) */
> pushq $0
> /* now RSP points to di (misaligned) or si (aligned) */
> orq $8, %rsp
> /* now RSP points to di */
> addq $8, %rsp
> /* now RSP points to orig_ax, and we're in good shape */
> .endm
>
> Is there a better sequence for this?

The only aspect I can think of is whether mixing the push/pops with
explicit updates updates to %rsp is better or worse than a very well
predicted branch, given that various frontends have special tracking to
reduce instruction dependencies on %rsp. I'll have to defer to the CPU
microachitects as to which of the two options is the lesser evil.

That said, both Intel and AMD's Optimisation guides have stack alignment
suggestions which mix push/sub/and on function prolog, so I expect this
is as optimised as it can reasonably be in the pipelines.

>> Another interesting problem is an IPI which its vector 0x80. A cunning
>> attacker can use this to simulate system calls from unsuspecting
>> positions in userspace, or for interrupting kernel context. At the very
>> least the int0x80 path does an unconditional swapgs, so will try to run
>> with the user gs, and I expect things will explode quickly from there.
> At least SMAP helps here on non-FSGSBASE systems. With FSGSBASE, I
> suppose we could harden this by adding a special check to int $0x80 to
> validate GSBASE.
>
>> One option here is to look at ISR and complain if it is found to be set.
> Barring some real hackery, we're toast long before we get far enough to do that.

Even if the path moves to be like a regular idtentry? How much more
expensive is that in reality? Ultimately, it is that which needs to be
weighed against any extra wanted robustness.

~Andrew