Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

From: Andrew Lutomirski
Date: Mon Apr 21 2014 - 20:38:06 EST


On Mon, Apr 21, 2014 at 4:29 PM, H. Peter Anvin <hpa@xxxxxxxxx> wrote:
> On 04/21/2014 04:19 PM, Andrew Lutomirski wrote:
>>
>> Hahaha! :)
>>
>> Some comments:
>>
>> Does returning to 64-bit CS with 16-bit SS not need espfix?
>
> There is no such thing. With a 64-bit CS, the flags on SS are ignored
> (although you still have to have a non-null SS... the conditions are a
> bit complex.)
>
>> Conversely, does 16-bit CS and 32-bit SS need espfix?
>
> It does not, at least to the best of my knowledge (it is controlled by
> the SS size, not the CS size.)
>
> I'm going to double-check the corner cases just out of healthy paranoia,
> but I'm 98% sure this is correct (and if not, the 32-bit code needs to
> be fixed, too.)
>
>>> @@ -1058,6 +1095,7 @@ bad_iret:
>>> * So pretend we completed the iret and took the #GPF in user mode.
>>> *
>>> * We are now running with the kernel GS after exception recovery.
>>> + * Exception entry will have removed us from the espfix stack.
>>> * But error_entry expects us to have user GS to match the user %cs,
>>> * so swap back.
>>> */
>>
>> What is that referring to?
>
> It means that we have already switched back from the espfix stack to the
> real stack.
>
>>> + /*
>>> + * Switch from the espfix stack to the proper stack: tricky stuff.
>>> + * On the stack right now is 5 words of exception frame,
>>> + * error code/oldeax, RDI, and the return value, so no additional
>>> + * stack is available.
>>> + *
>>> + * We will always be using the user space GS on entry.
>>> + */
>>> +ENTRY(espfix_fix_stack)
>>> + SWAPGS
>>> + cld
>>> + movq PER_CPU_VAR(kernel_stack),%rdi
>>> + subq $8*8,%rdi
>>> + /* Use the real stack to hold these registers for now */
>>> + movq %rsi,-8(%rdi)
>>> + movq %rcx,-16(%rdi)
>>> + movq %rsp,%rsi
>>> + movl $8,%ecx
>>> + rep;movsq
>>> + leaq -(10*8)(%rdi),%rsp
>>> + popq %rcx
>>> + popq %rsi
>>> + SWAPGS
>>> + retq
>>>
>>
>> Is it guaranteed that the userspace thread that caused this is dead?
>> If not, do you need to change RIP so that espfix gets invoked again
>> when you return from the exception?
>
> It is not guaranteed to be dead at all. Why would you need to change
> RIP, though?

Oh. You're not changing the RSP that you return to. So this should be okay.

>
>>> +
>>> +void init_espfix_cpu(void)
>>> +{
>>> + int cpu = smp_processor_id();
>>> + unsigned long addr;
>>> + pgd_t pgd, *pgd_p;
>>> + pud_t pud, *pud_p;
>>> + pmd_t pmd, *pmd_p;
>>> + pte_t pte, *pte_p;
>>> + int n;
>>> + void *stack_page;
>>> +
>>> + cpu = smp_processor_id();
>>> + BUG_ON(cpu >= (8 << 20)/ESPFIX_STACK_SIZE);
>>> +
>>> + /* We only have to do this once... */
>>> + if (likely(this_cpu_read(espfix_stack)))
>>> + return; /* Already initialized */
>>> +
>>> + addr = espfix_base_addr(cpu);
>>> +
>>> + /* Did another CPU already set this up? */
>>> + if (likely(espfix_already_there(addr)))
>>> + goto done;
>>> +
>>> + mutex_lock(&espfix_init_mutex);
>>> +
>>> + if (unlikely(espfix_already_there(addr)))
>>> + goto unlock_done;
>>
>> Wouldn't it be simpler to just have a single static bool to indicate
>> whether espfix is initialized?
>
> No, you would have to allocate memory for every possible CPU, which I
> wanted to avoid in case NR_CPUS >> actual CPUs (I don't know if we have
> already done that for percpu, but we *should* if we haven't yet.)
>
>> Even better: why not separate the percpu init from the pagetable init
>> and just do the pagetable init once from main or even modify_ldt?
>
> It needs to be done once per CPU. I wanted to do it late enough that
> the page allocator is fully functional, so we don't have to do the ugly
> hacks to call one allocator or another as the percpu initialization code
> does (otherwise it would have made a lot of sense to co-locate with percpu.)

Hmm. I guess espfix_already_there isn't so bad. Given that, in the
worst case, I think there are 16 pages allocated, it might make sense
to just track which of those 16 pages have been allocated in some
array. That whole array would probably be shorter than the test of
espfix_already_there. Or am I still failing to understand how this
works?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/