Re: [RFC PATCH] arch/x86: Optionally flush L1D on context switch

From: Andy Lutomirski
Date: Sun Mar 22 2020 - 11:10:43 EST



> On Mar 21, 2020, at 10:08 PM, Herrenschmidt, Benjamin <benh@xxxxxxxxxx> wrote:
>
> ïOn Fri, 2020-03-20 at 12:49 +0100, Thomas Gleixner wrote:
>> Balbir,
>>
>> "Singh, Balbir" <sblbir@xxxxxxxxxx> writes:
>>>> On Thu, 2020-03-19 at 01:38 +0100, Thomas Gleixner wrote:
>>>>> What's the point? The attack surface is the L1D content of the scheduled
>>>>> out task. If the malicious task schedules out, then why would you care?
>>>>>
>>>>> I might be missing something, but AFAICT this is beyond paranoia.
>>>>>
>>>
>>> I think there are two cases
>>>
>>> 1. Task with important data schedules out
>>> 2. Malicious task schedules in
>>>
>>> These patches address 1, but call out case #2
>>
>> The point is if the victim task schedules out, then there is no reason
>> to flush L1D immediately in context switch. If that just schedules a
>> kernel thread and then goes back to the task, then there is no point
>> unless you do not even trust the kernel thread.
>
> A switch to a kernel thread will not call switch_mm, will it ? At least it used not to...
>
>>>>> 3. There is a fallback software L1D load, similar to what L1TF does, but
>>>>> we don't prefetch the TLB, is that sufficient?
>>>>
>>>> If we go there, then the KVM L1D flush code wants to move into general
>>>> x86 code.
>>>
>>> OK.. we can definitely consider reusing code, but I think the KVM bits require
>>> tlb prefetching, IIUC before cache flush to negate any bad translations
>>> associated with an L1TF fault, but the code/comments are not clear on the need
>>> to do so.
>>
>> I forgot the gory details by now, but having two entry points or a
>> conditional and share the rest (page allocation etc.) is definitely
>> better than two slightly different implementation which basically do the same thing.
>>
>>>>> +void enable_l1d_flush_for_task(struct task_struct *tsk)
>>>>> +{
>>>>> + struct page *page;
>>>>> +
>>>>> + if (static_cpu_has(X86_FEATURE_FLUSH_L1D))
>>>>> + goto done;
>>>>> +
>>>>> + mutex_lock(&l1d_flush_mutex);
>>>>> + if (l1d_flush_pages)
>>>>> + goto done;
>>>>> + /*
>>>>> + * These pages are never freed, we use the same
>>>>> + * set of pages across multiple processes/contexts
>>>>> + */
>>>>> + page = alloc_pages(GFP_KERNEL | __GFP_ZERO, L1D_CACHE_ORDER);
>>>>> + if (!page)
>>>>> + return;
>>>>> +
>>>>> + l1d_flush_pages = page_address(page);
>>>>> + /* I don't think we need to worry about KSM */
>>>>
>>>> Why not? Even if it wouldn't be necessary why would we care as this is a
>>>> once per boot operation in fully preemptible code.
>>>
>>> Not sure I understand your question, I was stating that even if KSM was
>>> running, it would not impact us (with dedup), as we'd still be writing out 0s
>>> to the cache line in the fallback case.
>>
>> I probably confused myself vs. the comment in the VMX code, but that
>> mentions nested virt. Needs at least some consideration when we reuse
>> that code.
>>
>>>>> void switch_mm(struct mm_struct *prev, struct mm_struct *next,
>>>>> struct task_struct *tsk)
>>>>> {
>>>>> @@ -433,6 +519,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct
>>>>> mm_struct *next,
>>>>> trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
>>>>> }
>>>>>
>>>>> + l1d_flush(next, tsk);
>>>>
>>>> This is really the wrong place. You want to do that:
>>>>
>>>> 1) Just before return to user space
>>>> 2) When entering a guest
>>>>
>>>> and only when the previously running user space task was the one which
>>>> requested this massive protection.
>>>>
>>>
>>> Cases 1 and 2 are handled via
>>>
>>> 1. SWAPGS fixes/work arounds (unless I misunderstood your suggestion)
>>
>> How so? SWAPGS mitigation does not flush L1D. It merily serializes SWAPGS.
>
>>> 2. L1TF fault handling
>>>
>>> This mechanism allows for flushing not restricted to 1 or 2, the idea is to
>>> immediately flush L1D for paranoid processes on mm switch.
>>
>> Why? To protect the victim task against the malicious kernel?
>
> Mostly malicious other tasks for us. As I said, I don't think switch_mm
> is called on switching to a kernel thread and is definitely a colder
> path than the return to userspace, so it felt like the right place to
> put this, but I don't mind if you prefer it elsewhere as long as it
> does the job which is to prevent task B to snoop task A data.
>
>> The L1D content of the victim is endangered in the following case:
>>
>> victim out -> attacker in
>>
>> The attacker can either run in user space or in guest mode. So the flush
>> is only interesting when the attacker actually goes back to user space
>> or reenters the guest.
>>
>> The following is completely uninteresting:
>>
>> victim out -> kernel thread in/out -> victim in
>
> Sure but will that call switch_mm to be called ?
>
>> Even this is uninteresting:
>>
>> victim in -> attacker in (stays in kernel, e.g. waits for data) ->
>> attacker out -> victim in
>
> I don't get this ... how do you get attacker_in without victim_out
> first ? In which case you have a victim_out -> attacker_in transition
> which is what we are trying to protect.
>
> I still think flushing the "high value" process L1D on switch_mm out is
> the way to go here...

Let me try to understand the issue. There is some high-value data, and that data is owned by a high-value process. At some point, the data ends up in L1D. Later in, evil code runs and may attempt to exfiltrate that data from L1D using a side channel. (The evil code is not necessarily in a malicious process context. It could be kernel code targeted by LVI or similar. It could be ordinary code that happens to contain a side channel gadget by accident.)

So the idea is to flush L1D after manipulating high-value data and before running evil code.

The nasty part here is that we donât have a good handle on when L1D is filled and when the evil code runs. If the evil code is untrusted process userspace and the fill is an interrupt, then switch_mm is useless and we want to flush on kernel exit instead. If the fill and evil code are both userspace, then switch_mm is probably the right choice, but prepare_exit_from_usermode works too. If SMT is on, we lose no matter what. If the evil code is in kernel context, then itâs not clear what to do. If the fill and the evil code are both in kernel threads (hi, io_uring), then Iâm not at all sure what to do.

In summary, kernel exit seems stronger, but the right answer isnât so clear.

We could do an optimized variant where we flush at kernel exit but we *decide* to flush in switch_mm.