Re: TLB flushes on fixmap changes

From: Nadav Amit
Date: Mon Aug 27 2018 - 13:34:46 EST


at 1:05 AM, Masami Hiramatsu <mhiramat@xxxxxxxxxx> wrote:

> On Sun, 26 Aug 2018 20:26:09 -0700
> Nadav Amit <nadav.amit@xxxxxxxxx> wrote:
>
>> at 8:03 PM, Masami Hiramatsu <mhiramat@xxxxxxxxxx> wrote:
>>
>>> On Sun, 26 Aug 2018 11:09:58 +0200
>>> Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>>>
>>>> On Sat, Aug 25, 2018 at 09:21:22PM -0700, Andy Lutomirski wrote:
>>>>> I just re-read text_poke(). It's, um, horrible. Not only is the
>>>>> implementation overcomplicated and probably buggy, but it's SLOOOOOW.
>>>>> It's totally the wrong API -- poking one instruction at a time
>>>>> basically can't be efficient on x86. The API should either poke lots
>>>>> of instructions at once or should be text_poke_begin(); ...;
>>>>> text_poke_end();.
>>>>
>>>> I don't think anybody ever cared about performance here. Only
>>>> correctness. That whole text_poke_bp() thing is entirely tricky.
>>>
>>> Agreed. Self modification is a special event.
>>>
>>>> FWIW, before text_poke_bp(), text_poke() would only be used from
>>>> stop_machine, so all the other CPUs would be stuck busy-waiting with
>>>> IRQs disabled. These days, yeah, that's lots more dodgy, but yes
>>>> text_mutex should be serializing all that.
>>>
>>> I'm still not sure that speculative page-table walk can be done
>>> over the mutex. Also, if the fixmap area is for aliasing
>>> pages (which always mapped to memory), what kind of
>>> security issue can happen?
>>
>> The PTE is accessible from other cores, so just as we assume for L1TF that
>> the every addressable memory might be cached in L1, we should assume and
>> PTE might be cached in the TLB when it is present.
>
> Ok, so other cores can accidentally cache the PTE in TLB, (and no way
> to shoot down explicitly?)

There is way (although current it does not). But it seems that the consensus
is that it is better to avoid it being mapped at all in remote cores.

>> Although the mapping is for an alias, there are a couple of issues here.
>> First, this alias mapping is writable, so it might an attacker to change the
>> kernel code (following another initial attack).
>
> Combined with some buffer overflow, correct? If the attacker already can
> write a kernel data directly, he is in the kernel mode.

Right.

>
>> Second, the alias mapping is
>> never explicitly flushed. We may assume that once the original mapping is
>> removed/changed, a full TLB flush would take place, but there is no
>> guarantee it actually takes place.
>
> Hmm, would this means a full TLB flush will not flush alias mapping?
> (or, the full TLB flush just doesn't work?)

It will flush the alias mapping, but currently there is no such explicit
flush.

>>> Anyway, from the viewpoint of kprobes, either per-cpu fixmap or
>>> changing CR3 sounds good to me. I think we don't even need per-cpu,
>>> it can call a thread/function on a dedicated core (like the first
>>> boot processor) and wait :) This may prevent leakage of pte change
>>> to other cores.
>>
>> I implemented per-cpu fixmap, but I think that it makes more sense to take
>> peterz approach and set an entry in the PGD level. Per-CPU fixmap either
>> requires to pre-populate various levels in the page-table hierarchy, or
>> conditionally synchronize whenever module memory is allocated, since they
>> can share the same PGD, PUD & PMD. While usually the synchronization is not
>> needed, the possibility that synchronization is needed complicates locking.
>
> Could you point which PeterZ approach you said? I guess it will be
> make a clone of PGD and use it for local page mapping (as new mm).
> If so, yes it sounds perfectly fine to me.

The thread is too long. What I think is best is having a mapping in the PGD
level. Iâll try to give it a shot, and see what I get.

>> Anyhow, having fixed addresses for the fixmap can be used to circumvent
>> KASLR.
>
> I think text_poke doesn't mind using random address :)
>
>> I donât think a dedicated core is needed. Anyhow there is a lock
>> (text_mutex), so use_mm() can be used after acquiring the mutex.
>
> Hmm, use_mm() said;
>
> /*
> * use_mm
> * Makes the calling kernel thread take on the specified
> * mm context.
> * (Note: this routine is intended to be called only
> * from a kernel thread context)
> */
>
> So maybe we need a dedicated kernel thread for safeness?

Yes, it says so. But I am not sure it cannot be changed, at least for this
specific use-case. Switching kernel threads just for patching seems to me as
an overkill.

Let me see if I can get something half-reasonable doing so...