Re: Review of KPTI patchset

From: Mathieu Desnoyers
Date: Sat Dec 30 2017 - 15:40:12 EST


----- On Dec 30, 2017, at 2:58 PM, Thomas Gleixner tglx@xxxxxxxxxxxxx wrote:

> On Sat, 30 Dec 2017, Mathieu Desnoyers wrote:
>
>> Hi Thomas,
>>
>> Here is some feedback on the KPTI patchset. Sorry for not replying to the
>> patch, I was not CC'd on the original email, and don't have it in my inbox.
>
> I can bounce you 196 versions if you want.

Oh no, don't worry about this. I'm happy reviewing the resulting patchset
as it is. :)

>
>> I notice that fill_ldt() sets the desc->type with "|= 1", whereas all
>> other operations on the desc type are done with a type enum based on
>> clearly defined bits. Is the hardcoded "1" on purpose ?
>
> I don't understand your question. That code does not have any enum involved
> at all:

I think I got mixed up with other "desc" fields within other structures
of desc_defs.h.

>
> desc->type = (info->read_exec_only ^ 1) << 1;
> desc->type |= info->contents << 2;
> /* Set the ACCESS bit so it can be mapped RO */
> desc->type |= 1;
>
> So the |= 1 is completely consistent with the rest of that code.

It indeed seems consistent with the rest of that code, which could use
more comments and documentation. For instance, x86 desc_defs.h
could benefit from extra comments describing the meaning of each bit
near the "type" field.

I guess a counter-argument is that anyone reading through that code
should look up the "segment descriptor" layout in a x86 manual. Not
ideal though.

>
>> arch/x86/include/asm/processor.h:
>>
>> "+ * With page table isolation enabled, we map the LDT in ... [stay tuned]"
>>
>> I look forward to publication of the next chapter containing the rest of
>> this sentence. When is it due ? ;)
>
> Don't know. Lost my crystal ball.

Me too :) I would be helpful to complete this comment though.

[...]

>> @@ -156,6 +271,12 @@ int ldt_dup_context(struct mm_struct *old_mm, struct
>> mm_struct *mm)
>> new_ldt->nr_entries * LDT_ENTRY_SIZE);
>> finalize_ldt_struct(new_ldt);
>>
>> + retval = map_ldt_struct(mm, new_ldt, 0);
>> + if (retval) {
>> + free_ldt_pgtables(mm);
>> + free_ldt_struct(new_ldt);
>> + goto out_unlock;
>> + }
>> mm->context.ldt = new_ldt;
>>
>> out_unlock:
>>
>> ^ I don't get why it does "free_ldt_pgtables(mm)" on the mm argument, but
>> it's not done in other error paths. Perhaps it's OK, but ownership seems
>> non-obvious.
>
> The pagetable for LDT is allocated and populated in the user space visible
> part of a process PGDIR, which obviously is connected to the mm struct....
>
> Which other error paths are you talking about?

Let's look at the entire function:

> /*
> * Called on fork from arch_dup_mmap(). Just copy the current LDT state,
> * the new task is not running, so nothing can be installed.
> */
> int ldt_dup_context(struct mm_struct *old_mm, struct mm_struct *mm)
> {
> struct ldt_struct *new_ldt;
> int retval = 0;
>
> if (!old_mm)
> return 0;

If old_mm is NULL, free_ldt_pgtables(mm) is not called.

>
> mutex_lock(&old_mm->context.lock);
> if (!old_mm->context.ldt)

If old_mm->context.ldt is NULL, free_ldt_pgtables(mm) is not called.

> goto out_unlock;
>
> new_ldt = alloc_ldt_struct(old_mm->context.ldt->nr_entries);
> if (!new_ldt) {
> retval = -ENOMEM;

On allocation error, free_ldt_pgtables(mm) is not called.

> goto out_unlock;
> }
>
> memcpy(new_ldt->entries, old_mm->context.ldt->entries,
> new_ldt->nr_entries * LDT_ENTRY_SIZE);
> finalize_ldt_struct(new_ldt);
>
> retval = map_ldt_struct(mm, new_ldt, 0);
> if (retval) {
> free_ldt_pgtables(mm);

Here, if we fail to map_ldt_struct, then free_ldt_pgtables(mm) is called.

> free_ldt_struct(new_ldt);

In addition to call free_ldt_struct(), but map_ldt_struct failed... ?

This lack of symmetry makes me uncomfortable, and it may hint at something
fishy.

> goto out_unlock;
> }
> mm->context.ldt = new_ldt;
>
> out_unlock:
> mutex_unlock(&old_mm->context.lock);
> return retval;
> }

[...]

>
>> + /*
>> + * Force the population of PMDs for not yet allocated per cpu
>> + * memory like debug store buffers.
>> + */
>> + npages = sizeof(struct debug_store_buffers) / PAGE_SIZE;
>> + for (; npages; npages--, cea += PAGE_SIZE)
>> + cea_set_pte(cea, 0, PAGE_NONE);
>>
>> ^ the code above (in percpu_setup_debug_store()) depends on having
>> struct debug_store_buffers's size being a multiple of PAGE_SIZE. A
>> comment should be added near the structure declaration to document
>> this requirement.
>
> Hmm. There was a build_bug_on() somewhere which ensured that. That must
> have been lost in one of the gazillion iterations.

A build bug on would work as documentation indeed.

[...]

>
>> +/*
>> + * We get here when we do something requiring a TLB invalidation
>> + * but could not go invalidate all of the contexts. We do the
>> + * necessary invalidation by clearing out the 'ctx_id' which
>> + * forces a TLB flush when the context is loaded.
>> + */
>> +void clear_asid_other(void)
>> +{
>> + u16 asid;
>> +
>> + /*
>> + * This is only expected to be set if we have disabled
>> + * kernel _PAGE_GLOBAL pages.
>> + */
>> + if (!static_cpu_has(X86_FEATURE_PTI)) {
>> + WARN_ON_ONCE(1);
>> + return;
>> + }
>> +
>> + for (asid = 0; asid < TLB_NR_DYN_ASIDS; asid++) {
>> + /* Do not need to flush the current asid */
>> + if (asid == this_cpu_read(cpu_tlbstate.loaded_mm_asid))
>> + continue;
>> + /*
>> + * Make sure the next time we go to switch to
>> + * this asid, we do a flush:
>> + */
>> + this_cpu_write(cpu_tlbstate.ctxs[asid].ctx_id, 0);
>> + }
>> + this_cpu_write(cpu_tlbstate.invalidate_other, false);
>> +}
>>
>> Can this be called with preemption enabled ? If so, what happens
>> if migrated ?
>
> No, it can't and if it is then it's a bug and the smp_processor_id() debug
> code will yell at you.

I thought the whole point about this_cpu_*() was that it could be called
with preemption enabled, given that it figures out the per-cpu data offset
using a segment selector prefix. How would smp_processor_id() debug code be
involved here ?

Thanks,

Mathieu


>
> Thanks,
>
> tglx

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com