Re: [tip:x86/urgent] x86/mm: Add barriers and document switch_mm() -vs-flush synchronization

From: Andy Lutomirski
Date: Mon Jan 11 2016 - 16:50:51 EST


On Mon, Jan 11, 2016 at 10:25 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> On Mon, Jan 11, 2016 at 03:42:40AM -0800, tip-bot for Andy Lutomirski wrote:
>> --- a/arch/x86/include/asm/mmu_context.h
>> +++ b/arch/x86/include/asm/mmu_context.h
>> @@ -116,8 +116,34 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
>> #endif
>> cpumask_set_cpu(cpu, mm_cpumask(next));
>>
>> - /* Re-load page tables */
>> + /*
>> + * Re-load page tables.
>> + *
>> + * This logic has an ordering constraint:
>> + *
>> + * CPU 0: Write to a PTE for 'next'
>> + * CPU 0: load bit 1 in mm_cpumask. if nonzero, send IPI.
>> + * CPU 1: set bit 1 in next's mm_cpumask
>> + * CPU 1: load from the PTE that CPU 0 writes (implicit)
>> + *
>> + * We need to prevent an outcome in which CPU 1 observes
>> + * the new PTE value and CPU 0 observes bit 1 clear in
>> + * mm_cpumask. (If that occurs, then the IPI will never
>> + * be sent, and CPU 0's TLB will contain a stale entry.)
>> + *
>> + * The bad outcome can occur if either CPU's load is
>> + * reordered before that CPU's store, so both CPUs much
>
> s/much/must/ ?

Indeed. Is this worth a follow-up patch?

>
>> + * execute full barriers to prevent this from happening.
>> + *
>> + * Thus, switch_mm needs a full barrier between the
>> + * store to mm_cpumask and any operation that could load
>> + * from next->pgd. This barrier synchronizes with
>> + * remote TLB flushers. Fortunately, load_cr3 is
>> + * serializing and thus acts as a full barrier.
>> + *
>> + */
>> load_cr3(next->pgd);
>> +
>> trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
>>
>> /* Stop flush ipis for the previous mm */
>> @@ -156,10 +182,15 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
>> * schedule, protecting us from simultaneous changes.
>> */
>> cpumask_set_cpu(cpu, mm_cpumask(next));
>> +
>> /*
>> * We were in lazy tlb mode and leave_mm disabled
>> * tlb flush IPI delivery. We must reload CR3
>> * to make sure to use no freed page tables.
>> + *
>> + * As above, this is a barrier that forces
>> + * TLB repopulation to be ordered after the
>> + * store to mm_cpumask.
>
> somewhat confused by this comment, cpumask_set_cpu() is a LOCK BTS, that
> is already fully ordered.

There are more than enough barriers here. v1 had cpumask_set_cpu;
smp_mb__after_atomic, which is more portable and generates identical
code. I don't have a real preference for which barrier we should
consider to the important one.

>
>> */
>> load_cr3(next->pgd);
>> trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
>> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
>> index 8ddb5d0..8f4cc3d 100644
>
>
>> --- a/arch/x86/mm/tlb.c
>> +++ b/arch/x86/mm/tlb.c
>
>> @@ -188,17 +191,29 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>
>> if (!current->mm) {
>> leave_mm(smp_processor_id());
>> +
>> + /* Synchronize with switch_mm. */
>> + smp_mb();
>> +
>> goto out;
>> }
>
>> + } else {
>> leave_mm(smp_processor_id());
>> +
>> + /* Synchronize with switch_mm. */
>> + smp_mb();
>> + }
>> }
>
> The alternative is making leave_mm() unconditionally imply a full
> barrier. I've not looked at other sites using it though.

For a quick fix, I preferred the more self-contained change.

--Andy