Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs

From: Mathieu Desnoyers
Date: Mon Aug 23 2021 - 11:20:26 EST


[ re-send to Darren Hart ]

----- On Aug 23, 2021, at 11:18 AM, Mathieu Desnoyers mathieu.desnoyers@xxxxxxxxxxxx wrote:

> ----- On Aug 20, 2021, at 6:50 PM, Sean Christopherson seanjc@xxxxxxxxxx wrote:
>
>> Add a test to verify an rseq's CPU ID is updated correctly if the task is
>> migrated while the kernel is handling KVM_RUN. This is a regression test
>> for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer
>> to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM
>> without updating rseq, leading to a stale CPU ID and other badness.
>>
>
> [...]
>
> +#define RSEQ_SIG 0xdeadbeef
>
> Is there any reason for defining a custom signature rather than including
> tools/testing/selftests/rseq/rseq.h ? This should take care of including
> the proper architecture header which will define the appropriate signature.
>
> Arguably you don't define rseq critical sections in this test per se, but
> I'm wondering why the custom signature here.
>
> [...]
>
>> +
>> +static void *migration_worker(void *ign)
>> +{
>> + cpu_set_t allowed_mask;
>> + int r, i, nr_cpus, cpu;
>> +
>> + CPU_ZERO(&allowed_mask);
>> +
>> + nr_cpus = CPU_COUNT(&possible_mask);
>> +
>> + for (i = 0; i < 20000; i++) {
>> + cpu = i % nr_cpus;
>> + if (!CPU_ISSET(cpu, &possible_mask))
>> + continue;
>> +
>> + CPU_SET(cpu, &allowed_mask);
>> +
>> + /*
>> + * Bump the sequence count twice to allow the reader to detect
>> + * that a migration may have occurred in between rseq and sched
>> + * CPU ID reads. An odd sequence count indicates a migration
>> + * is in-progress, while a completely different count indicates
>> + * a migration occurred since the count was last read.
>> + */
>> + atomic_inc(&seq_cnt);
>
> So technically this atomic_inc contains the required barriers because the
> selftests
> implementation uses "__sync_add_and_fetch(&addr->val, 1)". But it's rather odd
> that
> the semantic differs from the kernel implementation in terms of memory barriers:
> the
> kernel implementation of atomic_inc guarantees no memory barriers, but this one
> happens to provide full barriers pretty much by accident (selftests
> futex/include/atomic.h documents no such guarantee).
>
> If this full barrier guarantee is indeed provided by the selftests atomic.h
> header,
> I would really like a comment stating that in the atomic.h header so the carpet
> is
> not pulled from under our feet by a future optimization.
>
>
>> + r = sched_setaffinity(0, sizeof(allowed_mask), &allowed_mask);
>> + TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)",
>> + errno, strerror(errno));
>> + atomic_inc(&seq_cnt);
>> +
>> + CPU_CLR(cpu, &allowed_mask);
>> +
>> + /*
>> + * Let the read-side get back into KVM_RUN to improve the odds
>> + * of task migration coinciding with KVM's run loop.
>
> This comment should be about increasing the odds of letting the seqlock
> read-side
> complete. Otherwise, the delay between the two back-to-back atomic_inc is so
> small
> that the seqlock read-side may never have time to complete the reading the rseq
> cpu id and the sched_getcpu() call, and can retry forever.
>
> I'm wondering if 1 microsecond is sufficient on other architectures as well. One
> alternative way to make this depend less on the architecture's implementation of
> sched_getcpu (whether it's a vDSO, or goes through a syscall) would be to read
> the rseq cpu id and call sched_getcpu a few times (e.g. 3 times) in the
> migration
> thread rather than use usleep, and throw away the value read. This would ensure
> the delay is appropriate on all architectures.
>
> Thanks!
>
> Mathieu
>
>> + */
>> + usleep(1);
>> + }
>> + done = true;
>> + return NULL;
>> +}
>> +
>> +int main(int argc, char *argv[])
>> +{
>> + struct kvm_vm *vm;
>> + u32 cpu, rseq_cpu;
>> + int r, snapshot;
>> +
>> + /* Tell stdout not to buffer its content */
>> + setbuf(stdout, NULL);
>> +
>> + r = sched_getaffinity(0, sizeof(possible_mask), &possible_mask);
>> + TEST_ASSERT(!r, "sched_getaffinity failed, errno = %d (%s)", errno,
>> + strerror(errno));
>> +
>> + if (CPU_COUNT(&possible_mask) < 2) {
>> + print_skip("Only one CPU, task migration not possible\n");
>> + exit(KSFT_SKIP);
>> + }
>> +
>> + sys_rseq(0);
>> +
>> + /*
>> + * Create and run a dummy VM that immediately exits to userspace via
>> + * GUEST_SYNC, while concurrently migrating the process by setting its
>> + * CPU affinity.
>> + */
>> + vm = vm_create_default(VCPU_ID, 0, guest_code);
>> +
>> + pthread_create(&migration_thread, NULL, migration_worker, 0);
>> +
>> + while (!done) {
>> + vcpu_run(vm, VCPU_ID);
>> + TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC,
>> + "Guest failed?");
>> +
>> + /*
>> + * Verify rseq's CPU matches sched's CPU. Ensure migration
>> + * doesn't occur between sched_getcpu() and reading the rseq
>> + * cpu_id by rereading both if the sequence count changes, or
>> + * if the count is odd (migration in-progress).
>> + */
>> + do {
>> + /*
>> + * Drop bit 0 to force a mismatch if the count is odd,
>> + * i.e. if a migration is in-progress.
>> + */
>> + snapshot = atomic_read(&seq_cnt) & ~1;
>> + smp_rmb();
>> + cpu = sched_getcpu();
>> + rseq_cpu = READ_ONCE(__rseq.cpu_id);
>> + smp_rmb();
>> + } while (snapshot != atomic_read(&seq_cnt));
>> +
>> + TEST_ASSERT(rseq_cpu == cpu,
>> + "rseq CPU = %d, sched CPU = %d\n", rseq_cpu, cpu);
>> + }
>> +
>> + pthread_join(migration_thread, NULL);
>> +
>> + kvm_vm_free(vm);
>> +
>> + sys_rseq(RSEQ_FLAG_UNREGISTER);
>> +
>> + return 0;
>> +}
>> --
>> 2.33.0.rc2.250.ged5fa647cd-goog
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com