Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

From: Alexey Kardashevskiy
Date: Wed Aug 31 2016 - 21:49:13 EST

On 31/08/16 17:28, Peter Zijlstra wrote:
> On Wed, Aug 31, 2016 at 01:41:33PM +1000, Balbir Singh wrote:
>> On 30/08/16 22:19, Peter Zijlstra wrote:
>>> On Tue, Aug 30, 2016 at 06:49:37PM +1000, Balbir Singh wrote:
>>>> The origin of the issue I've seen seems to be related to
>>>> rwsem spin lock stealing. Basically I see the system deadlock'd in the
>>>> following state
>>> As Nick says (good to see you're back Nick!), this is unrelated to
>>> rwsems.
>>> This is true for pretty much every blocking wait loop out there, they
>>> all do:
>>> for (;;) {
>>> current->state = UNINTERRUPTIBLE;
>>> smp_mb();
>>> if (cond)
>>> break;
>>> schedule();
>>> }
>>> current->state = RUNNING;
>>> Which, if the wakeup is spurious, is just the pattern you need.
>> Yes True! My bad Alexey had seen the same basic pattern, I should have been clearer
>> in my commit log. Should I resend the patch?
> Yes please.
>>> There isn't an MB there. The best I can do is UNLOCK+LOCK, which, thanks
>>> to PPC, is _not_ MB. It is however sufficient for this case.
>> The MB comes from the __switch_to() in schedule(). Ben mentioned it in a
>> different thread.
> Right, although even without that, there is sufficient ordering, as the
> rq unlock from the wakeup, coupled with the rq lock from the schedule
> already form a load-store barrier.
>>> Now, this has been present for a fair while, I suspect ever since we
>>> reworked the wakeup path to not use rq->lock twice. Curious you only now
>>> hit it.
>> Yes, I just hit it a a week or two back and I needed to collect data to
>> explain why p->on_rq got to 0. Hitting it requires extreme stress -- for me
>> I needed a system with large threads and less memory running stress-ng.
>> Reproducing the problem takes an unpredictable amount of time.
> What hardware do you see this on, is it shiny new Power8 chips which
> have never before seen deep queues or something. Or is it 'regular' old
> Power7 like stuff?

I am seeing it on POWER8 with KVM and 2 guests, each having 3 virtio-net
devices with vhost enabled, all virtio-net devices are connected to the
same virtual bridge on the host (via /dev/tap*) and are doing lots of
trafic, just between these 2 guests.

I remember doing the same test on POWER7 more than 2 years ago and finding
missing barriers in virtio but nothing like this one. But POWER7 is
seriously slower than POWER8 so it seems that nobody bothered with loading
it that much.

I wonder how to reproduce the bug quicker as sometime it works days with no
fault but sometime it fails within first 30 minutes (backtraces from 2
stuck CPUs are the same though), anyone has an idea (kernel hacks, taskset,
type of traficб уес)? As Nick suggested, I changed cpus_share_cache() to
return "false" so ttwu_queue() would always go via ttwu_queue_remote() path
but this did not make any difference.