Re: [PATCH v3 3/4] kernel/smp: add more data to CSD lock debugging

From: Paul E. McKenney
Date: Fri Apr 02 2021 - 12:11:34 EST


On Fri, Apr 02, 2021 at 05:46:52PM +0200, Juergen Gross wrote:
> On 30.03.21 19:33, Paul E. McKenney wrote:
> > On Wed, Mar 24, 2021 at 11:18:03AM +0100, Jürgen Groß wrote:
> > > On 02.03.21 07:28, Juergen Gross wrote:
> > > > In order to help identifying problems with IPI handling and remote
> > > > function execution add some more data to IPI debugging code.
> > > >
> > > > There have been multiple reports of cpus looping long times (many
> > > > seconds) in smp_call_function_many() waiting for another cpu executing
> > > > a function like tlb flushing. Most of these reports have been for
> > > > cases where the kernel was running as a guest on top of KVM or Xen
> > > > (there are rumours of that happening under VMWare, too, and even on
> > > > bare metal).
> > > >
> > > > Finding the root cause hasn't been successful yet, even after more than
> > > > 2 years of chasing this bug by different developers.
> > > >
> > > > Commit 35feb60474bf4f7 ("kernel/smp: Provide CSD lock timeout
> > > > diagnostics") tried to address this by adding some debug code and by
> > > > issuing another IPI when a hang was detected. This helped mitigating
> > > > the problem (the repeated IPI unlocks the hang), but the root cause is
> > > > still unknown.
> > > >
> > > > Current available data suggests that either an IPI wasn't sent when it
> > > > should have been, or that the IPI didn't result in the target cpu
> > > > executing the queued function (due to the IPI not reaching the cpu,
> > > > the IPI handler not being called, or the handler not seeing the queued
> > > > request).
> > > >
> > > > Try to add more diagnostic data by introducing a global atomic counter
> > > > which is being incremented when doing critical operations (before and
> > > > after queueing a new request, when sending an IPI, and when dequeueing
> > > > a request). The counter value is stored in percpu variables which can
> > > > be printed out when a hang is detected.
> > > >
> > > > The data of the last event (consisting of sequence counter, source
> > > > cpu, target cpu, and event type) is stored in a global variable. When
> > > > a new event is to be traced, the data of the last event is stored in
> > > > the event related percpu location and the global data is updated with
> > > > the new event's data. This allows to track two events in one data
> > > > location: one by the value of the event data (the event before the
> > > > current one), and one by the location itself (the current event).
> > > >
> > > > A typical printout with a detected hang will look like this:
> > > >
> > > > csd: Detected non-responsive CSD lock (#1) on CPU#1, waiting 5000000003 ns for CPU#06 scf_handler_1+0x0/0x50(0xffffa2a881bb1410).
> > > > csd: CSD lock (#1) handling prior scf_handler_1+0x0/0x50(0xffffa2a8813823c0) request.
> > > > csd: cnt(00008cc): ffff->0000 dequeue (src cpu 0 == empty)
> > > > csd: cnt(00008cd): ffff->0006 idle
> > > > csd: cnt(0003668): 0001->0006 queue
> > > > csd: cnt(0003669): 0001->0006 ipi
> > > > csd: cnt(0003e0f): 0007->000a queue
> > > > csd: cnt(0003e10): 0001->ffff ping
> > > > csd: cnt(0003e71): 0003->0000 ping
> > > > csd: cnt(0003e72): ffff->0006 gotipi
> > > > csd: cnt(0003e73): ffff->0006 handle
> > > > csd: cnt(0003e74): ffff->0006 dequeue (src cpu 0 == empty)
> > > > csd: cnt(0003e7f): 0004->0006 ping
> > > > csd: cnt(0003e80): 0001->ffff pinged
> > > > csd: cnt(0003eb2): 0005->0001 noipi
> > > > csd: cnt(0003eb3): 0001->0006 queue
> > > > csd: cnt(0003eb4): 0001->0006 noipi
> > > > csd: cnt now: 0003f00
> > > >
> > > > This example (being an artificial one, produced with a previous version
> > > > of this patch without the "hdlend" event), shows that cpu#6 started to
> > > > handle an IPI (cnt 3e72-3e74), bit didn't start to handle another IPI
> > > > (sent by cpu#4, cnt 3e7f). The next request from cpu#1 for cpu#6 was
> > > > queued (3eb3), but no IPI was needed (cnt 3eb4, there was the event
> > > > from cpu#4 in the queue already).
> > > >
> > > > The idea is to print only relevant entries. Those are all events which
> > > > are associated with the hang (so sender side events for the source cpu
> > > > of the hanging request, and receiver side events for the target cpu),
> > > > and the related events just before those (for adding data needed to
> > > > identify a possible race). Printing all available data would be
> > > > possible, but this would add large amounts of data printed on larger
> > > > configurations.
> > > >
> > > > Signed-off-by: Juergen Gross <jgross@xxxxxxxx>
> > > > Tested-by: Paul E. McKenney <paulmck@xxxxxxxxxx>
> > >
> > > Just an update regarding current status with debugging the underlying
> > > issue:
> > >
> > > On a customer's machine with a backport of this patch applied we've
> > > seen another case of the hang. In the logs we've found:
> > >
> > > smp: csd: Detected non-responsive CSD lock (#1) on CPU#18, waiting
> > > 5000000046 ns for CPU#06 do_flush_tlb_all+0x0/0x30( (null)).
> > > smp: csd: CSD lock (#1) unresponsive.
> > > smp: csd: cnt(0000000): 0000->0000 queue
> > > smp: csd: cnt(0000001): ffff->0006 idle
> > > smp: csd: cnt(0025dba): 0012->0006 queue
> > > smp: csd: cnt(0025dbb): 0012->0006 noipi
> > > smp: csd: cnt(01d1333): 001a->0006 pinged
> > > smp: csd: cnt(01d1334): ffff->0006 gotipi
> > > smp: csd: cnt(01d1335): ffff->0006 handle
> > > smp: csd: cnt(01d1336): ffff->0006 dequeue (src cpu 0 == empty)
> > > smp: csd: cnt(01d1337): ffff->0006 hdlend (src cpu 0 == early)
> > > smp: csd: cnt(01d16cb): 0012->0005 ipi
> > > smp: csd: cnt(01d16cc): 0012->0006 queue
> > > smp: csd: cnt(01d16cd): 0012->0006 ipi
> > > smp: csd: cnt(01d16f3): 0012->001a ipi
> > > smp: csd: cnt(01d16f4): 0012->ffff ping
> > > smp: csd: cnt(01d1750): ffff->0018 hdlend (src cpu 0 == early)
> > > smp: csd: cnt(01d1751): 0012->ffff pinged
> > > smp: csd: cnt now: 01d1769
> > >
> > > So we see that cpu#18 (0012 hex) is waiting for cpu#06 (first line of the
> > > data).
> > >
> > > The next 4 lines of the csd actions are not really interesting, as they are
> > > rather old.
> > >
> > > Then we see that cpu 0006 did handle a request rather recently (cnt 01d1333
> > > - 01d1337): cpu 001a pinged it via an IPI and it got the IPI, entered the
> > > handler, dequeued a request, and handled it.
> > >
> > > Nearly all of the rest shows the critical request: cpu 0012 did a loop over
> > > probably all other cpus and queued the requests and marked them to be IPI-ed
> > > (including cpu 0006, cnt 01d16cd). Then the cpus marked to receive an IPI
> > > were pinged (cnt 01d16f4 and cnt 01d1751).
> > >
> > > The entry cnt 01d1750 is not of interest here.
> > >
> > > This data confirms that on sending side everything seems to be okay at the
> > > level above the actual IPI sending. On receiver side there seems no IPI to
> > > be seen, but there is no visible reason for a race either.
> > >
> > > It seems as if we need more debugging in the deeper layers: is the IPI
> > > really sent out, and is something being received on the destination cpu?
> > > I'll have another try with even more debugging code, probably in private
> > > on the customer machine first.
> >
> > Apologies for the late reply, was out last week.
> >
> > Excellent news, and thank you!
> >
> > For my part, I have put together a rough prototype script that allows
> > me to run scftorture on larger groups of systems and started running it,
> > though I am hoping that 1,000 is far more than will be required.
> >
> > Your diagnosis of a lost IPI matches what we have been able to glean
> > from the occasional occurrences in the wild on our systems, for whatever
> > that might be worth. The hope is to get something that reproduces more
> > quickly, which would allow deeper debugging at this end as well.
>
> Sometimes one is lucky.
>
> I've found a reproducer while hunting another bug. The test on that
> machine will trigger the csd_lock timeout about once a day.

Nice!!! You are way ahead of me!

> I've used my new debug kernel and found that the IPI is really sent
> out (more precise: the hypervisor has been requested to do so, and
> it didn't report an error). On the target cpu there was no interrupt
> received after that, so the IPI has not been swallowed on the target
> cpu by the Linux kernel.
>
> Will now try to instrument the hypervisor to get more data.

I am increasing the number and types of systems and the test duration.
Ijust started running three different systems with IPI workloads in both
guests and within host over the weekend.

Thanx, Paul