Re: Continuous SD IO causes hung task messages

From: Russell King - ARM Linux admin
Date: Tue Aug 27 2019 - 11:06:22 EST


On Tue, Aug 27, 2019 at 03:52:17PM +0100, Russell King - ARM Linux admin wrote:
> On Tue, Aug 27, 2019 at 03:36:34PM +0100, Russell King - ARM Linux admin wrote:
> > On Tue, Aug 27, 2019 at 03:55:23PM +0200, Ulf Hansson wrote:
> > > On Tue, 27 Aug 2019 at 15:43, Russell King - ARM Linux admin
> > > <linux@xxxxxxxxxxxxxxx> wrote:
> > > >
> > > > Hi,
> > > >
> > > > While dd'ing the contents of a SD card, I get hung task timeout
> > > > messages as per below. However, the dd is making progress. Any
> > > > ideas?
> > > >
> > > > Presumably, mmc_rescan doesn't get a look-in while IO is progressing
> > > > for the card?
> > >
> > > Is it a regression?
> > >
> > > There not much of recent mmc core and mmc block changes, that I can
> > > think of at this point.
> >
> > No idea - I just repaired the SD socket after the D2 line became
> > disconnected, and decided to run that command as a test.
> >
> > > > ARM64 host, Macchiatobin, uSD card.
> > >
> > > What mmc host driver is it? mmci?
> >
> > sdhci-xenon.
> >
> > I'm just trying with one CPU online, then I'll try with two. My
> > suspicion is that there's a problem in the ARM64 arch code where
> > unlocking a mutex doesn't get noticed on other CPUs.
> >
> > Hmm, I thought I'd try bringing another CPU online, but it seems
> > like the ARM64 CPU hotplug code is broken:
> >
> > [ 3552.029689] CPU1: shutdown
> > [ 3552.031099] psci: CPU1 killed.
> > [ 3949.835212] CPU1: failed to come online
> > [ 3949.837753] CPU1: failed in unknown state : 0x0
> >
> > which means I can only take CPUs down, I can't bring them back
> > online without rebooting.
>
> Okay, running on a single CPU shows no problems.
>
> Running on four CPUs (as originally) shows that the kworker thread
> _never_ gets scheduled, so the warning is not false.
>
> With three CPUs, same problem.
>
> root@arm-d06300000000:~# ps aux | grep ' D '
> root 34 0.0 0.0 0 0 ? D 15:38 0:00 [kworker/1:1+events_freezable]
> root@arm-d06300000000:~# cat /proc/34/sched
> kworker/1:1 (34, #threads: 1)
> -------------------------------------------------------------------
> se.exec_start : 318689.992440
> se.vruntime : 37750.882357
> se.sum_exec_runtime : 9.421240
> se.nr_migrations : 0
> nr_switches : 1174
> nr_voluntary_switches : 1171
> nr_involuntary_switches : 3
> se.load.weight : 1048576
> se.runnable_weight : 1048576
> se.avg.load_sum : 6
> se.avg.runnable_load_sum : 6
> se.avg.util_sum : 5170
> se.avg.load_avg : 0
> se.avg.runnable_load_avg : 0
> se.avg.util_avg : 0
> se.avg.last_update_time : 318689991680
> se.avg.util_est.ewma : 10
> se.avg.util_est.enqueued : 0
> policy : 0
> prio : 120
> clock-delta : 0
>
> The only thing that changes there is "clock-delta". When I kill the
> dd, I get:
>
> root@arm-d06300000000:~# cat /proc/34/sched
> kworker/1:1 (34, #threads: 1)
> -------------------------------------------------------------------
> se.exec_start : 574025.791680
> se.vruntime : 79996.657300
> se.sum_exec_runtime : 10.916400
> se.nr_migrations : 0
> nr_switches : 1403
> nr_voluntary_switches : 1400
> nr_involuntary_switches : 3
> se.load.weight : 1048576
> se.runnable_weight : 1048576
> se.avg.load_sum : 15
> se.avg.runnable_load_sum : 15
> se.avg.util_sum : 15007
> se.avg.load_avg : 0
> se.avg.runnable_load_avg : 0
> se.avg.util_avg : 0
> se.avg.last_update_time : 574025791488
> se.avg.util_est.ewma : 10
> se.avg.util_est.enqueued : 0
> policy : 0
> prio : 120
> clock-delta : 40
>
> so the thread makes forward progress.
>
> Down to two CPUs:
>
> root@arm-d06300000000:~# ps aux | grep ' D '
> root 34 0.0 0.0 0 0 ? D 15:38 0:00 [kworker/1:1+events_freezable]
>
> Same symptoms. dd and md5sum switch between CPU 0 and CPU1.

Hmm.

static blk_status_t mmc_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
const struct blk_mq_queue_data *bd)

mq->in_flight[issue_type] += 1;
get_card = (mmc_tot_in_flight(mq) == 1);

if (get_card)
mmc_get_card(card, &mq->ctx);

mmc_get_card() gets the host lock according to the card.

So, if we always have requests in flight (which is probably the case
here) we never drop the host lock, and mmc_rescan() never gets a look
in - hence blocking the kworker.

So this is a real issue with MMC, and not down to something in the
arch.

I suspect the reason that single-CPU doesn't show it is because it is
unable to keep multiple requests in flight.

--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up