Re: Linux 3.0 STILL dies on USB device hotplug - please merge fixASAP

From: Alan Stern
Date: Fri Jul 22 2011 - 16:19:17 EST


On Fri, 22 Jul 2011, James Bottomley wrote:

> On Fri, 2011-07-22 at 19:02 +0200, Andi Kleen wrote:
> > Hi,
> >
> > 3.0 still oopses and dies immediately on USB device hot unplug.
> > The same problem also triggered with SAS device according to Dan.
> >
> > There was a lot of debugging on this a few weeks back and Alan Stern
> > posted a SCSI layer patch that fixed the problem (for both USB
> > and SAS):
> >
> > http://68.183.106.108/lists/linux-usb/msg49001.html
> >
> > But for some reason that patch didn't make it into 3.0 and 3.0 still
> > happily oopses as the RC*s.
> >
> > Can you please merge this patch ASAP? This should also go to stable.
> >
> > At least for me it makes pure 3.0 very risky to use, because these USB
> > hotunplug events are not uncommon and I end up with a dead machine.
>
> Like I said at the time, the patch is wrong because of the relocation of
> the queue teardown.

That argument doesn't seem right. The queue teardown (i.e., the call
to scsi_free_queue()) was moved by commit 86cbfb5607d4b81b ([SCSI] put
stricter guards on queue dead checks). Here's the changelog:

SCSI uses request_queue->queuedata == NULL as a signal that the queue
is dying. We set this state in the sdev release function. However,
this allows a small window where we release the last reference but
haven't quite got to this stage yet and so something will try to take
a reference in scsi_request_fn and oops. It's very rare, but we had a
report here, so we're pushing this as a bug fix

The actual fix is to set request_queue->queuedata to NULL in
scsi_remove_device() before we drop the reference. This causes
correct automatic rejects from scsi_request_fn as people who hold
additional references try to submit work and prevents anything from
getting a new reference to the sdev that way.

It's quite evident that the point of the commit was to move the line
setting queue->queuedata to NULL; the scsi_free_queue() call merely
went along for the ride (by mistake perhaps?). I don't see any reason
why moving scsi_free_queue() back to where it was should cause a
problem.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/