Re: [PATCH 0/3] SCSI: Fix hard lockup in scsi_remove_target()

From: Johannes Thumshirn
Date: Wed Oct 14 2015 - 13:36:34 EST

Next message: Linus Torvalds: "Re: [GIT PULL] workqueue fixes for v4.3-rc5"
Previous message: Paul E. McKenney: "Re: [tip:locking/urgent] compiler, atomics: Provide READ_ONCE_NOCHECK ()"
In reply to: James Bottomley: "Re: [PATCH 0/3] SCSI: Fix hard lockup in scsi_remove_target()"
Next in thread: Christoph Hellwig: "Re: [PATCH 0/3] SCSI: Fix hard lockup in scsi_remove_target()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Zitat von James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx>:

On Wed, 2015-10-14 at 16:39 +0200, Johannes Thumshirn wrote:

On Wed, 2015-10-14 at 07:30 -0700, James Bottomley wrote:
> On Wed, 2015-10-14 at 15:50 +0200, Johannes Thumshirn wrote:
> > Removing a SCSI target via scsi_remove_target() suspected to be
> > racy. When a
> > sibling get's removed from the list it can occassionly happen that
> > one CPU is
> > stuck endlessly looping around this code block
> >
> > list_for_each_entry(starget, &shost->__targets, siblings) {
> > if (starget->state == STARGET_DEL)
> > continue;
>
> How long is the __targets list? It seems a bit unlikely that this is
> the exact cause, because for a short list all in STARGET_DEL that
> loop
> should exit very quickly. Where in the code does scsi_remove_target
> +0x68/0x240 actually point to?
>
> Is it not a bit more likely that we're following a removed list
> element?
> Since that points back to itself, the list_for_each_entry() would
> then
> circulate forever. If that's the case the simple fix would be to use
> the safe version of the list traversal macro.

Yes it is traversing a removed element and yes the patches 2/3 and 3/3
are introducing the safe version of list_for_each_entry(), but they
also decouple the search for elements to be removed from the actual
removal. This is what my initial proposal did as well. Christoph wanted
me to decouple the whole process from the host_lock though and this is
what this patches do as well now.

OK, so I really need you to separate the problems. Fixing the bug
you're reporting does not require a complete rework of the locking
infrastructure; it just requires replacing the traversal macro with the
safe version, can you verify that and it can go into fixes?

Yes. I can sent a patch for it tomorrow.

Then we can discuss the merits of doing a locking rework in this area
separately from the idea that it's some sort of bug fix.

James

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Linus Torvalds: "Re: [GIT PULL] workqueue fixes for v4.3-rc5"
Previous message: Paul E. McKenney: "Re: [tip:locking/urgent] compiler, atomics: Provide READ_ONCE_NOCHECK ()"
In reply to: James Bottomley: "Re: [PATCH 0/3] SCSI: Fix hard lockup in scsi_remove_target()"
Next in thread: Christoph Hellwig: "Re: [PATCH 0/3] SCSI: Fix hard lockup in scsi_remove_target()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]