Re: [PATCH 9/9] firewire: fw-sbp2: fix I/O errors during reconnect

From: Jarod Wilson
Date: Tue Feb 12 2008 - 00:07:59 EST


Stefan Richter wrote:
Jarod Wilson wrote:
Stefan Richter wrote:
+static void sbp2_conditionally_block(struct sbp2_logical_unit *lu)
+{
+ struct fw_card *card =
fw_device(lu->tgt->unit->device.parent)->card;
+
+ if (!atomic_read(&lu->tgt->dont_block) &&
+ lu->generation != card->generation &&
+ atomic_cmpxchg(&lu->blocked, 0, 1) == 0) {
Just to be absolutely sure, we don't need any barriers here to ensure we
get the right generations, do we?

I didn't think so. But I will carefully look at it again later this
week. The function definitely must not block the device when the
generation is current. We look at two data fields here which makes this
even more problematic. Could be that we need locks after all.

I didn't see anything else earlier on that guaranteed we got the current generation in both cases, but I didn't look exhaustively, and you know this code a lot better than I do, so I definitely defer to your better judgment, just wanted to make sure it had been considered.


Also, this isn't expected to let I/O survive a disk being unplugged
briefly, then plugged back in, is it?

No, this only tells the SCSI core to not bother fw-sbp2's
.queuecommand() with new commands before reconnect. This will
mysteriously convince the SCSI core to not put the device offline too
quickly and will stabilize application client behavior thanks to
considerably fewer command retries.

Okay, that's what I thought and what I observed in operation.

To survive real or perceived temporary unplugs ("perceived" unplugs can
happen if a third node is slowly plugged in or out), we need to do
something in fw-device.c. We have to keep the fw_device around after
node removal event until a timeout, to check newly added devices whether
they are in fact one of the undead devices, and to revive that one
rather than creating a new one.

Gotcha. So basically, a temporary table of devices recently "removed", which will expire to full/actual removal after some relatively short interval, but which we'll also check for matches of "newly" connected devices to see if we should cancel removing the device and just pretend like it never actually went away. Right? I wonder if there's any sort of guidance on this sort of thing in the firewire specs... I'll make an effort to search for relevant info, unless you've already got it.

(I recall that being discussed, but I think it was as a 'would be
nice to do in the future' thing).

I realized now that it is a 'need it sooner than later' thing because of
these "perceived" unplugs. We need this feature at least with a minimal
timeout, otherwise people will sometimes lose connection to their
devices (the scsi_device will be destroyed and a new one created) when
they plug a 3rd or 4th or nth node. As mentioned in another post, this
is an actual regression for those who migrated from ieee1394 to fw-core.
But fear not, it looks like I will have a prolonged weekend. :-)

Heh, sounds good. I missed out on my entire past weekend (and half a week of work) due to family illnesses. Hoping to throw a bunch of time at further firewire work this week though.

--
Jarod Wilson
jwilson@xxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/