Re: [PATCH] [SCSI] libsas: remove expander from dev list on error

From: Luben Tuikov
Date: Wed Aug 31 2011 - 12:52:32 EST


----- Original Message -----

> From: Dan Williams <dan.j.williams@xxxxxxxxx>
> To: Luben Tuikov <ltuikov@xxxxxxxxx>
> Cc: "linux-scsi@xxxxxxxxxxxxxxx" <linux-scsi@xxxxxxxxxxxxxxx>; "linux-kernel@xxxxxxxxxxxxxxx" <linux-kernel@xxxxxxxxxxxxxxx>; "JBottomley@xxxxxxxxxxxxx" <JBottomley@xxxxxxxxxxxxx>; Jack Wang <jack_wang@xxxxxxxxx>
> Sent: Wednesday, July 27, 2011 1:32 PM
> Subject: Re: [PATCH] [SCSI] libsas: remove expander from dev list on error
>
> On Tue, Jul 26, 2011 at 11:10 PM, Luben Tuikov <ltuikov@xxxxxxxxx> wrote:
>> If expander discovery fails (sas_discover_expander()),
>> remove the expander from the port device list
>> (sas_ex_discover_expander()), before freeing it. Else
>> the list is corrupted and, e.g., when we attempt to send
>> SMP commands to other devices, the kernel oopses.
>>
>> Signed-off-by: Luben Tuikov <ltuikov@xxxxxxxxx>
>> Reviewed-by: Jack Wang <jack_wang@xxxxxxxxx>
>> ---
>>  drivers/scsi/libsas/sas_expander.c |    3 +++
>>  1 files changed, 3 insertions(+), 0 deletions(-)
>>
>> diff --git a/drivers/scsi/libsas/sas_expander.c
> b/drivers/scsi/libsas/sas_expander.c
>> index 874e29d..f84084b 100644
>> --- a/drivers/scsi/libsas/sas_expander.c
>> +++ b/drivers/scsi/libsas/sas_expander.c
>> @@ -849,6 +849,9 @@ static struct domain_device *sas_ex_discover_expander(
>>
>>         res = sas_discover_expander(child);
>>         if (res) {
>> +               spin_lock_irq(&parent->port->dev_list_lock);
>> +               list_del(&child->dev_list_node);
>> +               spin_unlock_irq(&parent->port->dev_list_lock);
>>                 kfree(child);
>>                 return NULL;
>
> Acked-by: Dan Williams <dan.j.williams@xxxxxxxxx>
>
> ...but it raises a couple questions for potential follow on patches:
>
> 1/ Do we need to add the device to port->dev_list prior to the
> discovery?  Seems cleaner to defer the list_add until after the
> discovery has succeeded.

In my original SAS Transport layer, devices were added to the
port device list in one and only one place, after discovery,
at sas_kobj_set() (which doesn't exist now because my code was
edited off-line before being submitted into the kernel by Bottomley).
Thus due to the architecture of the code, this bug doesn't exist in my original
implementation, but was introduced in the version submitted by Bottomley,
when devices are being added to the port list in more than one place.

My development git history was removed as the code was taken
off my git trees, edited off line, off git, and then submitted, thus erasing
its original state or its development history (in git). So it's now impossible
for a 3rd party to see what bugs were introduced due to those off-line
changes before submission.

And as you can see, there are bugs. The bugs are, as I can see, stemming
from the current _explicitness_ of libsas, a la, just-in-place fix, as if
someone has been studying the code by line-stepping it with a debugger,
and then going "A-ha! Here is where we can stick this in." SATA support
seems to have been added in a similar manner, "where can we stick this
to get it to work". I.e. there is no grand architecture, no vision.

The original code, was much more implicit: you had to pull back to see
what was happening.

It was designed as a transport layer. For example, for a new SAS
controller, you'd only need to write a PCI device driver which exposed
the ASIC's particular domain access implementation in a structure
of function pointers, that the PCI LLDD fills in, and the SAS Transport
Layer (SAS TL) fills in, and this is how both communicate. Then,
the SAS TL would register with the SCSI layer. It was a beautiful solution
of enterprise quality and design. SATA support was added in 2006 which
is _not_ libsata, but also a layer of abstraction with a structure of function
pointers, in effect allowing you to register a SATA device on _any_ transport
protocol, not only SAS, as a matter of filling in the stubs of a few function pointers.
(Thus also allowing you to easily emulate a SATA device if you so desire.)

The original SAS TL, also represented the whole SAS domain in sysfs,
exactly as it would look like in the physical world and was thus really
easy to do "tree -d" and then look at your work bench or rack and see
what is connected to what or what is missing or has been added.
See this http://marc.info/?l=linux-scsi&m=112629509826900&w=2,
for example.

The event infrastructure, removed by Bottomley, allowed for an infinite number
of events in a finite amount of memory. Removing it, introduced a few bugs, which
were I believe fixed back then, but it is impossible to know how many more there are.

> 2/ We have unlocked list manipulations in sas_ex_discover_end_dev(),
> sas_unregister_common_dev(), and sas_ex_discover_end_dev()

Yes, I can see that and that is very unfortunate. And as I mentioned, this
doesn't exist in the original design of the code. The point is that after
the changes offline people using Linux end up with substandard
and mediocre implementation of SAS.

Locking and coherency were achieved via the kobj infrastructure which was
a beautiful thing at the time, but has seen its number of changes beginning
in 06-07 (git can help there).

   Luben
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/