Re: mvsas errors in 2.6.36

From: Thomas Fjellstrom
Date: Sat Dec 04 2010 - 01:57:46 EST


On December 3, 2010, David Milburn wrote:
> Thomas Fjellstrom wrote:
> > On December 2, 2010, Thomas Fjellstrom wrote:
> >> On December 1, 2010, Thomas Fjellstrom wrote:
> >>> On November 17, 2010, you wrote:
> >>>> On 11/17/2010 08:53 AM, Thomas Fjellstrom wrote:
> >>>> [snip]
> >>>>
> >>>>> Still no fatal errors, but the problem is still happening regularly.
> >>>>> It causes a pause in disk io of a couple seconds at least. Really
> >>>>> quite annoying.
> >>>>>
> >>>>> One thing thats got me wondering, is could this be a power issue?
> >>>>> It almost seems like (from the messages) that a single drive (any
> >>>>> drive) is freaking out, and returning an error that probably
> >>>>> shouldn't happen (no CHS 0?), which could mean the drive is
> >>>>> underpowered and the firmware is flipping out. I'm not entirely
> >>>>> sure. The system has a 750w decent quality Antec power supply. The
> >>>>> total power use of the system shouldn't come over half that (phenom
> >>>>> II x4 810 cpu, gigabyte ma790fxtud5p mb, low profile nvidia 9400GS
> >>>>> gpu, 8 sata hdds, 3 fans, etc). I'm mostly sure the 12v rails are
> >>>>> spread out evenly, but I have yet to make absolutely sure.
> >>>
> >>> Made absolute sure. I had been worrying that I was overloading one of
> >>> the rails on the PSU, but it turns out that it isn't a multi 12v rail
> >>> PSU after all. The box and advertising says it is, but the electronics
> >>> inside all say its a single 12v rail device.
> >>>
> >>>> [snip]
> >>>>
> >>>> After the mvsas update in 2.6.35 this started happening to me as well;
> >>>> at least its better than the previous state - not working.. ;-)
> >>>> However, after rolling a new 2.6.35 with the following fix that is
> >>>> queued up for the upcoming 2.6.35 and 2.6.36 stable releases, they
> >>>> seem to have dissapeared - 3 days and counting.
> >>>>
> >>>> http://git.kernel.org/?p=linux/kernel/git/stable/stable-queue.git;a=bl
> >>>> o b_ pl
> >>>> ain;f=queue-2.6.33/libsas-fix-ncq-mixing-with-non-ncq.patch;h=b6d7c920
> >>>> 9 4 d95 ad67a3b23c2e09c25d4fbd0f46b;hb=HEAD
> >>>>
> >>>> The fix is queued up for the next 2.6.36 and 2.6.35 stable
> >>>> point-releases.
> >>>
> >>> Ahah. I wonder how I missed that when I first read it. I'll have to
> >>> give the stable .36 kernel a try. Thanks!
> >>
> >> No fix so far:
> >>
> >> [ 2539.040104] drivers/scsi/mvsas/mv_sas.c 1703:<7>mv_abort_task()
> >> mvi=ffff880222f00000 task=ffff88018b3e2980 slot=ffff880222f265d0
> >> slot_idx=x2 [ 2539.040118] drivers/scsi/mvsas/mv_sas.c
> >> 1632:mvs_query_task:rc= 5 [ 2539.040154] drivers/scsi/mvsas/mv_sas.c
> >> 2083:port 7 ctrl sts=0x89800. [ 2539.040163] drivers/scsi/mvsas/mv_sas.c
> >> 2085:Port 7 irq sts = 0x1001001 [ 2539.040176]
> >> drivers/scsi/mvsas/mv_sas.c 2111:phy7 Unplug Notice [ 2539.050220]
> >> drivers/scsi/mvsas/mv_sas.c
>
> The controller is reporting a phy ready state change, which is why you see
> the unplug notice.
>
> Can you enable SCSI_SAS_LIBSAS_DEBUG and see if libsas reports anything
> before the abort?
>
> You should be able to turn on in your kernel config:
>
> Device Drivers
> SCSI device support
> SCSI Transports
> Compile the SAS Domain Transport Attributes in debug mode

Hi, I've done as you requested.

here's all of the output from the first (and currently only) event:

[ 1428.000080] sas: command 0xffff880184ed1680, task 0xffff88017a0f2680, timed out: BLK_EH_NOT_HANDLED
[ 1428.080051] sas: command 0xffff880224e03880, task 0xffff88017a0f24c0, timed out: BLK_EH_NOT_HANDLED
[ 1428.080077] sas: Enter sas_scsi_recover_host
[ 1428.080085] sas: trying to find task 0xffff88017a0f2680
[ 1428.080092] sas: sas_scsi_find_task: aborting task 0xffff88017a0f2680
[ 1428.080102] drivers/scsi/mvsas/mv_sas.c 1703:<7>mv_abort_task() mvi=ffff880224040000 task=ffff88017a0f2680 slot=ffff880224066680 slot_idx=x4
[ 1428.080113] sas: sas_scsi_find_task: querying task 0xffff88017a0f2680
[ 1428.080119] drivers/scsi/mvsas/mv_sas.c 1632:mvs_query_task:rc= 5
[ 1428.080125] sas: sas_scsi_find_task: task 0xffff88017a0f2680 failed to abort
[ 1428.080130] sas: task 0xffff88017a0f2680 is not at LU: I_T recover
[ 1428.080135] sas: I_T nexus reset for dev 0000000000000000
[ 1428.080172] drivers/scsi/mvsas/mv_sas.c 2083:port 0 ctrl sts=0x89800.
[ 1428.080180] drivers/scsi/mvsas/mv_sas.c 2085:Port 0 irq sts = 0x1001
[ 1428.080193] drivers/scsi/mvsas/mv_sas.c 2111:phy0 Unplug Notice
[ 1428.090228] drivers/scsi/mvsas/mv_sas.c 2083:port 0 ctrl sts=0x199800.
[ 1428.090236] drivers/scsi/mvsas/mv_sas.c 2085:Port 0 irq sts = 0x1081
[ 1428.111954] drivers/scsi/mvsas/mv_sas.c 2083:port 0 ctrl sts=0x199800.
[ 1428.111962] drivers/scsi/mvsas/mv_sas.c 2085:Port 0 irq sts = 0x10000
[ 1428.111969] drivers/scsi/mvsas/mv_sas.c 2138:notify plug in on phy[0]
[ 1428.146351] drivers/scsi/mvsas/mv_sas.c 1224:port 0 attach dev info is 20004
[ 1428.146351] drivers/scsi/mvsas/mv_sas.c 1226:port 0 attach sas addr is 0
[ 1428.222044] drivers/scsi/mvsas/mv_sas.c 378:phy 0 byte dmaded.
[ 1428.222109] sas: sas_form_port: phy0 belongs to port0 already(1)!
[ 1430.300028] drivers/scsi/mvsas/mv_sas.c 1586:mvs_I_T_nexus_reset for device[0]:rc= 0
[ 1430.300040] sas: I_T 0000000000000000 recovered
[ 1430.300048] sas: sas_ata_task_done: SAS error 8d
[ 1430.300059] ata9: translated ATA stat/err 0x01/04 to SCSI SK/ASC/ASCQ 0xb/00/00
[ 1430.300883] ata9.00: device reported invalid CHS sector 0
[ 1430.300888] ata9: status=0x01 { Error }
[ 1430.300894] ata9: error=0x04 { DriveStatusError }
[ 1430.300950] sas: trying to find task 0xffff88017a0f24c0
[ 1430.300956] sas: sas_scsi_find_task: aborting task 0xffff88017a0f24c0
[ 1430.300963] sas: sas_scsi_find_task: task 0xffff88017a0f24c0 is done
[ 1430.300968] sas: sas_eh_handle_sas_errors: task 0xffff88017a0f24c0 is done
[ 1430.300974] sas: sas_ata_task_done: SAS error 8d
[ 1430.300982] ata12: translated ATA stat/err 0x01/04 to SCSI SK/ASC/ASCQ 0xb/00/00
[ 1430.301777] ata12.00: device reported invalid CHS sector 0
[ 1430.301782] ata12: status=0x01 { Error }
[ 1430.301788] ata12: error=0x04 { DriveStatusError }
[ 1430.301808] sas: --- Exit sas_scsi_recover_host

Thanks.

> Thanks,
> David
>
> >> 2083:port 7 ctrl sts=0x199800. [ 2539.050229]
> >> drivers/scsi/mvsas/mv_sas.c 2085:Port 7 irq sts = 0x1001081 [
> >> 2539.071157] drivers/scsi/mvsas/mv_sas.c 2083:port 7 ctrl sts=0x199800.
> >> [ 2539.071165] drivers/scsi/mvsas/mv_sas.c 2085:Port 7 irq sts =
> >> 0x10000 [ 2539.071173] drivers/scsi/mvsas/mv_sas.c 2138:notify plug in
> >> on phy[7] [ 2539.081142] drivers/scsi/mvsas/mv_sas.c 1224:port 7 attach
> >> dev info is 5000002 [ 2539.081142]
> >> drivers/scsi/mvsas/mv_sas.c 1226:port 7 attach sas addr is 7 [
> >> 2539.081142] drivers/scsi/mvsas/mv_sas.c 378:phy 7 byte dmaded.
> >> [ 2541.270047] drivers/scsi/mvsas/mv_sas.c 1586:mvs_I_T_nexus_reset for
> >> device[5]:rc= 0 [ 2541.270066] ata14: translated ATA stat/err 0x01/04 to
> >> SCSI SK/ASC/ASCQ 0xb/00/00 [ 2541.270926] ata14: status=0x01 { Error }
> >> [ 2541.271747] ata14: error=0x04 { DriveStatusError }
> >>
> >> That appeared after about 42 minutes of uptime.
> >
> > So after about 32 hours of uptime theres been 36 separate events. Each
> > spits out similar messages as above, and each comes with a noticeable
> > pause while the drive is reset.
> >
> > There are a number of possible reasons that I'm still having issues:
> > - I managed to mess up the git checkout
> > - My problem isn't related to the fix
> > - The fix doesn't cover all cases of the problem it meant to fix
> >
> > I'm not certain which of them it is, I'd be more inclined to think I
> > messed up the checkout, as I did patch something in, but the patches
> > were completely unrelated and shouldn't have affected the scsi or ata
> > systems at all. At this point I'm just grasping at straws.
> >
> > In case my card is somehow different than expected, I'll paste the lspci
> > info for it: (AOC-SASLP-MV8)
> >
> > 04:00.0 SCSI storage controller: Marvell Technology Group Ltd.
> > MV64460/64461/64462 System Controller, Revision B (rev 01)
> >
> > Subsystem: Super Micro Computer Inc Device 0500
> > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> > ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz-
> > UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort-
> > >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes
> > Interrupt: pin A routed to IRQ 19
> > Region 2: I/O ports at df00 [size=128]
> > Region 4: Memory at fdef0000 (64-bit, non-prefetchable)
> > [size=64K] [virtual] Expansion ROM at fdd00000 [disabled]
> > [size=256K] Capabilities: [48] Power Management version 2
> >
> > Flags: PMEClk- DSI- D1+ D2- AuxCurrent=0mA
> > PME(D0+,D1+,D2-,D3hot+,D3cold-) Status: D0 NoSoftRst-
> > PME-Enable- DSel=0 DScale=1 PME-
> >
> > Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
> >
> > Address: 0000000000000000 Data: 0000
> >
> > Capabilities: [e0] Express (v1) Legacy Endpoint, MSI 00
> >
> > DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s
> > unlimited, L1 unlimited
> >
> > ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
> >
> > DevCtl: Report errors: Correctable- Non-Fatal- Fatal-
> > Unsupported-
> >
> > RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
> > MaxPayload 128 bytes, MaxReadReq 2048 bytes
> >
> > DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr-
> > TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x4,
> > ASPM L0s, Latency L0 <256ns, L1 unlimited
> >
> > ClockPM- Surprise- LLActRep- BwNot-
> >
> > LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain-
> > CommClk+
> >
> > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> >
> > LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+
> > DLActive- BWMgmt- ABWMgmt-
> >
> > Capabilities: [100 v1] Advanced Error Reporting
> >
> > UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-
> > UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk:
> > DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> > RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+
> > SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+
> > MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr+ BadTLP-
> > BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr-
> > BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- AERCap:
> > First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
> >
> > Kernel driver in use: mvsas
> >
> > Its installed in a Phenom II X4 810 based system with a 790FX/SB750
> > chipset, 8G DDR3 1333 RAM, 6 1TB Seagate 7200.12 SATAII drives connected
> > to the card via sas->sata breakout cables, and a couple 4 drive SATA
> > hotswap bays. There are also two Seagate 7200.12 500G drives hooked up
> > to the motherboard SATA controller. The system is powered via an Antec
> > Neopower Blue 650W PSU which is probably only half loaded. System also
> > has a discreet gfx card, but its a low end, low profile, fanless card
> > that takes up next to no power.
> >
> > I'm still willing to help test any fixes for the mvsas driver on this
> > card.
> >
> > Thank you.


--
Thomas Fjellstrom
thomas@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/