RE: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405 constantly resets under high io load
From: Sagar.Biradar
Date: Tue Dec 06 2022 - 01:00:25 EST
Hi James,
We were in the process of finding the related information and we have finally found some details.
I am reviewing that as I write this email.
I will get back to you once I review and sort that information with more details.
Thanks
Sagar
-----Original Message-----
From: James Hilliard <james.hilliard1@xxxxxxxxx>
Sent: Sunday, December 4, 2022 5:26 AM
To: Sagar Biradar - C34249 <Sagar.Biradar@xxxxxxxxxxxxx>
Cc: martin.petersen@xxxxxxxxxx; khorenko@xxxxxxxxxxxxx; christian@xxxxxxxxxxxxxx; aacraid@xxxxxxxxxxxxx; Don Brace - C33706 <Don.Brace@xxxxxxxxxxxxx>; Tom White - C33503 <Tom.White@xxxxxxxxxxxxx>; linux-scsi@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx
Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405 constantly resets under high io load
EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe
On Thu, Nov 17, 2022 at 11:36 PM <Sagar.Biradar@xxxxxxxxxxxxx> wrote:
>
> Hi James,
> Thanks for your response.
> This issue seems to be slightly different and may have been originating from the drive itself (not too sure).
Yeah, the drive was having hardware issues, although it does sound like a potential error condition that's not being correctly handled by aacraid.
>
> The original issue I was talking about would still occur with the missing legacy interrupt on certain processors.
> We are still actively looking into the old "int-x missing" issue that we suspect might possibly originate from the patch.
Hmm, are there any available details on this "int-x missing" issue, I couldn't find any public details/reports relating to that.
Is there a list of CPU's known to be affected?
Does it occur in the vendor aacraid release that has this patch merged?
>
>
>
> -----Original Message-----
> From: James Hilliard <james.hilliard1@xxxxxxxxx>
> Sent: Thursday, November 17, 2022 3:26 AM
> To: Sagar Biradar - C34249 <Sagar.Biradar@xxxxxxxxxxxxx>
> Cc: martin.petersen@xxxxxxxxxx; khorenko@xxxxxxxxxxxxx;
> christian@xxxxxxxxxxxxxx; aacraid@xxxxxxxxxxxxx; Don Brace - C33706
> <Don.Brace@xxxxxxxxxxxxx>; Tom White - C33503
> <Tom.White@xxxxxxxxxxxxx>; linux-scsi@xxxxxxxxxxxxxxx;
> linux-kernel@xxxxxxxxxxxxxxx
> Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405
> constantly resets under high io load
>
> EXTERNAL EMAIL: Do not click links or open attachments unless you know
> the content is safe
>
> On Tue, Nov 15, 2022 at 10:05 AM <Sagar.Biradar@xxxxxxxxxxxxx> wrote:
> >
> > Hi James,
> > I have looked into the patch thoroughly.
> > We suspect this change might expose an old legacy interrupt issue on some processors.
>
> I did see this error once with this patch when a drive was having issues:
> [ 4306.357531] aacraid: Host adapter abort request.
> aacraid: Outstanding commands on (0,1,41,0):
> [ 4335.030025] aacraid: Host adapter abort request.
> aacraid: Outstanding commands on (0,1,41,0):
> [ 4335.030111] aacraid: Host adapter abort request.
> aacraid: Outstanding commands on (0,1,41,0):
> [ 4335.030172] aacraid: Host adapter abort request.
> aacraid: Outstanding commands on (0,1,41,0):
> [ 4335.189886] aacraid: Host bus reset request. SCSI hang ?
> [ 4335.189951] aacraid 0000:81:00.0: outstanding cmd: midlevel-0 [ 4335.189989] aacraid 0000:81:00.0: outstanding cmd: lowlevel-0 [ 4335.190101] aacraid 0000:81:00.0: outstanding cmd: error handler-3 [ 4335.190141] aacraid 0000:81:00.0: outstanding cmd: firmware-0 [ 4335.190177] aacraid 0000:81:00.0: outstanding cmd: kernel-0 [ 4335.274070] aacraid 0000:81:00.0: Controller reset type is 3 [ 4335.274142] aacraid 0000:81:00.0: Issuing IOP reset [ 4365.862127] aacraid 0000:81:00.0: IOP reset succeeded [ 4365.895079] aacraid: Comm Interface type2 enabled [ 4374.938119] aacraid 0000:81:00.0: Scheduling bus rescan [ 4387.022913] sd 0:1:41:0: [sdi] 27344764928 512-byte logical blocks:
> (14.0 TB/12.7 TiB)
> [ 4387.022988] sd 0:1:41:0: [sdi] 4096-byte physical blocks [ 5643.714301] aacraid: Host adapter abort request.
> aacraid: Outstanding commands on (0,1,41,0):
> [ 5672.349423] BUG: kernel NULL pointer dereference, address: 0000000000000018 [ 5672.351532] #PF: supervisor read access in kernel mode [ 5672.353262] #PF: error_code(0x0000) - not-present page [ 5672.354860] PGD 8000007ad6ac7067 P4D 8000007ad6ac7067 PUD 7af0892067 PMD 0 [ 5672.356444] Oops: 0000 [#1] SMP PTI
> [ 5672.358075] CPU: 9 PID: 644201 Comm: cc1plus Tainted: P O
> 5.15.64-1-pve #1
> [ 5672.359749] Hardware name: Supermicro Super Server/X10DRC, BIOS 3.4
> 05/21/2021
> [ 5672.361465] RIP: 0010:dma_direct_unmap_sg+0x49/0x1a0
> [ 5672.363223] Code: ec 20 89 4d d4 4c 89 45 c8 85 d2 0f 8e bb 00 00
> 00 49 89 fe 49 89 f7 89 d3 45 31 ed 4c 8b 05 ae fd b0 01 49 8b be 60
> 02 00 00 <45> 8b 4f 18 49 8b 77 10 49 f7 d0 48 85 ff 0f 84 06 01 00 00
> 4c 8b [ 5672.367024] RSP: 0000:ffffa4ff58c7cde0 EFLAGS: 00010046 [
> 5672.369020] RAX: 0000000000000000 RBX: 0000000000000003 RCX:
> 0000000000000001 [ 5672.371073] RDX: 0000000000000003 RSI:
> 0000000000000000 RDI: 0000000000000000 [ 5672.373007] RBP:
> ffffa4ff58c7ce28 R08: 0000000000000000 R09: 0000000000000001 [
> 5672.374795] R10: 0000000000000000 R11: ffffa4ff58c7cff8 R12:
> 0000000000000000 [ 5672.376418] R13: 0000000000000000 R14:
> ffff88968e1ec0d0 R15: 0000000000000000 [ 5672.378136] FS:
> 00007ff103d25ac0(0000) GS:ffff89547fac0000(0000)
> knlGS:0000000000000000
> [ 5672.379760] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5672.381402] CR2: 0000000000000018 CR3: 0000007ae90cc004 CR4: 00000000001706e0 [ 5672.383023] Call Trace:
> [ 5672.384673] <IRQ>
> [ 5672.386282] ? task_tick_fair+0x88/0x530 [ 5672.386469] aacraid: Host adapter abort request.
> aacraid: Outstanding commands on (0,1,41,0):
> [ 5672.387921] dma_unmap_sg_attrs+0x32/0x50 [ 5672.391431] aacraid: Host adapter abort request.
> aacraid: Outstanding commands on (0,1,41,0):
> [ 5672.393273] scsi_dma_unmap+0x3b/0x50 [ 5672.397079] aacraid: Host adapter abort request.
> aacraid: Outstanding commands on (0,1,41,0):
> [ 5672.398180] aac_srb_callback+0x88/0x3c0 [aacraid]
>
> Does that look related?
>
> >
> > We are currently debugging and digging further details to be able to explain it in much detailed fashion.
> > I will keep you the thread posted as soon as we have something interesting.
> >
> > Sagar
> >
> > -----Original Message-----
> > From: James Hilliard <james.hilliard1@xxxxxxxxx>
> > Sent: Monday, November 14, 2022 12:13 AM
> > To: Sagar Biradar - C34249 <Sagar.Biradar@xxxxxxxxxxxxx>
> > Cc: martin.petersen@xxxxxxxxxx; khorenko@xxxxxxxxxxxxx;
> > christian@xxxxxxxxxxxxxx; aacraid@xxxxxxxxxxxxx; Don Brace - C33706
> > <Don.Brace@xxxxxxxxxxxxx>; Tom White - C33503
> > <Tom.White@xxxxxxxxxxxxx>; linux-scsi@xxxxxxxxxxxxxxx; Linux Kernel
> > Mailing List <linux-kernel@xxxxxxxxxxxxxxx>
> > Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405
> > constantly resets under high io load
> >
> > EXTERNAL EMAIL: Do not click links or open attachments unless you
> > know the content is safe
> >
> > On Thu, Oct 27, 2022 at 1:17 PM <Sagar.Biradar@xxxxxxxxxxxxx> wrote:
> > >
> > > Hi James and Konstantin,
> > >
> > > *Limiting the audience to avoid spamming*
> > >
> > > Sorry for delayed response as I was on vacation.
> > > This one got missed somehow as someone else was looking into this and is no longer with the company.
> > >
> > > I will look into this, meanwhile I wanted to check if you (or someone else you know) had a chance to test this thoroughly with the latest kernel?
> > > I will get back to you with some more questions or the confirmation in a day or two max.
> >
> > Did this ever get looked at?
> >
> > As this exact patch was merged into the vendor aacraid a while ago I'm not sure why it wouldn't be good to merge to mainline as well.
> >
> > Vendor aacraid release with this patch merged:
> > https://download.adaptec.com/raid/aac/linux/aacraid-linux-src-1.2.1-
> > 60
> > 001.tgz
> >
> > >
> > >
> > > Thanks for your patience.
> > > Sagar
> > >
> > >
> > > -----Original Message-----
> > > From: James Hilliard <james.hilliard1@xxxxxxxxx>
> > > Sent: Thursday, October 27, 2022 1:40 AM
> > > To: Martin K. Petersen <martin.petersen@xxxxxxxxxx>
> > > Cc: Konstantin Khorenko <khorenko@xxxxxxxxxxxxx>; Christian
> > > Großegger <christian@xxxxxxxxxxxxxx>; linux-scsi@xxxxxxxxxxxxxxx;
> > > Adaptec OEM Raid Solutions <aacraid@xxxxxxxxxxxxx>; Sagar Biradar
> > > -
> > > C34249 <Sagar.Biradar@xxxxxxxxxxxxx>; Linux Kernel Mailing List
> > > <linux-kernel@xxxxxxxxxxxxxxx>; Don Brace - C33706
> > > <Don.Brace@xxxxxxxxxxxxx>
> > > Subject: Re: [PATCH v3 0/1] aacraid: Host adapter Adaptec 6405
> > > constantly resets under high io load
> > >
> > > EXTERNAL EMAIL: Do not click links or open attachments unless you
> > > know the content is safe
> > >
> > > On Wed, Oct 19, 2022 at 2:03 PM Konstantin Khorenko <khorenko@xxxxxxxxxxxxx> wrote:
> > > >
> > > > On 10.10.2022 14:31, James Hilliard wrote:
> > > > > On Tue, Feb 22, 2022 at 10:41 PM Martin K. Petersen
> > > > > <martin.petersen@xxxxxxxxxx> wrote:
> > > > >>
> > > > >>
> > > > >> Christian,
> > > > >>
> > > > >>> The faulty patch (Commit: 395e5df79a9588abf) from 2017
> > > > >>> should be repaired with Konstantin Khorenko (1):
> > > > >>>
> > > > >>> scsi: aacraid: resurrect correct arc ctrl checks for
> > > > >>> Series-6
> > > > >>
> > > > >> It would be great to get this patch resubmitted by Konstantin
> > > > >> and acked by Microchip.
> > >
> > > Can we merge this as is since microchip does not appear to be maintaining this driver any more or responding?
> > >
> > > > >
> > > > > Does the patch need to be rebased?
> > > >
> > > > James, i have just checked - the old patch (v3) applies cleanly onto latest master branch.
> > > >
> > > > > Based on this it looks like someone at microchip may have already reviewed:
> > > > > v3 changes:
> > > > > * introduced another wrapper to check for devices except for Series 6
> > > > > controllers upon request from Sagar Biradar (Microchip)
> > > >
> > > > Well, back in the year 2019 i've created a bug in RedHat
> > > > bugzilla
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=1724077
> > > > (the bug is private, this is default for Redhat bugs)
> > > >
> > > > In this bug Sagar Biradar (with the email @microchip.com)
> > > > suggested me to rework the patch - i've done that and sent the v3.
> > > >
> > > > And nothing happened after that, but in a ~year (2020-06-19) the
> > > > bug was closed with the resolution NOTABUG and a comment that S6 users will find the patch useful.
> > > >
> > > > i suppose S6 is so old that RedHat just does not have customers
> > > > using it and Microchip company itself is also not that interested in handling so old hardware issues.
> > > >
> > > > Sorry, i was unable to get a final ack from Microchip, i've
> > > > written direct emails to the addresses which is found in the
> > > > internet, tried to connect via linkedin, no luck.
> > > >
> > > > --
> > > > Konstantin Khorenko