Re: [PATCH] scsi: lpfc: Move work items to a stack list
From: Daniel Wagner
Date: Tue Nov 19 2019 - 13:14:55 EST
Hi James,
On Tue, Nov 19, 2019 at 09:25:54AM -0800, James Smart wrote:
> On 11/19/2019 5:28 AM, Daniel Wagner wrote:
> > On Tue, Nov 12, 2019 at 10:15:00PM -0500, Martin K. Petersen wrote:
> > > > While trying to understand what's going on in the Oops below I figured
> > > > that it could be the result of the invalid pointer access. The patch
> > > > still needs testing by our customer but indepent of this I think the
> > > > patch fixes a real bug.
> > I was able to reproduce the same stack trace with this patch
> > applied... That is obviously bad. The good news, I have access to this
> > machine, so maybe I able to figure out what's the root cause of this
> > crash.
>
> fyi - Dick and I were taking our time reviewing your patch as the major
> concern we had with it was that the splice and then the re-add of the
> work-list could have an abort complete before the IO, which could lead to
> DC.
I think there is potentially a bug. The crash I see might not be
related to my attempt here to fix it. After looking at it again, it
looks a bit ugly and over complex compared to what should do the trick:
@@ -3915,6 +3915,9 @@ lpfc_sli_handle_slow_ring_event_s4(struct lpfc_hba *phba,
cq_event, struct lpfc_cq_event, list);
spin_unlock_irqrestore(&phba->hbalock, iflag);
+ if (!cq_event)
+ break;
+
switch (bf_get(lpfc_wcqe_c_code, &cq_event->cqe.wcqe_cmpl)) {
case CQE_CODE_COMPL_WQE:
irspiocbq = container_of(cq_event, struct lpfc_iocbq,
> We'll look anew at your repro.
Thanks a lot.
Anyway, I tried to gather some more data on it.
(gdb) l *lpfc_sli4_io_xri_aborted+0x29c
0xa1f2c is in lpfc_sli4_io_xri_aborted (drivers/scsi/lpfc/lpfc_scsi.c:543).
538 }
539 #endif
540 qp->abts_scsi_io_bufs--;
541 spin_unlock(&qp->abts_io_buf_list_lock);
542
543 if (psb->rdata && psb->rdata->pnode)
544 ndlp = psb->rdata->pnode;
545 else
546 ndlp = NULL;
547
I looked at the assembler code and if I am not completely mistaken it
is 'psb->rdata' which triggers the KASAN dump.
So I though let's reset psb->rdata when the buf is freed in order to
avoid the report. So this was just to find out if my theory holds.
ff --git a/drivers/scsi/lpfc/lpfc_scsi.c b/drivers/scsi/lpfc/lpfc_scsi.c
index ba26df90a36a..8e3a05e44d76 100644
--- a/drivers/scsi/lpfc/lpfc_scsi.c
+++ b/drivers/scsi/lpfc/lpfc_scsi.c
@@ -791,10 +791,12 @@ lpfc_release_scsi_buf_s4(struct lpfc_hba *phba, struct lpfc_io_buf *psb)
if (psb->flags & LPFC_SBUF_XBUSY) {
spin_lock_irqsave(&qp->abts_io_buf_list_lock, iflag);
psb->pCmd = NULL;
+ psb->rdata = NULL;
list_add_tail(&psb->list, &qp->lpfc_abts_io_buf_list);
qp->abts_scsi_io_bufs++;
spin_unlock_irqrestore(&qp->abts_io_buf_list_lock, iflag);
} else {
+ psb->rdata = NULL;
lpfc_release_io_buf(phba, (struct lpfc_io_buf *)psb, qp);
}
}
Unfortunatly, this didn't help and instead I got a new KASAN report instead:
[ 167.006341] ==================================================================
[ 167.006428] BUG: KASAN: slab-out-of-bounds in lpfc_unreg_login+0x7c/0xc0 [lpfc]
[ 167.006434] Read of size 2 at addr ffff88a035f792e2 by task lpfc_worker_3/1130
[ 167.006435]
[ 167.006444] CPU: 4 PID: 1130 Comm: lpfc_worker_3 Tainted: G E 5.4.0-rc1-default+ #3
[ 167.006447] Hardware name: HP ProLiant DL580 Gen9/ProLiant DL580 Gen9, BIOS U17 07/21/2019
[ 167.006449] Call Trace:
[ 167.006463] dump_stack+0x71/0xab
[ 167.006535] ? lpfc_unreg_login+0x7c/0xc0 [lpfc]
[ 167.006545] print_address_description.constprop.6+0x1b/0x2f0
[ 167.006615] ? lpfc_unreg_login+0x7c/0xc0 [lpfc]
[ 167.006684] ? lpfc_unreg_login+0x7c/0xc0 [lpfc]
[ 167.006688] sd 7:0:0:0: alua: port group 3e9 state N non-preferred supports TolUsNA
[ 167.006693] __kasan_report+0x14e/0x192
[ 167.006764] ? lpfc_unreg_login+0x7c/0xc0 [lpfc]
[ 167.006770] kasan_report+0xe/0x20
[ 167.006840] lpfc_unreg_login+0x7c/0xc0 [lpfc]
[ 167.006912] lpfc_sli_def_mbox_cmpl+0x329/0x560 [lpfc]
[ 167.006921] ? _raw_spin_lock_irqsave+0x8d/0xf0
[ 167.006991] ? __lpfc_sli_rpi_release.isra.59+0xc0/0xc0 [lpfc]
[ 167.007057] lpfc_sli_handle_mb_event+0x3f3/0x8d0 [lpfc]
[ 167.007064] ? _raw_spin_lock_irqsave+0x8d/0xf0
[ 167.007069] ? _raw_write_lock_bh+0xe0/0xe0
[ 167.007138] ? lpfc_mbox_get+0x1d/0xa0 [lpfc]
[ 167.007204] ? lpfc_sli_free_hbq+0x80/0x80 [lpfc]
[ 167.007277] lpfc_do_work+0x100b/0x26e0 [lpfc]
[ 167.007285] ? __schedule+0x6ca/0xc10
[ 167.007288] ? _raw_spin_lock_irqsave+0x8d/0xf0
[ 167.007329] ? lpfc_unregister_unused_fcf+0xc0/0xc0 [lpfc]
[ 167.007334] ? wait_woken+0x130/0x130
[ 167.007375] ? lpfc_unregister_unused_fcf+0xc0/0xc0 [lpfc]
[ 167.007380] kthread+0x1b3/0x1d0
[ 167.007383] ? kthread_create_worker_on_cpu+0xc0/0xc0
[ 167.007386] ret_from_fork+0x35/0x40
[ 167.007389]
[ 167.007391] Allocated by task 667:
[ 167.007395] save_stack+0x19/0x80
[ 167.007398] __kasan_kmalloc.constprop.9+0xa0/0xd0
[ 167.007400] __kmalloc+0xfb/0x5d0
[ 167.007439] lpfc_sli4_alloc_resource_identifiers+0x260/0x8a0 [lpfc]
[ 167.007478] lpfc_sli4_hba_setup+0xbbd/0x3480 [lpfc]
[ 167.007517] lpfc_pci_probe_one_s4.isra.46+0x100b/0x1d70 [lpfc]
[ 167.007556] lpfc_pci_probe_one+0x191/0x11b0 [lpfc]
[ 167.007563] local_pci_probe+0x74/0xd0
[ 167.007567] work_for_cpu_fn+0x29/0x40
[ 167.007570] process_one_work+0x46e/0x7f0
[ 167.007573] worker_thread+0x4cd/0x6b0
[ 167.007575] kthread+0x1b3/0x1d0
[ 167.007578] ret_from_fork+0x35/0x40
[ 167.007578]
[ 167.007580] Freed by task 0:
[ 167.007580] (stack is not available)
[ 167.007581]
[ 167.007584] The buggy address belongs to the object at ffff88a035f790c0
[ 167.007584] which belongs to the cache kmalloc-512 of size 512
[ 167.007587] The buggy address is located 34 bytes to the right of
[ 167.007587] 512-byte region [ffff88a035f790c0, ffff88a035f792c0)
[ 167.007588] The buggy address belongs to the page:
[ 167.007592] page:ffffea0080d7de40 refcount:1 mapcount:0 mapping:ffff888107c00c40 index:0x0
[ 167.007595] flags: 0xd7ffffc0000200(slab)
[ 167.007601] raw: 00d7ffffc0000200 ffffea0080ec5588 ffffea0080d3e988 ffff888107c00c40
[ 167.007604] raw: 0000000000000000 ffff88a035f790c0 0000000100000006 0000000000000000
[ 167.007606] page dumped because: kasan: bad access detected
[ 167.007606]
[ 167.007607] Memory state around the buggy address:
[ 167.007610] ffff88a035f79180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 167.007613] ffff88a035f79200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 167.007615] >ffff88a035f79280: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc
[ 167.007617] ^
[ 167.007619] ffff88a035f79300: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
[ 167.007622] ffff88a035f79380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 167.007623] ==================================================================
(gdb) l *lpfc_unreg_login+0x7c
0x8bb1c is in lpfc_unreg_login (drivers/scsi/lpfc/lpfc_mbox.c:826).
821 memset(pmb, 0, sizeof (LPFC_MBOXQ_t));
822
823 mb->un.varUnregLogin.rpi = rpi;
824 mb->un.varUnregLogin.rsvd1 = 0;
825 if (phba->sli_rev >= LPFC_SLI_REV3)
826 mb->un.varUnregLogin.vpi = phba->vpi_ids[vpi];
827
828 mb->mbxCommand = MBX_UNREG_LOGIN;
829 mb->mbxOwner = OWN_HOST;
830
Not sure if this is not again an red herring.
Thanks,
Daniel