RE: scsi-mq V2

From: Elliott, Robert (Server Storage)
Date: Wed Jul 09 2014 - 20:56:16 EST




> -----Original Message-----
> From: Jens Axboe [mailto:axboe@xxxxxxxxx]
> Sent: Wednesday, 09 July, 2014 2:38 PM
> To: dgilbert@xxxxxxxxxxxx; Christoph Hellwig; James Bottomley; Bart Van
> Assche; Elliott, Robert (Server Storage); linux-scsi@xxxxxxxxxxxxxxx; linux-
> kernel@xxxxxxxxxxxxxxx
> Subject: Re: scsi-mq V2
>
> On 2014-07-09 18:39, Douglas Gilbert wrote:
> > On 14-07-08 10:48 AM, Christoph Hellwig wrote:
> >> On Wed, Jun 25, 2014 at 06:51:47PM +0200, Christoph Hellwig wrote:
> >>> Changes from V1:
> >>> - rebased on top of the core-for-3.17 branch, most notable the
> >>> scsi logging changes
> >>> - fixed handling of cmd_list to prevent crashes for some heavy
> >>> workloads
> >>> - fixed incorrect handling of !target->can_queue
> >>> - avoid scheduling a workqueue on I/O completions when no queues
> >>> are congested
> >>>
> >>> In addition to the patches in this thread there also is a git
> >>> available at:
> >>>
> >>> git://git.infradead.org/users/hch/scsi.git scsi-mq.2
> >>
> >>
> >> I've pushed out a new scsi-mq.3 branch, which has been rebased on the
> >> latest core-for-3.17 tree + the "RFC: clean up command setup" series
> >> from June 29th. Robert Elliot found a problem with not fully zeroed
> >> out UNMAP CDBs, which is fixed by the saner discard handling in that
> >> series.
> >>
> >> There is a new patch to factor the code from the above series for
> >> blk-mq use, which I've attached below. Besides that the only changes
> >> are minor merge fixups in the main blk-mq usage patch.
> >
> > Be warned: both Rob Elliott and I can easily break
> > the scsi-mq.3 branch. It seems as though a regression
> > has slipped in. I notice that Christoph has added a
> > new branch called "scsi-mq.3-no-rebase".
>
> Rob/Doug, those issues look very much like problems in the aio code. Can
> either/both of you try with:
>
> f8567a3845ac05bb28f3c1b478ef752762bd39ef
> edfbbf388f293d70bf4b7c0bc38774d05e6f711a
>
> reverted (in that order) and see if that changes anything.
>
>
> --
> Jens Axboe

scsi-mq.3-no-rebase, which has all the scsi updates from scsi-mq.3
but is based on 3.16.0-rc2 rather than 3.16.0-rc4, works fine:
* ^C exits fio cleanly with scsi_debug devices
* ^C exits fio cleanly with mpt3sas devices
* fio hits 1M IOPS with 16 hpsa devices
* fio hits 700K IOPS with 6 mpt3sas devices
* 38 device test to mpt3sas, hpsa, and scsi_debug devices runs OK


With:
* scsi-mq-3, which is based on 3.16.0-rc4
* [PATCH] x86-64: fix vDSO build from https://lkml.org/lkml/2014/7/3/738
* those two aio patches reverted

the problem still occurs - fio results in low or 0 IOPS, with perf top
reporting unusual amounts of time spent in do_io_submit and io_submit.

perf top:
14.38% [kernel] [k] do_io_submit
13.71% libaio.so.1.0.1 [.] io_submit
13.32% [kernel] [k] system_call
11.60% [kernel] [k] system_call_after_swapgs
8.88% [kernel] [k] lookup_ioctx
8.78% [kernel] [k] copy_user_generic_string
7.78% [kernel] [k] io_submit_one
5.97% [kernel] [k] blk_flush_plug_list
2.73% fio [.] fio_libaio_commit
2.70% [kernel] [k] sysret_check
2.68% [kernel] [k] blk_finish_plug
1.98% [kernel] [k] blk_start_plug
1.17% [kernel] [k] SyS_io_submit
1.17% [kernel] [k] __get_user_4
0.99% fio [.] io_submit@plt
0.85% [kernel] [k] _copy_from_user
0.79% [kernel] [k] system_call_fastpath

Repeating some of last night's investigation details for the lists:

ftrace of one of the CPUs for all functions shows these
are repeatedly being called:

<...>-34508 [004] .... 6360.790714: io_submit_one <-do_io_submit
<...>-34508 [004] .... 6360.790714: blk_finish_plug <-do_io_submit
<...>-34508 [004] .... 6360.790714: blk_flush_plug_list <-blk_finish_plug
<...>-34508 [004] .... 6360.790714: SyS_io_submit <-system_call_fastpath
<...>-34508 [004] .... 6360.790715: do_io_submit <-SyS_io_submit
<...>-34508 [004] .... 6360.790715: lookup_ioctx <-do_io_submit
<...>-34508 [004] .... 6360.790715: blk_start_plug <-do_io_submit
<...>-34508 [004] .... 6360.790715: io_submit_one <-do_io_submit
<...>-34508 [004] .... 6360.790715: blk_finish_plug <-do_io_submit
<...>-34508 [004] .... 6360.790715: blk_flush_plug_list <-blk_finish_plug
<...>-34508 [004] .... 6360.790715: SyS_io_submit <-system_call_fastpath
<...>-34508 [004] .... 6360.790715: do_io_submit <-SyS_io_submit
<...>-34508 [004] .... 6360.790715: lookup_ioctx <-do_io_submit
<...>-34508 [004] .... 6360.790716: blk_start_plug <-do_io_submit
<...>-34508 [004] .... 6360.790716: io_submit_one <-do_io_submit
<...>-34508 [004] .... 6360.790716: blk_finish_plug <-do_io_submit
<...>-34508 [004] .... 6360.790716: blk_flush_plug_list <-blk_finish_plug
<...>-34508 [004] .... 6360.790716: SyS_io_submit <-system_call_fastpath
<...>-34508 [004] .... 6360.790716: do_io_submit <-SyS_io_submit
<...>-34508 [004] .... 6360.790716: lookup_ioctx <-do_io_submit
<...>-34508 [004] .... 6360.790716: blk_start_plug <-do_io_submit
<...>-34508 [004] .... 6360.790717: io_submit_one <-do_io_submit
<...>-34508 [004] .... 6360.790717: blk_finish_plug <-do_io_submit
<...>-34508 [004] .... 6360.790717: blk_flush_plug_list <-blk_finish_plug
<...>-34508 [004] .... 6360.790717: SyS_io_submit <-system_call_fastpath
<...>-34508 [004] .... 6360.790717: do_io_submit <-SyS_io_submit

fs/aio.c do_io_submit is apparently completing (many times) - it's not
stuck in the for loop:
blk_start_plug(&plug);

/*
* AKPM: should this return a partial result if some of the IOs were
* successfully submitted?
*/
for (i=0; i<nr; i++) {
struct iocb __user *user_iocb;
struct iocb tmp;

if (unlikely(__get_user(user_iocb, iocbpp + i))) {
ret = -EFAULT;
break;
}

if (unlikely(copy_from_user(&tmp, user_iocb, sizeof(tmp)))) {
ret = -EFAULT;
break;
}

ret = io_submit_one(ctx, user_iocb, &tmp, compat);
if (ret)
break;
}
blk_finish_plug(&plug);


fs/aio.c io_submit_one is not getting to fget, which is traceable:
/* enforce forwards compatibility on users */
if (unlikely(iocb->aio_reserved1 || iocb->aio_reserved2)) {
pr_debug("EINVAL: reserve field set\n");
return -EINVAL;
}

/* prevent overflows */
if (unlikely(
(iocb->aio_buf != (unsigned long)iocb->aio_buf) ||
(iocb->aio_nbytes != (size_t)iocb->aio_nbytes) ||
((ssize_t)iocb->aio_nbytes < 0)
)) {
pr_debug("EINVAL: io_submit: overflow check\n");
return -EINVAL;
}

req = aio_get_req(ctx);
if (unlikely(!req))
return -EAGAIN;

req->ki_filp = fget(iocb->aio_fildes);

I don't have that file compiled with -DDEBUG so the pr_debug
prints are unavailable. The -EAGAIN seems most likely to lead
to a hang like this.

aio_get_req is not getting to kmem_cache_alloc, which is
traceable:
if (!get_reqs_available(ctx))
return NULL;

req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL|__GFP_ZERO);

get_reqs_available is probably returning false because not
enough reqs are available compared to req_batch:

struct kioctx_cpu *kcpu;
bool ret = false;

preempt_disable();
kcpu = this_cpu_ptr(ctx->cpu);

if (!kcpu->reqs_available) {
int old, avail = atomic_read(&ctx->reqs_available);

do {
if (avail < ctx->req_batch)
goto out;

old = avail;
avail = atomic_cmpxchg(&ctx->reqs_available,
avail, avail - ctx->req_batch);
} while (avail != old);

kcpu->reqs_available += ctx->req_batch;
}

ret = true;
kcpu->reqs_available--;
out:
preempt_enable();
return ret;


---
Rob Elliott HP Server Storage



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/