Re: [PATCH RFC 0/2] percpu_ida: Take into account CPU topology when stealing tags

From: Ming Lei
Date: Wed Apr 30 2014 - 05:40:58 EST


On Wed, Apr 30, 2014 at 5:13 AM, Jens Axboe <axboe@xxxxxxxxx> wrote:
> On 04/29/2014 05:35 AM, Ming Lei wrote:
>> On Sat, Apr 26, 2014 at 10:03 AM, Jens Axboe <axboe@xxxxxxxxx> wrote:
>>> On 2014-04-25 18:01, Ming Lei wrote:
>>>>
>>>> Hi Jens,
>>>>
>>>> On Sat, Apr 26, 2014 at 5:23 AM, Jens Axboe <axboe@xxxxxxxxx> wrote:
>>>>>
>>>>> On 04/25/2014 03:10 AM, Ming Lei wrote:
>>>>>
>>>>> Sorry, I did run it the other day. It has little to no effect here, but
>>>>> that's mostly because there's so much other crap going on in there. The
>>>>> most effective way to currently make it work better, is just to ensure
>>>>> the caching pool is of a sane size.
>>>>
>>>>
>>>> Yes, that is just what the patch is doing, :-)
>>>
>>>
>>> But it's not enough.
>>
>> Yes, the patch is only for cases of mutli hw queue and having
>> offline CPUs existed.
>>
>>> For instance, my test case, it's 255 tags and 64 CPUs.
>>> We end up in cross-cpu spinlock nightmare mode.
>>
>> IMO, the scaling problem for the above case might be
>> caused by either current percpu ida design or blk-mq's
>> usage on it.
>
> That is pretty much my claim, yes. Basically I don't think per-cpu tag
> caching is ever going to be the best solution for the combination of
> modern machines and the hardware that is out there (limited tags).
>
>> One of problems in blk-mq is that the 'set->queue_depth'
>> parameter from driver isn't scalable, maybe it is reasonable to
>> introduce 'set->min_percpu_cache', then ' tags->nr_max_cache'
>> can be computed as below:
>>
>> max(nr_tags / hctx->nr_ctx, set->min_percpu_cache)
>>
>> Another problem in blk-mq is that if it can be improved by computing
>> tags->nr_max_cache as 'nr_tags / hctx->nr_ctx' ? The current
>> approach should be based on that there are parallel I/O
>> activity on each CPU, but I am wondering if it is the common
>> case in reality. Suppose there are N(N << online CPUs in
>> big machine) concurrent I/O on some of CPUs, percpu cache
>> can be increased a lot by (nr_tags / N).
>
> That would certainly help the common case, but it'd still be slow for
> the cases where you DO have IO from lots of sources. If we consider

It may be difficult to figure out a efficient solution for the unusual case.

8-16
> tags the minimum for balanced performance, than that doesn't take a
> whole lot of CPUs to spread out the tag space. Just looking at a case
> today on SCSI with 62 tags. AHCI and friends have 31 tags. Even for the
> "bigger" case of the Micron card, you still only have 255 active tags.
> And we probably want to split that up into groups of 32, making the
> problem even worse.

Yes, that is a contradiction between having limited hardware
queue tags and requiring more local cpu cache. But it is really
a challenge to figure out an efficient approach in case that lots of
CPUs need to contend very limited resources.

Maybe blk-mq can learn network device: move the hw
queue_depth constraint into low level(such as, blk_mq_run_hw_queue()),
and keep adequate tags in the percpu pool, which means
nr_tags of percpu pool can be much bigger than queue_depth
for keeping enough percpu cache. When hw queue is full,
congestion control can be applied in blk_mq_alloc_request()
to avoid cross-cpu spinlock nightmare in percpu allocation/free.
But if device requires tag to be less than queue_depth, more
work is needed for the approach.


Thanks,
--
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/