Antw: 3.0.101: "blk_rq_check_limits: over max size limit."

From: Ulrich Windl
Date: Wed Dec 07 2016 - 07:25:59 EST


Hi again!

Maybe someone can confirm this:
If you have a device (e.g. multipath map) that limits max_sectors_kb to maybe 64, and then define an LVM LV using that multipath map as PV, the LV still has a larger max_sectors_kb. If you send a big request (read in my case), the kernel will complain:

kernel: [173116.098798] blk_rq_check_limits: over max size limit.

Note that this message does not give any clue to the device being involved, nor the size of the IO attempted, nor the limit of the IO size.

My expectation would be that the higher layer reports back an I/O error, and the user process receives an I/O error, or, alternatively, the big request is split into acceptable chunks before passing it to the lower layers.

However none of the above happens; instead the request seems to block the request queue, because later TUR-checks also fail:
kernel: [173116.105701] device-mapper: multipath: Failing path 66:384.
kernel: [173116.105714] device-mapper: multipath: Failing path 66:352.
multipathd: 66:384: mark as failed
multipathd: NAP_S11: remaining active paths: 1
multipathd: 66:352: mark as failed
multipathd: NAP_S11: Entering recovery mode: max_retries=6
multipathd: NAP_S11: remaining active paths: 0

(somewhat later)
multipathd: NAP_S11: sdkh - tur checker reports path is up
multipathd: 66:336: reinstated
multipathd: NAP_S11: Recovered to normal mode
kernel: [173117.286712] device-mapper: multipath: Could not failover device 66:368: Handler scsi_dh_alua error 8.
(I don't know the implications of this)

Of course this error does not appear as long as all devices use the same maximum request size, but tests have shown that different SAN disk systems prefer different request sizes (as they split large requests internally to handle them in chunks anyway).

Last seen with this kernel (SLES11 SP4 on x86_64): Linux version 3.0.101-88-default (geeko@buildhost) (gcc version 4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux) ) #1 SMP Fri Nov 4 22:07:35 UTC 2016 (b45f205)

Regards,
Ulrich

>>> Ulrich Windl schrieb am 23.08.2016 um 17:03 in Nachricht <57BC65CD.D1A : 161 :
60728>:
> Hello!
>
> While performance-testing a 3PARdata StorServ 8400 with SLES11SP4, I noticed
> that I/Os dropped, until everything stood still more or less. Looking into
> the syslog I found that multipath's TUR-checker considered the paths (FC,
> BTW) as dead. Amazingly I did not have this problem when I did read-only
> tests.
>
> The start looks like this:
> Aug 23 14:44:58 h10 multipathd: 8:32: mark as failed
> Aug 23 14:44:58 h10 multipathd: FirstTest-32: remaining active paths: 3
> Aug 23 14:44:58 h10 kernel: [ 880.159425] blk_rq_check_limits: over max
> size limit.
> Aug 23 14:44:58 h10 kernel: [ 880.159611] blk_rq_check_limits: over max
> size limit.
> Aug 23 14:44:58 h10 kernel: [ 880.159615] blk_rq_check_limits: over max
> size limit.
> Aug 23 14:44:58 h10 kernel: [ 880.159623] device-mapper: multipath: Failing
> path 8:32.
> Aug 23 14:44:58 h10 kernel: [ 880.186609] blk_rq_check_limits: over max
> size limit.
> Aug 23 14:44:58 h10 kernel: [ 880.186626] blk_rq_check_limits: over max
> size limit.
> Aug 23 14:44:58 h10 kernel: [ 880.186628] blk_rq_check_limits: over max
> size limit.
> Aug 23 14:44:58 h10 kernel: [ 880.186631] device-mapper: multipath: Failing
> path 129:112.
> [...]
> It seems the TUR-checker does some ping-pong-like game: Paths go up and down
>
> Now for the Linux part: I found the relevant message in blk-core.c
> (blk_rq_check_limits()).
> First s/agaist/against/ in " * Such request stacking drivers should check
> those requests agaist", the there's the problem that the message neither
> outputs the blk_rq_sectors(), nor the blk_queue_get_max_sectors(), nor the
> underlying device. That makes debugging somewhat difficult if you customize
> the block queue settings per device as I did:
>
> Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of
> queue/rotational for FirstTest-31 (0)
> Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of
> queue/add_random for FirstTest-31 (0)
> Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of
> queue/scheduler for FirstTest-31 (noop)
> Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of
> queue/max_sectors_kb for FirstTest-31 (128)
> Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of
> queue/rotational for FirstTest-32 (0)
> Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of
> queue/add_random for FirstTest-32 (0)
> Aug 23 14:32:33 h10 blocktune: (notice) start: activated tuning of
> queue/scheduler for FirstTest-32 (noop)
> Aug 23 14:32:34 h10 blocktune: (notice) start: activated tuning of
> queue/max_sectors_kb for FirstTest-32 (128)
>
> I suspect the "queue/max_sectors_kb=128" is the culprit:
> # multipath -ll FirstTest-32
> FirstTest-32 (360002ac000000000000000040001b383) dm-7 3PARdata,VV
> size=10G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
> `-+- policy='service-time 0' prio=50 status=active
> |- 2:0:0:1 sdet 129:80 failed ready running
> |- 2:0:2:1 sdev 129:112 failed ready running
> |- 1:0:0:1 sdb 8:16 failed ready running
> `- 1:0:1:1 sdc 8:32 failed ready running
> # cat /sys/block/{dm-7,sd{b,c},sde{t,v}}/queue/max_sectors_kb
> 128
> 128
> 128
> 128
> 128
>
> While writing this message, I noticed that I had created a primary partition
> of dm-7:
> # dmsetup ls |grep Fi
> FirstTest-32_part1 (253:8)
> FirstTest-32 (253:7)
> # cat /sys/block/dm-8/queue/max_sectors_kb
> 1024
>
> After "# echo 128 >/sys/block/dm-8/queue/max_sectors_kb" things still did not
> get better.
>
> Can't blk_rq_check_limits() do anything more clever than returning -EIO?
>
> Regards,
> Ulrich
> P.S: Keep me in CC:, please!
>
>
>