Re: [RFC PATCH v2 2/2] blk-mq: Lockout tagset iter when freeing rqs

From: John Garry
Date: Mon Dec 21 2020 - 07:08:44 EST


On 18/12/2020 22:43, Bart Van Assche wrote:
On 12/17/20 3:07 AM, John Garry wrote:
References to old IO sched requests are currently cleared from the
tagset when freeing those requests; switching elevator or changing
request queue depth is such a scenario in which this occurs.

However, this does not stop the potentially racy behaviour of freeing
and clearing a request reference between a tagset iterator getting a
reference to a request and actually dereferencing that request.

Such a use-after-free can be triggered, as follows:

==================================================================
BUG: KASAN: use-after-free in bt_iter+0xa0/0x120
Read of size 8 at addr ffff00108d589300 by task fio/3052

CPU: 32 PID: 3052 Comm: fio Tainted: GW
5.10.0-rc4-64839-g2dcf1ee5054f #693
Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon
D05 IT21 Nemo 2.0 RC0 04/18/2018
Call trace:
dump_backtrace+0x0/0x2d0
show_stack+0x18/0x68
dump_stack+0x100/0x16c
print_address_description.constprop.12+0x6c/0x4e8
kasan_report+0x130/0x200
__asan_load8+0x9c/0xd8
bt_iter+0xa0/0x120
blk_mq_queue_tag_busy_iter+0x2d8/0x540
blk_mq_in_flight+0x80/0xb8
part_stat_show+0xd8/0x238
dev_attr_show+0x44/0x90
sysfs_kf_seq_show+0x128/0x1c8
kernfs_seq_show+0xa0/0xb8
seq_read_iter+0x1ec/0x6a0
seq_read+0x1d0/0x250
kernfs_fop_read+0x70/0x330
vfs_read+0xe4/0x250
ksys_read+0xc8/0x178
__arm64_sys_read+0x44/0x58
el0_svc_common.constprop.2+0xc4/0x1e8
do_el0_svc+0x90/0xa0
el0_sync_handler+0x128/0x178
el0_sync+0x158/0x180

This is found experimentally by running fio on 2x SCSI disks - 1x disk
holds the root partition. Userspace is constantly triggering the tagset
iter from reading the root (gen)disk partition info. And so if the IO
sched is constantly changed on the other disk, eventually the UAF occurs,
as described above.

Hi John,

Hi Bart,


Something is not clear to me. The above call stack includes
blk_mq_queue_tag_busy_iter(). That function starts with
percpu_ref_tryget(&q->q_usage_counter) and ends with calling
percpu_ref_put(&q->q_usage_counter). So it will only iterate over a tag set
if q->q_usage_counter is live. However, both blk_mq_update_nr_requests()
and elevator_switch() start with freezing the request queue.
blk_mq_freeze_queue() starts with killing q->q_usage_counter and waits
until that counter has dropped to zero. In other words,
blk_mq_queue_tag_busy_iter() should not iterate over a tag set while a tag
set is being freed or reallocated.

Right, this is what I thought, but Ming reminded me "blk_mq_queue_tag_busy_iter() can be run on another request queue just
between one driver tag is allocated and updating the request map, so one
extra request reference still can be grabbed."

Does this mean that we do not yet have
a full explanation about why the above call stack can be triggered?

We understand it, and I'll describe my experiment in detail:
a. fio runs on 2x disks, sda (root partition disk) and sdb.
b. for sda, userpace triggers blk_mq_queue_tag_busy_iter(), as in stackframe above. Since its request queue is not frozen, it will iter the busy tags.
c. on sdb, I continuously change the IO scheduler.

So sdb request queue gets frozen as we switch IO sched, but we could have this sequence of events:
- blk_mq_queue_tag_busy_iter() on sda takes reference to a sdb request
- Getting a tag and updating ->rqs[] in tagset is not atomic
- requests for sdb cleared in tagset and request memory is freed
- blk_mq_queue_tag_busy_iter() on sda still holds reference to sdb request and dereferences it -> UAF

Hope it's clear. It is a bit unlikely, I will admit, but it still can happen and UAF is never good. So please let me know if other idea to solve.

Thanks,
John