Re: [PATCH v4 3/5] blktrace: fix debugfs use after free

From: Greg KH
Date: Sun May 10 2020 - 02:26:44 EST


On Sat, May 09, 2020 at 03:10:56AM +0000, Luis Chamberlain wrote:
> On commit 6ac93117ab00 ("blktrace: use existing disk debugfs directory")
> merged on v4.12 Omar fixed the original blktrace code for request-based
> drivers (multiqueue). This however left in place a possible crash, if you
> happen to abuse blktrace while racing to remove / add a device.
>
> We used to use asynchronous removal of the request_queue, and with that
> the issue was easier to reproduce. Now that we have reverted to
> synchronous removal of the request_queue, the issue is still possible to
> reproduce, its however just a bit more difficult.
>
> We essentially run two instances of break-blktrace which add/remove
> a loop device, and setup a blktrace and just never tear the blktrace
> down. We do this twice in parallel. This is easily reproduced with the
> break-blktrace run_0004.sh script.
>
> We can end up with two types of panics each reflecting where we
> race, one a failed blktrace setup:
>
> [ 252.426751] debugfs: Directory 'loop0' with parent 'block' already present!
> [ 252.432265] BUG: kernel NULL pointer dereference, address: 00000000000000a0
> [ 252.436592] #PF: supervisor write access in kernel mode
> [ 252.439822] #PF: error_code(0x0002) - not-present page
> [ 252.442967] PGD 0 P4D 0
> [ 252.444656] Oops: 0002 [#1] SMP NOPTI
> [ 252.446972] CPU: 10 PID: 1153 Comm: break-blktrace Tainted: G E 5.7.0-rc2-next-20200420+ #164
> [ 252.452673] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
> [ 252.456343] RIP: 0010:down_write+0x15/0x40
> [ 252.458146] Code: eb ca e8 ae 22 8d ff cc cc cc cc cc cc cc cc cc cc cc cc
> cc cc 0f 1f 44 00 00 55 48 89 fd e8 52 db ff ff 31 c0 ba 01 00
> 00 00 <f0> 48 0f b1 55 00 75 0f 48 8b 04 25 c0 8b 01 00 48 89
> 45 08 5d
> [ 252.463638] RSP: 0018:ffffa626415abcc8 EFLAGS: 00010246
> [ 252.464950] RAX: 0000000000000000 RBX: ffff958c25f0f5c0 RCX: ffffff8100000000
> [ 252.466727] RDX: 0000000000000001 RSI: ffffff8100000000 RDI: 00000000000000a0
> [ 252.468482] RBP: 00000000000000a0 R08: 0000000000000000 R09: 0000000000000001
> [ 252.470014] R10: 0000000000000000 R11: ffff958d1f9227ff R12: 0000000000000000
> [ 252.471473] R13: ffff958c25ea5380 R14: ffffffff8cce15f1 R15: 00000000000000a0
> [ 252.473346] FS: 00007f2e69dee540(0000) GS:ffff958c2fc80000(0000) knlGS:0000000000000000
> [ 252.475225] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 252.476267] CR2: 00000000000000a0 CR3: 0000000427d10004 CR4: 0000000000360ee0
> [ 252.477526] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 252.478776] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 252.479866] Call Trace:
> [ 252.480322] simple_recursive_removal+0x4e/0x2e0
> [ 252.481078] ? debugfs_remove+0x60/0x60
> [ 252.481725] ? relay_destroy_buf+0x77/0xb0
> [ 252.482662] debugfs_remove+0x40/0x60
> [ 252.483518] blk_remove_buf_file_callback+0x5/0x10
> [ 252.484328] relay_close_buf+0x2e/0x60
> [ 252.484930] relay_open+0x1ce/0x2c0
> [ 252.485520] do_blk_trace_setup+0x14f/0x2b0
> [ 252.486187] __blk_trace_setup+0x54/0xb0
> [ 252.486803] blk_trace_ioctl+0x90/0x140
> [ 252.487423] ? do_sys_openat2+0x1ab/0x2d0
> [ 252.488053] blkdev_ioctl+0x4d/0x260
> [ 252.488636] block_ioctl+0x39/0x40
> [ 252.489139] ksys_ioctl+0x87/0xc0
> [ 252.489675] __x64_sys_ioctl+0x16/0x20
> [ 252.490380] do_syscall_64+0x52/0x180
> [ 252.491032] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> And the other on the device removal:
>
> [ 128.528940] debugfs: Directory 'loop0' with parent 'block' already present!
> [ 128.615325] BUG: kernel NULL pointer dereference, address: 00000000000000a0
> [ 128.619537] #PF: supervisor write access in kernel mode
> [ 128.622700] #PF: error_code(0x0002) - not-present page
> [ 128.625842] PGD 0 P4D 0
> [ 128.627585] Oops: 0002 [#1] SMP NOPTI
> [ 128.629871] CPU: 12 PID: 544 Comm: break-blktrace Tainted: G E 5.7.0-rc2-next-20200420+ #164
> [ 128.635595] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
> [ 128.640471] RIP: 0010:down_write+0x15/0x40
> [ 128.643041] Code: eb ca e8 ae 22 8d ff cc cc cc cc cc cc cc cc cc cc cc cc
> cc cc 0f 1f 44 00 00 55 48 89 fd e8 52 db ff ff 31 c0 ba 01 00
> 00 00 <f0> 48 0f b1 55 00 75 0f 65 48 8b 04 25 c0 8b 01 00 48 89
> 45 08 5d
> [ 128.650180] RSP: 0018:ffffa9c3c05ebd78 EFLAGS: 00010246
> [ 128.651820] RAX: 0000000000000000 RBX: ffff8ae9a6370240 RCX: ffffff8100000000
> [ 128.653942] RDX: 0000000000000001 RSI: ffffff8100000000 RDI: 00000000000000a0
> [ 128.655720] RBP: 00000000000000a0 R08: 0000000000000002 R09: ffff8ae9afd2d3d0
> [ 128.657400] R10: 0000000000000056 R11: 0000000000000000 R12: 0000000000000000
> [ 128.659099] R13: 0000000000000000 R14: 0000000000000003 R15: 00000000000000a0
> [ 128.660500] FS: 00007febfd995540(0000) GS:ffff8ae9afd00000(0000) knlGS:0000000000000000
> [ 128.662204] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 128.663426] CR2: 00000000000000a0 CR3: 0000000420042003 CR4: 0000000000360ee0
> [ 128.664776] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 128.666022] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 128.667282] Call Trace:
> [ 128.667801] simple_recursive_removal+0x4e/0x2e0
> [ 128.668663] ? debugfs_remove+0x60/0x60
> [ 128.669368] debugfs_remove+0x40/0x60
> [ 128.669985] blk_trace_free+0xd/0x50
> [ 128.670593] __blk_trace_remove+0x27/0x40
> [ 128.671274] blk_trace_shutdown+0x30/0x40
> [ 128.671935] blk_release_queue+0x95/0xf0
> [ 128.672589] kobject_put+0xa5/0x1b0
> [ 128.673188] disk_release+0xa2/0xc0
> [ 128.673786] device_release+0x28/0x80
> [ 128.674376] kobject_put+0xa5/0x1b0
> [ 128.674915] loop_remove+0x39/0x50 [loop]
> [ 128.675511] loop_control_ioctl+0x113/0x130 [loop]
> [ 128.676199] ksys_ioctl+0x87/0xc0
> [ 128.676708] __x64_sys_ioctl+0x16/0x20
> [ 128.677274] do_syscall_64+0x52/0x180
> [ 128.677823] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> The common theme here is:
>
> debugfs: Directory 'loop0' with parent 'block' already present
>
> This crash happens because of how blktrace uses the debugfs directory
> where it places its files. Upon init we always create the same directory
> which would be needed by blktrace but we only do this for make_request
> drivers (multiqueue) block drivers, but never for request-based block
> drivers. Furthermore, that directory is only created on init for the
> entire disk. This means that if you use blktrace on a partition, we'll
> always be creating a new directory regardless of whether or not you
> are doing blktrace on a make_request driver (multiqueue) or a
> request-based block drivers.
>
> These directory creations are only associated with a path, and so
> when a debugfs_remove() is called it removes everything in its way.
> A device removal will remove all blktrace files, and so if a blktrace
> is still present a cleanup of blktrace files later will end up trying
> to remove dentries pointing to NULL.
>
> We can fix the UAF by using a debugfs directory which moving forward
> will always be accessible if debugfs is enabled for both make_request
> drivers (multiqueue) and request-based block drivers, *and* for all
> partitions upon creation. This ensures that removal of the directories
> only happens on device removal and removes the race of the files
> underneath an active blktrace.
>
> For partitions we simply symlink to the whole disk's debugfs_dir, as the
> debugfs_dir is shared anyway and this limits us to only run one blktrace
> for the entire disk.
>
> We special-case a solution for scsi-generic which got blktrace support
> added by Christof via commit 6da127ad0918 ("blktrace: Add blktrace
> ioctls to SCSI generic devices") so upstream since v2.6.25. scsi-generic
> drives use a character device, however behind the scenes we have a scsi
> device with a request_queue. How this is used varies by class of driver
> (TYPE_DISK, TYPE_TYPE, etc). Care has to be taken into consideration of
> the fact that scsi drivers will probe asynchronously but the scsi-generic
> class_interface sg_add_device() will complete before. This means
> sd_probe() will use device_add_disk() for TYPE_DISK and have its
> debugfs_dir created *after* the scsi-generic device is created.
>
> For scsi-generic then we symlink to the real debugfs_dir only during a
> blktrace ioctl, but we do this only once. We also have to special-case
> yet another solution for drivers which use the bsg queue.
>
> This goes tested with:
>
> o nvme partitions
> o ISCSI with tgt, and blktracing against scsi-generic with:
> o block
> o tape
> o cdrom
> o media changer
>
> Screenshots of what the debugfs for block looks like after running
> blktrace on a system with sg0 which has a raid controllerand then sg1
> as the media changer:
>
> # ls -l /sys/kernel/debug/block
> total 0
> drwxr-xr-x 3 root root 0 May 9 02:31 bsg
> drwxr-xr-x 19 root root 0 May 9 02:31 nvme0n1
> drwxr-xr-x 19 root root 0 May 9 02:31 nvme1n1
> lrwxrwxrwx 1 root root 0 May 9 02:31 nvme1n1p1 -> nvme1n1
> lrwxrwxrwx 1 root root 0 May 9 02:31 nvme1n1p2 -> nvme1n1
> lrwxrwxrwx 1 root root 0 May 9 02:31 nvme1n1p3 -> nvme1n1
> lrwxrwxrwx 1 root root 0 May 9 02:31 nvme1n1p5 -> nvme1n1
> lrwxrwxrwx 1 root root 0 May 9 02:31 nvme1n1p6 -> nvme1n1
> drwxr-xr-x 2 root root 0 May 9 02:33 sch0
> lrwxrwxrwx 1 root root 0 May 9 02:33 sg0 -> bsg/2:0:0:0
> lrwxrwxrwx 1 root root 0 May 9 02:33 sg1 -> sch0
> drwxr-xr-x 5 root root 0 May 9 02:31 vda
> lrwxrwxrwx 1 root root 0 May 9 02:31 vda1 -> vda
>
> Code for handling the ebugfs_dir did get more complicatd for
> scsi-generic but this is technical debt. For the other types of devices,
> this simplifies the code considerably, with the only penalty now being
> that we're always creating the request queue debugfs directory for the
> request-based block device drivers.
>
> The symlink use also makes it clearer when the request_queue is shared.
>
> This patch is part of the work which disputes the severity of
> CVE-2019-19770 which shows this issue is not a core debugfs issue, but
> a misuse of debugfs within blktace.
>
> Cc: Bart Van Assche <bvanassche@xxxxxxx>
> Cc: Omar Sandoval <osandov@xxxxxx>
> Cc: Hannes Reinecke <hare@xxxxxxxx>
> Cc: Nicolai Stange <nstange@xxxxxxx>
> Cc: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>
> Cc: Michal Hocko <mhocko@xxxxxxxxxx>
> Cc: yu kuai <yukuai3@xxxxxxxxxx>
> Cc: Christof Schmitt <christof.schmitt@xxxxxxxxxx>
> Reported-by: syzbot+603294af2d01acfdd6da@xxxxxxxxxxxxxxxxxxxxxxxxx
> Fixes: 6ac93117ab00 ("blktrace: use existing disk debugfs directory")
> Signed-off-by: Luis Chamberlain <mcgrof@xxxxxxxxxx>
> ---
> block/blk-debugfs.c | 187 +++++++++++++++++++++++++++++++++++
> block/blk-mq-debugfs.c | 5 -
> block/blk-sysfs.c | 3 +
> block/blk.h | 16 +++
> block/bsg.c | 2 +
> block/partitions/core.c | 9 ++
> drivers/scsi/ch.c | 1 +
> drivers/scsi/sg.c | 75 ++++++++++++++
> drivers/scsi/st.c | 2 +
> include/linux/blkdev.h | 4 +-
> include/linux/blktrace_api.h | 1 -
> include/linux/genhd.h | 69 +++++++++++++
> kernel/trace/blktrace.c | 24 +++--
> 13 files changed, 385 insertions(+), 13 deletions(-)
>
> diff --git a/block/blk-debugfs.c b/block/blk-debugfs.c
> index 19091e1effc0..d40f12aecf8a 100644
> --- a/block/blk-debugfs.c
> +++ b/block/blk-debugfs.c
> @@ -8,8 +8,195 @@
> #include <linux/debugfs.h>
>
> struct dentry *blk_debugfs_root;
> +struct dentry *blk_debugfs_bsg = NULL;
> +
> +/**
> + * enum blk_debugfs_dir_type - block device debugfs directory type
> + * @BLK_DBG_DIR_BASE: the block device debugfs_dir exists on the base
> + * system <system-debugfs-dir>/block/ debugfs directory.
> + * @BLK_DBG_DIR_BSG: the block device debugfs_dir is under the directory
> + * <system-debugfs-dir>/block/bsg/
> + */
> +enum blk_debugfs_dir_type {
> + BLK_DBG_DIR_BASE = 1,
> + BLK_DBG_DIR_BSG,
> +};
>
> void blk_debugfs_register(void)
> {
> blk_debugfs_root = debugfs_create_dir("block", NULL);
> }
> +
> +static struct dentry *queue_get_base_dir(enum blk_debugfs_dir_type type)
> +{
> + switch (type) {
> + case BLK_DBG_DIR_BASE:
> + return blk_debugfs_root;
> + case BLK_DBG_DIR_BSG:
> + return blk_debugfs_bsg;
> + }
> + return NULL;
> +}
> +
> +static void queue_debugfs_register_type(struct request_queue *q,
> + const char *name,
> + enum blk_debugfs_dir_type type)
> +{
> + struct dentry *base_dir = queue_get_base_dir(type);
> +
> + q->debugfs_dir = debugfs_create_dir(name, base_dir);
> +}
> +
> +/**
> + * blk_queue_debugfs_register - register the debugfs_dir for the block device
> + * @q: the associated request_queue of the block device
> + * @name: the name of the block device exposed
> + *
> + * This is used to create the debugfs_dir used by the block layer and blktrace.
> + * Drivers which use any of the *add_disk*() calls or variants have this called
> + * automatically for them. This directory is removed automatically on
> + * blk_release_queue() once the request_queue reference count reaches 0.
> + */
> +void blk_queue_debugfs_register(struct request_queue *q, const char *name)
> +{
> + queue_debugfs_register_type(q, name, BLK_DBG_DIR_BASE);
> +}
> +EXPORT_SYMBOL_GPL(blk_queue_debugfs_register);
> +
> +/**
> + * blk_queue_debugfs_unregister - remove the debugfs_dir for the block device
> + * @q: the associated request_queue of the block device
> + *
> + * Removes the debugfs_dir for the request_queue on the associated block device.
> + * This is handled for you on blk_release_queue(), and that should only be
> + * called once.
> + *
> + * Since we don't care where the debugfs_dir was created this is used for all
> + * types of of enum blk_debugfs_dir_type.
> + */
> +void blk_queue_debugfs_unregister(struct request_queue *q)
> +{
> + debugfs_remove_recursive(q->debugfs_dir);
> +}
> +
> +static struct dentry *queue_debugfs_symlink_type(struct request_queue *q,
> + const char *src,
> + const char *dst,
> + enum blk_debugfs_dir_type type)
> +{
> + struct dentry *dentry = ERR_PTR(-EINVAL);
> + char *dir_dst;
> +
> + dir_dst = kzalloc(PATH_MAX, GFP_KERNEL);
> + if (!dir_dst)
> + return dentry;
> +
> + switch (type) {
> + case BLK_DBG_DIR_BASE:
> + if (dst)
> + snprintf(dir_dst, PATH_MAX, "%s", dst);
> + else if (!IS_ERR_OR_NULL(q->debugfs_dir))
> + snprintf(dir_dst, PATH_MAX, "%s",
> + q->debugfs_dir->d_name.name);

How can debugfs_dir be NULL/error here?

And grabbing the name of a debugfs file is sketchy, just use the name
that you think you already have, from the device, don't rely on debugfs
working here.

And why a symlink anyway? THat's a new addition, what is going to work
with that in userspace?

> +#ifdef CONFIG_DEBUG_FS
> + p->debugfs_sym = blk_queue_debugfs_symlink(disk->queue, dev_name(pdev),
> + disk->disk_name);
> +#endif

No need to #ifdef this, right?

I feel like this patch series keeps getting more complex and messier
over time :(

greg k-h