Re: [PATCH v3 3/3] block: avoid to call .bi_end_io() recursively

From: Ming Lei
Date: Fri Apr 29 2016 - 05:33:24 EST

Next message: George Spelvin: "Re: random(4) changes"
Previous message: Linus Walleij: "Re: [PATCH] mfd: ab8500-debugfs: trivial fix of spelling mistake on "between""
In reply to: Mikulas Patocka: "Re: [PATCH v3 3/3] block: avoid to call .bi_end_io() recursively"
Next in thread: Mikulas Patocka: "Re: [PATCH v3 3/3] block: avoid to call .bi_end_io() recursively"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, Apr 29, 2016 at 4:39 PM, Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote:
>
>
> On Fri, 29 Apr 2016, Ming Lei wrote:
>
>> On Fri, Apr 29, 2016 at 12:59 AM, Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote:
>> >
>> >
>> > On Fri, 29 Apr 2016, Ming Lei wrote:
>> >
>> >> On Thu, Apr 28, 2016 at 11:58 PM, Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote:
>> >> >
>> >> >
>> >> > On Thu, 28 Apr 2016, Ming Lei wrote:
>> >> >
>> >> >> Hi Mikulas,
>> >> >>
>> >> >> On Thu, Apr 28, 2016 at 11:29 PM, Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote:
>> >> >> >
>> >> >> >
>> >> >> > On Thu, 28 Apr 2016, Ming Lei wrote:
>> >> >> >
>> >> >> >> There were reports about heavy stack use by recursive calling
>> >> >> >> .bi_end_io()([1][2][3]). For example, more than 16K stack is
>> >> >> >> consumed in a single bio complete path[3], and in [2] stack
>> >> >> >> overflow can be triggered if 20 nested dm-crypt is used.
>> >> >> >>
>> >> >> >> Also patches[1] [2] [3] were posted for addressing the issue,
>> >> >> >> but never be merged. And the idea in these patches is basically
>> >> >> >> similar, all serializes the recursive calling of .bi_end_io() by
>> >> >> >> percpu list.
>> >> >> >>
>> >> >> >> This patch still takes the same idea, but uses bio_list to
>> >> >> >> implement it, which turns out more simple and the code becomes
>> >> >> >> more readable meantime.
>> >> >> >>
>> >> >> >> One corner case which wasn't covered before is that
>> >> >> >> bi_endio() may be scheduled to run in process context(such
>> >> >> >> as btrfs), and this patch just bypasses the optimizing for
>> >> >> >> that case because one new context should have enough stack space,
>> >> >> >> and this approach isn't capable of optimizing it too because
>> >> >> >> there isn't easy way to get a per-task linked list head.
>> >> >> >
>> >> >> > Hi
>> >> >> >
>> >> >> > You could use preempt_disable() and then you could use per-cpu list even
>> >> >> > in the process context.
>> >> >>
>> >> >> Image why the .bi_end_io() is scheduled to process context, and the only
>> >> >> workable/simple way I thought of is to use per-task list because it may sleep.
>> >> >
>> >> > The bi_end_io callback should not sleep, even if it is called from the
>> >> > process context.
>> >>
>> >> If it shouldn't sleep, why is it scheduled to run in process context by paying
>> >> extra context switch cost?
>> >
>> > Some device mapper (and other) drivers use a worker thread to process
>> > bios. So the bio may be finished from the worker thread. It would be
>> > advantageous to prevent stack overflow even in this case.
>>
>> If the .bi_end_io wouldn't sleep, it can be put back into interrupt context
>> for the sake of performance when this patch is merged. The cost of context
>> switch in high IOPS case isn't cheap.
>
> If some block device driver in a process context finds out that it needs
> to terminate a bio, it calls bio_endio in a process context. Why would it
> need to trigger an interrupt just to call bio_endio? (and how could it
> trigger an interrupt if that driver perhaps doesn't use interrupts at
> all?) I don't know what are you trying to suggest.

That should be the failure path, and this patch just aims at the normal bio
complete path which is run at 99.999% time. And irq handler often
borrows current process's stack and cause huge stack uses.

>
>> It isn't easy to avoid the recursive calling in process context except you
>> can add something 'task_struct' or introduce 'alloca()' in kernel. Or do you
>> have better ideas?
>
> preempt_disable around the bi_endio callback should be sufficient.

How can a sleepable function be run in preempt disabled context?

Looks you still don't believe the bi_endio scheduled to process context
can sleep, do you?

>
>> >
>> >> And you can find that btrfs_subio_endio_read() does sleep for checksum stuff.
>> >
>> > I'm not an expert on btrfs. What happens if it is called from an
>> > interrupt? Do you have an actual stracktrace when this function is called
>>
>> What do you expect if sleepable function is called in softirq or
>> hardirq handler? :-)
>>
>> > from bio_endio and when it sleeps?
>>
>> The problem is observed in xfstests generic/323 by this patch v1. Sometimes the
>> test hangs, and sometimes kernel oops is triggered. and the issue is
>> fixed by introducing 'if (!in_interrupt())' block for handling running
>> .bi_end_io() from
>> process context.
>>
>> If the block of 'if (!in_interrupt())' is removed and
>> preempt_disable()/preempt_enable() is added around bio->bi_end_io(),
>> the following kernel warning can be seen easily:
>>
>> [ 51.086303] BUG: sleeping function called from invalid context at
>> mm/slab.h:388
>> [ 51.087442] in_atomic(): 1, irqs_disabled(): 0, pid: 633, name: kworker/u8:4
>> [ 51.088575] CPU: 3 PID: 633 Comm: kworker/u8:4 Not tainted 4.6.0-rc3+ #2017
>> [ 51.088578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
>> BIOS rel-1.9.0-0-g01a84be-prebuilt.qemu-project.org 04/01/2014
>> [ 51.088637] Workqueue: btrfs-endio btrfs_endio_helper [btrfs]
>> [ 51.088640] 0000000000000000 ffff88007bbebc00 ffffffff813d92d3
>> ffff88007ba6ce00
>> [ 51.088643] 0000000000000184 ffff88007bbebc18 ffffffff810a38bb
>> ffffffff81a35310
>> [ 51.088645] ffff88007bbebc40 ffffffff810a3949 0000000002400040
>> 0000000002400040
>> [ 51.088648] Call Trace:
>> [ 51.088651] [<ffffffff813d92d3>] dump_stack+0x63/0x90
>> [ 51.088655] [<ffffffff810a38bb>] ___might_sleep+0xdb/0x120
>> [ 51.088657] [<ffffffff810a3949>] __might_sleep+0x49/0x80
>> [ 51.088659] [<ffffffff811e98e7>] kmem_cache_alloc+0x1a7/0x210
>> [ 51.088670] [<ffffffffc0102741>] ? alloc_extent_state+0x21/0xe0 [btrfs]
>> [ 51.088680] [<ffffffffc0102741>] alloc_extent_state+0x21/0xe0 [btrfs]
>> [ 51.088689] [<ffffffffc01046ce>] __clear_extent_bit+0x2ae/0x3d0 [btrfs]
>> [ 51.088698] [<ffffffffc0104e6a>] clear_extent_bit+0x2a/0x30 [btrfs]
>> [ 51.088708] [<ffffffffc00e96d0>] btrfs_endio_direct_read+0x70/0xf0 [btrfs]
>> [ 51.088711] [<ffffffff8139f1e7>] bio_endio+0xf7/0x140
>> [ 51.088718] [<ffffffffc00dbb9c>] end_workqueue_fn+0x3c/0x40 [btrfs]
>> [ 51.088728] [<ffffffffc01184d7>] normal_work_helper+0xc7/0x310 [btrfs]
>> [ 51.088737] [<ffffffffc01187f2>] btrfs_endio_helper+0x12/0x20 [btrfs]
>> [ 51.088740] [<ffffffff81097b77>] process_one_work+0x157/0x420
>> [ 51.088742] [<ffffffff810985ab>] worker_thread+0x12b/0x4d0
>> [ 51.088744] [<ffffffff817171a8>] ? __schedule+0x368/0x950
>> [ 51.088746] [<ffffffff81098480>] ? rescuer_thread+0x380/0x380
>> [ 51.088748] [<ffffffff8109de84>] kthread+0xd4/0xf0
>> [ 51.088750] [<ffffffff8171b7a2>] ret_from_fork+0x22/0x40
>> [ 51.088752] [<ffffffff8109ddb0>] ? kthread_park+0x60/0x60
>>
>>
>> Not mention wait_for_completion() in both __btrfs_correct_data_nocsum()
>> and __btrfs_subio_endio_read(), which are called by btrfs_subio_endio_read()
>> in btrfs_endio_direct_read().
>
> btrfs is calling bio_endio on bio that it created. So, bio_endio in btrfs
> can be replaced with:
> if (bio->bi_end_io == btrfs_endio_direct_read)
> btrfs_endio_direct_read(bio);
> else
> bio_endio(bio);
>
> ... or just maybe just with this:
> bio->bi_end_io(bio);

It is really ugly and should be avoided most of times.

>
> This could be fixed easily.

Suppose it might be done in this way, and there may be other .bi_end_io
which can sleep too, you need to make sure there isn't any such usage
first before running it with preempt disabled.

Anyway looks you worry about the seldom-run failure path, which isn't
addressed by this patch. And you can optimize that in future and that isn't
contradictory with this patch.

Thanks,
Ming

>
> Mikulas
>
>> Thanks,
>> Ming
>>
>> >
>> >> Thanks,
>> >> Ming
>> >
>> > Mikulas
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-block" in
>> > the body of a message to majordomo@xxxxxxxxxxxxxxx
>> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Next message: George Spelvin: "Re: random(4) changes"
Previous message: Linus Walleij: "Re: [PATCH] mfd: ab8500-debugfs: trivial fix of spelling mistake on "between""
In reply to: Mikulas Patocka: "Re: [PATCH v3 3/3] block: avoid to call .bi_end_io() recursively"
Next in thread: Mikulas Patocka: "Re: [PATCH v3 3/3] block: avoid to call .bi_end_io() recursively"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]