[RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI

From: Nikolay Borisov
Date: Thu Oct 08 2015 - 11:32:16 EST


Currently when bios are being finished in ext4_finish_bio this is done by
first disabling interrupts and then acquiring a bit_spin_lock.
However, those buffer heads might be under async write and as such the
wait on bit_spin_lock might cause the CPU to be spinning with interrupts
disabled for arbitrary period of time. If in the mean time there is
demand for memory and such cannot be freed the allocator's code might
have to resort to dumping the per-cpu lists, like so:

PID: 31111 TASK: ffff881cbb2fb870 CPU: 2 COMMAND: "kworker/u96:0"
#0 [ffff881fffa46dc0] crash_nmi_callback at ffffffff8106f24e
#1 [ffff881fffa46de0] nmi_handle at ffffffff8104c152
#2 [ffff881fffa46e70] do_nmi at ffffffff8104c3b4
#3 [ffff881fffa46ef0] end_repeat_nmi at ffffffff81656e2e
[exception RIP: smp_call_function_many+577]
RIP: ffffffff810e7f81 RSP: ffff880d35b815c8 RFLAGS: 00000202
RAX: 0000000000000017 RBX: ffffffff81142690 RCX: 0000000000000017
RDX: ffff883fff375478 RSI: 0000000000000040 RDI: 0000000000000040
RBP: ffff880d35b81628 R8: ffff881fffa51ec8 R9: 0000000000000000
R10: 0000000000000000 R11: ffffffff812943f3 R12: 0000000000000000
R13: ffff881fffa51ec0 R14: ffff881fffa51ec8 R15: 0000000000011f00
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#4 [ffff880d35b815c8] smp_call_function_many at ffffffff810e7f81
#5 [ffff880d35b81630] on_each_cpu_mask at ffffffff810e801c
#6 [ffff880d35b81660] drain_all_pages at ffffffff81140178
#7 [ffff880d35b81690] __alloc_pages_nodemask at ffffffff8114310b
#8 [ffff880d35b81810] alloc_pages_current at ffffffff81181c5e
#9 [ffff880d35b81860] new_slab at ffffffff81188305

However, this will never return, since on_each_cpu_mask is being called
with last argument 1 i.e. wait until the IPI handler is invoked on every
cpu. Additionally, if there is another thread on which ext4_finish_bio
depends to complete e.g:

PID: 34220 TASK: ffff883937660810 CPU: 44 COMMAND: "kworker/u98:39"
#0 [ffff88209d5b10b8] __schedule at ffffffff81653d5a
#1 [ffff88209d5b1150] schedule at ffffffff816542f9
#2 [ffff88209d5b1160] schedule_preempt_disabled at ffffffff81654686
#3 [ffff88209d5b1180] __mutex_lock_slowpath at ffffffff816521eb
#4 [ffff88209d5b1200] mutex_lock at ffffffff816522d1
#5 [ffff88209d5b1220] new_read at ffffffffa0152a7e [dm_bufio]
#6 [ffff88209d5b1280] dm_bufio_get at ffffffffa0152ba6 [dm_bufio]
#7 [ffff88209d5b1290] dm_bm_read_try_lock at ffffffffa015c878 [dm_persistent_data]
#8 [ffff88209d5b12e0] dm_tm_read_lock at ffffffffa015f7ad [dm_persistent_data]
#9 [ffff88209d5b12f0] bn_read_lock at ffffffffa016281b [dm_persistent_data]

And in turn this second thread is dependent on the original, allocation
to succeed a hard lockup occurs, since ext4_finish_bio would be waitin for
block_write_full_page to complete, which in turn is dependent on the original
memory allocation to succeeds, which in turn is dependent on the IPI executing
on each core. For completeness here is how the call stack for hung ext4_bio_finish
would look like:

[427160.405277] NMI backtrace for cpu 23
[427160.405279] CPU: 23 PID: 4611 Comm: kworker/u98:7 Tainted: G W 3.12.47-clouder1 #1
[427160.405281] Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1 04/14/2015
[427160.405285] Workqueue: writeback bdi_writeback_workfn (flush-252:148)
[427160.405286] task: ffff8825aa819830 ti: ffff882b19180000 task.ti: ffff882b19180000
[427160.405290] RIP: 0010:[<ffffffff8125be13>] [<ffffffff8125be13>] ext4_finish_bio+0x273/0x2a0
[427160.405291] RSP: 0000:ffff883fff3639b0 EFLAGS: 00000002
[427160.405292] RAX: ffff882b19180000 RBX: ffff883f67480a80 RCX: 0000000000000110
[427160.405292] RDX: ffff882b19180000 RSI: 0000000000000000 RDI: ffff883f67480a80
[427160.405293] RBP: ffff883fff363a70 R08: 0000000000014b80 R09: ffff881fff454f00
[427160.405294] R10: ffffea00473214c0 R11: ffffffff8113bfd7 R12: ffff880826272138
[427160.405295] R13: 0000000000000000 R14: 0000000000000000 R15: ffffea00aeaea400
[427160.405296] FS: 0000000000000000(0000) GS:ffff883fff360000(0000) knlGS:0000000000000000
[427160.405296] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[427160.405297] CR2: 0000003c5b009c24 CR3: 0000000001c0b000 CR4: 00000000001407e0
[427160.405297] Stack:
[427160.405305] 0000000000000000 ffffffff8203f230 ffff883fff363a00 ffff882b19180000
[427160.405312] ffff882b19180000 ffff882b19180000 00000400018e0af8 ffff882b19180000
[427160.405319] ffff883f67480a80 0000000000000000 0000000000000202 00000000d219e720
[427160.405320] Call Trace:
[427160.405324] <IRQ>
[427160.405327] [<ffffffff8125c2c8>] ext4_end_bio+0xc8/0x120
[427160.405335] [<ffffffff811dbf1d>] bio_endio+0x1d/0x40
[427160.405341] [<ffffffff81546781>] dec_pending+0x1c1/0x360
[427160.405345] [<ffffffff81546996>] clone_endio+0x76/0xa0
[427160.405350] [<ffffffff811dbf1d>] bio_endio+0x1d/0x40
[427160.405353] [<ffffffff81546781>] dec_pending+0x1c1/0x360
[427160.405358] [<ffffffff81546996>] clone_endio+0x76/0xa0
[427160.405362] [<ffffffff811dbf1d>] bio_endio+0x1d/0x40
[427160.405365] [<ffffffff81546781>] dec_pending+0x1c1/0x360
[427160.405369] [<ffffffff81546996>] clone_endio+0x76/0xa0
[427160.405373] [<ffffffff811dbf1d>] bio_endio+0x1d/0x40
[427160.405380] [<ffffffff812fad2b>] blk_update_request+0x21b/0x450
[427160.405385] [<ffffffff812faf87>] blk_update_bidi_request+0x27/0xb0
[427160.405389] [<ffffffff812fcc7f>] blk_end_bidi_request+0x2f/0x80
[427160.405392] [<ffffffff812fcd20>] blk_end_request+0x10/0x20
[427160.405400] [<ffffffff813fdc1c>] scsi_io_completion+0xbc/0x620
[427160.405404] [<ffffffff813f57f9>] scsi_finish_command+0xc9/0x130
[427160.405408] [<ffffffff813fe2e7>] scsi_softirq_done+0x147/0x170
[427160.405413] [<ffffffff813035ad>] blk_done_softirq+0x7d/0x90
[427160.405418] [<ffffffff8108ed87>] __do_softirq+0x137/0x2e0
[427160.405422] [<ffffffff81658a0c>] call_softirq+0x1c/0x30
[427160.405427] [<ffffffff8104a35d>] do_softirq+0x8d/0xc0
[427160.405428] [<ffffffff8108e925>] irq_exit+0x95/0xa0
[427160.405431] [<ffffffff8106f755>] smp_call_function_single_interrupt+0x35/0x40
[427160.405434] [<ffffffff8165826f>] call_function_single_interrupt+0x6f/0x80
[427160.405436] <EOI>
[427160.405438] [<ffffffff813276e6>] ? memcpy+0x6/0x110
[427160.405440] [<ffffffff811dc6d6>] ? __bio_clone+0x26/0x70
[427160.405442] [<ffffffff81546db9>] __clone_and_map_data_bio+0x139/0x160
[427160.405445] [<ffffffff815471cd>] __split_and_process_bio+0x3ed/0x490
[427160.405447] [<ffffffff815473a6>] dm_request+0x136/0x1e0
[427160.405449] [<ffffffff812fbe0a>] generic_make_request+0xca/0x100
[427160.405451] [<ffffffff812fbeb9>] submit_bio+0x79/0x160
[427160.405453] [<ffffffff81144c3d>] ? account_page_writeback+0x2d/0x40
[427160.405455] [<ffffffff81144dbd>] ? __test_set_page_writeback+0x16d/0x1f0
[427160.405457] [<ffffffff8125b7a9>] ext4_io_submit+0x29/0x50
[427160.405459] [<ffffffff8125b8fb>] ext4_bio_write_page+0x12b/0x2f0
[427160.405461] [<ffffffff81252fe8>] mpage_submit_page+0x68/0x90
[427160.405463] [<ffffffff81253100>] mpage_process_page_bufs+0xf0/0x110
[427160.405465] [<ffffffff81254a80>] mpage_prepare_extent_to_map+0x210/0x310
[427160.405468] [<ffffffff8125a911>] ? ext4_writepages+0x361/0xc60
[427160.405472] [<ffffffff81283c09>] ? __ext4_journal_start_sb+0x79/0x110
[427160.405474] [<ffffffff8125a948>] ext4_writepages+0x398/0xc60
[427160.405477] [<ffffffff812fd358>] ? blk_finish_plug+0x18/0x50
[427160.405479] [<ffffffff81146b40>] do_writepages+0x20/0x40
[427160.405483] [<ffffffff811cec79>] __writeback_single_inode+0x49/0x2b0
[427160.405487] [<ffffffff810aeeef>] ? wake_up_bit+0x2f/0x40
[427160.405488] [<ffffffff811cfdee>] writeback_sb_inodes+0x2de/0x540
[427160.405492] [<ffffffff811a6e65>] ? put_super+0x25/0x50
[427160.405494] [<ffffffff811d00ee>] __writeback_inodes_wb+0x9e/0xd0
[427160.405495] [<ffffffff811d035b>] wb_writeback+0x23b/0x340
[427160.405497] [<ffffffff811d04f9>] wb_do_writeback+0x99/0x230
[427160.405500] [<ffffffff810a40f1>] ? set_worker_desc+0x81/0x90
[427160.405503] [<ffffffff810c7a6a>] ? dequeue_task_fair+0x36a/0x4c0
[427160.405505] [<ffffffff811d0bf8>] bdi_writeback_workfn+0x88/0x260
[427160.405509] [<ffffffff810bbb3e>] ? finish_task_switch+0x4e/0xe0
[427160.405511] [<ffffffff81653dac>] ? __schedule+0x2dc/0x760
[427160.405514] [<ffffffff810a61e5>] process_one_work+0x195/0x550
[427160.405517] [<ffffffff810a848a>] worker_thread+0x13a/0x430
[427160.405519] [<ffffffff810a8350>] ? manage_workers+0x2c0/0x2c0
[427160.405521] [<ffffffff810ae48e>] kthread+0xce/0xe0
[427160.405523] [<ffffffff810ae3c0>] ? kthread_freezable_should_stop+0x80/0x80
[427160.405525] [<ffffffff816571c8>] ret_from_fork+0x58/0x90
[427160.405527] [<ffffffff810ae3c0>] ? kthread_freezable_should_stop+0x80/0x80

To fix the situation this patch changes the order in which the
bit_spin_lock and interrupts disabling occcurs. The exepected
effect is that even if a core is spinning on the bitlock it will
have its interrupts enabled, thus being able to respond to IPIs.
This eventually would allow memory allocation requiring draining of
the per cpu pages to succeed.

Signed-off-by: Nikolay Borisov <kernel@xxxxxxxx>
---
fs/ext4/page-io.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 84ba4d2..095331b 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -96,8 +96,8 @@ static void ext4_finish_bio(struct bio *bio)
* We check all buffers in the page under BH_Uptodate_Lock
* to avoid races with other end io clearing async_write flags
*/
- local_irq_save(flags);
bit_spin_lock(BH_Uptodate_Lock, &head->b_state);
+ local_irq_save(flags);
do {
if (bh_offset(bh) < bio_start ||
bh_offset(bh) + bh->b_size > bio_end) {
@@ -109,8 +109,8 @@ static void ext4_finish_bio(struct bio *bio)
if (bio->bi_error)
buffer_io_error(bh);
} while ((bh = bh->b_this_page) != head);
- bit_spin_unlock(BH_Uptodate_Lock, &head->b_state);
local_irq_restore(flags);
+ bit_spin_unlock(BH_Uptodate_Lock, &head->b_state);
if (!under_io) {
#ifdef CONFIG_EXT4_FS_ENCRYPTION
if (ctx)
--
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/