Re: [PATCH] sched: Avoid that __wait_on_bit_lock() hangs

From: Oleg Nesterov
Date: Tue Aug 16 2016 - 09:09:22 EST


On 08/15, Bart Van Assche wrote:
>
> On 08/13/2016 09:32 AM, Oleg Nesterov wrote:
>> On 08/12, Bart Van Assche wrote:
>>> before I started testing. It took some time
>>> before I could reproduce the hang in truncate_inode_pages_range().
>>
>> all I can say this contradicts with the previous testing results with
>> my previous patch or with your change in abort_exclusive_wait().
>
> Hello Oleg,
>
> My opinion is that all this means is that we do not yet have a full
> understanding of what is going on.

Sure.

> BTW, I have improved my page lock owner instrumentation patch such that
> it prints a call stack of the lock owner if lock_page() takes too long.
> The following call stack was reported:
>
> __lock_page / pid 8549 / m 0x2: timeout - continuing to wait for 8549
> [<ffffffff8102b316>] save_stack_trace+0x26/0x50
> [<ffffffff81152bee>] add_to_page_cache_lru+0x7e/0x170
> [<ffffffff8121bfc5>] mpage_readpages+0xc5/0x170
> [<ffffffff81215548>] blkdev_readpages+0x18/0x20
> [<ffffffff81163a68>] __do_page_cache_readahead+0x268/0x310
> [<ffffffff811640a8>] force_page_cache_readahead+0xa8/0x100
> [<ffffffff81164139>] page_cache_sync_readahead+0x39/0x40
> [<ffffffff81153967>] generic_file_read_iter+0x707/0x920
> [<ffffffff81215920>] blkdev_read_iter+0x30/0x40
> [<ffffffff811d4b4b>] __vfs_read+0xbb/0x130
> [<ffffffff811d4f31>] vfs_read+0x91/0x130
> [<ffffffff811d62b4>] SyS_read+0x44/0xa0
> [<ffffffff816281e5>] entry_SYSCALL_64_fastpath+0x18/0xa8
>
> My understanding of mpage_readpages() is that the page unlock happens
> after readahead I/O completed (see also page_endio()). So this probably
> means that an I/O request submitted because of readahead code did not
> get completed. I will see whether I can find anything that's wrong in
> the block layer.

Perhaps. But this means another problem! Or you didn't wait enough. Or
your previous testing was wrong.

Because, once again, your changes in abort_exclusive_wait(), and my
debugging patch which adds wakeup into ClearPageLocked() suggest that
the problem is NOT that the page is still locked.


I'd still like to know what happens with the last patch I sent (without
any other changes)... but now I am totally confused.

If only I could reproduce. Or at least understand what are you doing to
hit thi bug ;)

Oleg.