Re: [patch 04/22] fix deadlock in throttle_vm_writeout
From: Miklos Szeredi
Date: Thu Mar 01 2007 - 02:49:26 EST
> > From: Miklos Szeredi <mszeredi@xxxxxxx>
> >
> > This deadlock is similar to the one in balance_dirty_pages, but
> > instead of waiting in balance_dirty_pages after submitting a write
> > request, it happens during a memory allocation for filesystem B before
> > submitting a write request.
> >
> > It is easy to reproduce on a machine with not too much memory.
> > E.g. try this on 2.6.21-rc1 UML with 32MB (works on physical hw as
> > well):
> >
> > dd if=/dev/zero of=/tmp/tmp.img bs=1048576 count=40
> > mke2fs -j -F /tmp/tmp.img
> > mkdir /tmp/img
> > mount -oloop /tmp/tmp.img /tmp/img
> > bash-shared-mapping /tmp/img/foo 30000000
> >
> > The deadlock doesn't happen immediately, sometimes only after a few
> > minutes.
> >
> > Simplified stack trace for bash-shared-mapping after the deadlock:
> >
> > io_schedule_timeout
> > congestion_wait
> > balance_dirty_pages
> > balance_dirty_pages_ratelimited_nr
> > generic_file_buffered_write
> > __generic_file_aio_write_nolock
> > generic_file_aio_write
> > ext3_file_write
> > do_sync_write
> > vfs_write
> > sys_pwrite64
> >
> > and for [loop0]:
> >
> > io_schedule_timeout
> > congestion_wait
> > throttle_vm_writeout
> > shrink_zone
> > shrink_zones
> > try_to_free_pages
> > __alloc_pages
> > find_or_create_page
> > do_lo_send_aops
> > lo_send
> > do_bio_filebacked
> > loop_thread
> >
> > The requirement for the deadlock is that
> >
> > nr_writeback > dirty_thresh * 1.1 + margin
> >
> > Again margin seems to be in the 100 page range.
> >
> > The task of throttle_vm_writeout is to limit the rate at which
> > under-writeback pages are created due to swapping. There's no other
> > way direct reclaim can increase the nr_writeback + nr_file_dirty.
> >
> > So when there are few or no under-swap pages, it is safe for this
> > function to return. This ensures, that there's progress with writing
> > back dirty pages.
> >
>
> Would this also be solved by the below just-submitted bugfix? I guess not.
>
> I think the basic problem here is that the loop thread is reponsible for cleaning
> memory, but in throttle_vm_writeout(), the loop thread can get stuck waiting
> for some other thread to clean memory, but that ain't going to happen.
>
> throttle_vm_writeout() wasn't very well thought through, I suspect.
>
>
> I suspect we can just delete throttle_vm_writeout() now. The original
> rationale was:
>
> [PATCH] vm: pageout throttling
>
> With silly pageout testcases it is possible to place huge amounts of memory
> under I/O. With a large request queue (CFQ uses 8192 requests) it is
> possible to place _all_ memory under I/O at the same time.
>
> This means that all memory is pinned and unreclaimable and the VM gets
> upset and goes oom.
>
> The patch limits the amount of memory which is under pageout writeout to be
> a little more than the amount of memory at which balance_dirty_pages()
> callers will synchronously throttle.
>
> This means that heavy pageout activity can starve heavy writeback activity
> completely, but heavy writeback activity will not cause starvation of
> pageout. Because we don't want a simple `dd' to be causing excessive
> latencies in page reclaim.
>
> but now that we limit the amount of dirty MAP_SHARED memory, and given that
> the various pieces of the dirty-memory limiting puzzle also take the number
> of under-writeback pages into account, we should no longer be able to get
> in the situation where the total number of writeback+dirty pages exceeds
> dirty_ratio.
Only with swap, which can generate a lot of writeback pages, without
limiting. Does this matter? I guess that depends on the queue
length. If a lot of swap requests are using up memory, than that's
bad.
> From: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
>
> throttle_vm_writeout() is designed to wait for the dirty levels to subside.
> But if the caller holds IO or FS locks, we might be holding up that writeout.
>
> So change it to take a single nap to give other devices a chance to clean some
> memory, then return.
>
Why does it nap unconditionally for GFP_NOFS/NOIO? That seems to be a
waste of time.
Btw, I did try not calling throttle_vm_writeout() for GFP_NOFS/NOIO
and IIRC it didn't help. I'll check again.
Thanks,
Miklos
> Cc: Nick Piggin <nickpiggin@xxxxxxxxxxxx>
> Cc: OGAWA Hirofumi <hirofumi@xxxxxxxxxxxxxxxxxx>
> Cc: Kumar Gala <galak@xxxxxxxxxxxxxxxxxxx>
> Cc: Pete Zaitcev <zaitcev@xxxxxxxxxx>
> Cc: <stable@xxxxxxxxxx>
> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> ---
>
> include/linux/writeback.h | 2 +-
> mm/page-writeback.c | 13 +++++++++++--
> mm/vmscan.c | 2 +-
> 3 files changed, 13 insertions(+), 4 deletions(-)
>
> diff -puN include/linux/writeback.h~throttle_vm_writeout-dont-loop-on-gfp_nofs-and-gfp_noio-allocations include/linux/writeback.h
> --- a/include/linux/writeback.h~throttle_vm_writeout-dont-loop-on-gfp_nofs-and-gfp_noio-allocations
> +++ a/include/linux/writeback.h
> @@ -84,7 +84,7 @@ static inline void wait_on_inode(struct
> int wakeup_pdflush(long nr_pages);
> void laptop_io_completion(void);
> void laptop_sync_completion(void);
> -void throttle_vm_writeout(void);
> +void throttle_vm_writeout(gfp_t gfp_mask);
>
> /* These are exported to sysctl. */
> extern int dirty_background_ratio;
> diff -puN mm/page-writeback.c~throttle_vm_writeout-dont-loop-on-gfp_nofs-and-gfp_noio-allocations mm/page-writeback.c
> --- a/mm/page-writeback.c~throttle_vm_writeout-dont-loop-on-gfp_nofs-and-gfp_noio-allocations
> +++ a/mm/page-writeback.c
> @@ -296,11 +296,21 @@ void balance_dirty_pages_ratelimited_nr(
> }
> EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
>
> -void throttle_vm_writeout(void)
> +void throttle_vm_writeout(gfp_t gfp_mask)
> {
> long background_thresh;
> long dirty_thresh;
>
> + if ((gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO)) {
> + /*
> + * The caller might hold locks which can prevent IO completion
> + * or progress in the filesystem. So we cannot just sit here
> + * waiting for IO to complete.
> + */
> + congestion_wait(WRITE, HZ/10);
> + return;
> + }
> +
> for ( ; ; ) {
> get_dirty_limits(&background_thresh, &dirty_thresh, NULL);
>
> @@ -317,7 +327,6 @@ void throttle_vm_writeout(void)
> }
> }
>
> -
> /*
> * writeback at least _min_pages, and keep writing until the amount of dirty
> * memory is less than the background threshold, or until we're all clean.
> diff -puN mm/vmscan.c~throttle_vm_writeout-dont-loop-on-gfp_nofs-and-gfp_noio-allocations mm/vmscan.c
> --- a/mm/vmscan.c~throttle_vm_writeout-dont-loop-on-gfp_nofs-and-gfp_noio-allocations
> +++ a/mm/vmscan.c
> @@ -952,7 +952,7 @@ static unsigned long shrink_zone(int pri
> }
> }
>
> - throttle_vm_writeout();
> + throttle_vm_writeout(sc->gfp_mask);
>
> atomic_dec(&zone->reclaim_in_progress);
> return nr_reclaimed;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/