Re: [PATCH 0/4] Improve swap page error handling

From: Hugh Dickins
Date: Mon Jan 08 2007 - 14:31:58 EST

Next message: Alexey Starikovskiy: "Re: Regression in kernel linux-2.6.20-rc1/2: Problems with poweroff"
Previous message: Evgeniy Polyakov: "[take31 7/10] kevent: Signal notifications."
In reply to: Richard Purdie: "[PATCH 0/4] Improve swap page error handling"
Next in thread: Richard Purdie: "[PATCH 0/4] Improve swap page error handling"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, 8 Jan 2007, Richard Purdie wrote:
> Improve the error handling when writes fail to a swap page.
>
> Currently, the kernel will repeatedly retry the write which is unlikely
> to ever succeed. Instead we allow the pages to be unused and then marked
> as bad at which prevents reuse. It should hopefully be suitable for
> testing in -mm.
>
> These patches are a based on a patch by Nick Piggin and some of my own
> patches/bugfixes as discussed on LKML.

No, not this way, I'm afraid. Sorry, I don't remember the prior
discussion on LKML, must have flooded past when my attention was
elsewhere.

Is it worth doing this at all? Probably, but I've no experience
whatsoever of swap write errors, so it's hard for me to judge: my
guess is that many cases would turn out to be software errors (e.g.
lower level needing more memory to perform the write). But you'd
be right to counter: let's assume they're hardware errors, and
then fix up any software errors when reported.

If it is worth doing this, then you'll need to add code to write
back the swap header, to note the bad pages permanently: you may
well have been waiting to see what reception the patches so far
get, before embarking on that.

1/4 seemed mostly reasonable to me at first, though I had a few
nits about the way you'd divided it between try_to_unuse_entry
and try_to_unuse (e.g. the complication of !start_mm_p belongs
to a later patch, and the mm_users check belongs in try_to_unuse).

I was uneasy with 2/4, wondered if swap_free(entry, page) would
be a better direction to go than your swap_free_markbad(entry).

I disliked 3/4: that initial_locked argument to try_to_unuse_entry,
which (I think) was only necessary because you'd put the lock_page
in try_to_unuse_entry instead of try_to_unuse in the first one.

I rather liked 4/4, the way you let the page rotate around and at
that point discover the failure and fix it up. But that reveals
both the problem with your approach and its solution.

The killer problem, doing it this way, is that unuse_mm does
down_read(&mm->mmap_sem) for every mm it's tried on, but it's very
possible that the process doing shrink_page_list holds an mmap_sem:
most likely only for reading, but even a recursive down_read can
be deadlocked by a waiting down_write. There may be more such
problems, I didn't bother to think further once I'd seen that.

But it hints at the solution too. You're calling your
try_to_unuse_page_entry shortly before the call to try_to_unmap,
and try_to_unmap manages to find all the pages without needing
any mmap_sems at all.

I guess your approach simply highlights my laziness: all the mm
stuff that you factor out in 1/4 dates from before rmap came in.
It still serves a vital purpose, when dealing with pages freshly
read in from swap, when there's no record of what mm or tmpfs file
they belong to; but once that's known, it much quicker to use an
rmap technique, looping through the vmas of the relevant anon_vma
(taking its lock), than scanning all the (swapped) mms in the system.

All I bothered to do, after objrmap came in, was force a shortcut in
unuse_vma once the anon_vma is known (that page_address_in_vma call):
it didn't seem worth rejigging the whole mm loop. But now you have
another use for unusing a swap entry, that is surely the way to go:
the mm loop is unnecessary, or becomes unnecessary, once we know
the anon_vma to which the swap page belongs. Which is already so
for your failed-write swap pages.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Alexey Starikovskiy: "Re: Regression in kernel linux-2.6.20-rc1/2: Problems with poweroff"
Previous message: Evgeniy Polyakov: "[take31 7/10] kevent: Signal notifications."
In reply to: Richard Purdie: "[PATCH 0/4] Improve swap page error handling"
Next in thread: Richard Purdie: "[PATCH 0/4] Improve swap page error handling"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]