Re: [RFC PATCH] mm: silence soft lockups from unlock_page

From: Hugh Dickins
Date: Thu Jul 23 2020 - 23:46:18 EST


On Thu, Jul 23, 2020 at 5:47 PM Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> On Thu, Jul 23, 2020 at 5:07 PM Hugh Dickins <hughd@xxxxxxxxxx> wrote:
> >
> > I say that for full disclosure, so you don't wrack your brains
> > too much, when it may still turn out to be a screwup on my part.
>
> Sounds unlikely.
>
> If that patch applied even reasonably closely, I don't see how you'd
> see a list corruption that wasn't due to the patch.
>
> You'd have had to use the wrong spinlock by mistake due to munging it,
> or something crazy like that.
>
> The main list-handling change is
>
> (a) open-coding of that finish_wait()
>
> (b) slightly different heuristics for removal in the wakeup function
>
> where (a) was because my original version of finishing the wait needed
> to do that return code checking.
>
> So a normal "finish_wait()" just does
>
> list_del_init(&wait->entry);
>
> where-as my open-coded one replaced that with
>
> if (!list_empty(&wait->entry)) {
> list_del(&wait->entry);
> ret = -EINTR;
> }
>
> and apart from that "set return to -EINTR because nobody woke us up",
> it also uses just a regular "list_del()" rather than a
> "list_del_init()". That causes the next/prev field to be poisoned
> rather than re-initialized. But that second change was because the
> list entry is on the stack, and we're not touching it any more and are
> about to return, so I removed the "init" part.
>
> Anyway, (a) really looks harmless. Unless the poisoning now triggers
> some racy debug test that had been hidden by the "init". Hmm.
>
> In contrast, (b) means that the likely access patterns of irqs
> removing the wait entry from the list might be very different from
> before. The old "autoremove" semantics would only remove the entry
> from the list when try_to_wake_up() actually woke things up. Now, a
> successful bit state _always_ removes it, which was kind of the point.
> But it might cause very different list handling patterns.
>
> All the actual list handling looks "obviously safe" because it's
> protected by the spinlock, though...
>
> If you do get oopses with the new patch too, try to send me a copy,
> and maybe I'll stare at exactly where it happens register contents and
> go "aah".

This new version is doing much better: many hours to go, but all
machines have got beyond the danger point where yesterday's version
was crashing - phew!

Hugh