Re: Back to the future.

From: Rafael J. Wysocki
Date: Sun Apr 29 2007 - 05:18:33 EST


On Sunday, 29 April 2007 10:23, Pavel Machek wrote:
> Hi!
>
> > > > The freezer has *caused* those deadlocks (eg by stopping threads that were
> > > > needed for the suspend writeouts to succeed!), not solved them.
> > >
> > > I can't remember anything like this, but I believe you have a specific test
> > > case in mind.
> >
> > Ehh.. Why do you thik we _have_ that PF_NOFREEZE thing in the first place?
> >
> > Rafael, you really don't know what you're talking about, do you?
> >
> > Just _look_ at them. It's the IO threads etc that shouldn't be frozen,
> > exactly *because* they do IO. You claim that kernel threads shouldn't do
> > IO, but that's the point: if you cannot do IO when snapshotting to disk,
> > here's a damn big clue for you: how do you think that snapshot is going to
> > get written?
> >
> > I *guarantee* you that we've had a lot more problems with threads that
> > should *not* have been frozen than with those hypothetical threads that
> > you think should have been frozen.
>
> Well, we had nasty corruption on XFS, caused by thread that was not
> frozen and should be. (While the other case leads "only" to deadlocks,
> so it is easier to debug.)
>
> The locking point.. when I added freezing to swsusp, I knew very
> little about kernel locking, so I "simply" decided to avoid the
> problem altogether... using the freezer.
>
> You may be right that locks are not a big problem for the hibernation
> after all; I just do not know.

Still, I think, if a kernel thread is a part of a device driver, then _in_
_principle_ it needs _some_ synchronization with the driver's suspend/freeze
and resume/thaw callbacks. For example, it's reasonable to assume that the
thread should be quiet between suspend/freeze and resume/thaw.

With the freezing of kernel threads we provide a simple means of such
synchronization: use try_to_freeze() in a suitable place of your kernel thread
and you're done. [Well, there should be a second part for making the thread
die if the thaw callback doesn't find the device, but that's in the works.]

Without it, there may be race conditions that we are not even aware of and that
may trigger in, say, 1 in 10 suspends or so and I wish you good luck with
debugging such things.

Greetings,
Rafael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/