Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch

From: Matt Helsley
Date: Sat Nov 06 2010 - 01:18:59 EST


On Sat, Nov 06, 2010 at 12:06:09AM -0400, Oren Laadan wrote:
> On 11/05/2010 09:16 PM, Matt Helsley wrote:
> > On Fri, Nov 05, 2010 at 01:17:03PM -0400, Gene Cooperman wrote:
> >> On Fri, Nov 05, 2010 at 04:57:33AM -0700, Luck, Tony wrote:
> >>>> Oren noted that sometimes it's important to stop the process only
> >>>> for a few milliseconds while one checkpoints. In DMTCP, we do that
> >>>> by configuring with --enable-forked-checkpointing. This causes us
> >>>> to fork a child process taking advantage of copy-on-write and then
> >>>> checkpoint the memory pages of the child while the parent continues
> >>>> to execute.
> >>>
> >>> Interesting ... but while the process is only stopped for the duration
> >>> of the fork, it may be taking COW faults on almost every page it
> >>> touches. I think this will not work well for large HPC applications
> >>> that allocate most of physical memory as anonymous pages for the
> >>> application. It may even result in an OOM kill if you don't complete
> >>> the checkpoint of the child and have it exit in a timely manner.

<snip>

> > The current linux-cr approach to handling [dirty] pages doesn't use COW.
> > The tasks are frozen using the cgroup freezer and thus unable to modify
> > the pages. So we don't have to mess with page tables nor do we pay
> > any extra overhead for page faults.
>
> The current linux-cr patchset leaves out any optimizations
> for simplicity of reviewing - first get it working and reviewed.
> We experienced with optimizations with previous systems.
>
> > If we ever implement thawed checkpointing -- checkpointing while
> > the task isn't frozen -- then we'd probably use COW and see
> > the same faults. The difference then would be that in-kernel we
> > wouldn't have one extra task per mm being checkpointed.
>
> Thawed checkpointing can be done with any COW tax, by leveraging
> the native hardware dirty bit in page tables. There is no need to
> trigger additional checkpoints. Tracking modified pages using the

s/checkpoints/faults/

Cheers,
-Matt Helsley
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/