Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch

From: Grant Likely
Date: Sun Nov 21 2010 - 18:20:59 EST


On Tue, Nov 16, 2010 at 10:29 PM, Anton Blanchard <anton@xxxxxxxxxxx> wrote:
> Hi Grant,
[...]
> There are two usage scenarios for C/R in this environment:
>
> 1. Resource management. Any large HPC cluster should be 100% busy and
> as such you will often fill in the gaps with low priority jobs which
> may need to be preempted. These low priority jobs need to give up their
> resources (memory, interconnect resources etc) whenever something
> important comes in.
>
> 2. Fault tolerance. Failures are a fact of life for any decent sized
> cluster. As the cluster gets larger these failures become very common.
> Speaking from an industry perspective, MTBF rates measured in the order
> of several hours for large commodity clusters are not surprising. We at
> IBM improve on that with hardware and system design, but there is only
> so much you can do. The failures also happen at the Linux kernel level
> so even if we had 100% reliable systems we would still have issues.
>
> Now this is the pointy end of HPC, but similar issues are happening in
> the meat of the HPC market. One area we are seeing a lot of C/R
> interest is the EDA space. As ICs become more and more complex the
> amount of cluster compute power it takes to route, check, create masks
> etc grows so large that system reliability becomes an issue. Some tool
> vendors write their own application C/R, but there are a multitude of
> in house applications that have no C/R capability today.

I agree, and I think this is exactly the place where the discussions
about c/r need to be focused (the pointy end). I don't tend to swoon
at the idea of c/r'ing my desktop session because it doesn't represent
a real or interesting problem for me. However, I do see the value in
the scenarios described above. I have another for you; I peripherally
worked on a telephone switch system that used a form of C/R for the
call processing task to synchronise with a hot-standby node for
uninterrupted cut-over in the event of failure. /my/ concerns are
more of the, "what is the impact on the kernel?" type.

> You could argue that we should just add C/R capability to every HPC
> application and library people care about or rework them to be
> fault tolerant in software. Unfortunately I don't see either as being
> viable. There are so many applications, libraries and even programming
> languages in use for HPC that it would be a losing battle. If we
> did go down this route we would also be unable to leverage C/R for
> anything else.

Fair enough, and I do somewhat agree with this. However the question
remains, what are the constraints? What are the limitations and
boundaries? Oden describes the constrains on the current c/r patches.
How well do those match up with the use cases discussed above? How
does DMTCP match up with those use cases?

> I can understand the concern around finding a general
> purpose case, but I do believe many other solid uses for C/R outside of
> HPC will emerge.For example, there was interest from the embedded guys
> during the KS discussion and I can easily imagine using C/R to bring up
> firefox faster on a TV.

Heh, sounds like doing the initial-program-load (IPL) stage like I
used to do on telephone switch firmware. :-)

>
> The problems found in HPC often turn into more general problems down
> the track. I think back to the heated discussions we had around SMP back
> in the early 2000s when we had 32 core POWER4s and SGI had similar sized
> machines. Now a 24 core machine fits in 1U and can be purchased for
> under $5k. NUMA support, CPU affinity and multi queue scheduling are
> other areas that initially had a very small user base but have since
> become important features for many users.
>
> Anton
>



--
Grant Likely, B.Sc., P.Eng.
Secret Lab Technologies Ltd.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/