Using cliques of nodes to store desktop grid checkpoints



Checkpoints that store intermediate results of computation have a fundamental
impact on the computing throughput of Desktop Grid systems, like BOINC.
Currently, BOINC workers store their checkpoints locally. A major limitation
of this approach is that whenever a worker leaves unfinished computation, no
other worker can proceed from the last stable checkpoint. This forces tasks to be
restarted from scratch when the original machine is no longer available.

To overcome this limitation, we propose to share checkpoints between nodes.
To organize this mechanism, we arrange nodes to form complete graphs (cliques),
where nodes share all the checkpoints they compute. Cliques function as sur-
vivable units, where checkpoints and tasks are not lost as long as one of the
nodes of the clique remains alive. To simplify construction and maintenance of
the cliques, we take advantage of the central supervisor of BOINC. To evaluate
our solution, we combine simulation with some real data to answer the most
fundamental question: what do we need to pay for increased throughput?


Desktop Grid


Coregrid Integration Workshop, April 2008

PDF File

