System-Level versus User-Defined Checkpointing



Checkpointing and rollback recovery is a very effective technique to tolerate transient faults and preventive shutdowns. In the past, most of the checkpointing schemes published in the literature were supposed to be transparent to the application programmer and implemented at the operating-system level. In the recent years, there has been some work on higher-level forms of checkpointing. In this second approach, the user is responsible for the checkpoint placement and is required to specify the checkpoint contents.
In this paper, we compare the two approaches: systemlevel and user-defined checkpointing. We discuss the pros and cons of both approaches and we present an experimental study that was conducted on a commercial parallel machine.




17th IEEE Symposium on Reliable Distributed Systems, SRDS'98, October 1998

Cited by

Year 2004 : 4 citations

 Sriram Krishnan and Dennis Gannon. Checkpoint and Restart for Distributed Components in XCAT3. In Grid 2004, 5th IEEE/ACM International Workshop on Grid Computing. IEEE Computer Society Press, November 2004.

 Pawel Czarnul, Arkadiusz Urbaniak, Marcin Fraczak, Maciej Dyczkowski, Bartlomiej Balcerek, "Towards Easy-to-Use Checkpointing of MPI Applications within CLUSTERIX", International Conference on Parallel Computing in Electrical Engineering, PARELEC'04, September 2004, Dresden, Germany

 Ashwin Raju Jeyakumar, "Metamori: A library for Incremental File Checkpointing", MSc Thesis Faculty of the Virginia Polytechnic Institute and State University, Virginia Tech, Blacksburg, VA, June 3, 2004

 J. T. Rough, A. M. Goscinski, "The development of an efficient checkpointing facility exploiting operating systems services of the GENESIS cluster operating system ", Future Generation Computer Systems, Special issue: Advanced services for clusters and internet computing, Volume 20 , Issue 4 (May 2004)

Year 2003 : 1 citations

 1. Yuan Zijing, "Load Management System for Distributed Simulation", MSc Thesis School of Computer Engineering Nanyang Technological University, July 2003

Year 2000 : 3 citations

 E. Gendelman, L. F. Bic, M. Dillencourt. "An Application-Transparent, Platform Independent Approach to Rollback-Recovery for Mobile Agent Systems" 20 th IEEE International Conference on Distributed Computing Systems. Taipei, Taiwan 2000.

 Kuo-Feng Ssu, "HETEROGENEOUS AND MOBILE RECOVERY", PhD Thesis University of Illinois at Urbana-Champaign, 2000

 Xinfeng Ye, "A Checkpointing Scheme for Internet-based Computing Systems", Proceedings of International Conference on Software of 16th IFIP Congress, pp.539-546, 2000

Year 1999 : 2 citations

 K. F. Ssu, B. Yao and W. K. Fuchs. An Adaptive Checkpointing Protocol to Bound Recovery Time with Message Logging. In Proceeding of the 18th IEEE Symposium on Reliable Distributed Systems, Lausanne, Switzerland, October, 1999.

 Mootaz Elnozahy, Lorenzo Alvisi, Yi-Min Wang, David B. Johnson "A Survey of Rollback-Recovery Protocols in Message Passing Systems", CMU Technical Report CMU-CS-99-148, June 1999, Carbegie-Mellon University, USA.