Global Checkpointing for Distributed Programs



This paper presents a novel algorithm for checkpointing and rollback recovery in distributed systems. Processes belonging to the same program must take periodically a nonblocking
coordinated global checkpoint, but only a minimum overhead is imposed during normal computation. Messages can be delivered out-of-order, and the processes are not required to be deterministic. The non-blocking structure is an important characteristic to avoid laying an heavy burden on the application programs.
Our proposal also includes the damage assessment phase, unlike previous schemes that either assume that an error is detected immediately after it occurs (fail-stop) or simply ignore
the damage caused by imperfect detection mechanisms. We present a possible way to evaluate the error detection latency, which enables us to assess the damage made, and avoid the propagation of errors.




IEEE 11th Symposium on Reliable Distributed Systems SRDS-11, October 1992

Cited by

Year 2003 : 3 citations

 Hai Jin, Kai Hwang, "Distributed Checkpointing on Clusters with Dynamic Striping and Staggering", Lecture Notes in Computer Science, Springer-Verlag Heidelberg,Volume 2550 / 2002, July 2003

 G Cao, M Singhal, "Checkpointing with mutable checkpoints", in Theoretical Computer Science 290 (2003) 1127"1148

 Weigang Ni, Vrbsky, S.V. Ray, S., "Low-cost coordinated non-blocking checkpointing in mobile computing systems", Proceedings. Eighth IEEE International Symposium on Computers and Communication, 2003. (ISCC 2003), June 2003

Year 2001 : 5 citations

 Guohong Cao, Mukesh Singhal "Mutable Checkpoints: A New Checkpointing Approach for Mobile Computing Systems" IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 12, NO. 2, FEBRUARY 2001.

 Osada S, Higaki H "QoS-based checkpoint protocol for multimedia network systems" ADVANCES IN MUTLIMEDIA INFORMATION PROCESSING - PCM 2001, PROCEEDINGS LECTURE NOTES IN COMPUTER SCIENCE, 2195: 574-581 2001

 Yoshinori Morita, Hiroaki Higaki, "Hybrid Checkpoint Protocol for Supporting Mobile-to-Mobile Communication", The 15th International Conference on Information Networking (ICOIN'01), January 2001, Beppu City, Oita, Japan

 Kengo Hiraga, Hiroaki Higaki, "Consistent Global Checkpoints in Multimedia Network Systems", The 15th International Conference on Information Networking (ICOIN'01), January 2001, Beppu City, Oita, Japan

 Yoshinori Morita, Hiroaki Higaki, "Checkpoint-Recovery for Mobile Computing Systems", 21st International Conference on Distributed Computing Systems Workshops (ICDCSW '01), April 2001, Mesa, Arizona

Year 2000 : 5 citations

 E. Gendelman, L. F. Bic, M. Dillencourt. "An Application-Transparent, PlatformIndependent Approach to Rollback-Recovery for Mobile Agent Systems" 20 th IEEE International Conference on Distributed Computing Systems. Taipei, Taiwan 2000.

 Kalaiselvi S, Rajaraman V "A survey of checkpointing algorithms for parallel and distributed computers" SADHANA-ACADEMY PROCEEDINGS IN ENGINEERING SCIENCES 25: 489-510 Part 5 OCT 2000

 JW Lin, SY Kuo, "Resolving error propagation in distributed systems", Information Processing Letters 74 (2000) 257"262.

 T Osman, A Bargiela, "FADI: A fault tolerant environment for open distributed computing", Software Journal, IEE Proceedings, June 2000, Volume 147, Issue 3

 Kuo-Feng Ssu, "HETEROGENEOUS AND MOBILE RECOVERY". PhD Thesis, University of Illinois at Urbana-Champaign, 2000

Year 1999 : 5 citations

 D. Manivannan, Mukesh Singhal "Quasi-Synchronous Checkpointing: Models, Characterization, and Classification" IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, Vol. 10, No. 7; JULY 1999, pp. 703-713.

 B. Yao, K. F. Ssu, and W. K. Fuchs, "Message Logging in Mobile Computing," Proceedings of IEEE Fault-Tolerant Computing Symposium, pp. 294--301, June 1999.

 Mootaz Elnozahy, Lorenzo Alvisi, Yi-Min Wang, David B. Johnson "A Survey of Rollback-Recovery Protocols in Message Passing Systems", CMU Technical Report CMU-CS-99-148, June 1999, Carbegie-Mellon University, USA.

 E. Gendelman, L. F. Bic, M. Dillencourt. "An Efficient Checkpointing Algorithm for Distributed Systems Implementing Reliable Communication Channels." 18th Symposium on Reliable Distributed Systems, Lausanne, Switzerland 1999, pp 290-291.

 Lorenzo Alvisi, Sriram Rao, Syed Amir Husain, Asanka de Mel, Elmootazbellah Elnozahy, "An Analysis of Communication-Induced Checkpointing", Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, June 1999, Madison, Wisconsin

Year 1998 : 5 citations

 N. Neves and W. K. Fuchs, "Coordinated Checkpointing without Direct Coordination," Proceedings of IEEE International Computer Performance & Dependability Symposium, pp. 23--31, Sept. 1998.

 Guohong Cao, Mukesh Singhal "On Coordinated Checkpointing in Distributed Systems" IEEE Transactions on Parallel and Distributed Systems, Vol 9, no 12, December 1998, pp 1213,1225.

 B. Ramamurthy, S.J.Upadhyaya "Hardware Assisted Fast Recovery in Distributed Systems" Dependable Computing for Critical Applications 5, R.K.Iyer, M.Morganti, W.K.Fuchs,V.Gligor (Eds), IEEE Computer Society Press, 1998, ISBN 0-8186-7803-8 , pp. 224-241.

 Denis Conan, Guy Bernard, "La reprise sur erreur par recouvrement arrière automatique dans les systèmes répartis", em: ""Parallèlisme et Répartition', Ed. J.-F. Myoupo, Hermès, pages 91-123, Avril 1998.

 Guohong Cao, Mukesh Singhal, "On the Impossibility of Min-Process Non-Blocking Checkpointing and An Efficient Checkpointing Algorithm for Mobile Computing Systems", 1998 International Conference on Parallel Processing, August 1998, Minneapolis, Minnesota

Year 1997 : 4 citations

 Adam Beguelin, Erik Seligman and Peter Stephan "Application Level Fault Tolerance in Heterogeneous Networks of Workstations", Journal of Parallel and Distributed Computing, vol. 43, no. 2, pp.147--155,1997

 G.G.Richard III, M. Singhal "Complete Process Recovery: Using Vector Time to Handle Multiple Failures in Distributed Systems" IEEE Concurrency, Vol. 5, No. 2, April-June 1997, pp. 50-59.

 Marc Zweiacker "The Persistent Object Group Service: An Approach to Fault Tolerance of Open Distributed Applications" Proc. 1997 IFIG WG 6.1 International Conference on Open Distributed Processing (ICODP'97) 27-30 Maio 1997, Toronto, Canada.

 N. Neves and W. K. Fuchs. Adaptive recovery for mobile environments. Communications of the ACM, 40(1):68--74, January 1997

Year 1996 : 7 citations

 E. N. Elnozahi, D.B. Johnson, Y.M.Wang "A Survey of Rollback-Recovery Protocols in Message-Passing Systems" Carnegie Mellon University, School of Computer Science, Report CMU-CS-96-181, October 1996.

 N.Neves, W.K.Fuchs. "Using Time to Improve the Performance of Coordinated Checkpointing", Proceedings of the International Performance and Dependability Symposium, IPDS96, Illinois, September 1996.

 A. Moustefaoui, Michel Raynal "Efficient Message Logging for Uncoordinated Checkpointing Protocols" Proc. Second European Dependable Computing Conference, Taormina, Itália, Outubro 1996, Lecture Notes in Computer Science 1150, Springer Verlag, pp. 353-364, ISBN 3-540-61772-8.

 G. Muller, M. Banâtre , N. Peyrouz and B. Rochat. "Lessons from FTM: an experiment in design and implementation of a low-cost fault-tolerant system." In IEEE Transactions on Reliability, 45(2):332---340, Jun. 1996

 Knop P, Rego V, Sunderam V, "Fail-safe concurrency in the EcliPSe system?, Concurrency-Practice and Experience, John Wiley & Sons Ltd, W Sussex, vol. 8, no. 4, MAY 1996, pp. 283-312.

 D. Manivannan and M. Singhal, "Quasi-synchronous checkpointing: Models, characterization, and classification ", Ohio State University, Tech. Rep. OSU-CISRC-5/96TR33, 1996.

 J. Arabe, A. Beguelin, B. Lowekamp, E. Seligman, M. Starkey, and P. Stephan "Dome: Parallel programming in a heterogeneous multi-user environment?, Proceedings of the International Parallel Processing Symposium (IPPS), April 1996, Honolulu, Hawaii.

Year 1995 : 7 citations

 Ganesha Beedubail, Anish Karmarkar, Anil Gurijala, Willis Marti and Udo Pooch, "An Algorithm for Supporting Fault Tolerant Objectsin Distributed Object Oriented Operating Systems", Technical Report (TR 95-019) April 1995, Department of Computer Science, Texas A&M University,College Station, TX - 77843.

 G. Cabillic, G.Muller, I. Puaut "The Performance of Consistent Checkpointing in Distributed Shared Memory Systems" IEEE 14th Symposium on Reliable Distributed Systems, Bad Neuenhar, RFA - 13-15 de Setembro, 1995, IEEE Computer Society Press, ISBN 0-8186-7153-X.

 G.Deconinck, J.Vounckx, R.Lauwereins, J.A.Peperstraete. "A User-Triggered Checkpointing Library for Computation-Intensive Applications", Proc. 7th Int. Conf. on Parallel and Distributed Computing and Systems, Washington DC, pp. 321-324, Oct. 1995

 J. Ouyang and G. Heiser, "Checkpointing and Recovery for Distributed Shared Memory Applications", Proc. of the Fourth Int'l Workshop on Object Orientation in Operating Systems ({IWOOOS}'95), pp 191-199, 1995.

 G. Beedubail et al., "An algorithm for supporting fault tolerant objects in distributed object oriented operating systems," In Proc. of International Workshop on Object-Orientation in Operating Systems, August 1995.

 José Nagib Cotrim Árabe, Adam Beguelin, Bruce Lowekamp, Erik Seligman, Michael Starkey, and Peter Stephan. Dome: Parallel programming in a heterogeneous multi-user environment. Technical Report CMU-CS95 -137, Carnegie Mellon University, April 1995.

 Y. Yong "Replay and Distributed Breakpoints in an OSF DCE Environment" M.Math Thesis, University of Waterloo, Ontario, Canada, 1995.

Year 1994 : 7 citations

 Seligman E., Beguelin A. High-Level Fault Tolerance in Distributed Programs. Technical Report CMU-CS-94-223, Carnegie-Mellon University, December 1994.

 E. N. Elnozahi, W. Zaenepoel "On the Use and Implementation of Message Logging" Proceedings of the 24th Annual International Symposium on Fault-Tolerant Computing (FTCS-24), Austin, Texas, USA, 15-17 June 1994, pp. 298-307, IEEE Computer Society Press, ISBN 0-8186-5520-8.

 Gilles Muller, Mireille Hue, Nadine Peyrouze "Performance of Consistent Checkpointing in a Modular Operating System: Results of the FTM Experiment" in "Dependable Computing - EDCC-1" Klaus Echtle, Dieter Hammer, David Powell (Eds.) Lecture Notes in Computer Science 852, Springer Verlag, 1994, pág. 491-508. ISBN 3-540-58426-9.

 J.Leon, "An Application-Oriented Toolkit for Highly Available Distributed Scientific Computing", PhD Thesis Proposal, Department of Computer Science, Carnegie-Mellon University, 1994.

 K.S.Toi, C.J.Hou "Effective and Concurrent Checkpointing and Rollback Recovery in Distributed Systems" Tech. Report TR-ECE-94-05, Univ. of Wisconsin, Madison, March 1994.

 P.Krishna, N.H.Vaidya, D.K.Pradhan "Recovery in Multicomputers with Finite Error Detection Latency" Proceedings of the 23rd International Conference on Parallel Processing, August 1994

 E.Seligman, A.Beguelim "High-Level Fault-Tolerance in Distributed Programs" Tech Report Carnegie Mellon University, CMU-CS-94-223

Year 1993 : 5 citations

 David B. Johnson "Efficient Transparent Optimistic Rollback Recovery for Distributed Application Programs" Proceedings of the 12th Symposium on Reliable Distributed Systems, IEEE Computer Society, October 1993.

 P. Krishna, Nitin H. Vaidya, D.K.Pradhan "Independent Checkpointing and Recovery Scheme for Fail-Slow Processors" Technical Report 93-028, Dept. of Computer Science, Texas A&M University, College Station, USA.

 G. Deconinck, J. Vounckx, R. Lauwereins, and J. A. Peperstraete. Survey of backward error recovery techniques for multicomputers based on checkpointing and rollback. In Proc. IASTED Int. Conf. on Modelling and Simulation, pages 262--265, May 1993.

 Nitin H. Vaydya "Dynamic Cluster-Based Recovery: Pessimistic and Optimistic Schemes" Technical Report 93-027, Dept. of Computer Science, Texas A&M University, College Station, USA.

 G.Richard III, M.Singhal "Complete Process Recovery in Distributed Systems using Vector Time" Tech Report, Dept. Computer and Information Science, The Ohio State University. USA.