Checkpointing SPMD Applications on Transputer Networks





1994 Scalable High Performance Computing Conference (SHPCC 94), May 1994

Cited by

Year 1999 : 3 citations

 Mootaz Elnozahy, Lorenzo Alvisi, Yi-Min Wang, David B. Johnson "A Survey of Rollback-Recovery Protocols in Message Passing Systems", CMU Technical Report CMU-CS-99-148, June 1999, Carbegie-Mellon University, USA.

 James S. Plank, Henri Casanova, Micah Beck and Jack Dongarra, "Deploying Fault Tolerance and Task Migration with NetSolve", Future Generation Computer Systems, Volume 15, 1999, pages 745 - 755. Elsevier.

 Hsu ST, Chang RC, "An implementation of using remote memory to checkpoint processes?, Software-Practice & Experience, John Wiley & Sons Ltd, W Sussex, vol. 29, no, 11, SEP 1999, pp. 985-1004.

Year 1998 : 2 citations

 Plank JS, Casanova H, Beck M, Dongarra J, "Deploying fault-tolerance and task migration with NetSolve?, Applied Parallel Computing Lecture Notes in Computer Science, Springer-Verlag Berlin, 1541, 1998, pp. 418-432.

 James S. Plank, Kai Li, Michael Puening "Diskless Checkpointing" IEEE Trans. on Parallel and Distributed Systems, Vol.9, No.10, Outubro 1998, pp. 972-986

Year 1997 : 3 citations

 Y.Chen, J.Plank, K.Li: "CLIP: A Checkpointing Tool for Message-Passing Parallel Programs", Proceedings of Supercomputing'97, 1997

 V.Naik, S.Midkiff, J.Moreira: "A Checkpointing Strategy for Scalable Recovery on Distributed Parallel Systems", Proceedings of Supercomputing'97, 1997.

 Adam Beguelin, Erik Seligman and Peter Stephan "Application Level Fault Tolerance in Heterogeneous Networks of Workstations", Journal of Parallel and Distributed Computing, vol. 43, no. 2, pp.147--155,1997

Year 1996 : 5 citations

 J. S. Plank. Improving the performance of coordinated checkpointers on networks of workstations using RAID techniques. In 15th Symposium on Reliable Distributed Systems, pages 76--85, October 1996.

 Tzi-cker Chiueh, Peitao Deng "Evaluation of Checkpoint Mechanisms for Massively Parallel Machines" Proc. Twenty-Six Annual International Symposium on Fault-Tolerant Computing (FTCS-26), Junho 25-26, 1996, Sendai, Japão, IEEE Computer Society, ISBN 0-8186-7261-7, pp 370-371.

 J. Gehring, A. Reinefeld. "MARS - A Framework for Minimizing the Job Execution Time in a Metacomputing Environment" Future Generation Computer Systems, FGCS-12,1 (1996), Elsevier, pp. 87-99.

 J. Arabe, A. Beguelin, B. Lowekamp, E. Seligman, M. Starkey, and P. Stephan "Dome: Parallel programming in a heterogeneous multi-user environment?, Proceedings of the International Parallel Processing Symposium (IPPS), April 1996, Honolulu, Hawaii.

 Youngbae Kim, "Fault Tolerant Matrix Operations for Parallel and Distributed Systems?, Phd Thesis, University of Tennessee, Knoxville, 1996

Year 1995 : 3 citations

 J.Planck, Y. Kim, J.J.Dongarra "Algorithm-Based Diskless Checkpointing for Fault-Tolerant Matrix Operations" Proceedings of the 25th International Symposium on Fault-Tolerant Computing(FTCS-25), California, USA, June 1995, pp. 151-360, IEEE Computer Society.

 José Nagib Cotrim Árabe, Adam Beguelin, Bruce Lowekamp, Erik Seligman, Michael Starkey, and Peter Stephan. "Dome: Parallel programming in a heterogeneous multi-user environment". Technical Report CMU-CS95 -137, Carnegie Mellon University, April 1995.

 J.Plank, M.Beck, G.Kingley, K.Li "Libckpt: Transparent Checkpointing under Unix" Proceedings of the Usenix Winter 1995 Technical Conference, New Orleans, January 1995.

Year 1994 : 3 citations

 Seligman E., Beguelin A. High-Level Fault Tolerance in Distributed Programs. Technical Report CMU-CS-94-223, Carnegie-Mellon University, December 1994.

 E.Seligman, A.Beguelim "High-Level Fault-Tolerance in Distributed Programs" Tech Report Carnegie Mellon University, CMU-CS-94-223.

 M.Beck, J.Plank, G.Kingsley. "Compiler-Assisted Checkpointing" University of Tennessee, Tech. Report CS-94-269, December 1994.