Checkpointing and recovery for distributed shared memory applications

Download files
Access & Terms of Use
open access
Altmetric
Abstract
This paper proposes an approach for adding fault tolerance, based on consistent checkpointing, to distributed shared memory applications. Two different mechanisms are presented to efficiently address the issue of message losses due to either site failures or unreliable non-FIFO channels. Both guarantee a correct and efficient recovery from a consistent distributed system state following a failure. A variant of the two-phase commit protocol is employed such that the communication overhead required to take a consistent checkpoint is the same as that of systems using a one-phase commit protocol, while our protocol utilises stable storage more efficiently. A consistent checkpoint is committed when the first phase of the protocol finishes.
Persistent link to this record
DOI
Link to Open Access Version
Additional Link
Author(s)
Ouyang, Jinsong
;
Heiser, Gernot
Supervisor(s)
Creator(s)
Editor(s)
Translator(s)
Curator(s)
Designer(s)
Arranger(s)
Composer(s)
Recordist(s)
Conference Proceedings Editor(s)
Other Contributor(s)
Corporate/Industry Contributor(s)
Publication Year
1995
Resource Type
Conference Paper
Degree Type
UNSW Faculty
Files
download 9502.pdf 195.47 KB Adobe Portable Document Format
Related dataset(s)