Checkpointing and recovery for distributed shared memory applications

Ouyang, Jinsong; Heiser, Gernot

doi:10.1109/IWOOS.1995.470555

Checkpointing and recovery for distributed shared memory applications

Download files

Access & Terms of Use

open access

CC BY-NC-ND 3.0

Abstract

This paper proposes an approach for adding fault tolerance, based on consistent checkpointing, to distributed shared memory applications. Two different mechanisms are presented to efficiently address the issue of message losses due to either site failures or unreliable non-FIFO channels. Both guarantee a correct and efficient recovery from a consistent distributed system state following a failure. A variant of the two-phase commit protocol is employed such that the communication overhead required to take a consistent checkpoint is the same as that of systems using a one-phase commit protocol, while our protocol utilises stable storage more efficiently. A consistent checkpoint is committed when the first phase of the protocol finishes.

Persistent link to this record

http://hdl.handle.net/1959.4/39900

Link to Publisher Version

http://dx.doi.org/10.1109/IWOOS.1995.470555

Author(s)

Ouyang, Jinsong

;

Heiser, Gernot

Publication Year

1995

Resource Type

Conference Paper

UNSW Faculty

Files

9502.pdf

195.47 KB

Adobe Portable Document Format

View full record Show statistics

Library

Checkpointing and recovery for distributed shared memory applications

Access & Terms of Use

Altmetric

Abstract

Persistent link to this record

DOI

Link to Publisher Version

Link to Open Access Version

Additional Link

Author(s)

Supervisor(s)

Creator(s)

Editor(s)

Translator(s)

Curator(s)

Designer(s)

Arranger(s)

Composer(s)

Recordist(s)

Conference Proceedings Editor(s)

Other Contributor(s)

Corporate/Industry Contributor(s)

Publication Year

Resource Type

Degree Type

UNSW Faculty

Files

Related dataset(s)