FADI: a fault-tolerant environment for distributed processing systems

Osman, TM ORCID logoORCID: https://orcid.org/0000-0001-8781-2658, 1998. FADI: a fault-tolerant environment for distributed processing systems. PhD, Nottingham Trent University.

[thumbnail of 10183026.pdf]
Preview
Text
10183026.pdf - Published version

Download (25MB) | Preview

Abstract

This thesis describes the research done on the development of a FAult-tolerant Distributed Processing Environment (FADI). The main motivation for designing FADI is to create an efficient low-cost fault-tolerant environment, enabling reliable execution of concurrent user-application processes in presence of hardware faults that affect one or more of the distributed system nodes.

There are two aspects to any fault-tolerant system: error detection and fault recovery. In FADI, a user-transparent mechanism was developed for the detection of both permanent (power failures, network malfunction, etc.) and transient (temporary memory faults, radiation affects, etc.) processor node failures. The Detection mechanism in FADI also allows the incorporation of additional user-programmed error checks and, by dynamically measuring the network traffic, it identifies the error latency thus facilitating damage confinement and assessment.

The recovery of user-processes affected by faults is based on the principles of checkpointing and roll-back. The application processes record their execution states at regular time intervals (take a checkpoint), and upon the occurrence of a hardware failure, the failed process is restarted from the last recorded execution state (roll-back of the process).

In contrast to the redundancy-based fault-tolerant systems, the adopted approach does not require any extra hardware, nor does it demand the replication of application processes. Instead, it optimises the use of an existing distributed system to save the checkpoints, and to migrate the application processes from the faulty hardware to operating nodes. This reduces both the cost of building the recovery software and its overhead on the running time of the application-processes. The novel checkpointing mechanism developed in the course of this research is user-transparent and is computationally efficient since it does not require freezing the application-process while its checkpoint is being taken.

The checkpointing mechanism has been initially evaluated in the context of stand-alone applications and the experimental results have shown that it is very robust. Subsequent research extended the developed checkpointing protocol, so that it covers the possible inter-process communications taking place between the distributed application processes. This ensued the development of a novel algorithm that supports checkpointing and rollback of message passing (interactive) processes in FADI. The algorithm introduces a novel technique to tolerate faults that might occur whilst inter-process messages are in transit. It also tolerates the duplication of inter-process messages and has a low failure- free overhead due to its policy of coordinated checkpointing and selective message logging.

The performance studies indicate that FADI exhibits low overhead on the execution time of the applications and it has confirmed its potential in the context of computation-intensive scientific programs and distributed telemetry and telecontrol industrial systems.

Item Type: Thesis
Creators: Osman, T.M.
Date: 1998
ISBN: 9781369313185
Identifiers:
Number
Type
PQ10183026
Other
Divisions: Schools > School of Science and Technology
Record created by: Linda Sullivan
Date Added: 28 Aug 2020 12:34
Last Modified: 21 Jun 2023 08:19
URI: https://irep.ntu.ac.uk/id/eprint/40576

Actions (login required)

Edit View Edit View

Statistics

Views

Views per month over past year

Downloads

Downloads per month over past year