Communication and Fault Tolerance in Parallel Computers (Thesis)
Report ID: TR-414-93Author: Sitaraman, Ramesh Kumar
Date: 1993-06-00
Pages: 116
Download Formats: |Postscript|
Abstract:
This thesis explores two fundamental issues in the design of large-scale parallel computers: {it communication /} and {it fault tolerance/}. In Chapter 1, we introduce and provide motivation for the problems that we study in this thesis. Chapter 2 examines several simple algorithms for routing packets on butterfly networks with bounded queues. Among other things, we show that for any greedy queuing protocol, a routing problem in which each of the $N$ inputs sends a packet to a randomly chosen output requires $O(log N)$ steps, with high probability, provided that the queue size is a sufficiently large, but fixed, constant. In Chapter 3, we analyze the fault-tolerance properties of several bounded-degree hypercubic networks that are commonly used for parallel computation. Among other things, we show that an $N$-node butterfly containing $N^{1-epsilon}$ worst-case faults (for any constant $epsilon > 0$) can emulate a fault-free butterfly of the same size with only constant slowdown. Similar results are proved for the shuffle-exchange graph. Hence, these networks become the first connected bounded-degree networks known to be able to sustain more than a constant number of worst-case faults without suffering more than a constant-factor slowdown in performance. In Chapter 4, we study the ability of array-based networks to tolerate faults. Among other things, we show that an $N imes N$ two-dimensional array can sustain $N^{1-epsilon}$ worst-case faults, for some fixed $epsilon < 1$, and still emulate a fully-functioning $N imes N$ array with only constant slowdown. In Chapter 5, we study a concurrent error detection scheme called Algorithm Based Fault Tolerance (ABFT). Unlike the schemes developed in Chapters 3 and 4 to tolerate permanent faults, the scheme studied in this chapter is primarily aimed at tolerating transient faults in a parallel computer. The main contribution of this chapter is to propose a simple and novel algorithm called RANDGEN to generate data-check relationships. By simply varying its parameters, RANDGEN can produce data-check relationships with a wide spectrum of properties, many of which have been considered important in recent ABFT designs.