Shirish Ranjit
Distributed computing has been a research topic since the birth of the computer. Early research and developments in distributed computing have been in modeling hardware rather than software. Computer engineers at major companies such as AT\&T, Xerox, and IBM, and at research institutions developed models for hardware to achieve distributed computing. Those studies and experiences set the stage for the modeling of distributed software.
The “Search for Extra-Terrestrial Intelligence” (SETI@HOME, http://setiathome.ssl.berkeley.edu/) project collects data from different radio telescopes, and searches for evidence of Extra-Terrestrial Intelligence. Since an enormous volume of data is collected from telescopes, processing those data takes many years to complete, even for a fast super-computer. For this reason, SETI@HOME takes advantage of distributed computing to process the very large data set.
The SETI@HOME model is a `massively’ distributed computing model. The problem is sliced into smaller tasks, and each task is given to a participating client. A server keeps track of participant clients and their work units. If a participant fails to deliver results, the task is sent to different participant. The client application is run as a screen saver such that the processing takes place only when the CPU is idle on the client computer.
Our observation of the SETI@HOME architecture and implementation is that the client software has memory of the data processed. The client application is able to pick up where it has previously paused. Similarly, the server is able to piece together results that it received from many clients that are processing various pieces of data at any time.
The SETI@HOME project collects data from each radio telescope and stores the collected data in high capacity tape drives. Then, the data is divided into 50 seconds of a 20 kHz signal, which is 0.25 MB of data. “On the receiving end a 0.25 MB chunk will require 1.3 sec on an incoming T1 line of 190 kb/s, or 2.3 minutes on a 14.4 k baud line. Upon completion (typically after several days) of the data analysis for each chunk, a short message reporting candidate signals is presented to the customer and also returned to Big Science and to the University of Washington for post-processingâ€(Sullivan, Werthmier, et al, July 1997, Procedure of Fifth International Conference).
The article by Waldo, Wyant, Wollrath, and Kendall from Sun Microsystems Laboratories sets a stage for distributed computing and its challenges. The authors describe distributed computing by comparing it to local computing. The article describes the differences and challenges in distributed computing by comparing the same issues in local computing. The article presents distributed computing more or less as remote procedure call (RPC) extended to the object-oriented paradigm. In distributed computing, nothing is known about a recipient other than that it supports a particular interface. The article also raises fundamental issues pertaining to distributed computing regardless of hardware or software model:
* Latency: The difference in time between a local object invocation and the invocation of an operation on a remote object.
* Memory access: An issue with an access to a memory by a local processor or a remote processor. The article presents two choices: either all memory access must be controlled by the underlying system or the programmer must be aware of the different types of access.
* Partial failure: One of the sub-systems fails to respond. Since there is no global state, no agents can determine if a component has failed and inform other components about the failure. This is a central reality of distributed computing.
* Concurrency: The distributed environment introduces truly asynchronous operations; thus giving rise to problem of concurrency. The article does not mention distributed computing specifically. However, it states that concurrency is either handled entirely or not handled at all. There is no partial solution to concurrency.
“One could take the position that the way an object deals with latency, memory access, partial failure, and concurrency control is really an aspect of the implementation of that object, and is best described as part of the “quality of service” provided by that implementation” (Waldo, Wyant, Wollrath, and Kendall, Nov 19994, Sun Microsystem Technical Report). The authors argue that in addressing the quality of service of a distributed system, issues of latency, memory access, partial failure, and concurrency are also addressed. The argument is that these issues need to be addressed in any implementation. By coding in a certain way, we can avoid partial failure, memory access problems, and concurrency problems. However, the authors also claim that robustness is not a function of implementations. Their position is the robustness is inherent to the architecture and design of the distributed computing environment