Distributed operating system
A distributed operating system is the logical aggregation of operating system
software over a collection of independent, networked
, communicating
, and spatially disseminated computational nodes. Individual system nodes each hold a discrete software subset of the global aggregate operating system. Each node-level software subset is a composition of two distinct provisioners of services.
The first is a ubiquitous minimal kernel
, or microkernel
, situated directly above each node’s hardware. The microkernel provides only the necessary mechanisms for a node's functionality. Second is a higher-level collection of system management components, providing all necessary policies for a node's individual and collaborative activities. This collection of management components exists immediately above the microkernel, and below any user applications or APIs that might reside at higher levels.
These two entities, the microkernel and the management components collection, work together. They support the global system’s goal of seamlessly integrating all network-connected resources and processing functionality into an efficient, available, and unified system. This seamless integration of individual nodes into a global system is referred to as transparency, or single system image; describing the illusion provided to users of the global system’s appearance as a singular and local computational entity.
A Distributed operating system is an operating system
. This statement may be trivial, but it is not always overt and obvious because the distributed operating system is such an integral part of the distributed system. This idea is synonymous to the consideration of a square. A square might not immediately be recognized as a rectangle. Although possessing all requisite attributes
defining a rectangle, a square’s additional attributes and specific configuration provide a disguise. At its core, the distributed operating system provides only the essential services and minimal functionality required of an operating system, but its additional attributes and particular configuration
make it different. The Distributed operating system fulfills its role as operating system; and does so in a manner indistinguishable from a centralized, monolithic operating system
. That is, although distributed in nature, it supports a single system image through the implementation of transparency; or more simply said, the system’s appearance as a singular, local entity.
.” Mechanism and policy can be simply interpreted as "how something is done" versus "why something is done," respectively. Achieving this separation allows for an exceptionally loosely coupled, flexible, and scalable distributed operating system.
In a distributed operating system, the kernel is often defined by a relatively minimal architecture. A kernel of this design is referred to as a Microkernel. The microkernel usually contains only the mechanisms and services which, if otherwise removed, would render a node or the global system functionally inoperable. The minimal nature of the microkernel strongly enhances a distributed operating system’s modular potential. It is generally the case that the microkernel is implemented directly above its node’s hardware and resources; it is also common for a kernel to be identically replicated over all nodes in a system. The combination of a microkernel’s minimal design and ubiquitous node coverage enhances the global system's extensibility, and the ability to dynamically introduce new nodes or services.
However, these system management components have additional challenges with respect to supporting a node's responsibilities to the global system. In addition, the system management components accept the defensive responsibilities of reliability, availability, and persistence inherent in the distributed operating system. Quite often, any effort to realize a high level of success in a particular area incites conflict with similar efforts in other areas. Therefore, a consistent approach, balanced perspective, and a deep understanding of the overall system and its goals can help mitigate some complexity, and assist in quickly identifying potential points of diminishing returns. This is an example of why the separation of policy and mechanism is so critical.
The multi-level collaboration between a kernel and the system management components, and in turn between the distinct nodes in a distributed operating system is the functional challenge of the distributed operating system. This is the point in the system that must maintain a perfect harmony of purpose, and simultaneously maintain a complete disconnect of intent from implementation. This challenge is the distributed operating system's opportunity, to produce the foundation and framework for a reliable, efficient, available, robust, extensible, and scalable system. However, this opportunity comes at a very high cost in complexity.
These design and development considerations are critical and unforgiving. For instance, a deep understanding of a distributed operating system’s overall architectural and design detail is required at an exceptionally early point. There are an exhaustive array of design considerations inherent in the development of a distributed operating system. Each of these design considerations can potentially affect many of the others to a significant degree. This leads to a massive effort in balanced approach, in terms of the individual design considerations, and many of their permutations. As an aid in this effort, most rely strongly on the immense amount of documented experience and research in distributed computing which exists, and continues even today.
The subject of distributed operating systems however, has a much richer historical perspective. This is especially evident when considering distributed operating system design issues severally, and with respect to some of the primordial strides taken towards their realization. There are several instances of fundamental and pioneering implementations of primitive distributed operating system component concepts dating back to the early 1950s. Some of these very early individual steps were not focused directly on distributed computing, and at the time, many may not have realized their important impact. These pioneering efforts laid important groundwork, and inspired continued research in areas related to distributed computing.
Beginning in the mid-1970s, many important research efforts produced extremely important advances in distributed computing. These breakthroughs provided a solid, stable foundation for the continued efforts through the 1990s, mentioned earlier. Considering the modern distributed operating system and its future, one must look no further than the current incredible challenges of many-core and multi-processor science. The accelerating proliferation of multi-processor
and multi-core processor systems research has led to a resurgence of the distributed operating system concept. Many of these research efforts are investigating interesting, exciting, and plausible paradigms impacting the future of distributed computing.
The unique nature of the Distributed operating system is both subtle and complex. A distributed operating system’s hardware infrastructure elements are not centralized; that is, the elements do not have a tight proximity to one another at a single location. A given distributed operating system’s structure elements could reside in various rooms within a building, or in various buildings around the world. This geographically spatial dissemination defines its decentralization; however, the distributed operating system is distributed, not simply decentralized.
This distinction is the source of the subtlety and complexity. While decentralized systems and distributed operating systems are both spatially diverse, it is the specific manner of and relative degree in linkage between the elements, or nodes in the systems that differentiate the two. In the case of these two types of operating system, these linkages are the lines of communication
between the nodes of the system.
; centralized, decentralized, and distributed. In this examination, consider three tightly related aspects of their structure: organization, connection, and control. Organization will describe a system's physical arrangement characteristics, connection will involve the associations among constituent structural entities, and control will correlate the manner, necessity, and rationale of the earlier two considerations.
is organized most simply; basically one real level of structure, where all constituent elements are highly influenced by and are ultimately dependent upon this organization. The decentralized system is a more federated structure
composed of multiple levels, where subsets of a system’s entities unite. These entity subsets in turn unite at higher levels, in the direction of and ultimately culminating at the central master element. The (purely) distributed system has no discernable concept of, or indeed any necessity for levels; it is purely an autonomous collection of discrete elements.
It is important to note that all of these systems are distributed, in that they comprise separate and distinct constituent elements connected together to form a system. This is a generic idea of the distributed organization of system elements; however, a distributed system is a quite specific entity unto itself. It is this distributed system concept that will be approached in the following sub-sections.
s -- each on a string, -- with the hand being the central figure. A decentralized system (or network system
) incorporates a single-step direct, or multi-step indirect path between any given constituent element and the central entity. This can be understood by thinking of a corporate organizational chart, the first level connecting directly, and lower levels connecting indirectly through successively higher levels (no lateral “dotted” lines). Finally, the distributed operating system has no inherent pattern; direct and indirect connections are possible between any two given elements of the system. Consider the 1970s phenomena of “string art
” or a spirograph
drawing as a fully connected system, and the spider’s web
or the Interstate Highway System
between U.S. cities as examples of a partially connected system.
Notice that in the directed systems (centralized and decentralized) there is more control, therefore easing the administration of processes, but constraining their possible scope of influence. On the other hand, the distributed operating system — without directed connections — is much more difficult to control, but is effectively limited in extensible scope only by the capabilities of its individual, autonomous, and interdependent nodes. The associations of the distributed operating system conform only to the needs imposed by its many design considerations, and not in any way by organizational limitations.
Transparency allows a user to accomplish a system-related objective with absolute minimal knowledge of the particular internal details related to the objective. A system or application may expose as much, or as little transparency in a given area of functionality as deemed necessary. That is to say, the degree to which transparency is implemented can vary between subsets of functionality in a system or application. There are many specific areas of a system that can benefit from transparency; access, location, performance, naming, and migration to name a few.
For example, a distributed operating system may present access to a hard drive as "C:" and access to a DVD as "G:". The user does not require any knowledge of device drivers or methods of direct memory access techniques possibly used behind-the-scenes; both devices work the same way, from the user's perspective. This example demonstrates a high-level of transparency; and displays how low-level details are made somewhat "invisible" to the user through transparency. On the other hand, if a user desires to access another system or server, a host name or IP address may be required, along with a remote-machine user login and password. This would indicate a low-degree of transparency, as there is detailed knowledge required of the user in order to accomplish this task.
Generally, transparency and user-required knowledge form an inverse relation. As transparency is designed and implemented into various areas of a system, great care must be taken not to adversely affect other areas of transparency and other basic design concerns. Transparency, as a design concept, is one of the grand challenges in design of a distributed operating system; as it is a factor in the necessity for a complete upfront understanding.
(IPC) is the implementation of general communication, process interaction, and dataflow
between threads
and/or processes
both within a system node, and between all nodes in a distributed operating system. The distributed nature of a system's nodes and the multi-level considerations of intra-node and inter-node requirements provide the base-line for high-level IPC design considerations. However, IPC in a distributed operating system is a low-level implementation. IPC is the low-level critical complement to the high-level concept of transparency. Many of the requirements and restrictions imposed on a system as a result of transparency will be accomplished directly or indirectly through IPC. In this sense, IPC is the greatest underlying concept in the low-level design considerations of a distributed operating system.
provides policies and mechanisms for effective and efficient sharing of a system's distributed processing resources between that system's distributed processes. These policies and mechanisms support operations involving the allocation and de-allocation of processes and ports to processors, as well as provisions to run, suspend, migrate, halt, or resume execution of processes. While these distributed operating system resources and the operations on them can be either local or remote with respect to each other, the distributed operating system must still maintain complete state of and synchronization over all processes in the system; and do so in a manner completely consistent from the user's unified system perspective.
As an example, load balancing
is a common process management function. One consideration of load balancing is which process should be moved. The kernel may have several mechanisms, one of which might be priority-based choice. This mechanism in the kernel defines what can be done; in this case, choose a process based on some priority. The system management components would have policies implementing the decision making for this context. One of these policies would define what priority means, and how it is to be used to choose a process in this instance.
such as memory, files, devices, etc. are distributed throughout a system, and at any given moment, any of these nodes may have light to idle workloads. Load sharing and load balancing require many policy-oriented decisions, ranging from finding idle CPUs, when to move, and which to move. Many algorithm
s exist to aid in these decisions; however, this calls for a second level of decision making policy in choosing the algorithm best suited for the scenario, and the conditions surrounding the scenario.
and security of a system's hardware, services, and data. Issues arising from availability failures or security violations are considered faults. Faults
are physical or logical defects that can cause errors in the system. For a system to be reliable, it must somehow overcome the adverse effects of faults.
There are three general methods for dealing with faults: fault avoidance, fault tolerance
, and fault detection and recovery. Fault avoidance is considered to be the proactive measures taken to minimize the occurrence of faults. These proactive measures can be in the form of transactions, replicated resources and processes
, and primary back-ups of complete servers. Fault tolerance is the ability of a system to continue some meaningful level of operation in the face of a fault. In the event a fault does occur, the system should detect the fault and have the capability to respond quickly and effectively to recover full functionality. In any event, any actions taken should make every effort to preserving the single system image.
is arguably the quintessential computing concern, and in the distributed operating system, it is no different. Many benchmark metrics
exist for performance; throughput, job completions per unit time, system utilization, etc. Each of these benchmarks are more meaningful in describing some scenarios, and less in others. With respect to a distributed operating system, this consideration most often distills to a balance between process parallelism
and IPC. Managing the task granularity of parallelism in a sensible relation to the messages required for support is extremely effective. Also, identifying when it is more beneficial to migrate a process
to its data, rather than copy the data, is effective as well. Many process and resource management algorithms, and algorithms in this space work to maximize performance.
have an inherent need for synchronization
. Three basic situations that define the scope of this need:
There are a multitude of algorithms available for these scenarios, and each have many variations. Unfortunately, whenever synchronization is required the opportunity for process deadlock
usually exists.
in a distributed operating system is made possible through the modular characteristics of the microkernel. With the microkernel presenting an absolute minimal—but complete—set of primitives and basic functionally cohesive
services, The higher-level management components can be composed in a similar cohesive manner. This capability leads to exceptional flexibility in the management components collection; but more importantly, it allows the opportunity to dynamically swap, upgrade, or install additional instances of components above the kernel.
One of the first solutions to these new questions was the DYSEAC
, a self-described general-purpose synchronous
computer; but at this point in history, exhibited signs of being much more than general-purpose. In one of the earliest publications of the Association for Computing Machinery
, in April 1954, a researcher at the National Bureau of Standards – now the National Institute of Standards and Technology (NIST) – presented a detailed implementation design specification of the DYSEAC. Without carefully reading the entire specification, one could be misled by summary language in the introduction, as to the nature of this machine. The initial section of the introduction advises that major emphasis will be focused upon the requirements of the intended applications, and these applications would require flexible communication. However, suggesting the external devices could be typewriters, magnetic medium
, and CRTs
, and with the term “input-output operation
” used more than once, could quickly limit any paradigm of this system to a complex centralized “ensemble.” Seemingly, saving the best for last, the author eventually describes the true nature of the system.
While this more detailed description elevates the perception of the system, the best that can be distilled from this is some semblance of decentralized control. The avid reader, persevering in the investigation would get to a point at which the real nature of the system is divulged.
This is one of the earliest examples of a computer with distributed control. Dept. of the Army
reports show it was certified reliable and passed all acceptance tests in April 1954. It was completed and delivered on time, in May 1954. In addition, was it mentioned that this was a portable computer
? It was housed in a tractor-trailer, and had 2 attendant vehicles and 6 tons of refrigeration
Described as an input-output system of experimental nature, the Lincoln TX-2 placed a premium on flexibility in its association of simultaneously operational input-output devices. The design of the TX-2 was modular, supporting a high degree of modification and expansion, as well as flexibility in operating and programming of its devices. The system employed The Multiple-Sequence Program Technique.
This technique allowed for multiple program counters to each associate with one of 32 possible sequences of program code. These explicitly prioritized sequences could be interleaved and executed concurrently, affecting not only the computation in process, but also the control flow of sequences and switching of devices as well. Much discussion ensues related to the complexity and sophistication in the sequence capabilities of devices.
Similar to the previous system, the TX-2 discussion has a distinct decentralized theme until it is revealed that efficiencies in system operation are gained when separate programmed devices are operated simultaneously. It is also stated that the full power of the central unit can be utilized by any device; and it may be used for as long as the device's situation requires. In this, we see the TX-2 as another example of a system exhibiting distributed control, its central unit not having dedicated control.
One early memory access paradigm was Intercommunicating Cells, where a cell is composed of a collection of memory elements. A memory element was basically an electronic flip-flop
or relay
, capable of two possible values. Within a cell there are two types of elements, symbol and cell elements. Each cell structure stores data
in a string
of symbols, consisting of a name
and a set of associated parameter
s. Consequently, a system's information is linked through various associations of cells.
Intercommunicating Cells fundamentally break from tradition in that it has no counter
s or any concept of addressing memory
. The theory contends that addressing is a wasteful and non-valuable level of indirection
. Information is accessed in two ways, direct and cross-retrieval. Direct retrieval looks to a name and returns a parameter set. Cross-retrieval projects
through parameter sets and returns a set of names containing the given subset
of parameters. This would be similar to a modified hash table
data structure
that would allow for multiple values
(parameters) for each key
This early research into alternative memory describes a configuration
ideal for the distributed operating system. The constant-time projection through memory for storing and retrieval would be inherently atomic and exclusive
. The cellular memory's intrinsic distributed characteristics would be an invaluable benefit; however, the impact on the user
, hardware
, or Application programming interface
s is uncertain. It is distinctly obvious that these early researchers had a distributed system concept in mind, as they state:
Memory coherence in shared virtual memory systems
Transactional Memory
Composable memory transactions
Transactional memory: architectural support for lock-free data structures
Software transactional memory for dynamic-sized data structures
Software transactional memory
Consensus in the presence of partial synchrony
The Byzantine Generals Problem
Fail-stop processors: an approach to designing fault-tolerant computing systems
Distributed snapshots: determining global states of distributed systems
Optimistic recovery in distributed systems
The Cronus distributed operating system
Design and development of MINIX distributed operating system
