History of general purpose CPUs
Encyclopedia
The history of general purpose CPUs is a continuation of the earlier history of computing hardware
History of computing hardware
The history of computing hardware is the record of the ongoing effort to make computer hardware faster, cheaper, and capable of storing more data....

.

1950s: early designs

Each of the computer designs of the early 1950s was a unique design; there were no upward-compatible machines or computer architectures with multiple, differing implementations. Programs written for one machine would not run on another kind, even other kinds from the same company. This was not a major drawback at the time because there was not a large body of software developed to run on computers, so starting programming from scratch was not seen as a large barrier.

The design freedom of the time was very important, for designers were very constrained by the cost of electronics, yet just beginning to explore how a computer could best be organized. Some of the basic features introduced during this period included index registers (on the Ferranti Mark 1), a return-address
Return address
In postal mail, a return address is an explicit inclusion of the address of the person sending the message. It provides the recipient with a means to determine how to respond to the sender of the message if needed....

 saving instruction (UNIVAC I
UNIVAC I
The UNIVAC I was the first commercial computer produced in the United States. It was designed principally by J. Presper Eckert and John Mauchly, the inventors of the ENIAC...

), immediate operands (IBM 704
IBM 704
The IBM 704, the first mass-produced computer with floating point arithmetic hardware, was introduced by IBM in 1954. The 704 was significantly improved over the IBM 701 in terms of architecture as well as implementations which were not compatible with its predecessor.Changes from the 701 included...

), and the detection of invalid operations (IBM 650
IBM 650
The IBM 650 was one of IBM’s early computers, and the world’s first mass-produced computer. It was announced in 1953, and over 2000 systems were produced between the first shipment in 1954 and its final manufacture in 1962...

).

By the end of the 1950s commercial builders had developed factory-constructed, truck-deliverable computers. The most widely installed computer was the IBM 650
IBM 650
The IBM 650 was one of IBM’s early computers, and the world’s first mass-produced computer. It was announced in 1953, and over 2000 systems were produced between the first shipment in 1954 and its final manufacture in 1962...

, which used drum memory
Drum memory
Drum memory is a magnetic data storage device and was an early form of computer memory widely used in the 1950s and into the 1960s, invented by Gustav Tauschek in 1932 in Austria....

 onto which programs were loaded using either paper tape
Punched tape
Punched tape or paper tape is an obsolete form of data storage, consisting of a long strip of paper in which holes are punched to store data...

 or punched card
Punched card
A punched card, punch card, IBM card, or Hollerith card is a piece of stiff paper that contains digital information represented by the presence or absence of holes in predefined positions...

s. Some very high-end machines also included core memory which provided higher speeds. Hard disk
Hard disk
A hard disk drive is a non-volatile, random access digital magnetic data storage device. It features rotating rigid platters on a motor-driven spindle within a protective enclosure. Data is magnetically read from and written to the platter by read/write heads that float on a film of air above the...

s were also starting to become popular.

A computer is an automatic abacus
Abacus
The abacus, also called a counting frame, is a calculating tool used primarily in parts of Asia for performing arithmetic processes. Today, abaci are often constructed as a bamboo frame with beads sliding on wires, but originally they were beans or stones moved in grooves in sand or on tablets of...

. The type of number system affects the way it works. In the early 1950s most computers were built for specific numerical processing tasks, and many machines used decimal numbers as their basic number system – that is, the mathematical functions of the machines worked in base-10 instead of base-2 as is common today. These were not merely binary coded decimal. Most machines actually had ten vacuum tubes per digit in each register
Processor register
In computer architecture, a processor register is a small amount of storage available as part of a CPU or other digital processor. Such registers are addressed by mechanisms other than main memory and can be accessed more quickly...

. Some early Soviet
Soviet Union
The Soviet Union , officially the Union of Soviet Socialist Republics , was a constitutionally socialist state that existed in Eurasia between 1922 and 1991....

 computer designers implemented systems based on ternary logic
Ternary logic
In logic, a three-valued logic is any of several many-valued logic systems in which there are three truth values indicating true, false and some indeterminate third value...

; that is, a bit could have three states: +1, 0, or -1, corresponding to positive, zero, or negative voltage.

An early project for the U.S. Air Force, BINAC
BINAC
BINAC, the Binary Automatic Computer, was an early electronic computer designed for Northrop Aircraft Company by the Eckert-Mauchly Computer Corporation in 1949. Eckert and Mauchly, though they had started the design of EDVAC at the University of Pennsylvania, chose to leave and start EMCC, the...

 attempted to make a lightweight, simple computer by using binary arithmetic. It deeply impressed the industry.

As late as 1970, major computer languages were unable to standardize their numeric behavior because decimal computers had groups of users too large to alienate.

Even when designers used a binary system, they still had many odd ideas. Some used sign-magnitude arithmetic (-1 = 10001), or ones' complement (-1 = 11110), rather than modern two's complement
Two's complement
The two's complement of a binary number is defined as the value obtained by subtracting the number from a large power of two...

 arithmetic (-1 = 11111). Most computers used six-bit character sets, because they adequately encoded Hollerith cards. It was a major revelation to designers of this period to realize that the data word should be a multiple of the character size. They began to design computers with 12, 24 and 36 bit data words (e.g. see the TX-2
TX-2
The MIT Lincoln Laboratory TX-2 computer was the successor to the Lincoln TX-0 and was known for its role in advancing both artificial intelligence and human-computer interaction.- Specifications :...

).

In this era, Grosch's law
Grosch's law
Grosch's law is the following observation about computer performance made by Herb Grosch in 1965:There is a fundamental rule, which I modestly call Grosch's law, giving added economy only as the square root of the increase in speed -- that is, to do a calculation 10 times as cheaply you must do it...

 dominated computer design: Computer cost increased as the square of its speed.

1960s: the computer revolution and CISC

One major problem with early computers was that a program for one would not work on others. Computer companies found that their customers had little reason to remain loyal to a particular brand, as the next computer they purchased would be incompatible anyway. At that point, price and performance were usually the only concerns.

In 1962, IBM tried a new approach to designing computers. The plan was to make an entire family of computers that could all run the same software, but with different performances, and at different prices. As users' requirements grew they could move up to larger computers, and still keep all of their investment in programs, data and storage media.

In order to do this they designed a single reference computer called the System/360
System/360
The IBM System/360 was a mainframe computer system family first announced by IBM on April 7, 1964, and sold between 1964 and 1978. It was the first family of computers designed to cover the complete range of applications, from small to large, both commercial and scientific...

(or S/360). The System/360 was a virtual computer, a reference instruction set and capabilities that all machines in the family would support. In order to provide different classes of machines, each computer in the family would use more or less hardware emulation, and more or less microprogram emulation, to create a machine capable of running the entire System/360 instruction set
Instruction set
An instruction set, or instruction set architecture , is the part of the computer architecture related to programming, including the native data types, instructions, registers, addressing modes, memory architecture, interrupt and exception handling, and external I/O...

.

For instance a low-end machine could include a very simple processor for low cost. However this would require the use of a larger microcode emulator to provide the rest of the instruction set, which would slow it down. A high-end machine would use a much more complex processor that could directly process more of the System/360 design, thus running a much simpler and faster emulator.

IBM chose to make the reference instruction set
Instruction set
An instruction set, or instruction set architecture , is the part of the computer architecture related to programming, including the native data types, instructions, registers, addressing modes, memory architecture, interrupt and exception handling, and external I/O...

 quite complex, and very capable. This was a conscious choice. Even though the computer was complex, its "control store
Control store
A control store is the part of a CPU's control unit that stores the CPU's microprogram. It is usually accessed by a microsequencer. Early control stores were implemented as a diode-array accessed via address decoders, a form of read-only memory. This tradition dates back to the program timing...

" containing the microprogram would stay relatively small, and could be made with very fast memory. Another important effect was that a single instruction could describe quite a complex sequence of operations. Thus the computers would generally have to fetch fewer instructions from the main memory, which could be made slower, smaller and less expensive for a given combination of speed and price.

As the S/360 was to be a successor to both scientific machines like the 7090
IBM 7090
The IBM 7090 was a second-generation transistorized version of the earlier IBM 709 vacuum tube mainframe computers and was designed for "large-scale scientific and technological applications". The 7090 was the third member of the IBM 700/7000 series scientific computers. The first 7090 installation...

 and data processing machines like the 1401
IBM 1401
The IBM 1401 was a variable wordlength decimal computer that was announced by IBM on October 5, 1959. The first member of the highly successful IBM 1400 series, it was aimed at replacing electromechanical unit record equipment for processing data stored on punched cards...

, it needed a design that could reasonably support all forms of processing. Hence the instruction set was designed to manipulate not just simple binary numbers, but text, scientific floating-point (similar to the numbers used in a calculator), and the binary coded decimal arithmetic needed by accounting systems.

Almost all following computers included these innovations in some form. This basic set of features is now called a "complex instruction set computer
Complex instruction set computer
A complex instruction set computer , is a computer where single instructions can execute several low-level operations and/or are capable of multi-step operations or addressing modes within single instructions...

," or CISC , a term not invented until many years later.

In many CISCs, an instruction could access either registers or memory, usually in several different ways.
This made the CISCs easier to program, because a programmer could remember just thirty to a hundred instructions, and a set of three to ten addressing mode
Addressing mode
Addressing modes are an aspect of the instruction set architecture in most central processing unit designs. The various addressing modes that are defined in a given instruction set architecture define how machine language instructions in that architecture identify the operand of each instruction...

s rather than thousands of distinct instructions.
This was called an "orthogonal instruction set
Orthogonal instruction set
Orthogonal instruction set is a term used in computer engineering. A computer's instruction set is said to be orthogonal if any instruction can use data of any type via any addressing mode...

."
The PDP-11
PDP-11
The PDP-11 was a series of 16-bit minicomputers sold by Digital Equipment Corporation from 1970 into the 1990s, one of a succession of products in the PDP series. The PDP-11 replaced the PDP-8 in many real-time applications, although both product lines lived in parallel for more than 10 years...

 and Motorola 68000
Motorola 68000
The Motorola 68000 is a 16/32-bit CISC microprocessor core designed and marketed by Freescale Semiconductor...

 architecture are examples of nearly orthogonal instruction sets.

There was also the BUNCH
BUNCH
The group of mainframe computer competitors to IBM in the 1970s became known as the BUNCH: Burroughs, UNIVAC, NCR, Control Data Corporation, and Honeywell...

(Burroughs, UNIVAC
UNIVAC
UNIVAC is the name of a business unit and division of the Remington Rand company formed by the 1950 purchase of the Eckert-Mauchly Computer Corporation, founded four years earlier by ENIAC inventors J. Presper Eckert and John Mauchly, and the associated line of computers which continues to this day...

, NCR
NCR Corporation
NCR Corporation is an American technology company specializing in kiosk products for the retail, financial, travel, healthcare, food service, entertainment, gaming and public sector industries. Its main products are self-service kiosks, point-of-sale terminals, automated teller machines, check...

, Control Data Corporation
Control Data Corporation
Control Data Corporation was a supercomputer firm. For most of the 1960s, it built the fastest computers in the world by far, only losing that crown in the 1970s after Seymour Cray left the company to found Cray Research, Inc....

, and Honeywell
Honeywell
Honeywell International, Inc. is a major conglomerate company that produces a variety of consumer products, engineering services, and aerospace systems for a wide variety of customers, from private consumers to major corporations and governments....

) that competed against IBM at this time though IBM dominated the era with S/360.

The Burroughs Corporation (which later merged with Sperry/Univac to become Unisys
Unisys
Unisys Corporation , headquartered in Blue Bell, Pennsylvania, United States, and incorporated in Delaware, is a long established business whose core products now involves computing and networking.-History:...

) offered an alternative to S/360 with their B5000 series machines. In 1961, the B5000 had virtual memory, symmetric multiprocessing, a multi-programming operating system (Master Control Program or MCP), written in ALGOL 60
ALGOL 60
ALGOL 60 is a member of the ALGOL family of computer programming languages. It gave rise to many other programming languages, including BCPL, B, Pascal, Simula, C, and many others. ALGOL 58 introduced code blocks and the begin and end pairs for delimiting them...

, and the industry's first recursive-descent compilers as early as 1963.

1970s: Large Scale Integration

In the 1960s, the Apollo guidance computer
Apollo Guidance Computer
The Apollo Guidance Computer provided onboard computation and control for guidance, navigation, and control of the Command Module and Lunar Module spacecraft of the Apollo program...

 and Minuteman missile made the integrated circuit
Integrated circuit
An integrated circuit or monolithic integrated circuit is an electronic circuit manufactured by the patterned diffusion of trace elements into the surface of a thin substrate of semiconductor material...

 economical and practical.

Around 1971, the first calculator and clock chips began to show that very small computers might be possible. The first microprocessor
Microprocessor
A microprocessor incorporates the functions of a computer's central processing unit on a single integrated circuit, or at most a few integrated circuits. It is a multipurpose, programmable device that accepts digital data as input, processes it according to instructions stored in its memory, and...

 was the Intel 4004
Intel 4004
The Intel 4004 was a 4-bit central processing unit released by Intel Corporation in 1971. It was the first complete CPU on one chip, and also the first commercially available microprocessor...

, designed in 1971 for a calculator company (Busicom
Busicom
Busicom was a Japanese company that owned the rights to the first microprocessor, the Intel 4004, which they created in partnership with Intel in 1970....

), and produced by Intel. In 1972, Intel introduced a microprocessor having a different architecture: the 8008. The 8008 is the direct ancestor of the current Core i7, even now maintaining code compatibility (every instruction of the 8008's instruction set has a direct equivalent in the Intel Core i7's much larger instruction set, although the opcode values are different).

By the mid-1970s, the use of integrated circuits in computers was commonplace. The whole decade consists of upheavals caused by the shrinking price of transistors.

It became possible to put an entire CPU on a single printed circuit board. The result was that minicomputers, usually with 16-bit words, and 4k to 64K of memory, came to be commonplace.

CISCs were believed to be the most powerful types of computers, because their microcode was small and could be stored in very high-speed memory. The CISC architecture also addressed the "semantic gap" as it was perceived at the time. This was a defined distance between the machine language, and the higher level language people used to program a machine. It was felt that compilers could do a better job with a richer instruction set.

Custom CISCs were commonly constructed using "bit slice" computer logic such as the AMD 2900 chips, with custom microcode. A bit slice component is a piece of an ALU
Arithmetic logic unit
In computing, an arithmetic logic unit is a digital circuit that performs arithmetic and logical operations.The ALU is a fundamental building block of the central processing unit of a computer, and even the simplest microprocessors contain one for purposes such as maintaining timers...

, register file or microsequencer. Most bit-slice integrated circuits were 4-bits wide.

By the early 1970s, the PDP-11
PDP-11
The PDP-11 was a series of 16-bit minicomputers sold by Digital Equipment Corporation from 1970 into the 1990s, one of a succession of products in the PDP series. The PDP-11 replaced the PDP-8 in many real-time applications, although both product lines lived in parallel for more than 10 years...

 was developed, arguably the most advanced small computer of its day. Almost immediately, wider-word CISCs were introduced, the 32-bit VAX
VAX
VAX was an instruction set architecture developed by Digital Equipment Corporation in the mid-1970s. A 32-bit complex instruction set computer ISA, it was designed to extend or replace DEC's various Programmed Data Processor ISAs...

 and 36-bit PDP-10
PDP-10
The PDP-10 was a mainframe computer family manufactured by Digital Equipment Corporation from the late 1960s on; the name stands for "Programmed Data Processor model 10". The first model was delivered in 1966...

.

Also, to control a cruise missile
Cruise missile
A cruise missile is a guided missile that carries an explosive payload and is propelled, usually by a jet engine, towards a land-based or sea-based target. Cruise missiles are designed to deliver a large warhead over long distances with high accuracy...

, Intel developed a more-capable version of its 8008 microprocessor, the 8080.

IBM continued to make large, fast computers. However the definition of large and fast now meant more than a megabyte of RAM, clock speeds near one megahertz http://www.hometoys.com/mentors/caswell/sep00/trends01.htmhttp://research.microsoft.com/users/GBell/Computer_Structures_Principles_and_Examples/csp0727.htm, and tens of megabytes of disk drives.

IBM's System 370 was a version of the 360 tweaked to run virtual computing environments. The virtual computer
VM (operating system)
VM refers to a family of IBM virtual machine operating systems used on IBM mainframes System/370, System/390, zSeries, System z and compatible systems, including the Hercules emulator for personal computers. The first version, released in 1972, was VM/370, or officially Virtual Machine Facility/370...

 was developed in order to reduce the possibility of an unrecoverable software failure.

The Burroughs B5000/B6000/B7000 series reached its largest market share. It was a stack computer whose OS was programmed in a dialect of Algol.

All these different developments competed for market share.

Early 1980s: the lessons of RISC

In the early 1980s, researchers at UC Berkeley and IBM
IBM
International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...

 both discovered that most computer language compilers and interpreters used only a small subset of the instructions of a CISC
Complex instruction set computer
A complex instruction set computer , is a computer where single instructions can execute several low-level operations and/or are capable of multi-step operations or addressing modes within single instructions...

. Much of the power of the CPU was simply being ignored in real-world use. They realized that by making the computer simpler and less orthogonal, they could make it faster and less expensive at the same time.

At the same time, CPU calculation became faster in relation to the time for necessary memory accesses. Designers also experimented with using large sets of internal registers. The idea was to cache
Cache
In computer engineering, a cache is a component that transparently stores data so that future requests for that data can be served faster. The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that are stored elsewhere...

 intermediate results in the registers under the control of the compiler.
This also reduced the number of addressing mode
Addressing mode
Addressing modes are an aspect of the instruction set architecture in most central processing unit designs. The various addressing modes that are defined in a given instruction set architecture define how machine language instructions in that architecture identify the operand of each instruction...

s and orthogonality.

The computer designs based on this theory were called Reduced Instruction Set Computer
Reduced instruction set computer
Reduced instruction set computing, or RISC , is a CPU design strategy based on the insight that simplified instructions can provide higher performance if this simplicity enables much faster execution of each instruction. A computer based on this strategy is a reduced instruction set computer...

s, or RISC. RISCs generally had larger numbers of registers, accessed by simpler instructions, with a few instructions specifically to load and store data to memory. The result was a very simple core CPU running at very high speed, supporting the exact sorts of operations the compilers were using anyway.

A common variation on the RISC design employs the Harvard architecture
Harvard architecture
The Harvard architecture is a computer architecture with physically separate storage and signal pathways for instructions and data. The term originated from the Harvard Mark I relay-based computer, which stored instructions on punched tape and data in electro-mechanical counters...

, as opposed to the Von Neumann
Von Neumann architecture
The term Von Neumann architecture, aka the Von Neumann model, derives from a computer architecture proposal by the mathematician and early computer scientist John von Neumann and others, dated June 30, 1945, entitled First Draft of a Report on the EDVAC...

 or Stored Program architecture common to most other designs. In a Harvard Architecture machine, the program and data occupy separate memory devices and can be accessed simultaneously. In Von Neumann machines the data and programs are mixed in a single memory device, requiring sequential accessing which produces the so-called "Von Neumann bottleneck."

One downside to the RISC design has been that the programs that run on them tend to be larger. This is because compiler
Compiler
A compiler is a computer program that transforms source code written in a programming language into another computer language...

s have to generate longer sequences of the simpler instructions to accomplish the same results. Since these instructions need to be loaded from memory anyway, the larger code size offsets some of the RISC design's fast memory handling.

Recently, engineers have found ways to compress the reduced instruction sets so they fit in even smaller memory systems than CISCs. Examples of such compression schemes include the ARM
ARM architecture
ARM is a 32-bit reduced instruction set computer instruction set architecture developed by ARM Holdings. It was named the Advanced RISC Machine, and before that, the Acorn RISC Machine. The ARM architecture is the most widely used 32-bit ISA in numbers produced...

's "Thumb" instruction set. In applications that do not need to run older binary software, compressed RISCs are coming to dominate sales.

Another approach to RISCs was the MISC
Minimal instruction set computer
Minimal Instruction Set Computer is a processor architecture with a very small number of basic operations and corresponding opcodes. Such instruction sets are commonly stack based rather than register based to reduce the size of operand specifiers. Such a stack machine architecture is inherently...

, "niladic" or "zero-operand" instruction set. This approach realized that the majority of space in an instruction was to identify the operands of the instruction. These machines placed the operands on a push-down (last-in, first out) stack
Stack (data structure)
In computer science, a stack is a last in, first out abstract data type and linear data structure. A stack can have any abstract data type as an element, but is characterized by only three fundamental operations: push, pop and stack top. The push operation adds a new item to the top of the stack,...

. The instruction set was supplemented with a few instructions to fetch and store memory. Most used simple caching to provide extremely fast RISC machines, with very compact code. Another benefit was that the interrupt latencies were extremely small, smaller than most CISC machines (a rare trait in RISC machines). The Burroughs large systems architecture uses this approach. The B5000 was designed in 1961, long before the term "RISC" was invented. The architecture puts six 8-bit instructions in a 48-bit word, and was a precursor to VLIW design (see below: 1990 to Today).

The Burroughs architecture was one of the inspirations for Charles H. Moore
Charles H. Moore
Charles H. Moore is the inventor of the Forth programming language.- Biography :In 1968, while employed at the United States National Radio Astronomy Observatory , Moore invented the initial version of the Forth language to help control radio telescopes...

's Forth programming language, which in turn inspired his later MISC chip designs. For example, his f20 cores had 31 5-bit instructions, which were fit four to a 20-bit word.

RISC chips now dominate the market for 32-bit embedded systems. Smaller RISC chips are even becoming common in the cost-sensitive 8-bit embedded-system market. The main market for RISC CPUs has been systems that require low power or small size.

Even some CISC processors (based on architectures that were created before RISC became dominant), such as newer x86 processors, translate instructions internally into a RISC-like instruction set.

These numbers may surprise many, because the "market" is perceived to be desktop computers. x86 designs dominate desktop and notebook computer sales, but desktop and notebook computers are only a tiny fraction of the computers now sold. Most people in industrialised countries own more computers in embedded systems in their car and house than on their desks.

Mid-to-late 1980s: exploiting instruction level parallelism

In the mid-to-late 1980s, designers began using a technique known as "instruction pipelining", in which the processor works on multiple instructions in different stages of completion. For example, the processor may be retrieving the operands for the next instruction while calculating the result of the current one. Modern CPUs may use over a dozen such stages. MISC
Minimal instruction set computer
Minimal Instruction Set Computer is a processor architecture with a very small number of basic operations and corresponding opcodes. Such instruction sets are commonly stack based rather than register based to reduce the size of operand specifiers. Such a stack machine architecture is inherently...

 processors achieve single-cycle execution of instructions without the need for pipelining.

A similar idea, introduced only a few years later, was to execute multiple instructions in parallel on separate arithmetic logic unit
Arithmetic logic unit
In computing, an arithmetic logic unit is a digital circuit that performs arithmetic and logical operations.The ALU is a fundamental building block of the central processing unit of a computer, and even the simplest microprocessors contain one for purposes such as maintaining timers...

s (ALUs). Instead of operating on only one instruction at a time, the CPU will look for several similar instructions that are not dependent on each other, and execute them in parallel. This approach is called superscalar
Superscalar
A superscalar CPU architecture implements a form of parallelism called instruction level parallelism within a single processor. It therefore allows faster CPU throughput than would otherwise be possible at a given clock rate...

 processor design.

Such techniques are limited by the degree of instruction level parallelism
Instruction level parallelism
Instruction-level parallelism is a measure of how many of the operations in a computer program can be performed simultaneously. Consider the following program: 1. e = a + b 2. f = c + d 3. g = e * f...

 (ILP), the number of non-dependent instructions in the program code. Some programs are able to run very well on superscalar processors due to their inherent high ILP, notably graphics. However more general problems do not have such high ILP, thus making the achievable speedups due to these techniques to be lower.

Branching is one major culprit. For example, the program might add two numbers and branch to a different code segment if the number is bigger than a third number. In this case even if the branch operation is sent to the second ALU for processing, it still must wait for the results from the addition. It thus runs no faster than if there were only one ALU. The most common solution for this type of problem is to use a type of branch prediction.

To further the efficiency of multiple functional units which are available in superscalar
Superscalar
A superscalar CPU architecture implements a form of parallelism called instruction level parallelism within a single processor. It therefore allows faster CPU throughput than would otherwise be possible at a given clock rate...

 designs, operand register dependencies was found to be another limiting factor. To minimize these dependencies, out-of-order execution
Out-of-order execution
In computer engineering, out-of-order execution is a paradigm used in most high-performance microprocessors to make use of instruction cycles that would otherwise be wasted by a certain type of costly delay...

 of instructions was introduced. In such a scheme, the instruction results which complete out-of-order must be re-ordered in program order by the processor for the program to be restartable after an exception. Out-of-Order execution was the main advancement of the computer industry during the 1990s.
A similar concept is speculative execution
Speculative execution
Speculative execution in computer systems is doing work, the result of which may not be needed. This performance optimization technique is used in pipelined processors and other systems.-Main idea:...

, where instructions from one direction of a branch (the predicted direction) are executed before the branch direction is known. When the branch direction is known, the predicted direction and the actual direction are compared. If the predicted direction was correct, the speculatively-executed instructions and their results are kept; if it was incorrect, these instructions and their results are thrown out. Speculative execution coupled with an accurate branch predictor gives a large performance gain.

These advances, which were originally developed from research for RISC-style designs, allow modern CISC processors to execute twelve or more instructions per clock cycle, when traditional CISC designs could take twelve or more cycles to execute just one instruction.

The resulting instruction scheduling logic of these processors is large, complex and difficult to verify. Furthermore, the higher complexity requires more transistors, increasing power consumption and heat. In this respect RISC is superior because the instructions are simpler, have less interdependence and make superscalar implementations easier. However, as Intel has demonstrated, the concepts can be applied to a CISC
Complex instruction set computer
A complex instruction set computer , is a computer where single instructions can execute several low-level operations and/or are capable of multi-step operations or addressing modes within single instructions...

 design, given enough time and money.
Historical note: Some of these techniques (e.g. pipelining) were originally developed in the late 1950s by IBM on their Stretch
IBM 7030
The IBM 7030, also known as Stretch, was IBM's first transistorized supercomputer. The first one was delivered to Los Alamos National Laboratory in 1961....

 mainframe computer.

VLIW and EPIC

The instruction scheduling logic that makes a superscalar processor is just boolean logic. In the early 1990s, a significant innovation was to realize that the coordination of a multiple-ALU computer could be moved into the compiler
Compiler
A compiler is a computer program that transforms source code written in a programming language into another computer language...

, the software that translates a programmer's instructions into machine-level instructions.

This type of computer is called a very long instruction word
Very long instruction word
Very long instruction word or VLIW refers to a CPU architecture designed to take advantage of instruction level parallelism . A processor that executes every instruction one after the other may use processor resources inefficiently, potentially leading to poor performance...

(VLIW) computer.

Statically scheduling the instructions in the compiler (as opposed to letting the processor do the scheduling dynamically) can reduce CPU complexity. This can improve performance, reduce heat, and reduce cost.

Unfortunately, the compiler lacks accurate knowledge of runtime scheduling issues. Merely changing the CPU core frequency multiplier will have an effect on scheduling. Actual operation of the program, as determined by input data, will have major effects on scheduling. To overcome these severe problems a VLIW system may be enhanced by adding the normal dynamic scheduling, losing some of the VLIW advantages.

Static scheduling in the compiler also assumes that dynamically generated code will be uncommon. Prior to the creation of Java
Java Virtual Machine
A Java virtual machine is a virtual machine capable of executing Java bytecode. It is the code execution component of the Java software platform. Sun Microsystems stated that there are over 4.5 billion JVM-enabled devices.-Overview:...

, this was in fact true. It was reasonable to assume that slow compiles would only affect software developers. Now, with JIT
Just-in-time compilation
In computing, just-in-time compilation , also known as dynamic translation, is a method to improve the runtime performance of computer programs. Historically, computer programs had two modes of runtime operation, either interpreted or static compilation...

 virtual machines for Java and .NET
.NET Framework
The .NET Framework is a software framework that runs primarily on Microsoft Windows. It includes a large library and supports several programming languages which allows language interoperability...

, slow code generation affects users as well.

There were several unsuccessful attempts to commercialize VLIW. The basic problem is that a VLIW computer does not scale to different price and performance points, as a dynamically scheduled computer can. Another issue is that compiler design for VLIW computers is extremely difficult, and the current crop of compilers (as of 2005) don't always produce optimal code for these platforms.

Also, VLIW computers optimise for throughput, not low latency, so they were not attractive to the engineers designing controllers and other computers embedded in machinery. The embedded system
Embedded system
An embedded system is a computer system designed for specific control functions within a larger system. often with real-time computing constraints. It is embedded as part of a complete device often including hardware and mechanical parts. By contrast, a general-purpose computer, such as a personal...

s markets had often pioneered other computer improvements by providing a large market that did not care about compatibility with older software.

In January 2000, a company called Transmeta
Transmeta
Transmeta Corporation was a US-based corporation that licensed low power semiconductor intellectual property. Transmeta originally produced very long instruction word code morphing microprocessors, with a focus on reducing power consumption in electronic devices. It was founded in 1995 by Bob...

 took the interesting
step of placing a compiler in the central processing unit, and making the compiler translate from a reference byte code (in their case, x86 instructions) to an internal VLIW instruction set. This approach
combines the hardware simplicity, low power and speed of VLIW RISC with
the compact main memory system and software reverse-compatibility provided
by popular CISC.

Intel's Itanium
Itanium
Itanium is a family of 64-bit Intel microprocessors that implement the Intel Itanium architecture . Intel markets the processors for enterprise servers and high-performance computing systems...

 chip is based on what they call an Explicitly Parallel Instruction Computing
Explicitly Parallel Instruction Computing
Explicitly parallel instruction computing is a term coined in 1997 by the HP–Intel alliance to describe a computing paradigm that researchers had been investigating since the early 1980s. This paradigm is also called Independence architectures...

 (EPIC) design. This design supposedly provides the VLIW advantage of increased instruction throughput. However, it avoids some of the issues of scaling and complexity, by explicitly providing in each "bundle" of instructions information concerning their dependencies. This information is calculated by the compiler, as it would be in a VLIW design. The early versions are also backward-compatible with current x86 software by means of an on-chip emulation
Emulator
In computing, an emulator is hardware or software or both that duplicates the functions of a first computer system in a different second computer system, so that the behavior of the second system closely resembles the behavior of the first system...

 mode. Integer performance was disappointing and despite improvements, sales in volume markets continue to be low.

Multi-threading

Current designs work best when the computer is running only a single program, however nearly all modern operating system
Operating system
An operating system is a set of programs that manage computer hardware resources and provide common services for application software. The operating system is the most important type of system software in a computer system...

s allow the user to run multiple programs at the same time. For the CPU to change over and do work on another program requires expensive context switching. In contrast, multi-threaded CPUs can handle instructions from multiple programs at once.

To do this, such CPUs include several sets of registers. When a context switch occurs, the contents of the "working registers" are simply copied into one of a set of registers for this purpose.

Such designs often include thousands of registers instead of hundreds as in a typical design. On the downside, registers tend to be somewhat expensive in chip space needed to implement them. This chip space might otherwise be used for some other purpose.

Multi-core

Multi-core CPUs are typically multiple CPU cores on the same die, connected to each other via a shared L2 or L3 cache, an on-die bus
Computer bus
In computer architecture, a bus is a subsystem that transfers data between components inside a computer, or between computers.Early computer buses were literally parallel electrical wires with multiple connections, but the term is now used for any physical arrangement that provides the same...

, or an on-die crossbar switch
Crossbar switch
In electronics, a crossbar switch is a switch connecting multiple inputs to multiple outputs in a matrix manner....

. All the CPU cores on the die share interconnect components with which to interface to other processors and the rest of the system. These components may include a front side bus
Front side bus
A front-side bus is a computer communication interface often used in computers during the 1990s and 2000s.It typically carries data between the central processing unit and a memory controller hub, known as the northbridge....

 interface, a memory controller
Memory controller
The memory controller is a digital circuit which manages the flow of data going to and from the main memory. It can be a separate chip or integrated into another chip, such as on the die of a microprocessor...

 to interface with DRAM
Dynamic random access memory
Dynamic random-access memory is a type of random-access memory that stores each bit of data in a separate capacitor within an integrated circuit. The capacitor can be either charged or discharged; these two states are taken to represent the two values of a bit, conventionally called 0 and 1...

, a cache coherent
Cache coherency
In computing, cache coherence refers to the consistency of data stored in local caches of a shared resource.When clients in a system maintain caches of a common memory resource, problems may arise with inconsistent data. This is particularly true of CPUs in a multiprocessing system...

 link
Computer bus
In computer architecture, a bus is a subsystem that transfers data between components inside a computer, or between computers.Early computer buses were literally parallel electrical wires with multiple connections, but the term is now used for any physical arrangement that provides the same...

 to other processors, and a non-coherent link to the southbridge
Southbridge (computing)
The southbridge is one of the two chips in the core logic chipset on a personal computer motherboard, the other being the northbridge. The southbridge typically implements the slower capabilities of the motherboard in a northbridge/southbridge chipset computer architecture. In Intel chipset...

 and I/O devices. The terms multi-core and MPU
Microprocessor
A microprocessor incorporates the functions of a computer's central processing unit on a single integrated circuit, or at most a few integrated circuits. It is a multipurpose, programmable device that accepts digital data as input, processes it according to instructions stored in its memory, and...

 (which stands for Micro-Processor Unit) have come into general usage for a single die that contains multiple CPU cores.

Intelligent RAM

One way to work around the Von Neumann bottleneck is to mix a processor and DRAM all on one chip.
  • The Berkeley IRAM Project
    The Berkeley IRAM Project
    In a 1996–2004 research project in the Computer Science Division of the University of California, Berkeley, The Berkeley IRAM Project explored computer architecture enabled by the wide bandwidth between memory and processor made possible when both are designed on the same integrated circuit...

  • eDRAM
    EDRAM
    eDRAM stands for "embedded DRAM", a capacitor-based dynamic random access memory integrated on the same die as an ASIC or processor. The cost-per-bit is higher than for stand-alone DRAM chips but in many applications the performance advantages of placing the eDRAM on the same chip as the processor...

  • computational RAM
    Computational RAM
    Computational RAM or C-RAM is random access memory with processing elements integrated into the design. This enables C-RAM to be used as a SIMD computer...


Reconfigurable logic

Another track of development is to combine reconfigurable logic with a general-purpose CPU. In this scheme, a special computer language compiles fast-running subroutines into a bit-mask to configure the logic. Slower, or less-critical parts of the program can be run by sharing their time on the CPU. This process has the capability to create devices such as software radio
Radio
Radio is the transmission of signals through free space by modulation of electromagnetic waves with frequencies below those of visible light. Electromagnetic radiation travels by means of oscillating electromagnetic fields that pass through the air and the vacuum of space...

s, by using digital signal processing to perform functions usually performed by analog electronics
Electronics
Electronics is the branch of science, engineering and technology that deals with electrical circuits involving active electrical components such as vacuum tubes, transistors, diodes and integrated circuits, and associated passive interconnection technologies...

.

Open source processors

As the lines between hardware and software increasingly blur due to progress in design methodology and availability of chips such as FPGAs and cheaper production processes, even open source hardware
Open source hardware
Open source hardware consists of physical artifacts of technology designed and offered in the same manner as free and open source software . Open source hardware is part of the open source culture movement and applies a like concept to a variety of components. The term usually means that...

 has begun to appear. Loosely knit communities like OpenCores
OpenCores
OpenCores is the world's largest open source hardware community developing digital open source hardware through electronic design automation, with a similar ethos to the free software movement. OpenCores hopes to eliminate redundant design work and slash development costs. A number of companies...

 have recently announced completely open CPU architectures such as the OpenRISC
OpenRISC
OpenRISC is the original flagship project of the OpenCores community. This project aims to develop a series of general purpose open source RISC CPU architectures...

 which can be readily implemented on FPGAs or in custom produced chips, by anyone, without paying license fees, and even established processor manufacturers like Sun Microsystems
Sun Microsystems
Sun Microsystems, Inc. was a company that sold :computers, computer components, :computer software, and :information technology services. Sun was founded on February 24, 1982...

 have released processor designs (e.g. OpenSPARC
OpenSPARC
OpenSPARC is an open-source hardware project started in December 2005. The initial contribution to the project was Sun Microsystems' register-transfer level Verilog code for a full 64-bit, 32-thread microprocessor, the UltraSPARC T1 processor. On 21 March 2006, Sun released the source code to the...

) under open-source licenses.

Asynchronous CPUs

Yet another possibility is the "clockless CPU" (asynchronous CPU). Unlike conventional processors, clockless processors have no central clock to coordinate the progress of data through the pipeline.
Instead, stages of the CPU are coordinated using logic devices called "pipe line controls" or "FIFO sequencers." Basically, the pipeline controller clocks the next stage of logic when the existing stage is complete. In this way, a central clock is unnecessary.

It might be easier to implement high performance devices in asynchronous logic as opposed to clocked logic:
  • components can run at different speeds in the clockless CPU. In a clocked CPU, no component can run faster than the clock rate.
  • In a clocked CPU, the clock can go no faster than the worst-case performance of the slowest stage. In a clockless CPU, when a stage finishes faster than normal, the next stage can immediately take the results rather than waiting for the next clock tick. A stage might finish faster than normal because of the particular data inputs (multiplication can be very fast if it is multiplying by 0 or 1), or because it is running at a higher voltage or lower temperature than normal.

Asynchronous logic proponents believe these capabilities would have these benefits:
  • lower power dissipation for a given performance level
  • highest possible execution speeds


The biggest disadvantage of the clockless CPU is that most CPU design tools assume a clocked CPU (a synchronous circuit
Synchronous circuit
A synchronous circuit is a digital circuit in which the parts are synchronized by a clock signal.In an ideal synchronous circuit, every change in the logical levels of its storage components is simultaneous. These transitions follow the level change of a special signal called the clock...

), so making a clockless CPU (designing an asynchronous circuit
Asynchronous circuit
An asynchronous circuit is a circuit in which the parts are largely autonomous. They are not governed by a clock circuit or global clock signal, but instead need only wait for the signals that indicate completion of instructions and operations. These signals are specified by simple data transfer...

) involves modifying the design tools to handle clockless logic and doing extra testing to ensure the design avoids metastable
Metastability in electronics
Metastability in electronics is the ability of a digital electronic system to persist for an unbounded time in an unstable equilibrium or metastable state....

 problems.

Even so, several asynchronous CPUs have been built, including
  • the ORDVAC
    ORDVAC
    The ORDVAC or Ordnance Discrete Variable Automatic Computer, an early computer built by the University of Illinois for the Ballistics Research Laboratory at Aberdeen Proving Ground, was based on the IAS architecture developed by John von Neumann, which came to be known as the von Neumann architecture...

     and the identical ILLIAC I
    ILLIAC I
    The ILLIAC I , a pioneering computer built in 1952 by the University of Illinois, was the first computer built and owned entirely by a US educational institution, Manchester University UK having built Manchester Mark 1 in 1948.ILLIAC I was based on the Institute for Advanced Study Von Neumann...

     (1951)
  • the ILLIAC II
    ILLIAC II
    The ILLIAC II was a revolutionary super-computer built by the University of Illinois that became operational in 1962.-Description:The concept, proposed in 1958, pioneered Emitter-coupled logic circuitry, pipelining, and transistor memory with a design goal of 100x speedup compared to ILLIAC...

     (1962), the fastest computer in the world at the time
  • The Caltech Asynchronous Microprocessor, the world-first asynchronous microprocessor (1988)
  • the ARM
    ARM architecture
    ARM is a 32-bit reduced instruction set computer instruction set architecture developed by ARM Holdings. It was named the Advanced RISC Machine, and before that, the Acorn RISC Machine. The ARM architecture is the most widely used 32-bit ISA in numbers produced...

    -implementing AMULET
    AMULET microprocessor
    AMULET is a series of microprocessors that implement the ARM processor architecture. Developed by the group under the University of Manchester's computer science school , AMULET is unique from other ARM implementations in being an asynchronous microprocessor, not making use of a square wave clock...

     (1993 and 2000)
  • the asynchronous implementation of MIPS
    MIPS architecture
    MIPS is a reduced instruction set computer instruction set architecture developed by MIPS Technologies . The early MIPS architectures were 32-bit, and later versions were 64-bit...

     R3000, dubbed MiniMIPS (1998)
  • the SEAforth multi-core processor from Charles H. Moore
    Charles H. Moore
    Charles H. Moore is the inventor of the Forth programming language.- Biography :In 1968, while employed at the United States National Radio Astronomy Observatory , Moore invented the initial version of the Forth language to help control radio telescopes...


Optical communication

One interesting possibility would be to eliminate the front side bus
Front side bus
A front-side bus is a computer communication interface often used in computers during the 1990s and 2000s.It typically carries data between the central processing unit and a memory controller hub, known as the northbridge....

. Modern vertical laser diode
Laser diode
The laser diode is a laser where the active medium is a semiconductor similar to that found in a light-emitting diode. The most common type of laser diode is formed from a p-n junction and powered by injected electric current...

s enable this change. In theory, an optical computer's components could directly connect through a holographic or phased open-air switching system. This would provide a large increase in effective speed and design flexibility, and a large reduction in cost. Since a computer's connectors are also its most likely failure point, a busless system might be more reliable, as well.

In addition, current (2010) modern processors use 64- or 128-bit logic. Wavelength superposition could allow for data lanes and logic many orders of magnitude higher, without additional space or copper wires.

Optical processors

Another farther-term possibility is to use light instead of electricity for the digital logic itself.
In theory, this could run about 30% faster and use less power, as well as permit a direct interface with quantum computational devices.
The chief problem with this approach is that for the foreseeable future, electronic devices are faster, smaller (i.e. cheaper) and more reliable.
An important theoretical problem is that electronic computational elements are already smaller than some wavelengths of light, and therefore even wave-guide based optical logic may be uneconomic compared to electronic logic.
The majority of development effort, as of 2006 is focused on electronic circuitry.
See also optical computing.

Time of events

  • 1971. Intel released the Intel 4004
    Intel 4004
    The Intel 4004 was a 4-bit central processing unit released by Intel Corporation in 1971. It was the first complete CPU on one chip, and also the first commercially available microprocessor...

    , the world's first commercially available microprocessor.
  • 1977. First VAX sold, a VAX-11
    VAX-11
    The VAX-11 was a family of minicomputers developed and manufactured by Digital Equipment Corporation using processors implementing the VAX instruction set architecture . The VAX-11/780 was the first VAX computer.- VAX-11/780 :...

    /780.
  • 1978. Intel introduces the Intel 8086
    Intel 8086
    The 8086 is a 16-bit microprocessor chip designed by Intel between early 1976 and mid-1978, when it was released. The 8086 gave rise to the x86 architecture of Intel's future processors...

     and Intel 8088
    Intel 8088
    The Intel 8088 microprocessor was a variant of the Intel 8086 and was introduced on July 1, 1979. It had an 8-bit external data bus instead of the 16-bit bus of the 8086. The 16-bit registers and the one megabyte address range were unchanged, however...

    , the first x86 chips.
  • 1981. Stanford MIPS introduced, one of the first RISC
    Reduced instruction set computer
    Reduced instruction set computing, or RISC , is a CPU design strategy based on the insight that simplified instructions can provide higher performance if this simplicity enables much faster execution of each instruction. A computer based on this strategy is a reduced instruction set computer...

     designs.
  • 1982. Intel introduces the Intel 80286
    Intel 80286
    The Intel 80286 , introduced on 1 February 1982, was a 16-bit x86 microprocessor with 134,000 transistors. Like its contemporary simpler cousin, the 80186, it could correctly execute most software written for the earlier Intel 8086 and 8088...

    , which was the first Intel processor that could run all the software written for its predecessors, the 8086 and 8088.
  • 1985. Intel introduces the Intel 80386
    Intel 80386
    The Intel 80386, also known as the i386, or just 386, was a 32-bit microprocessor introduced by Intel in 1985. The first versions had 275,000 transistors and were used as the central processing unit of many workstations and high-end personal computers of the time...

    , which adds a 32-bit instruction set to the x86 microarchitecture.
  • 1989. Intel introduces the Intel 80486
    Intel 80486
    The Intel 80486 microprocessor was a higher performance follow up on the Intel 80386. Introduced in 1989, it was the first tightly pipelined x86 design as well as the first x86 chip to use more than a million transistors, due to a large on-chip cache and an integrated floating point unit...

  • 1993. Intel launches the original Pentium microprocessor, the first processor with a x86 superscalar microacrhitecture.
  • 1995. Intel introduces the Pentium Pro which becomes the foundation for the Pentium II, Pentium III, Pentium M, and Intel Core Architectures.
  • 2000. AMD announced x86-64
    X86-64
    x86-64 is an extension of the x86 instruction set. It supports vastly larger virtual and physical address spaces than are possible on x86, thereby allowing programmers to conveniently work with much larger data sets. x86-64 also provides 64-bit general purpose registers and numerous other...

     extension to the x86 microarchitecture.
  • 2000. Analog Devices introduces the Blackfin
    Blackfin
    The Blackfin is a family of 16- or 32-bit microprocessors developed, manufactured and marketed by Analog Devices. The family is characterized by their built-in, fixed-point digital signal processor functionality supplied by 16-bit Multiply–accumulates , accompanied on-chip by a small and...

     architecture.
  • 2000. Intel introduces Netburst which would become the foundation of all Pentium 4 variants and later be phased out in favor of the Core architecture.
  • 2002. Intel releases a Pentium 4
    Pentium 4
    Pentium 4 was a line of single-core desktop and laptop central processing units , introduced by Intel on November 20, 2000 and shipped through August 8, 2008. They had a 7th-generation x86 microarchitecture, called NetBurst, which was the company's first all-new design since the introduction of the...

     with Hyper-Threading
    Hyper-threading
    Hyper-threading is Intel's term for its simultaneous multithreading implementation in its Atom, Intel Core i3/i5/i7, Itanium, Pentium 4 and Xeon CPUs....

    , the first modern desktop processor to implement simultaneous multithreading
    Simultaneous multithreading
    Simultaneous multithreading, often abbreviated as SMT, is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading...

     (SMT).
  • 2003. Intel introduced the Pentium M, a low power mobile derivative of the Pentium Pro architecture.
  • 2005. AMD announced Athlon 64 X2
    Athlon 64 X2
    The Athlon 64 X2 is the first dual-core desktop CPU designed by AMD. It was designed from scratch as native dual-core by using an already multi-CPU enabled Athlon 64, joining it with another functional core on one die, and connecting both via a shared dual-channel memory controller/north bridge and...

    , the first x86 dual-core processor.
  • 2006. Intel introduces the Core line of CPUs based on a modified Pentium M design.
  • 2008. About ten billion CPUs were manufactured in 2008.
  • 2010. Intel introduced Core i3, i5, i7 processors.

by Michael Barr 2009

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK