XCore XS1
Encyclopedia
The XCore XS1 is a 32-bit RISC microprocessor architecture designed by XMOS
XMOS
XMOS is a fabless semiconductor company that develops multi-core multi-threaded processors designed to execute several real-time tasks, DSP, and control flow all at once.-Company history:...

. The architecture is designed to be used for embedded systems
Embedded system
An embedded system is a computer system designed for specific control functions within a larger system. often with real-time computing constraints. It is embedded as part of a complete device often including hardware and mechanical parts. By contrast, a general-purpose computer, such as a personal...

, and instruction encoding is compact using 16 bits for frequently used instructions (with up to three operands) and 32 bits for less frequently used instructions (with up to 6 operands).

Almost all instructions execute in a single cycle, and the architecture is event-driven in order to decouple the timings that a program needs to make from the execution speed of the program. A program will normally perform its computations and then wait for an event
Event (computing)
In computing an event is an action that is usually initiated outside the scope of a program and that is handled by a piece of code inside the program. Typically events are handled synchronous with the program flow, that is, the program has one or more dedicated places where events are handled...

 (eg a message
Message passing
Message passing in computer science is a form of communication used in parallel computing, object-oriented programming, and interprocess communication. In this model, processes or objects can send and receive messages to other processes...

, time, or external I/O event) before continuing.

Processors with this architecture include the XCore XS1-G4
XCore XS1-G4
The XS1-G4 is a processor designed by XMOS. It is a 32-bit quad-core processor, where each core runs up to 8 concurrent threads. It was available as of Autumn 2008 running at 400 MHz. Each thread can run at up to 100 MHz; four threads follow each other through the pipeline, resulting in a...

 and XCore XS1-L1
XCore XS1-L1
The XS1-L1is a processor designed by XMOS. It is a 32-bit processor, that runs up to 8 concurrent threads. It was available as of June 2009 running at 400 MHz. As of April 2010 500 Mhz versions are available. Each thread can run at up to 125 MHz; four threads follow each other through...

.

Architecture

The architecture comprises a central execution unit that operates on a set of 25 registers, a surrounded by a number of _resources_ that perform operations that interact with the environment. Each thread has its own set of hardware registers, enabling threads to execute concurrently.
The instruction set comprises both a (more or less standard) sequential programming model, and instructions that implement multi-threading, multi-core and I/O operations.

Instruction encoding

Instructions can use between zero and six operands. Most common arithmetic operations (such as ADD, SUB, MULT) are three-operand instructions based on a set of 12 general purpose registers. Three operands can be identified using no more than 11 bits, enabling a set of 13 frequently used 3-operand instructions to be encoded in 16 bits. Other instructions that are encoded in 16 bits are branch operations and common loads and stores.

Less frequently used instructions are encoded in 32 bits. These instructions encode instructions that operate on long immediate operands (far branches), on a large number of operands (for example long multiply which has 4 source and two destination operands) and instructions that are rarely used with fewer operands.

Sequential programming model

Each thread has access to 12 general purpose registers R0...R11. In addition there are 4 special purpose registers the SP, LR (Link register - stores the return address), CP (constant pool, points to a part of memory that stores constants) and DP (data pool - points to global variables). In addition to those 16 there are another 9 registers that store the PC, kernel PC, Exception type, Exception data, and saved copies of all those in case of an exception or interrupt. The instruction set is a load-store instruction set.

All instructions execute in a single cycle. If an instruction does not need data from memory (for example, arithmetic operations), the instruction will prefetch a word of instructions. Because most instructions are encoded in 16-bits, and because most instructions are not loads or stores (a typical number is 20% loads stores, 80% other instructions), the prefetch
Instruction prefetch
In computer architecture, instruction prefetch is a technique used in microprocessors to speed up the execution of a program by reducing wait states....

 mechanism can stay ahead of the instructions stream. This acts like a very small instruction cache, but its behaviour can be predicted at compile time
Compile time
In computer science, compile time refers to either the operations performed by a compiler , programming language requirements that must be met by source code for it to be successfully compiled , or properties of the program that can be reasoned about at compile time.The operations performed at...

, making timing behaviour as predictable as functional behaviour.

Instructions that access memory all use a base register: SP, DP, CP, PC or any general purpose register. In a single 16-bit instruction a thread can access:
  • Up to 64 words relative to the stack pointer (read or write, word access only)
  • Up to 64 words relative to the data pointer (read or write, word access only)
  • Up to 64 words relative to the constant pointer (read only, word access only)
  • Up to 12 words relative to any general purpose register (read and write, word access only)
  • An indexed word using any two general purpose registers
  • An indexed 16-bit quantity using any two general purpose registers
  • An indexed byte using any two general purpose registers

Larger sections of memory can be accessed by means of extended instructions, which extend the above ranges to 64 KBytes.

This scheme is designed in order densely encode the common cases found in many programming patterns: access to small stack frames, a small set of globals and constants, structures, and arrays. Access to bit fields that have an odd length is facilitated by means of sign and zero extend instructions.

All common arithmetic instructions are provided - including a divide and remainder (which are the only instructions that are not single cycle). Comparison instructions compute a truth value (0 or 1) into a register, avoiding the use of flags (insert reference to other processors here). Many instructions have immediate version that allow a single operand with a value of between 0 and 11 inclusive, encoding many common cases such as "i = i + 1". In the case of bit operations such as shift, the immediate value encodes common cases. Extra instructions are provided for reversing bits and bytes, count leading zeros, digital signal processing
Digital signal processing
Digital signal processing is concerned with the representation of discrete time signals by a sequence of numbers or symbols and the processing of these signals. Digital signal processing and analog signal processing are subfields of signal processing...

, and long integer arithmetic.

The branch instructions include conditional and unconditional relative
branches. A branch using the address in a register is provided; a
relative branch which adds a scaled register operand to the program
counter is provided to support jump tables. Branches to up to instructions distance are encoded in a single word.
The procedure calling instructions include relative calls, calls via the
constant pool, indexed calls via a dedicated register and calls via a
register. Most calls within a single program module can be encoded in a
single instruction; inter-module calling requires at most two instructions.
It is up to the callee to save the link-register if it is not a leaf-function, a single instruction extends the stack and saves the link register.

Parallel Programming Model

The XS1 instruction set is designed to support both multi threading and multi-core computations. To this extent it supports channel communication (to support distributed memory computations) and barriers and locks (to support shared memory computations).
A thread initiates execution on one or more newly
allocated threads by setting their initial register values.

Communication between threads is performed using channels that provide full-duplex data transfer between channel-ends. This enables, amongst others, the implementation of CSP
Communicating sequential processes
In computer science, Communicating Sequential Processes is a formal language for describing patterns of interaction in concurrent systems. It is a member of the family of mathematical theories of concurrency known as process algebras, or process calculi...

 based languages, languages based on the Pi calculus. The instruction set is agnostic as to where a channel is connected to - whether that is inside a core or outside the core. Channels carry messages constructed from data and control
tokens between the two channel ends. The control tokens can be
used to encode communication protocols.

Channel ends have a buffer able to hold sufficient tokens to
allow at least one word to be buffered. If an output instruction
is executed when the channel is too full to take the data then
the thread which executed the instruction is paused. It is
restarted when there is enough room in the channel for the
instruction to successfully complete. Likewise, when
an input instruction is executed and there is not enough data
available then the thread is paused and will be restarted
when enough data becomes available.

A thread can, with a single instruction, synchronise with a group of threads using a barrier synchronisation. Alternatively a thread can synchronise using a lock, providing mutual exclusion. In order to communicate data when using barriers and locks, threads can either write data into the registers of another thread, or they can access memory of another thread (provided both threads execute on the same core). If shared memory is used, then the compiler or the programmer must ensure that there are no race conditions.

I/O and timing instructions

The XS1 architecture is event-driven. It has an instruction that can dispatch an external events
Event (computing)
In computing an event is an action that is usually initiated outside the scope of a program and that is handled by a piece of code inside the program. Typically events are handled synchronous with the program flow, that is, the program has one or more dedicated places where events are handled...

 in addition to traditional interrupts. If the program chooses to use events, then the underlying processor has to expect an event and wait in a specific place so that it can be handled synchronously. If desired, I/O can be handled asynchronously using interrupts. Events and interrupts can be used on any resource that the implementation supports.

Common resources that are supported are ports (for external input and output), timers (that allow timing to a reference clock), channels (that allow communication and synchronization between threads within a core, and threads on different cores), locks (which allow controlled access to shared memory), and synchronizers (which implement barrier synchronizations between threads).

Devices

The XS1 instruction set is implemented by the XCore XS1-G4
XCore XS1-G4
The XS1-G4 is a processor designed by XMOS. It is a 32-bit quad-core processor, where each core runs up to 8 concurrent threads. It was available as of Autumn 2008 running at 400 MHz. Each thread can run at up to 100 MHz; four threads follow each other through the pipeline, resulting in a...

 and XCore XS1-L1
XCore XS1-L1
The XS1-L1is a processor designed by XMOS. It is a 32-bit processor, that runs up to 8 concurrent threads. It was available as of June 2009 running at 400 MHz. As of April 2010 500 Mhz versions are available. Each thread can run at up to 125 MHz; four threads follow each other through...

. The former is a four-core processing node, the latter a single core processing node.

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK