Memory management unit
Encyclopedia
A memory management unit (MMU), sometimes called paged memory management unit (PMMU), is a computer hardware
component responsible for handling accesses to memory
requested by the CPU
. Its functions include translation of virtual address
es to physical address
es (i.e., virtual memory
management), memory protection
, cache
control, bus
arbitration
and in simpler computer architectures (especially 8-bit
systems) bank switching
.
(the range of addresses used by the processor) into pages, each having a size which is a power of 2, usually a few kilobyte
s, but they may be much larger. The bottom n bits of the address (the offset within a page) are left unchanged. The upper address bits are the (virtual) page number. The MMU normally translates virtual page numbers to physical page numbers via an associative cache called a Translation Lookaside Buffer
(TLB). When the TLB lacks a translation, a slower mechanism involving hardware-specific data structures or software assistance is used. The data found in such data structures are typically called page table entries (PTEs), and the data structure itself is typically called a page table
. The physical page number is combined with the page offset to give the complete physical address.
A PTE or TLB entry may also include information about whether the page has been written to (the dirty bit), when it was last used (the accessed bit, for a least recently used page replacement algorithm
), what kind of processes (user mode, supervisor mode) may read and write it, and whether it should be cache
d.
Sometimes, a TLB entry or PTE prohibits access to a virtual page, perhaps because no physical random access memory has been allocated to that virtual page. In this case the MMU signals a page fault
to the CPU. The operating system
(OS) then handles the situation, perhaps by trying to find a spare frame of RAM and set up a new PTE to map it to the requested virtual address. If no RAM is free, it may be necessary to choose an existing page (known as a victim), using some replacement algorithm, and save it to disk (this is called "paging
"). With some MMUs, there can also be a shortage of PTEs or TLB entries, in which case the OS will have to free one for the new mapping.
In some cases a "page fault" may indicate a software bug. A key benefit of an MMU is memory protection
: an OS can use it to protect against errant programs, by disallowing access to memory that a particular program should not have access to. Typically, an OS assigns each program its own virtual address space.
An MMU also reduces the problem of fragmentation
of memory. After blocks of memory have been allocated and freed, the free memory may become fragmented (discontinuous) so that the largest contiguous block of free memory may be much smaller than the total amount. With virtual memory, a contiguous range of virtual addresses can be mapped to several non-contiguous blocks of physical memory.
In some early microprocessor
designs, memory management was performed by a separate integrated circuit
such as the VLSI VI475 or the Motorola 68851
used with the Motorola 68020
CPU in the Macintosh II
or the Z8015 used with the Zilog Z80
family of processors. Later microprocessors such as the Motorola 68030
and the ZILOG Z280
placed the MMU together with the CPU on the same integrated circuit, as did the Intel 80286
and later x86 microprocessors.
While this article concentrates on modern MMUs, commonly based on page
s, early systems used a similar concept for base-limit addressing, that further developed into segmentation
. Those are occasionally also present on modern architectures. The x86 architecture
provided segmentation rather than paging in the 80286, and provides both paging and segmentation in the 80386 and later processors (although the use of segmentation is not available in 64-bit operation).
to 64 KB in size, often with the possibility to use huge pages from 2 MB
to 512 MB in size. Page translations are cached in a TLB
. Some systems, mainly older RISC designs, trap
into the OS when a page translation is not found in the TLB. Most systems use a hardware-based tree walker. Most systems allow the MMU to be disabled; some disable the MMU when trapping into OS code.
pages are 512 bytes, which is very small. An OS may treat multiple pages as if they were a single larger page, for example Linux on VAX groups 8 pages together, thus the system is viewed as having 4 KB pages. The VAX divides memory into 4 fixed-purpose regions, each 1 GB
in size. They are:
Page tables are big linear arrays. Normally this would be very wasteful when addresses are used at both ends of the possible range, but the page table for applications is itself stored in the kernel's paged memory. Thus there is effectively a 2-level tree, allowing applications to have sparse memory layout without wasting lots of space on unused page table entries. The VAX MMU is notable for lacking an accessed bit. OSes which implement paging must find some way to emulate the accessed bit if they are to operate efficiently. Typically, the OS will periodically unmap pages so that page-not-present faults can be used to let the OS set an accessed bit.
based application processors implement an MMU defined by ARM's Virtual Memory System Architecture. The current architecture defines PTEs for describing 4KB and 64KB pages, 1MB sections and 16MB super-sections; legacy versions also defined a 1KB tiny page. The ARM uses a two-level pagetable if using 4KB and 64KB pages, or just a one-level pagetable for 1MB sections and 16MB sections.
TLB updates are performed automatically by page-table walking hardware.
PTEs include read/write access permission based on privilege, cacheability information, an NX bit
, and a non-secure bit
easier. These features have been inherited by succeeding mainframe architectures, up to the current z/Architecture
.
processor divides memory into 8 KB pages. After a TLB miss, low-level firmware machine code (here called PALcode
) walks a 3-level tree-structured page table. Addresses are broken down as follows: 21 bits unused, 10 bits to index the root level of the tree, 10 bits to index the middle level of the tree, 10 bits to index the leaf level of the tree, and 13 bits that pass through to the physical address without modification. Full read/write/execute permission bits are supported.
supports 1 to 64 entries in the TLB
. The number of TLB entries is configurable at CPU configuration before synthesis. TLB entries are dual. Each TLB entry maps a virtual page number (VPN2) to either one of two page frame numbers (PFN0 or PFN1), depending on the least significant bit of the virtual address that is not part of the page mask. This bit and the page mask bits, are not stored in the VPN2. Each TLB entry has its own page size, which can be any value from 1 KB to 256 MB in multiples of 4. Each PFN in a TLB entry has a caching attribute, a dirty and a valid status bit. A VPN2 has a global status bit and an OS assigned ID which participates in the virtual address TLB entry match, if the global status bit is set to 0. A PFN stores the physical address without the page mask bits.
A TLB Refill exception is generated when there are no entries in the TLB that match the mapped virtual address. A TLB Invalid exception is generated when there is a match but the entry is marked invalid. A TLB Modified exception is generated when there is a match but the dirty status is not set. If a TLB exception occurs when processing a TLB exception, a double fault TLB exception, it is dispatched to its own exception handler.
MIPS32 and MIPS32r2 support 32 bits of virtual address space and up to 36 bits of physical address space.
MIPS64 supports up to 64 bits of virtual address space and up to 59 bits of physical address space.
built around the Motorola 68000
microprocessor
and introduced in 1982. It included the original Sun 1 Memory Management Unit, that provided address translation, memory protection, memory sharing and memory allocation for multiple processes running on the CPU. All access of the CPU to private on-board RAM, external Multibus memory, on-board I/O and the Multibus I/O ran through the MMU where they were translated and protected in uniform fashion. The MMU was implemented in hardware on the CPU board.
The MMU consisted of a context register, a segment map and a page map. Virtual addresses from the CPU were translated into intermediate addresses by the segment map, which in turn were translated into physical addresses by the page map. The page size was 2 KB
and the segment size was 32 KB which gave 16 pages per segment. Up to 16 contexts could be mapped concurrently. The maximum logical address space for a context was 1024 pages or 2 MB. The maximum physical address that could be mapped simultaneously was also 2 MB.
The context register was important in a multitasking operating system because it allowed the CPU to switch between processes without reloading all the translation state information. The 4-bit context register could switch between 16 sections of the segment map under supervisor control which allowed 16 contexts to be mapped concurrently. Each context had its own virtual address space
. Sharing of virtual address space and inter-context communications could be provided by writing the same values in to the segment or page maps of different contexts. Additional contexts could be handled by treating the segment map as a context cache and replacing out-of-date contexts on a least-recently-used basis.
The context register made no distinction between user and supervisor states; interrupts and traps did not switch contexts which required that all valid interrupt vectors always be mapped in page 0 of context, as well as the valid Supervisor Stack.
G1, G2, G3, and G4, pages are normally 4 KB
. After a TLB miss, the standard PowerPC MMU begins two simultaneous lookups. One lookup attempts to match the address with one of 4 or 8 Data Block Address Translation (DBAT) registers, or 4 or 8 Instruction Block Address Translation registers (IBAT) as appropriate. The BAT registers can map linear chunks of memory as large as 256 MB
, and are normally used by an OS to map large portions of the address space for the OS kernel's own use. If the BAT lookup succeeds, the other lookup is halted and ignored.
The other lookup, not directly supported by all processors in this family, is via a so-called "inverted page table" which acts as a hashed off-chip extension of the TLB. First, the top 4 bits of the address are used to select one of 16 segment registers. 24 bits from the segment register replace those 4 bits, producing a 52-bit address. The use of segment registers allows multiple processes to share the same hash table. The 52-bit address is hashed, then used as an index into the off-chip table. There, a group of 8 page table entries is scanned for one that matches. If none match due to excessive hash collision
s, the processor tries again with a slightly different hash function
. If this too fails, the CPU traps into the OS
(with MMU disabled) so that the problem may be resolved. The OS needs to discard an entry from the hash table to make space for a new entry. The OS may generate the new entry from a more-normal tree-like page table or from per-mapping data structures which are likely to be slower and more space-efficient. Support for no-execute
control is in the segment registers, leading to 256-MB granularity.
A major problem with this design is poor cache locality caused by the hash function. Tree-based designs avoid this by placing the page table entries for adjacent pages in adjacent locations. An operating system running on the PowerPC may minimize the size of the hash table to reduce this problem.
It is also somewhat slow to remove the page table entries of a process; the OS may avoid reusing segment values to delay facing this or it may elect to suffer the waste of memory associated with per-process hash tables. G1 chips do not search for page table entries, but they do generate the hash with the expectation that an OS will search the standard hash table via software. The OS can write to the TLB. G2, G3, and early G4 chips use hardware to search the hash table. The latest chips allow the OS to choose either method. On chips that make this optional or do not support it at all, the OS may choose to use a tree-based page table exclusively.
) is described here.
The CPU primarily divides memory into 4 KB pages. Segment registers, fundamental to the older 8088
and 80286 MMU designs, are avoided as much as possible by modern OSes. There is one major exception to this: access to thread
-specific data for applications or CPU-specific data for OS kernels, which is done with explicit use of the FS and GS segment registers. All memory access involves a segment register, chosen according to the code being executed. The segment register acts as an index into a table, which provides an offset to be added to the virtual address. Except when using FS or GS as described above, the OS ensures that the offset will be zero. After the offset is added, the address is masked to be no larger than 32 bits. The result may be looked up via a tree-structured page table, with the bits of the address being split as follows: 10 bits for the root of the tree, 10 bits for the leaves of the tree, and the 12 lowest bits being directly copied to the result.
Minor revisions of the MMU introduced with the Pentium have allowed very large 4 MB pages by skipping the bottom level of the tree. Minor revisions of the MMU introduced with the Pentium Pro
introduced the Physical Address Extension
(PAE) feature, enabling 36-bit physical addresses via three-level page tables (with 9+9+2 bits for the three levels, and the 12 lowest bits being directly copied to the result; large pages become only 2 MB in size). In addition, the Page Attribute Table
allowed specification of cacheability by looking up a few high bits in a small on-CPU table.
No-execute
support was originally only provided on a per-segment basis, making it very awkward to use. More recent x86 chips provide a per-page no-execute bit in the PAE mode. PaX
is one way to emulate per-page non-execute support via the segments, with a performance loss and halving the available address space.
is a 64-bit extension of x86, that almost entirely removes segmentation in favor of the flat memory model
used by almost all operating systems for the 386 or newer processors. In long mode, all segment offsets are ignored, except for the FS and GS segments. When used with 4 KB pages, the page table tree has four levels, instead of three. The virtual addresses are divided up as follows: 16 bits unused, 9 bits each for 4 tree levels (total: 36 bits), and the 12 lowest bits unmodified. With 2 MB pages there are only three levels to the page table for a total of 27 bits used in paging and 21 bits of offset. Some newer CPUs also support a 1 GB page with two levels of paging and 30 bits of offset. CPUID
can be used to determine if 1 GB pages are supported. In all three cases, the 16 highest bits are required to be equal to 48th bit, or in the other words, the low 48 bits are sign extended
to the higher bits. This is done to allow a future expansion of the addressable range, without compromising backwards compatibility.
In all levels of the page table, the page table entry includes a no-execute
bit, used to mark pages as data to avoid stack-smashing buffer overflow
attacks. It has no effect unless the NXE bit in the EFER is set, which not all legacy operating systems use.
The B5000 was the first commercial system to support virtual memory after the Atlas
. It provides the two functions of an MMU in different ways. Firstly, the mapping of virtual memory addresses. Instead of needing an MMU, the MCP systems are descriptor based. Each allocated memory block is given a master descriptor with the properties of the block, i.e., the size, address, and whether present in memory. When a request is made to access the block for reading or writing, the hardware checks its presence via the presence bit (pbit) in the descriptor.
A pbit of 1 indicates the presence of the block. In this case the block can be accessed via the physical address in the descriptor. If the pbit is zero, an interrupt is generated for the MCP (operating system) to make the block present. If the address field is zero, this is the first access to this block and it is allocated (an init pbit). If the address field is non-zero, it is a disk address of the block, which has previously been rolled out, so the block is fetched from disk and the pbit is set to 1 and the physical memory address updated to point to the block in memory (another pbit). This makes descriptors equivalent to a page-table entry in an MMU system. System performance can be monitored through the number of pbits. Init pbits indicate initial allocations, but a high level of other pbits indicate that the system may be thrashing.
Note that all memory allocation is therefore completely automatic (one of the features of modern systems) and there is no way to allocate blocks other than this mechanism. There are no such calls as malloc or dealloc, since memory blocks are also automatically discarded. The scheme is also lazy, since a block will not be allocated until it is actually referenced. When memory is near full, the MCP examines the working set, trying compaction (since the system is segmented, not paged), deallocating read-only segments (such as code-segments which can be restored from their original copy), and as a last resort, rolling dirty data segments out to disk.
Secondly, protection. Since all accesses are via the descriptor the hardware can check all accesses are within bounds, and in the case of a write that the process has write permission. The MCP system is inherently secure and thus has no need of an MMU to provide this level of memory protection. Descriptors are read only to user processes and may only be updated by the system (hardware or MCP). (Descriptors have a tag of 5 and odd-tagged words are read only – code words have a tag of 3.)
Blocks can be shared between processes via copy descriptors in the process stack – thus some processes may have write permission, whereas others not. A code segment is read only, thus reentrant and shared between processes. Copy descriptors contain a 20-bit address field giving index of the master descriptor in the master descriptor array. This also implements a very efficient and secure IPC mechanism. Blocks can easily be relocated since only the master descriptor needs update when a block's status changes.
The only other aspect is performance – do MMU- or non-MMU-based systems provide better performance? MCP systems may be implemented on top of standard hardware that does have an MMU (e.g., a standard PC). Even if the system implementation uses the MMU in some way, this will not be at all visible at the MCP level.
Computer hardware
Personal computer hardware are component devices which are typically installed into or peripheral to a computer case to create a personal computer upon which system software is installed including a firmware interface such as a BIOS and an operating system which supports application software that...
component responsible for handling accesses to memory
Computer memory
In computing, memory refers to the physical devices used to store programs or data on a temporary or permanent basis for use in a computer or other digital electronic device. The term primary memory is used for the information in physical systems which are fast In computing, memory refers to the...
requested by the CPU
Central processing unit
The central processing unit is the portion of a computer system that carries out the instructions of a computer program, to perform the basic arithmetical, logical, and input/output operations of the system. The CPU plays a role somewhat analogous to the brain in the computer. The term has been in...
. Its functions include translation of virtual address
Virtual address
In computer technology, a virtual address is an address identifying a virtual, i.e. non-physical, entity.-Description:The term virtual address is most commonly used for an address pointing to virtual memory or, in networking, when referring to a virtual network address...
es to physical address
Physical address
In computing, a physical address, also real address, or binary address, is the memory address that is represented in the form of a binary number on the address bus circuitry in order to enable the data bus to access a particular storage cell of main memory.In a computer with virtual memory, the...
es (i.e., virtual memory
Virtual memory
In computing, virtual memory is a memory management technique developed for multitasking kernels. This technique virtualizes a computer architecture's various forms of computer data storage , allowing a program to be designed as though there is only one kind of memory, "virtual" memory, which...
management), memory protection
Memory protection
Memory protection is a way to control memory access rights on a computer, and is a part of most modern operating systems. The main purpose of memory protection is to prevent a process from accessing memory that has not been allocated to it. This prevents a bug within a process from affecting...
, cache
CPU cache
A CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access memory. The cache is a smaller, faster memory which stores copies of the data from the most frequently used main memory locations...
control, bus
Computer bus
In computer architecture, a bus is a subsystem that transfers data between components inside a computer, or between computers.Early computer buses were literally parallel electrical wires with multiple connections, but the term is now used for any physical arrangement that provides the same...
arbitration
Arbiter (electronics)
-Asynchronous arbiters:An important form of arbiter is used in asynchronous circuits, to select the order of access to a shared resource among asynchronous requests. Its function is to prevent two operations from occurring at once when they should not...
and in simpler computer architectures (especially 8-bit
8-bit
The first widely adopted 8-bit microprocessor was the Intel 8080, being used in many hobbyist computers of the late 1970s and early 1980s, often running the CP/M operating system. The Zilog Z80 and the Motorola 6800 were also used in similar computers...
systems) bank switching
Bank switching
Bank switching is a technique to increase the amount of usable memory beyond the amount directly addressable by the processor. It can be used to configure a system differently at different times; for example, a ROM required to start a system from diskette could be switched out when no longer...
.
How it works
Modern MMUs typically divide the virtual address spaceAddress space
In computing, an address space defines a range of discrete addresses, each of which may correspond to a network host, peripheral device, disk sector, a memory cell or other logical or physical entity.- Overview :...
(the range of addresses used by the processor) into pages, each having a size which is a power of 2, usually a few kilobyte
Kilobyte
The kilobyte is a multiple of the unit byte for digital information. Although the prefix kilo- means 1000, the term kilobyte and symbol KB have historically been used to refer to either 1024 bytes or 1000 bytes, dependent upon context, in the fields of computer science and information...
s, but they may be much larger. The bottom n bits of the address (the offset within a page) are left unchanged. The upper address bits are the (virtual) page number. The MMU normally translates virtual page numbers to physical page numbers via an associative cache called a Translation Lookaside Buffer
Translation Lookaside Buffer
A translation lookaside buffer is a CPU cache that memory management hardware uses to improve virtual address translation speed. All current desktop and server processors use a TLB to map virtual and physical address spaces, and it is ubiquitous in any hardware which utilizes virtual memory.The...
(TLB). When the TLB lacks a translation, a slower mechanism involving hardware-specific data structures or software assistance is used. The data found in such data structures are typically called page table entries (PTEs), and the data structure itself is typically called a page table
Page table
A page table is the data structure used by a virtual memory system in a computer operating system to store the mapping between virtual addresses and physical addresses. Virtual addresses are those unique to the accessing process...
. The physical page number is combined with the page offset to give the complete physical address.
A PTE or TLB entry may also include information about whether the page has been written to (the dirty bit), when it was last used (the accessed bit, for a least recently used page replacement algorithm
Page replacement algorithm
In a computer operating system that uses paging for virtual memory management, page replacement algorithms decide which memory pages to page out when a page of memory needs to be allocated...
), what kind of processes (user mode, supervisor mode) may read and write it, and whether it should be cache
Cache
In computer engineering, a cache is a component that transparently stores data so that future requests for that data can be served faster. The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that are stored elsewhere...
d.
Sometimes, a TLB entry or PTE prohibits access to a virtual page, perhaps because no physical random access memory has been allocated to that virtual page. In this case the MMU signals a page fault
Page fault
A page fault is a trap to the software raised by the hardware when a program accesses a page that is mapped in the virtual address space, but not loaded in physical memory. In the typical case the operating system tries to handle the page fault by making the required page accessible at a location...
to the CPU. The operating system
Operating system
An operating system is a set of programs that manage computer hardware resources and provide common services for application software. The operating system is the most important type of system software in a computer system...
(OS) then handles the situation, perhaps by trying to find a spare frame of RAM and set up a new PTE to map it to the requested virtual address. If no RAM is free, it may be necessary to choose an existing page (known as a victim), using some replacement algorithm, and save it to disk (this is called "paging
Paging
In computer operating systems, paging is one of the memory-management schemes by which a computer can store and retrieve data from secondary storage for use in main memory. In the paging memory-management scheme, the operating system retrieves data from secondary storage in same-size blocks called...
"). With some MMUs, there can also be a shortage of PTEs or TLB entries, in which case the OS will have to free one for the new mapping.
In some cases a "page fault" may indicate a software bug. A key benefit of an MMU is memory protection
Memory protection
Memory protection is a way to control memory access rights on a computer, and is a part of most modern operating systems. The main purpose of memory protection is to prevent a process from accessing memory that has not been allocated to it. This prevents a bug within a process from affecting...
: an OS can use it to protect against errant programs, by disallowing access to memory that a particular program should not have access to. Typically, an OS assigns each program its own virtual address space.
An MMU also reduces the problem of fragmentation
Fragmentation (computer)
In computer storage, fragmentation is a phenomenon in which storage space is used inefficiently, reducing storage capacity and in most cases reducing the performance. The term is also used to denote the wasted space itself....
of memory. After blocks of memory have been allocated and freed, the free memory may become fragmented (discontinuous) so that the largest contiguous block of free memory may be much smaller than the total amount. With virtual memory, a contiguous range of virtual addresses can be mapped to several non-contiguous blocks of physical memory.
In some early microprocessor
Microprocessor
A microprocessor incorporates the functions of a computer's central processing unit on a single integrated circuit, or at most a few integrated circuits. It is a multipurpose, programmable device that accepts digital data as input, processes it according to instructions stored in its memory, and...
designs, memory management was performed by a separate integrated circuit
Integrated circuit
An integrated circuit or monolithic integrated circuit is an electronic circuit manufactured by the patterned diffusion of trace elements into the surface of a thin substrate of semiconductor material...
such as the VLSI VI475 or the Motorola 68851
Motorola 68851
The Motorola 68851 is an external Memory Management Unit which is designed to provide paged memory support for the 68020 using that processor's coprocessor interface...
used with the Motorola 68020
Motorola 68020
The Motorola 68020 is a 32-bit microprocessor from Motorola, released in 1984. It is the successor to the Motorola 68010 and is succeeded by the Motorola 68030...
CPU in the Macintosh II
Macintosh II
The Apple Macintosh II was the first personal computer model of the Macintosh II series in the Apple Macintosh line and the first Macintosh to support a color display.- History :...
or the Z8015 used with the Zilog Z80
Zilog Z80
The Zilog Z80 is an 8-bit microprocessor designed by Zilog and sold from July 1976 onwards. It was widely used both in desktop and embedded computer designs as well as for military purposes...
family of processors. Later microprocessors such as the Motorola 68030
Motorola 68030
The Motorola 68030 is a 32-bit microprocessor in Motorola's 68000 family. It was released in 1987. The 68030 was the successor to the Motorola 68020, and was followed by the Motorola 68040. In keeping with general Motorola naming, this CPU is often referred to as the 030 .The 68030 features on-chip...
and the ZILOG Z280
Zilog Z280
The Zilog Z280 was an enhancement of the Zilog Z80 architecture introduced in July 1987, basically a slightly improved CMOS version of the earlier NMOS Zilog Z800, both versions were commercial failures...
placed the MMU together with the CPU on the same integrated circuit, as did the Intel 80286
Intel 80286
The Intel 80286 , introduced on 1 February 1982, was a 16-bit x86 microprocessor with 134,000 transistors. Like its contemporary simpler cousin, the 80186, it could correctly execute most software written for the earlier Intel 8086 and 8088...
and later x86 microprocessors.
While this article concentrates on modern MMUs, commonly based on page
Page (computing)
A page, memory page, or virtual page is a fixed-length contiguous block of virtual memory that is the smallest unit of data for the following:* memory allocation performed by the operating system for a program; and...
s, early systems used a similar concept for base-limit addressing, that further developed into segmentation
Segmentation (memory)
Memory segmentation is the division of computer memory into segments or sections. Segments or sections are also used in object files of compiled programs when they are linked together into a program image, or when the image is loaded into memory...
. Those are occasionally also present on modern architectures. The x86 architecture
X86 architecture
The term x86 refers to a family of instruction set architectures based on the Intel 8086 CPU. The 8086 was launched in 1978 as a fully 16-bit extension of Intel's 8-bit based 8080 microprocessor and also introduced segmentation to overcome the 16-bit addressing barrier of such designs...
provided segmentation rather than paging in the 80286, and provides both paging and segmentation in the 80386 and later processors (although the use of segmentation is not available in 64-bit operation).
Examples
Most modern systems divide memory into pages that are 4 KBKB
- Computing :* Kilobit , a unit of information used, for example, to quantify computer memory or storage capacity* Kilobyte , a unit of information used, for example, to quantify computer memory or storage capacity...
to 64 KB in size, often with the possibility to use huge pages from 2 MB
MB
- Computers :* Megabyte , a measure of amount of information used, for example, to quantify computer memory or storage capacity* Megabit , a measure of amount of information* MacBook* Motherboard* Message board- File format ".MB" :...
to 512 MB in size. Page translations are cached in a TLB
Translation Lookaside Buffer
A translation lookaside buffer is a CPU cache that memory management hardware uses to improve virtual address translation speed. All current desktop and server processors use a TLB to map virtual and physical address spaces, and it is ubiquitous in any hardware which utilizes virtual memory.The...
. Some systems, mainly older RISC designs, trap
Trap (computing)
In computing and operating systems, a trap, also known as an exception or a fault, is typicallyThere is a wide variation in the nomenclature...
into the OS when a page translation is not found in the TLB. Most systems use a hardware-based tree walker. Most systems allow the MMU to be disabled; some disable the MMU when trapping into OS code.
VAX
VAXVAX
VAX was an instruction set architecture developed by Digital Equipment Corporation in the mid-1970s. A 32-bit complex instruction set computer ISA, it was designed to extend or replace DEC's various Programmed Data Processor ISAs...
pages are 512 bytes, which is very small. An OS may treat multiple pages as if they were a single larger page, for example Linux on VAX groups 8 pages together, thus the system is viewed as having 4 KB pages. The VAX divides memory into 4 fixed-purpose regions, each 1 GB
GB
- Geography :* Gabon , a country in West Africa* Great Britain, an island in Europe* Guinea Bissau , a country in West Africa* Green Bay, Wisconsin, a city in Wisconsin, USA...
in size. They are:
- P0 space, which is used for general-purpose per-process memory such as heaps,
- P1 space, or control space, which is also per-process and is typically used for supervisor, executive, kernel, and user stacks and other per-process control structures managed by the operating system,
- S0 space, or system space, which is global to all processes and stores operating system code and data, whether paged or not, including pagetables,
- S1 space, which is unused and "Reserved to DigitalDigital Equipment CorporationDigital Equipment Corporation was a major American company in the computer industry and a leading vendor of computer systems, software and peripherals from the 1960s to the 1990s...
".
Page tables are big linear arrays. Normally this would be very wasteful when addresses are used at both ends of the possible range, but the page table for applications is itself stored in the kernel's paged memory. Thus there is effectively a 2-level tree, allowing applications to have sparse memory layout without wasting lots of space on unused page table entries. The VAX MMU is notable for lacking an accessed bit. OSes which implement paging must find some way to emulate the accessed bit if they are to operate efficiently. Typically, the OS will periodically unmap pages so that page-not-present faults can be used to let the OS set an accessed bit.
ARM
ARM architectureARM architecture
ARM is a 32-bit reduced instruction set computer instruction set architecture developed by ARM Holdings. It was named the Advanced RISC Machine, and before that, the Acorn RISC Machine. The ARM architecture is the most widely used 32-bit ISA in numbers produced...
based application processors implement an MMU defined by ARM's Virtual Memory System Architecture. The current architecture defines PTEs for describing 4KB and 64KB pages, 1MB sections and 16MB super-sections; legacy versions also defined a 1KB tiny page. The ARM uses a two-level pagetable if using 4KB and 64KB pages, or just a one-level pagetable for 1MB sections and 16MB sections.
TLB updates are performed automatically by page-table walking hardware.
PTEs include read/write access permission based on privilege, cacheability information, an NX bit
NX bit
The NX bit, which stands for No eXecute, is a technology used in CPUs to segregate areas of memory for use by either storage of processor instructions or for storage of data, a feature normally only found in Harvard architecture processors...
, and a non-secure bit
IBM System/370 and successors
The IBM System/370 has had an MMU since the early 1970s; it was initially known as a DAT (Dynamic Address Translation) box. It has the unusual feature of storing accessed and dirty bits outside of the page table. They refer to physical memory rather than virtual memory. They are accessed by special-purpose instructions. This reduces overhead for the OS, which would otherwise need to propagate accessed and dirty bits from the page tables to a more physically oriented data structure. This makes OS-level virtualizationOperating system-level virtualization
Operating system-level virtualization is a server virtualization method where the kernel of an operating system allows for multiple isolated user-space instances, instead of just one. Such instances may look and feel like a real server, from the point of view of its owner...
easier. These features have been inherited by succeeding mainframe architectures, up to the current z/Architecture
Z/Architecture
z/Architecture, initially and briefly called ESA Modal Extensions , refers to IBM's 64-bit computing architecture for IBM mainframe computers. IBM introduced its first z/Architecture-based system, the zSeries Model 900, in late 2000. Later z/Architecture systems include the IBM z800, z990, z890,...
.
DEC Alpha
The DEC AlphaDEC Alpha
Alpha, originally known as Alpha AXP, is a 64-bit reduced instruction set computer instruction set architecture developed by Digital Equipment Corporation , designed to replace the 32-bit VAX complex instruction set computer ISA and its implementations. Alpha was implemented in microprocessors...
processor divides memory into 8 KB pages. After a TLB miss, low-level firmware machine code (here called PALcode
PALcode
In computing, in the Alpha instruction set architecture, PALcode is the name used by DEC for a set of functions in the SRM or AlphaBIOS firmware, providing a hardware abstraction layer for system software, covering features such as cache management, translation lookaside buffer miss handling,...
) walks a 3-level tree-structured page table. Addresses are broken down as follows: 21 bits unused, 10 bits to index the root level of the tree, 10 bits to index the middle level of the tree, 10 bits to index the leaf level of the tree, and 13 bits that pass through to the physical address without modification. Full read/write/execute permission bits are supported.
MIPS
The MIPS architectureMIPS architecture
MIPS is a reduced instruction set computer instruction set architecture developed by MIPS Technologies . The early MIPS architectures were 32-bit, and later versions were 64-bit...
supports 1 to 64 entries in the TLB
Translation Lookaside Buffer
A translation lookaside buffer is a CPU cache that memory management hardware uses to improve virtual address translation speed. All current desktop and server processors use a TLB to map virtual and physical address spaces, and it is ubiquitous in any hardware which utilizes virtual memory.The...
. The number of TLB entries is configurable at CPU configuration before synthesis. TLB entries are dual. Each TLB entry maps a virtual page number (VPN2) to either one of two page frame numbers (PFN0 or PFN1), depending on the least significant bit of the virtual address that is not part of the page mask. This bit and the page mask bits, are not stored in the VPN2. Each TLB entry has its own page size, which can be any value from 1 KB to 256 MB in multiples of 4. Each PFN in a TLB entry has a caching attribute, a dirty and a valid status bit. A VPN2 has a global status bit and an OS assigned ID which participates in the virtual address TLB entry match, if the global status bit is set to 0. A PFN stores the physical address without the page mask bits.
A TLB Refill exception is generated when there are no entries in the TLB that match the mapped virtual address. A TLB Invalid exception is generated when there is a match but the entry is marked invalid. A TLB Modified exception is generated when there is a match but the dirty status is not set. If a TLB exception occurs when processing a TLB exception, a double fault TLB exception, it is dispatched to its own exception handler.
MIPS32 and MIPS32r2 support 32 bits of virtual address space and up to 36 bits of physical address space.
MIPS64 supports up to 64 bits of virtual address space and up to 59 bits of physical address space.
Sun 1
The original Sun 1 was a single-board computerSingle-board computer
A single-board computer is a complete computer built on a single circuit board, with microprocessor, memory, input/output and other features required of a functional computer. Unlike a typical personal computer, an SBC may not include slots into which accessory cards may be plugged...
built around the Motorola 68000
Motorola 68000
The Motorola 68000 is a 16/32-bit CISC microprocessor core designed and marketed by Freescale Semiconductor...
microprocessor
Microprocessor
A microprocessor incorporates the functions of a computer's central processing unit on a single integrated circuit, or at most a few integrated circuits. It is a multipurpose, programmable device that accepts digital data as input, processes it according to instructions stored in its memory, and...
and introduced in 1982. It included the original Sun 1 Memory Management Unit, that provided address translation, memory protection, memory sharing and memory allocation for multiple processes running on the CPU. All access of the CPU to private on-board RAM, external Multibus memory, on-board I/O and the Multibus I/O ran through the MMU where they were translated and protected in uniform fashion. The MMU was implemented in hardware on the CPU board.
The MMU consisted of a context register, a segment map and a page map. Virtual addresses from the CPU were translated into intermediate addresses by the segment map, which in turn were translated into physical addresses by the page map. The page size was 2 KB
KB
- Computing :* Kilobit , a unit of information used, for example, to quantify computer memory or storage capacity* Kilobyte , a unit of information used, for example, to quantify computer memory or storage capacity...
and the segment size was 32 KB which gave 16 pages per segment. Up to 16 contexts could be mapped concurrently. The maximum logical address space for a context was 1024 pages or 2 MB. The maximum physical address that could be mapped simultaneously was also 2 MB.
The context register was important in a multitasking operating system because it allowed the CPU to switch between processes without reloading all the translation state information. The 4-bit context register could switch between 16 sections of the segment map under supervisor control which allowed 16 contexts to be mapped concurrently. Each context had its own virtual address space
Virtual address space
Virtual address space is a memory mapping mechanism available in modern operating systems such as OpenVMS, UNIX, Linux, and Windows NT...
. Sharing of virtual address space and inter-context communications could be provided by writing the same values in to the segment or page maps of different contexts. Additional contexts could be handled by treating the segment map as a context cache and replacing out-of-date contexts on a least-recently-used basis.
The context register made no distinction between user and supervisor states; interrupts and traps did not switch contexts which required that all valid interrupt vectors always be mapped in page 0 of context, as well as the valid Supervisor Stack.
PowerPC
In PowerPCPowerPC
PowerPC is a RISC architecture created by the 1991 Apple–IBM–Motorola alliance, known as AIM...
G1, G2, G3, and G4, pages are normally 4 KB
KB
- Computing :* Kilobit , a unit of information used, for example, to quantify computer memory or storage capacity* Kilobyte , a unit of information used, for example, to quantify computer memory or storage capacity...
. After a TLB miss, the standard PowerPC MMU begins two simultaneous lookups. One lookup attempts to match the address with one of 4 or 8 Data Block Address Translation (DBAT) registers, or 4 or 8 Instruction Block Address Translation registers (IBAT) as appropriate. The BAT registers can map linear chunks of memory as large as 256 MB
MB
- Computers :* Megabyte , a measure of amount of information used, for example, to quantify computer memory or storage capacity* Megabit , a measure of amount of information* MacBook* Motherboard* Message board- File format ".MB" :...
, and are normally used by an OS to map large portions of the address space for the OS kernel's own use. If the BAT lookup succeeds, the other lookup is halted and ignored.
The other lookup, not directly supported by all processors in this family, is via a so-called "inverted page table" which acts as a hashed off-chip extension of the TLB. First, the top 4 bits of the address are used to select one of 16 segment registers. 24 bits from the segment register replace those 4 bits, producing a 52-bit address. The use of segment registers allows multiple processes to share the same hash table. The 52-bit address is hashed, then used as an index into the off-chip table. There, a group of 8 page table entries is scanned for one that matches. If none match due to excessive hash collision
Hash collision
Not to be confused with wireless packet collision.In computer science, a collision or clash is a situation that occurs when two distinct pieces of data have the same hash value, checksum, fingerprint, or cryptographic digest....
s, the processor tries again with a slightly different hash function
Hash function
A hash function is any algorithm or subroutine that maps large data sets to smaller data sets, called keys. For example, a single integer can serve as an index to an array...
. If this too fails, the CPU traps into the OS
Operating system
An operating system is a set of programs that manage computer hardware resources and provide common services for application software. The operating system is the most important type of system software in a computer system...
(with MMU disabled) so that the problem may be resolved. The OS needs to discard an entry from the hash table to make space for a new entry. The OS may generate the new entry from a more-normal tree-like page table or from per-mapping data structures which are likely to be slower and more space-efficient. Support for no-execute
NX bit
The NX bit, which stands for No eXecute, is a technology used in CPUs to segregate areas of memory for use by either storage of processor instructions or for storage of data, a feature normally only found in Harvard architecture processors...
control is in the segment registers, leading to 256-MB granularity.
A major problem with this design is poor cache locality caused by the hash function. Tree-based designs avoid this by placing the page table entries for adjacent pages in adjacent locations. An operating system running on the PowerPC may minimize the size of the hash table to reduce this problem.
It is also somewhat slow to remove the page table entries of a process; the OS may avoid reusing segment values to delay facing this or it may elect to suffer the waste of memory associated with per-process hash tables. G1 chips do not search for page table entries, but they do generate the hash with the expectation that an OS will search the standard hash table via software. The OS can write to the TLB. G2, G3, and early G4 chips use hardware to search the hash table. The latest chips allow the OS to choose either method. On chips that make this optional or do not support it at all, the OS may choose to use a tree-based page table exclusively.
IA-32 / x86
The x86 architecture has evolved over a long time while maintaining full software compatibility even for OS code. Thus the MMU is extremely complex, with many different possible operating modes. Normal operation of the traditional 80386 CPU and its successors (IA-32IA-32
IA-32 , also known as x86-32, i386 or x86, is the CISC instruction-set architecture of Intel's most commercially successful microprocessors, and was first implemented in the Intel 80386 as a 32-bit extension of x86 architecture...
) is described here.
The CPU primarily divides memory into 4 KB pages. Segment registers, fundamental to the older 8088
Intel 8088
The Intel 8088 microprocessor was a variant of the Intel 8086 and was introduced on July 1, 1979. It had an 8-bit external data bus instead of the 16-bit bus of the 8086. The 16-bit registers and the one megabyte address range were unchanged, however...
and 80286 MMU designs, are avoided as much as possible by modern OSes. There is one major exception to this: access to thread
Thread (computer science)
In computer science, a thread of execution is the smallest unit of processing that can be scheduled by an operating system. The implementation of threads and processes differs from one operating system to another, but in most cases, a thread is contained inside a process...
-specific data for applications or CPU-specific data for OS kernels, which is done with explicit use of the FS and GS segment registers. All memory access involves a segment register, chosen according to the code being executed. The segment register acts as an index into a table, which provides an offset to be added to the virtual address. Except when using FS or GS as described above, the OS ensures that the offset will be zero. After the offset is added, the address is masked to be no larger than 32 bits. The result may be looked up via a tree-structured page table, with the bits of the address being split as follows: 10 bits for the root of the tree, 10 bits for the leaves of the tree, and the 12 lowest bits being directly copied to the result.
Minor revisions of the MMU introduced with the Pentium have allowed very large 4 MB pages by skipping the bottom level of the tree. Minor revisions of the MMU introduced with the Pentium Pro
Pentium Pro
The Pentium Pro is a sixth-generation x86 microprocessor developed and manufactured by Intel introduced in November 1, 1995 . It introduced the P6 microarchitecture and was originally intended to replace the original Pentium in a full range of applications...
introduced the Physical Address Extension
Physical Address Extension
In computing, Physical Address Extension is a feature to allow x86 processors to access a physical address space larger than 4 gigabytes....
(PAE) feature, enabling 36-bit physical addresses via three-level page tables (with 9+9+2 bits for the three levels, and the 12 lowest bits being directly copied to the result; large pages become only 2 MB in size). In addition, the Page Attribute Table
Page Attribute Table
The page attribute table is a processor supplementary capability extension to the page table format of certain x86 and x86-64 microprocessors...
allowed specification of cacheability by looking up a few high bits in a small on-CPU table.
No-execute
NX bit
The NX bit, which stands for No eXecute, is a technology used in CPUs to segregate areas of memory for use by either storage of processor instructions or for storage of data, a feature normally only found in Harvard architecture processors...
support was originally only provided on a per-segment basis, making it very awkward to use. More recent x86 chips provide a per-page no-execute bit in the PAE mode. PaX
PaX
PaX is a patch for the Linux kernel that implements least privilege protections for memory pages. The least-privilege approach allows computer programs to do only what they have to do in order to be able to execute properly, and nothing more. PaX was first released in 2000.PaX flags data memory as...
is one way to emulate per-page non-execute support via the segments, with a performance loss and halving the available address space.
x86-64
x86-64X86-64
x86-64 is an extension of the x86 instruction set. It supports vastly larger virtual and physical address spaces than are possible on x86, thereby allowing programmers to conveniently work with much larger data sets. x86-64 also provides 64-bit general purpose registers and numerous other...
is a 64-bit extension of x86, that almost entirely removes segmentation in favor of the flat memory model
Flat memory model
Flat memory model or linear memory model refers to a memory addressing paradigm in low-level software design such that the CPU can directly address all of the available memory locations without having to resort to any sort of memory segmentation or paging schemes.Memory management and...
used by almost all operating systems for the 386 or newer processors. In long mode, all segment offsets are ignored, except for the FS and GS segments. When used with 4 KB pages, the page table tree has four levels, instead of three. The virtual addresses are divided up as follows: 16 bits unused, 9 bits each for 4 tree levels (total: 36 bits), and the 12 lowest bits unmodified. With 2 MB pages there are only three levels to the page table for a total of 27 bits used in paging and 21 bits of offset. Some newer CPUs also support a 1 GB page with two levels of paging and 30 bits of offset. CPUID
CPUID
The CPUID opcode is a processor supplementary instruction for the x86 architecture. It was introduced by Intel in 1993 when it introduced the Pentium and SL-Enhanced 486 processors....
can be used to determine if 1 GB pages are supported. In all three cases, the 16 highest bits are required to be equal to 48th bit, or in the other words, the low 48 bits are sign extended
Sign extension
Sign extension is the operation, in computer arithmetic, of increasing the number of bits of a binary number while preserving the number's sign and value...
to the higher bits. This is done to allow a future expansion of the addressable range, without compromising backwards compatibility.
In all levels of the page table, the page table entry includes a no-execute
NX bit
The NX bit, which stands for No eXecute, is a technology used in CPUs to segregate areas of memory for use by either storage of processor instructions or for storage of data, a feature normally only found in Harvard architecture processors...
bit, used to mark pages as data to avoid stack-smashing buffer overflow
Buffer overflow
In computer security and programming, a buffer overflow, or buffer overrun, is an anomaly where a program, while writing data to a buffer, overruns the buffer's boundary and overwrites adjacent memory. This is a special case of violation of memory safety....
attacks. It has no effect unless the NXE bit in the EFER is set, which not all legacy operating systems use.
Unisys MCP Systems (Burroughs B5000)
Tanenbaum et al., recently stated that the B5000 (and descendant systems) have no MMU. To understand the functionality provided by an MMU, it is instructive to study a counter example of a system that achieves this functionality by other means.The B5000 was the first commercial system to support virtual memory after the Atlas
Atlas Computer (Manchester)
The Atlas Computer was a joint development between the University of Manchester, Ferranti, and Plessey. The first Atlas, installed at Manchester University and officially commissioned in 1962, was one of the world's first supercomputers, considered to be the most powerful computer in the world at...
. It provides the two functions of an MMU in different ways. Firstly, the mapping of virtual memory addresses. Instead of needing an MMU, the MCP systems are descriptor based. Each allocated memory block is given a master descriptor with the properties of the block, i.e., the size, address, and whether present in memory. When a request is made to access the block for reading or writing, the hardware checks its presence via the presence bit (pbit) in the descriptor.
A pbit of 1 indicates the presence of the block. In this case the block can be accessed via the physical address in the descriptor. If the pbit is zero, an interrupt is generated for the MCP (operating system) to make the block present. If the address field is zero, this is the first access to this block and it is allocated (an init pbit). If the address field is non-zero, it is a disk address of the block, which has previously been rolled out, so the block is fetched from disk and the pbit is set to 1 and the physical memory address updated to point to the block in memory (another pbit). This makes descriptors equivalent to a page-table entry in an MMU system. System performance can be monitored through the number of pbits. Init pbits indicate initial allocations, but a high level of other pbits indicate that the system may be thrashing.
Note that all memory allocation is therefore completely automatic (one of the features of modern systems) and there is no way to allocate blocks other than this mechanism. There are no such calls as malloc or dealloc, since memory blocks are also automatically discarded. The scheme is also lazy, since a block will not be allocated until it is actually referenced. When memory is near full, the MCP examines the working set, trying compaction (since the system is segmented, not paged), deallocating read-only segments (such as code-segments which can be restored from their original copy), and as a last resort, rolling dirty data segments out to disk.
Secondly, protection. Since all accesses are via the descriptor the hardware can check all accesses are within bounds, and in the case of a write that the process has write permission. The MCP system is inherently secure and thus has no need of an MMU to provide this level of memory protection. Descriptors are read only to user processes and may only be updated by the system (hardware or MCP). (Descriptors have a tag of 5 and odd-tagged words are read only – code words have a tag of 3.)
Blocks can be shared between processes via copy descriptors in the process stack – thus some processes may have write permission, whereas others not. A code segment is read only, thus reentrant and shared between processes. Copy descriptors contain a 20-bit address field giving index of the master descriptor in the master descriptor array. This also implements a very efficient and secure IPC mechanism. Blocks can easily be relocated since only the master descriptor needs update when a block's status changes.
The only other aspect is performance – do MMU- or non-MMU-based systems provide better performance? MCP systems may be implemented on top of standard hardware that does have an MMU (e.g., a standard PC). Even if the system implementation uses the MMU in some way, this will not be at all visible at the MCP level.