Class (file format)
Encyclopedia
In the Java programming language
, source files (.java files) are compiled
into (virtual) machine-readable class files which have a .class extension. Since Java is a platform-independent language, source code
is compiled into an output file known as bytecode, which it stores in a .class file. If a source file has more than one class, each class is compiled into a separate .class file. These .class files can be loaded by any Java Virtual Machine
(JVM).
JVMs are available for many platform
s, and the .class file compiled in one platform will execute in a JVM of another platform. This makes Java platform-independent.
There is a handy mnemonic
for remembering these 10:
My Very Cute Animal Turns Savage In Full Moon Areas.
Magic, Version, Constant, Access, This, Super, Interfaces, Fields, Methods, Attributes (MVCATSIFMA)
header
(in hexadecimal
):
was explained by James Gosling
:
Some of these fundamental types are then re-interpreted as higher-level values (such as strings or floating-point numbers), depending on context.
There is no enforcement of word alignment, and so no padding bytes are ever used.
The overall layout of the class file is as shown in the following table.
Due to historic choices made during the file format development, the number of constants in the constant pool table is not actually the same as the constant pool count which precedes the table. First, the table is indexed starting at 1 (rather than 0), so the count should actually be interpreted as the maximum index. Additionally, two types of constants (longs and doubles) take up two consecutive slots in the table, although the second such slot is a phantom index that is never directly used.
The type of each item (constant) in the constant pool is identified by an initial byte tag. The number of bytes following this tag and their interpretation are then dependent upon the tag value. The valid constant types and their tag values are:
There are only two integral constant types, integer and long. Other integral types appearing in the high-level language, such as boolean, byte, and short must be represented as an integer constant.
Class names in Java, when fully qualified, are traditionally dot-separated, such as "java.lang.Object". However within the low-level Class reference constants, an internal form appears which uses slashes instead, such as "java/lang/Object".
The Unicode strings, despite the moniker "UTF-8 string", are not actually encoded according to the Unicode standard, although it is similar. There are two differences (see UTF-8
for a complete discussion). The first is that the codepoint U+0000 is encoded as the two-byte sequence
, which includes the class file format. Both the first and second editions of the book are freely available online for viewing and/or download.
Java (programming language)
Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...
, source files (.java files) are compiled
Compiler
A compiler is a computer program that transforms source code written in a programming language into another computer language...
into (virtual) machine-readable class files which have a .class extension. Since Java is a platform-independent language, source code
Source code
In computer science, source code is text written using the format and syntax of the programming language that it is being written in. Such a language is specially designed to facilitate the work of computer programmers, who specify the actions to be performed by a computer mostly by writing source...
is compiled into an output file known as bytecode, which it stores in a .class file. If a source file has more than one class, each class is compiled into a separate .class file. These .class files can be loaded by any Java Virtual Machine
Java Virtual Machine
A Java virtual machine is a virtual machine capable of executing Java bytecode. It is the code execution component of the Java software platform. Sun Microsystems stated that there are over 4.5 billion JVM-enabled devices.-Overview:...
(JVM).
JVMs are available for many platform
Platform (computing)
A computing platform includes some sort of hardware architecture and a software framework , where the combination allows software, particularly application software, to run...
s, and the .class file compiled in one platform will execute in a JVM of another platform. This makes Java platform-independent.
History
, the modification of the class file format is being considered under Java Specification Request (JSR) 202.Sections
There are 10 basic sections to the Java Class File structure:- Magic Number: 0xCAFEBABE
- Version of Class File Format: the minor and major versions of the class file
- Constant Pool: Pool of constants for the class
- Access Flags: for example whether the class is abstract, static, etc.
- This Class: The name of the current class
- Super Class: The name of the super class
- Interfaces: Any interfaces in the class
- Fields: Any fields in the class
- Methods: Any methods in the class
- Attributes: Any attributes of the class (for example the name of the sourcefile, etc.)
There is a handy mnemonic
Mnemonic
A mnemonic , or mnemonic device, is any learning technique that aids memory. To improve long term memory, mnemonic systems are used to make memorization easier. Commonly encountered mnemonics are often verbal, such as a very short poem or a special word used to help a person remember something,...
for remembering these 10:
My Very Cute Animal Turns Savage In Full Moon Areas.
Magic, Version, Constant, Access, This, Super, Interfaces, Fields, Methods, Attributes (MVCATSIFMA)
Magic Number
Class files are identified by the following 4 byteByte
The byte is a unit of digital information in computing and telecommunications that most commonly consists of eight bits. Historically, a byte was the number of bits used to encode a single character of text in a computer and for this reason it is the basic addressable element in many computer...
header
Header (information technology)
In information technology, header refers to supplemental data placed at the beginning of a block of data being stored or transmitted. In data transmission, the data following the header are sometimes called the payload or body....
(in hexadecimal
Hexadecimal
In mathematics and computer science, hexadecimal is a positional numeral system with a radix, or base, of 16. It uses sixteen distinct symbols, most often the symbols 0–9 to represent values zero to nine, and A, B, C, D, E, F to represent values ten to fifteen...
):
CA FE BA BE
(the first 4 entries in the below table). The history of this magic numberMagic number (programming)
In computer programming, the term magic number has multiple meanings. It could refer to one or more of the following:* A constant numerical or text value used to identify a file format or protocol; for files, see List of file signatures...
was explained by James Gosling
James Gosling
James A. Gosling, OC is a computer scientist, best known as the father of the Java programming language.-Education and career:In 1977, Gosling received a B.Sc in Computer Science from the University of Calgary...
:
"We used to go to lunch at a place called St Michael's Alley. According to local legend, in the deep dark past, the Grateful DeadGrateful DeadThe Grateful Dead was an American rock band formed in 1965 in the San Francisco Bay Area. The band was known for its unique and eclectic style, which fused elements of rock, folk, bluegrass, blues, reggae, country, improvisational jazz, psychedelia, and space rock, and for live performances of long...
used to perform there before they made it big. It was a pretty funky place that was definitely a Grateful Dead Kinda Place. When JerryJerry GarciaJerome John "Jerry" Garcia was an American musician best known for his lead guitar work, singing and songwriting with the band the Grateful Dead...
died, they even put up a little Buddhist-esque shrine. When we used to go there, we referred to the place as Cafe Dead. Somewhere along the line it was noticed that this was a HEX number. I was re-vamping some file format code and needed a couple of magic numbersMagic number (programming)In computer programming, the term magic number has multiple meanings. It could refer to one or more of the following:* A constant numerical or text value used to identify a file format or protocol; for files, see List of file signatures...
: one for the persistent object file, and one for classes. I used CAFEDEAD for the object file format, and in grepGrepgrep is a command-line text-search utility originally written for Unix. The name comes from the ed command g/re/p...
ping for 4 character hex words that fit after "CAFE" (it seemed to be a good theme) I hit on BABE and decided to use it. At that time, it didn't seem terribly important or destined to go anywhere but the trash-can of history. So CAFEBABE became the class file format, and CAFEDEAD was the persistent object format. But the persistent object facility went away, and along with it went the use of CAFEDEAD - it was eventually replaced by RMIJava remote method invocationThe Java Remote Method Invocation Application Programming Interface , or Java RMI, is a Java application programming interface that performs the object-oriented equivalent of remote procedure calls ....
."
General layout
Because the class file contains variable-sized items and does not also contain embedded file offsets (or pointers), it is typically parsed sequentially, from the first byte toward the end. At the lowest level the file format is described in terms of a few fundamental data types:- u1: an unsigned 8-bitOctet (computing)An octet is a unit of digital information in computing and telecommunications that consists of eight bits. The term is often used when the term byte might be ambiguous, as there is no standard for the size of the byte.-Overview:...
integer - u2: an unsigned 16-bit16-bit-16-bit architecture:The HP BPC, introduced in 1975, was the world's first 16-bit microprocessor. Prominent 16-bit processors include the PDP-11, Intel 8086, Intel 80286 and the WDC 65C816. The Intel 8088 was program-compatible with the Intel 8086, and was 16-bit in that its registers were 16...
integer in big-endianEndiannessIn computing, the term endian or endianness refers to the ordering of individually addressable sub-components within the representation of a larger data item as stored in external memory . Each sub-component in the representation has a unique degree of significance, like the place value of digits...
byte order - u4: an unsigned 32-bit32-bitThe range of integer values that can be stored in 32 bits is 0 through 4,294,967,295. Hence, a processor with 32-bit memory addresses can directly access 4 GB of byte-addressable memory....
integer in big-endian byte order - table: an array of variable-length items of some type. The number of items in the table is identified by a preceding count number, but the size in bytes of the table can only be determined by examining each of its items.
Some of these fundamental types are then re-interpreted as higher-level values (such as strings or floating-point numbers), depending on context.
There is no enforcement of word alignment, and so no padding bytes are ever used.
The overall layout of the class file is as shown in the following table.
byte offset | size | type or value | description |
---|---|---|---|
0 | 4 bytes | u1 = 0xCA hex |
magic number (CAFEBABE) used to identify file as conforming to the class file format |
1 | u1 = 0xFE hex |
||
2 | u1 = 0xBA hex |
||
3 | u1 = 0xBE hex |
||
4 | 2 bytes | u2 | minor version number of the class file format being used |
5 | |||
6 | 2 bytes | u2 | major version number of the class file format being used. J2SE 7 = 51 (0x33 hex), J2SE 6.0 = 50 (0x32 hex), J2SE 5.0 = 49 (0x31 hex), JDK 1.4 = 48 (0x30 hex), JDK 1.3 = 47 (0x2F hex), JDK 1.2 = 46 (0x2E hex), JDK 1.1 = 45 (0x2D hex). For details of earlier version numbers see footnote 1 at The JavaTM Virtual Machine Specification 2nd edition |
7 | |||
8 | 2 bytes | u2 | constant pool count, number of entries in the following constant pool table. This count is at least one greater than the actual number of entries; see following discussion. |
9 | |||
10 | cpsize (variable) | table | constant pool table, an array of variable-sized constant pool entries, containing items such as literal numbers, strings, and references to classes or methods. Indexed starting at 1, containing (constant pool count - 1) number of entries in total (see note). |
... | |||
... | |||
... | |||
10+cpsize | 2 bytes | u2 | access flags, a bitmask |
11+cpsize | |||
12+cpsize | 2 bytes | u2 | identifies this class, index into the constant pool to a "Class"-type entry |
13+cpsize | |||
14+cpsize | 2 bytes | u2 | identifies super class, index into the constant pool to a "Class"-type entry |
15+cpsize | |||
16+cpsize | 2 bytes | u2 | interface count, number of entries in the following interface table |
17+cpsize | |||
18+cpsize | isize (variable) | table | interface table, an array of variable-sized interfaces |
... | |||
... | |||
... | |||
18+cpsize+isize | 2 bytes | u2 | field count, number of entries in the following field table |
19+cpsize+isize | |||
20+cpsize+isize | fsize (variable) | table | field table, variable length array of fields |
... | |||
... | |||
... | |||
20+cpsize+isize+fsize | 2 bytes | u2 | method count, number of entries in the following method table |
21+cpsize+isize+fsize | |||
22+cpsize+isize+fsize | msize (variable) | table | method table, variable length array of methods |
... | |||
... | |||
... | |||
22+cpsize+isize+fsize+msize | 2 bytes | u2 | attribute count, number of entries in the following attribute table |
23+cpsize+isize+fsize+msize | |||
24+cpsize+isize+fsize+msize | asize (variable) | table | attribute table, variable length array of attributes |
... | |||
... | |||
... |
C programming language representation
The constant pool
The constant pool table is where most of the literal constant values are stored. This includes values such as numbers of all sorts, strings, identifier names, references to classes and methods, and type descriptors. All indexes, or references, to specific constants in the constant pool table are given by 16-bit (type u2) numbers, where index value 1 refers to the first constant in the table (index value 0 is invalid).Due to historic choices made during the file format development, the number of constants in the constant pool table is not actually the same as the constant pool count which precedes the table. First, the table is indexed starting at 1 (rather than 0), so the count should actually be interpreted as the maximum index. Additionally, two types of constants (longs and doubles) take up two consecutive slots in the table, although the second such slot is a phantom index that is never directly used.
The type of each item (constant) in the constant pool is identified by an initial byte tag. The number of bytes following this tag and their interpretation are then dependent upon the tag value. The valid constant types and their tag values are:
Tag byte | Additional bytes | Description of constant |
---|---|---|
1 | 2+x bytes (variable) |
UTF-8 (Unicode) string: a character string prefixed by a 16-bit number (type u2) indicating the number of bytes in the encoded string which immediately follows (which may be different than the number of characters). Note that the encoding used is not actually UTF-8 UTF-8 UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks... , but involves a slight modification of the Unicode standard encoding form. |
3 | 4 bytes | Integer: a signed 32-bit two's complement Two's complement The two's complement of a binary number is defined as the value obtained by subtracting the number from a large power of two... number in big-endian format |
4 | 4 bytes | Float: a 32-bit single-precision IEEE 754 floating-point number |
5 | 8 bytes | Long: a signed 64-bit two's complement number in big-endian format (takes two slots in the constant pool table) |
6 | 8 bytes | Double: a 64-bit double-precision IEEE 754 floating-point number (takes two slots in the constant pool table) |
7 | 2 bytes | Class reference: an index within the constant pool to a UTF-8 string containing the fully qualified class name (in internal format) |
8 | 2 bytes | String reference: an index within the constant pool to a UTF-8 string |
9 | 4 bytes | Field reference: two indexes within the constant pool, the first pointing to a Class reference, the second to a Name and Type descriptor. |
10 | 4 bytes | Method reference: two indexes within the constant pool, the first pointing to a Class reference, the second to a Name and Type descriptor. |
11 | 4 bytes | Interface method reference: two indexes within the constant pool, the first pointing to a Class reference, the second to a Name and Type descriptor. |
12 | 4 bytes | Name and type descriptor: two indexes to UTF-8 strings within the constant pool, the first representing a name (identifier) and the second a specially encoded type descriptor. |
There are only two integral constant types, integer and long. Other integral types appearing in the high-level language, such as boolean, byte, and short must be represented as an integer constant.
Class names in Java, when fully qualified, are traditionally dot-separated, such as "java.lang.Object". However within the low-level Class reference constants, an internal form appears which uses slashes instead, such as "java/lang/Object".
The Unicode strings, despite the moniker "UTF-8 string", are not actually encoded according to the Unicode standard, although it is similar. There are two differences (see UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...
for a complete discussion). The first is that the codepoint U+0000 is encoded as the two-byte sequence
C0 80
(in hex) instead of the standard single-byte encoding 00
. The second difference is that supplementary characters (those outside the BMP at U+10000 and above) are encoded using a surrogate-pair construction similar to UTF-16 rather than being directly encoded using UTF-8. In this case each of the two surrogates is encoded separately in UTF-8. For example U+1D11E is encoded as the 6-byte sequence ED A0 B4 ED B4 9E
, rather than the correct 4-byte UTF-8 encoding of F0 9D 84 9E
.Further reading
The official defining document of the Java Virtual MachineJava Virtual Machine
A Java virtual machine is a virtual machine capable of executing Java bytecode. It is the code execution component of the Java software platform. Sun Microsystems stated that there are over 4.5 billion JVM-enabled devices.-Overview:...
, which includes the class file format. Both the first and second editions of the book are freely available online for viewing and/or download.