Self-modifying code
Encyclopedia
In computer science
, self-modifying code is code that alters its own instructions while it is executing
- usually to reduce the instruction path length
and improve performance
or simply to reduce otherwise repetitively similar code, thus simplifying maintenance. Self modification is an alternative to the method of 'flag setting' and conditional program branching, used primarily to reduce the number of times a condition needs to be tested for. The term is usually only applied to code where the self-modification is intentional, not in situations where code accidentally modifies itself due to an error such as a buffer overflow
.
The method is frequently used for conditionally invoking test/debugging code without requiring additional computational overhead
for every input/output
cycle.
The modifications may be performed:-
In either case, the modifications may be performed directly to the machine code
instructions themselves, by overlaying new instructions over the existing ones (for example: altering a compare and branch to an unconditional branch or alternatively a 'NOP
'). In the IBM/360 and Z/Architecture
instruction set
, an EXECUTE (EX) instruction overlays its target instruction (in its 2nd byte) with the lowest order 8 bits of register 1, as a standard, legitimate method of (temporary) instruction modification.
. Instructions can be dynamically created in memory
(or else overlaid over existing code in non-protected program storage), in a sequence equivalent to the ones that a standard compiler may generate as the object code
(/binary file
). With modern processors, there can be unintended side effects on the CPU cache
that must be considered. The method was frequently used for testing 'first time' conditions, as in this suitably commented IBM/360 assembler example. It uses instruction overlay to reduce the instruction path length
by (N x 1)-1 where N is the number of records on the file (-1 being the overhead
to perform the overlay).
(Since the replacement unconditional branch is also slightly faster than a compare instruction, as well as reducing the overall path length, the saved difference in timing between the two instructions is magnified by a factor of N. The 'jump' instruction retains locality of reference and much higher 'visibility' by its close proximity to the overwritten instruction, despite adding an unnecessary extra instruction after the OPEN)
In later Operating systems for programs residing in protected storage
this technique could not be used and so changing the pointer to the subroutine
would be used instead. The pointer would reside in dynamic storage and could be altered at will after the first pass to bypass the OPEN (Having to load a pointer first instead of a direct branch & link to the subroutine would add N instructions to the path length - but there would be a corresponding reduction of N for the unconditional branch that would no longer be required).
allows programs to modify themselves; one batch
programming technique is to use self-modifying code. Clipper and SPITBOL
also provide facilities for explicit self-modification.
Other languages, such as Perl
, Python
and JavaScript
, allow programs to create new code at run-time and execute it using an eval
function, but do not allow existing code to be mutated. The illusion of modification (even though no machine code is really being overwritten) is achieved by modifying function pointers, as in this JavaScript example:
interpreters
can be considered to be, in one sense, 'self-modified' by data values extracted from the table entries (rather than specifically hand coded
in conditional statement
s of the form "IF inputx = 'yyy'").
In the early days of computers, self-modifying code was often used to reduce use of limited memory, or improve performance, or both. It was also sometimes used to implement subroutine calls and returns when the instruction set only provided simple branching or skipping instructions to vary the control flow
. This use is still relevant in certain ultra-RISC architectures, at least theoretically; see for example one instruction set computer
. Donald Knuth
's MIX
architecture also used self-modifying code to implement subroutine calls.
example:
repeat N times {
if STATE is 1
increase A by one
else
decrease A by one
do something with A
}
Self-modifying code in this case would simply be a matter of rewriting the loop like this:
repeat N times {
increase A by one
do something with A
}
when STATE has to switch {
replace the opcode "increase" above with the opcode to decrease, or vice versa
}
Note that 2-state replacement of the opcode
can be easily written as
'xor var at address with the value "opcodeOf(Inc) xor opcodeOf(dec)"'.
Choosing this solution must depend on the value of 'N' and the frequency of state changing.
and Apple II
. For example, on an IBM PC (or compatible
), the floppy disk
drive access instruction 'int
0x13' would not appear in the executable program's image but it would be written into the executable's memory image after the program started executing.
Self-modifying code is also sometimes used by programs that do not want to reveal their presence, such as computer virus
es and some shellcode
s. Viruses and shellcodes that use self-modifying code mostly do this in combination with polymorphic code
. Modifying a piece of running code is also used in certain attacks, such as buffer overflow
s.
systems have a fixed, pre-programmed learning algorithm
to adjust their parameter
s. However, since the 1980s Jürgen Schmidhuber
has published several self-modifying systems with the ability to change their own learning algorithm. They avoid the danger of catastrophic self-rewrites by making sure that self-modifications will survive only if they are useful according to a user-given fitness
, error
or reward function.
s are careful to remove such vulnerabilities as they become known. The concern is typically not that programs will intentionally modify themselves, but that they could be maliciously changed by an exploit
.
As consequence of the troubles that can be caused by these exploits, an OS feature called W^X
(for "write xor execute") has been developed which prohibits a program from making any page of memory both writable and executable. Some systems prevent a writable page from ever being changed to be executable, even if write permission is removed. Other systems provide a 'back door' of sorts, allowing multiple mappings of a page of memory to have different permissions. A relatively portable way to bypass W^X is to create a file with all permissions, then map the file into memory twice. On Linux, one may use an undocumented SysV shared memory flag to get executable shared memory without needing to create a file.
Regardless, at a meta-level, programs can still modify their own behavior by changing data stored elsewhere (see metaprogramming
) or via use of polymorphism
.
In some cases short sections of self-modifying code execute more slowly on modern processors. This is because a modern processor will usually try to keep blocks of code in its cache memory. Each time the program rewrites a part of itself, the rewritten part must be loaded into the cache again, which results in a slight delay, if the modified codelet shares the same cache line with the modifying code, as is the case when the modified memory address is located within a few bytes to the one of the modifying code.
The cache invalidation issue on modern processors usually means that self-modifying code would still be faster only when the modification will occur rarely, such as in the case of a state switching inside an inner loop.
Most modern processors load the machine code before they execute it, which means that if an instruction that is too near the instruction pointer is modified, the processor will not notice, but instead execute the code as it was before it was modified. See prefetch input queue
(PIQ). PC processors must handle self-modifying code correctly for backwards compatibility reasons but they are far from efficient at doing so.
's Ph.D.
thesis is a tiny Unix
kernel that takes a structured
, or even object oriented
, approach to self-modifying code, where code is created for individual quajects, like filehandles; generating code for specific tasks allows the Synthesis kernel to (as a JIT interpreter might) apply a number of optimizations
such as constant folding
or common subexpression elimination
.
The Synthesis kernel was extremely fast, but was written entirely in assembly. The resulting lack of portability has prevented Massalin's optimization ideas from being adopted by any production kernel. However, the structure of the techniques suggests that they could be captured by a higher level language
, albeit one more complex than existing mid-level languages. Such a language and compiler could allow development of faster operating systems and applications.
Paul Haeberli and Bruce Karsh have objected to the "marginalization" of self-modifying code, and optimization in general, in favor of reduced development costs.
because it makes code harder to read and maintain. There are however ways in which self modification is nevertheless deemed acceptable, such as when function pointer
s are dynamically altered, even though the effect is almost identical to direct modification. The subtle difference, in this case, is that a pointer variable
is altered, not actual program instructions. The change to the pointer is, in this case, equivalent to the setting of a 'flag
' (that might have been set as an alternative), except that the flag does not need to be tested each time thereafter.
At the machine instruction level, frequently self-modifying code can cause significant performance degradation on modern processors. The most common method of handling correctness in the case of self-modifying code in the processor is a full pipeline
flush, which carries a far stiffer penalty than branch misprediction recovery.
Self-modifying code cannot be used at all in some environments.
The most common such environments are:
Computer science
Computer science or computing science is the study of the theoretical foundations of information and computation and of practical techniques for their implementation and application in computer systems...
, self-modifying code is code that alters its own instructions while it is executing
Execution (computers)
Execution in computer and software engineering is the process by which a computer or a virtual machine carries out the instructions of a computer program. The instructions in the program trigger sequences of simple actions on the executing machine...
- usually to reduce the instruction path length
Instruction path length
In computer performance, the instruction path length is the number of machine code instructions required to execute a section of a computer program. The total path length for the entire program could be deemed a measure of the algorithm's performance on a particular computer hardware...
and improve performance
Performance
A performance, in performing arts, generally comprises an event in which a performer or group of performers behave in a particular way for another group of people, the audience. Choral music and ballet are examples. Usually the performers participate in rehearsals beforehand. Afterwards audience...
or simply to reduce otherwise repetitively similar code, thus simplifying maintenance. Self modification is an alternative to the method of 'flag setting' and conditional program branching, used primarily to reduce the number of times a condition needs to be tested for. The term is usually only applied to code where the self-modification is intentional, not in situations where code accidentally modifies itself due to an error such as a buffer overflow
Buffer overflow
In computer security and programming, a buffer overflow, or buffer overrun, is an anomaly where a program, while writing data to a buffer, overruns the buffer's boundary and overwrites adjacent memory. This is a special case of violation of memory safety....
.
The method is frequently used for conditionally invoking test/debugging code without requiring additional computational overhead
Computational overhead
In computer science, overhead is generally considered any combination of excess or indirect computation time, memory, bandwidth, or other resources that are required to attain a particular goal...
for every input/output
Input/output
In computing, input/output, or I/O, refers to the communication between an information processing system , and the outside world, possibly a human, or another information processing system. Inputs are the signals or data received by the system, and outputs are the signals or data sent from it...
cycle.
The modifications may be performed:-
- only during initialization - based on input parameters (when the process is more commonly described as software 'configurationComputer configurationIn communications or computer systems, a configuration is an arrangement of functional units according to their nature, number, and chief characteristics. Often, configuration pertains to the choice of hardware, software, firmware, and documentation...
' and is somewhat analogous, in hardware terms, to setting jumpersJumper (computing)In electronics and particularly computing, a jumper is a short length of conductor used to close a break in or bypass part of an electrical circuit...
for printed circuit boardPrinted circuit boardA printed circuit board, or PCB, is used to mechanically support and electrically connect electronic components using conductive pathways, tracks or signal traces etched from copper sheets laminated onto a non-conductive substrate. It is also referred to as printed wiring board or etched wiring...
s). Alteration of program entry pointers is an equivalent indirect method of self-modification, but requiring the co-existence of one or more alternative instruction paths, increasing the program sizeBinary fileA binary file is a computer file which may contain any type of data, encoded in binary form for computer storage and processing purposes; for example, computer document files containing formatted text...
. - throughout execution ('on-the-fly') - based on particular program states that have been reached during the execution
In either case, the modifications may be performed directly to the machine code
Machine code
Machine code or machine language is a system of impartible instructions executed directly by a computer's central processing unit. Each instruction performs a very specific task, typically either an operation on a unit of data Machine code or machine language is a system of impartible instructions...
instructions themselves, by overlaying new instructions over the existing ones (for example: altering a compare and branch to an unconditional branch or alternatively a 'NOP
NOP
In computer science, NOP or NOOP is an assembly language instruction, sequence of programming language statements, or computer protocol command that effectively does nothing at all....
'). In the IBM/360 and Z/Architecture
Z/Architecture
z/Architecture, initially and briefly called ESA Modal Extensions , refers to IBM's 64-bit computing architecture for IBM mainframe computers. IBM introduced its first z/Architecture-based system, the zSeries Model 900, in late 2000. Later z/Architecture systems include the IBM z800, z990, z890,...
instruction set
Instruction set
An instruction set, or instruction set architecture , is the part of the computer architecture related to programming, including the native data types, instructions, registers, addressing modes, memory architecture, interrupt and exception handling, and external I/O...
, an EXECUTE (EX) instruction overlays its target instruction (in its 2nd byte) with the lowest order 8 bits of register 1, as a standard, legitimate method of (temporary) instruction modification.
Application in low and high level languages
Self-modification can be accomplished in a variety of ways depending upon the programming language and its support for pointers and/or access to dynamic compiler or interpreter 'engines':-- overlay of existing instructions (or parts of instructions such as opcode, register, flags or address) or
- direct creation of whole instructions or sequences of instructions in memory
- creating or modification of source codeSource codeIn computer science, source code is text written using the format and syntax of the programming language that it is being written in. Such a language is specially designed to facilitate the work of computer programmers, who specify the actions to be performed by a computer mostly by writing source...
statements followed by a 'mini compile' or a dynamic interpretation (see evalEvalIn some programming languages, eval is a function which evaluates a string as though it were an expression and returns a result; in others, it executes multiple lines of code as though they had been included instead of the line including the eval...
statement) - creating an entire program dynamically and then executing it
Assembly language
Self-modifying code is quite straightforward to implement when using assembly languageAssembly language
An assembly language is a low-level programming language for computers, microprocessors, microcontrollers, and other programmable devices. It implements a symbolic representation of the machine codes and other constants needed to program a given CPU architecture...
. Instructions can be dynamically created in memory
Memory
In psychology, memory is an organism's ability to store, retain, and recall information and experiences. Traditional studies of memory began in the fields of philosophy, including techniques of artificially enhancing memory....
(or else overlaid over existing code in non-protected program storage), in a sequence equivalent to the ones that a standard compiler may generate as the object code
Object code
Object code, or sometimes object module, is what a computer compiler produces. In a general sense object code is a sequence of statements in a computer language, usually a machine code language....
(/binary file
Binary file
A binary file is a computer file which may contain any type of data, encoded in binary form for computer storage and processing purposes; for example, computer document files containing formatted text...
). With modern processors, there can be unintended side effects on the CPU cache
CPU cache
A CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access memory. The cache is a smaller, faster memory which stores copies of the data from the most frequently used main memory locations...
that must be considered. The method was frequently used for testing 'first time' conditions, as in this suitably commented IBM/360 assembler example. It uses instruction overlay to reduce the instruction path length
Instruction path length
In computer performance, the instruction path length is the number of machine code instructions required to execute a section of a computer program. The total path length for the entire program could be deemed a measure of the algorithm's performance on a particular computer hardware...
by (N x 1)-1 where N is the number of records on the file (-1 being the overhead
Computational overhead
In computer science, overhead is generally considered any combination of excess or indirect computation time, memory, bandwidth, or other resources that are required to attain a particular goal...
to perform the overlay).
SUBRTN CLI SUBRTN,X'95' FIRST TIME HERE? (this instruction is immediately overlaid during 1st time through)
BNE OPENED (WILL DROP THROUGH IF CLI OPCODE, = x'95' HAS NOT BEEN CHANGED YET)
MVC SUBRTN(4),JUMP YES, OVERLAY THE TEST BY THE MOVE OF AN UNCONDITIONAL BRANCH (same length machine code)
OPEN INPUT and OPEN THE INPUT FILE since its first time through here
JUMP B OPENED THIS 4-BYTE UNCONDITIONAL BRANCH INSTRUCTION OVERLAYS THE 4-BYTE INSTRUCTION AT LABEL 'TEST'
OPENED GET INPUT NORMAL PROCESSING RESUMES HERE
...
(Since the replacement unconditional branch is also slightly faster than a compare instruction, as well as reducing the overall path length, the saved difference in timing between the two instructions is magnified by a factor of N. The 'jump' instruction retains locality of reference and much higher 'visibility' by its close proximity to the overwritten instruction, despite adding an unnecessary extra instruction after the OPEN)
In later Operating systems for programs residing in protected storage
Memory protection
Memory protection is a way to control memory access rights on a computer, and is a part of most modern operating systems. The main purpose of memory protection is to prevent a process from accessing memory that has not been allocated to it. This prevents a bug within a process from affecting...
this technique could not be used and so changing the pointer to the subroutine
Subroutine
In computer science, a subroutine is a portion of code within a larger program that performs a specific task and is relatively independent of the remaining code....
would be used instead. The pointer would reside in dynamic storage and could be altered at will after the first pass to bypass the OPEN (Having to load a pointer first instead of a direct branch & link to the subroutine would add N instructions to the path length - but there would be a corresponding reduction of N for the unconditional branch that would no longer be required).
High level languages
Some languages explicitly permit self-modifying code. For example, the ALTER verb in COBOLCOBOL
COBOL is one of the oldest programming languages. Its name is an acronym for COmmon Business-Oriented Language, defining its primary domain in business, finance, and administrative systems for companies and governments....
allows programs to modify themselves; one batch
Batch file
In DOS, OS/2, and Microsoft Windows, batch file is the name given to a type of script file, a text file containing a series of commands to be executed by the command interpreter....
programming technique is to use self-modifying code. Clipper and SPITBOL
SPITBOL compiler
SPITBOL is a compiled implementation of the SNOBOL4 language. Originally targeted for the IBM System/360 and System/370 family of computers, it has now been ported to most major microprocessors including the SPARC...
also provide facilities for explicit self-modification.
Other languages, such as Perl
Perl
Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...
, Python
Python (programming language)
Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...
and JavaScript
JavaScript
JavaScript is a prototype-based scripting language that is dynamic, weakly typed and has first-class functions. It is a multi-paradigm language, supporting object-oriented, imperative, and functional programming styles....
, allow programs to create new code at run-time and execute it using an eval
Eval
In some programming languages, eval is a function which evaluates a string as though it were an expression and returns a result; in others, it executes multiple lines of code as though they had been included instead of the line including the eval...
function, but do not allow existing code to be mutated. The illusion of modification (even though no machine code is really being overwritten) is achieved by modifying function pointers, as in this JavaScript example:
Control tables
Control tableControl table
Control tables are tables that control the program flow or play a major part in program control. There are no rigid rules concerning the structure or content of a control table - its only qualifying attribute is its ability to direct program flow in some way through its 'execution' by an associated...
interpreters
Interpreter (computing)
In computer science, an interpreter normally means a computer program that executes, i.e. performs, instructions written in a programming language...
can be considered to be, in one sense, 'self-modified' by data values extracted from the table entries (rather than specifically hand coded
Hand coding
In computing, hand coding means editing the underlying representation of a document or a computer program, when tools that allow working on more sophisticated representation also exist. Typically this means editing the source code, or the textual representation of a document or a program, instead...
in conditional statement
Conditional statement
In computer science, conditional statements, conditional expressions and conditional constructs are features of a programming language which perform different computations or actions depending on whether a programmer-specified boolean condition evaluates to true or false...
s of the form "IF inputx = 'yyy'").
History
The IBM SSEC, demonstrated in January 1948, had the ability to modify its instructions or otherwise treat them exactly like data. However, the capability was rarely used in practice.In the early days of computers, self-modifying code was often used to reduce use of limited memory, or improve performance, or both. It was also sometimes used to implement subroutine calls and returns when the instruction set only provided simple branching or skipping instructions to vary the control flow
Control flow
In computer science, control flow refers to the order in which the individual statements, instructions, or function calls of an imperative or a declarative program are executed or evaluated....
. This use is still relevant in certain ultra-RISC architectures, at least theoretically; see for example one instruction set computer
One instruction set computer
A one instruction set computer , sometimes called an ultimate reduced instruction set computer , is an abstract machine that uses only one instruction – obviating the need for a machine language opcode...
. Donald Knuth
Donald Knuth
Donald Ervin Knuth is a computer scientist and Professor Emeritus at Stanford University.He is the author of the seminal multi-volume work The Art of Computer Programming. Knuth has been called the "father" of the analysis of algorithms...
's MIX
MIX
MIX is a hypothetical computer used in Donald Knuth's monograph, The Art of Computer Programming . MIX's model number is 1009, which was derived by combining the model numbers and names of several contemporaneous, commercial machines deemed significant by the author...
architecture also used self-modifying code to implement subroutine calls.
Usage
Self-modifying code can be used for various purposes:- Semi-automatic optimizingOptimization (computer science)In computer science, program optimization or software optimization is the process of modifying a software system to make some aspect of it work more efficiently or use fewer resources...
of a state dependent loop. - Run-time code generation, or specialization of an algorithm in runtime or loadtime (which is popular, for example, in the domain of real-time graphics) such as a general sort utility - preparing code to perform the key comparison described in a specific invocation.
- Altering of inlinedInline functionIn various versions of the C and C++ programming languages, an inline function is a function upon which the compiler has been requested to perform inline expansion...
state of an objectObject (computer science)In computer science, an object is any entity that can be manipulated by the commands of a programming language, such as a value, variable, function, or data structure...
, or simulating the high-level construction of closuresClosure (computer science)In computer science, a closure is a function together with a referencing environment for the non-local variables of that function. A closure allows a function to access variables outside its typical scope. Such a function is said to be "closed over" its free variables...
. - Patching of subroutineSubroutineIn computer science, a subroutine is a portion of code within a larger program that performs a specific task and is relatively independent of the remaining code....
(pointer) address calling, usually as performed at load/initialization time of dynamic libraries, or else on each invocation, patching the subroutine's internal references to its parameters so as to use actual addresses of specific routines. (i.e. Indirect 'self-modification'). - Evolutionary computing systems such as genetic programmingGenetic programmingIn artificial intelligence, genetic programming is an evolutionary algorithm-based methodology inspired by biological evolution to find computer programs that perform a user-defined task. It is a specialization of genetic algorithms where each individual is a computer program...
. - Hiding of code to prevent reverse engineeringReverse engineeringReverse engineering is the process of discovering the technological principles of a device, object, or system through analysis of its structure, function, and operation...
(by use of a disassemblerDisassemblerA disassembler is a computer program that translates machine language into assembly language—the inverse operation to that of an assembler. A disassembler differs from a decompiler, which targets a high-level language rather than an assembly language...
or debuggerDebuggerA debugger or debugging tool is a computer program that is used to test and debug other programs . The code to be examined might alternatively be running on an instruction set simulator , a technique that allows great power in its ability to halt when specific conditions are encountered but which...
) or to evade detection by virus/spyware scanning software and the like. - Filling 100% of memory (in some architectures) with a rolling pattern of repeating opcodes, to erase all programs and data, or to burn-in hardware.
- CompressingData compressionIn computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....
code to be decompressed and executed at runtime, e.g., when memory or disk space is limited. - Some very limited instruction setInstruction setAn instruction set, or instruction set architecture , is the part of the computer architecture related to programming, including the native data types, instructions, registers, addressing modes, memory architecture, interrupt and exception handling, and external I/O...
s leave no option but to use self-modifying code to perform certain functions. For example, a one instruction set computerOne instruction set computerA one instruction set computer , sometimes called an ultimate reduced instruction set computer , is an abstract machine that uses only one instruction – obviating the need for a machine language opcode...
(OISC) machine that uses only the subtract-and-branch-if-negative "instruction" cannot do an indirect copy (something like the equivalent of "*a = **b" in the C languageC (programming language)C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....
) without using self-modifying code. - Altering instructions for fault-tolerance.
Optimizing a state-dependent loop
PseudocodePseudocode
In computer science and numerical computation, pseudocode is a compact and informal high-level description of the operating principle of a computer program or other algorithm. It uses the structural conventions of a programming language, but is intended for human reading rather than machine reading...
example:
repeat N times {
if STATE is 1
increase A by one
else
decrease A by one
do something with A
}
Self-modifying code in this case would simply be a matter of rewriting the loop like this:
repeat N times {
increase A by one
do something with A
}
when STATE has to switch {
replace the opcode "increase" above with the opcode to decrease, or vice versa
}
Note that 2-state replacement of the opcode
Opcode
In computer science engineering, an opcode is the portion of a machine language instruction that specifies the operation to be performed. Their specification and format are laid out in the instruction set architecture of the processor in question...
can be easily written as
'xor var at address with the value "opcodeOf(Inc) xor opcodeOf(dec)"'.
Choosing this solution must depend on the value of 'N' and the frequency of state changing.
Specialisation
Suppose a set of statistics such as average, extrema, location of extrema, standard deviation, etc. are to be calculated for some large data set. In a general situation, there may be an option of associating weights with the data, so each xi is associated with a wi and rather than test for the presence of weights at every index value, there could be two versions of the calculation, one for use with weights and one not, with one test at the start. Now consider a further option, that each value may have associated with it a boolean to signify whether that value is to be skipped or not. This could be handled by producing four batches of code, one for each permutation and code bloat results. Alternatively, the weight and the skip arrays could be merged into a temporary array (with zero weights for values to be skipped), at the cost of processing and still there is bloat. However, with code modification, to the template for calculating the statistics could be added as appropriate the code for skipping unwanted values, and for applying weights. There would be no repeated testing of the options and the data array would be accessed once, as also would the weight and skip arrays, if involved.Use as camouflage
Self-modifying code was used to hide copy protection instructions in 1980s disk-based programs for platforms such as IBM PCIBM PC
The IBM Personal Computer, commonly known as the IBM PC, is the original version and progenitor of the IBM PC compatible hardware platform. It is IBM model number 5150, and was introduced on August 12, 1981...
and Apple II
Apple II
The Apple II is an 8-bit home computer, one of the first highly successful mass-produced microcomputer products, designed primarily by Steve Wozniak, manufactured by Apple Computer and introduced in 1977...
. For example, on an IBM PC (or compatible
IBM PC compatible
IBM PC compatible computers are those generally similar to the original IBM PC, XT, and AT. Such computers used to be referred to as PC clones, or IBM clones since they almost exactly duplicated all the significant features of the PC architecture, facilitated by various manufacturers' ability to...
), the floppy disk
Floppy disk
A floppy disk is a disk storage medium composed of a disk of thin and flexible magnetic storage medium, sealed in a rectangular plastic carrier lined with fabric that removes dust particles...
drive access instruction 'int
Interrupt
In computing, an interrupt is an asynchronous signal indicating the need for attention or a synchronous event in software indicating the need for a change in execution....
0x13' would not appear in the executable program's image but it would be written into the executable's memory image after the program started executing.
Self-modifying code is also sometimes used by programs that do not want to reveal their presence, such as computer virus
Computer virus
A computer virus is a computer program that can replicate itself and spread from one computer to another. The term "virus" is also commonly but erroneously used to refer to other types of malware, including but not limited to adware and spyware programs that do not have the reproductive ability...
es and some shellcode
Shellcode
In computer security, a shellcode is a small piece of code used as the payload in the exploitation of a software vulnerability. It is called "shellcode" because it typically starts a command shell from which the attacker can control the compromised machine. Shellcode is commonly written in...
s. Viruses and shellcodes that use self-modifying code mostly do this in combination with polymorphic code
Polymorphic code
In computer terminology, polymorphic code is code that uses a polymorphic engine to mutate while keeping the original algorithm intact. That is, the code changes itself each time it runs, but the function of the code will not change at all...
. Modifying a piece of running code is also used in certain attacks, such as buffer overflow
Buffer overflow
In computer security and programming, a buffer overflow, or buffer overrun, is an anomaly where a program, while writing data to a buffer, overruns the buffer's boundary and overwrites adjacent memory. This is a special case of violation of memory safety....
s.
Self-referential machine learning systems
Traditional machine learningMachine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...
systems have a fixed, pre-programmed learning algorithm
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...
to adjust their parameter
Parameter
Parameter from Ancient Greek παρά also “para” meaning “beside, subsidiary” and μέτρον also “metron” meaning “measure”, can be interpreted in mathematics, logic, linguistics, environmental science and other disciplines....
s. However, since the 1980s Jürgen Schmidhuber
Jürgen Schmidhuber
Jürgen Schmidhuber is a computer scientist and artist known for his work on machine learning, universal Artificial Intelligence , artificial neural networks, digital physics, and low-complexity art. His contributions also include generalizations of Kolmogorov complexity and the Speed Prior...
has published several self-modifying systems with the ability to change their own learning algorithm. They avoid the danger of catastrophic self-rewrites by making sure that self-modifications will survive only if they are useful according to a user-given fitness
Fitness function
A fitness function is a particular type of objective function that is used to summarise, as a single figure of merit, how close a given design solution is to achieving the set aims....
, error
Error function
In mathematics, the error function is a special function of sigmoid shape which occurs in probability, statistics and partial differential equations...
or reward function.
Operating systems
Because of the security implications of self-modifying code, all of the major operating systemOperating system
An operating system is a set of programs that manage computer hardware resources and provide common services for application software. The operating system is the most important type of system software in a computer system...
s are careful to remove such vulnerabilities as they become known. The concern is typically not that programs will intentionally modify themselves, but that they could be maliciously changed by an exploit
Exploit (computer security)
An exploit is a piece of software, a chunk of data, or sequence of commands that takes advantage of a bug, glitch or vulnerability in order to cause unintended or unanticipated behavior to occur on computer software, hardware, or something electronic...
.
As consequence of the troubles that can be caused by these exploits, an OS feature called W^X
W^X
W^X is the name of a security feature present in the OpenBSD operating system. It is a memory protection policy whereby every page in a process' address space is either writable or executable, but not both simultaneously...
(for "write xor execute") has been developed which prohibits a program from making any page of memory both writable and executable. Some systems prevent a writable page from ever being changed to be executable, even if write permission is removed. Other systems provide a 'back door' of sorts, allowing multiple mappings of a page of memory to have different permissions. A relatively portable way to bypass W^X is to create a file with all permissions, then map the file into memory twice. On Linux, one may use an undocumented SysV shared memory flag to get executable shared memory without needing to create a file.
Regardless, at a meta-level, programs can still modify their own behavior by changing data stored elsewhere (see metaprogramming
Metaprogramming
Metaprogramming is the writing of computer programs that write or manipulate other programs as their data, or that do part of the work at compile time that would otherwise be done at runtime...
) or via use of polymorphism
Type polymorphism
In computer science, polymorphism is a programming language feature that allows values of different data types to be handled using a uniform interface. The concept of parametric polymorphism applies to both data types and functions...
.
Interaction of cache and self-modifying code
On architectures without coupled data and instruction cache (some ARM and MIPS cores) the cache synchronization must be explicitly performed by the modifying code (flush data cache and invalidate instruction cache for the modified memory area).In some cases short sections of self-modifying code execute more slowly on modern processors. This is because a modern processor will usually try to keep blocks of code in its cache memory. Each time the program rewrites a part of itself, the rewritten part must be loaded into the cache again, which results in a slight delay, if the modified codelet shares the same cache line with the modifying code, as is the case when the modified memory address is located within a few bytes to the one of the modifying code.
The cache invalidation issue on modern processors usually means that self-modifying code would still be faster only when the modification will occur rarely, such as in the case of a state switching inside an inner loop.
Most modern processors load the machine code before they execute it, which means that if an instruction that is too near the instruction pointer is modified, the processor will not notice, but instead execute the code as it was before it was modified. See prefetch input queue
Prefetch input queue
Fetching the instruction opcodes from program memory well in advance is known as prefetching and it is served by using prefetch input queue .The pre-fetched instructions are stored in data structure Queue. The fetching of opcodes well in advance, prior to their need for execution increases the...
(PIQ). PC processors must handle self-modifying code correctly for backwards compatibility reasons but they are far from efficient at doing so.
Massalin's Synthesis kernel
The Synthesis kernel presented in Dr. Alexia MassalinAlexia Massalin
Alexia Massalin is an American computer scientist and programmer. Massalin pioneered the concept of superoptimization, and designed the Synthesis kernel...
's Ph.D.
Doctor of Philosophy
Doctor of Philosophy, abbreviated as Ph.D., PhD, D.Phil., or DPhil , in English-speaking countries, is a postgraduate academic degree awarded by universities...
thesis is a tiny Unix
Unix
Unix is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna...
kernel that takes a structured
Structured programming
Structured programming is a programming paradigm aimed on improving the clarity, quality, and development time of a computer program by making extensive use of subroutines, block structures and for and while loops - in contrast to using simple tests and jumps such as the goto statement which could...
, or even object oriented
Object-oriented programming
Object-oriented programming is a programming paradigm using "objects" – data structures consisting of data fields and methods together with their interactions – to design applications and computer programs. Programming techniques may include features such as data abstraction,...
, approach to self-modifying code, where code is created for individual quajects, like filehandles; generating code for specific tasks allows the Synthesis kernel to (as a JIT interpreter might) apply a number of optimizations
Compiler optimization
Compiler optimization is the process of tuning the output of a compiler to minimize or maximize some attributes of an executable computer program. The most common requirement is to minimize the time taken to execute a program; a less common one is to minimize the amount of memory occupied...
such as constant folding
Constant folding
Constant folding and constant propagation are related compiler optimizations used by many modern compilers. An advanced form of constant propagation known as sparse conditional constant propagation can more accurately propagate constants and simultaneously remove dead code.- Constant folding...
or common subexpression elimination
Common subexpression elimination
In computer science, common subexpression elimination is a compiler optimization that searches for instances of identical expressions , and analyses whether it is worthwhile replacing them with a single variable holding the computed value.- Example :In the following code: a = b * c + g; d = b * c...
.
The Synthesis kernel was extremely fast, but was written entirely in assembly. The resulting lack of portability has prevented Massalin's optimization ideas from being adopted by any production kernel. However, the structure of the techniques suggests that they could be captured by a higher level language
Programming language
A programming language is an artificial language designed to communicate instructions to a machine, particularly a computer. Programming languages can be used to create programs that control the behavior of a machine and/or to express algorithms precisely....
, albeit one more complex than existing mid-level languages. Such a language and compiler could allow development of faster operating systems and applications.
Paul Haeberli and Bruce Karsh have objected to the "marginalization" of self-modifying code, and optimization in general, in favor of reduced development costs.
Advantages
- Fast pathFast pathFast path is a term used in computer science to describe a path with shorterinstruction path length through a program compared to the 'normal' path. For a fast path to be effective it must handle the most commonly occurring tasks more efficiently than the 'normal' path, leaving the latter to handle...
s can be established for a program's execution, reducing some otherwise repetitive conditional branches. - Self-modifying code can improve algorithmic efficiencyAlgorithmic efficiencyIn computer science, efficiency is used to describe properties of an algorithm relating to how much of various types of resources it consumes. Algorithmic efficiency can be thought of as analogous to engineering productivity for a repeating or continuous process, where the goal is to reduce...
.
Disadvantages
Self-modifying code is seen by some as a bad practiceBest practice
A best practice is a method or technique that has consistently shown results superior to those achieved with other means, and that is used as a benchmark...
because it makes code harder to read and maintain. There are however ways in which self modification is nevertheless deemed acceptable, such as when function pointer
Function pointer
A function pointer is a type of pointer in C, C++, D, and other C-like programming languages, and Fortran 2003. When dereferenced, a function pointer can be used to invoke a function and pass it arguments just like a normal function...
s are dynamically altered, even though the effect is almost identical to direct modification. The subtle difference, in this case, is that a pointer variable
Variable (programming)
In computer programming, a variable is a symbolic name given to some known or unknown quantity or information, for the purpose of allowing the name to be used independently of the information it represents...
is altered, not actual program instructions. The change to the pointer is, in this case, equivalent to the setting of a 'flag
Flag (computing)
In computer programming, flag can refer to one or more bits that are used to store a binary value or code that has an assigned meaning, but can refer to uses of other data types...
' (that might have been set as an alternative), except that the flag does not need to be tested each time thereafter.
At the machine instruction level, frequently self-modifying code can cause significant performance degradation on modern processors. The most common method of handling correctness in the case of self-modifying code in the processor is a full pipeline
Instruction pipeline
An instruction pipeline is a technique used in the design of computers and other digital electronic devices to increase their instruction throughput ....
flush, which carries a far stiffer penalty than branch misprediction recovery.
Self-modifying code cannot be used at all in some environments.
The most common such environments are:
- Application software running under an operating system with strict W^X security cannot execute instructions in pages it is allowed to write to—only the operating system itself is allowed to both write instructions to memory and later execute those instructions.
- Many Harvard architectureHarvard architectureThe Harvard architecture is a computer architecture with physically separate storage and signal pathways for instructions and data. The term originated from the Harvard Mark I relay-based computer, which stored instructions on punched tape and data in electro-mechanical counters...
microcontrollers cannot execute instructions in RAM, but only instructions in memory that it cannot write to—ROM or non-self-programmable flash memory.
See also
- Algorithmic efficiencyAlgorithmic efficiencyIn computer science, efficiency is used to describe properties of an algorithm relating to how much of various types of resources it consumes. Algorithmic efficiency can be thought of as analogous to engineering productivity for a repeating or continuous process, where the goal is to reduce...
- evalEvalIn some programming languages, eval is a function which evaluates a string as though it were an expression and returns a result; in others, it executes multiple lines of code as though they had been included instead of the line including the eval...
statement - Just-in-time compilationJust-in-time compilationIn computing, just-in-time compilation , also known as dynamic translation, is a method to improve the runtime performance of computer programs. Historically, computer programs had two modes of runtime operation, either interpreted or static compilation...
: This technique can often give users many of the benefits of self-modifying code (except memory size) without the disadvantages. - PCASTLPCASTLThe PCASTL is an interpreted high-level programming language. It was created in 2008 by Philippe Choquette. The PCASTL is designed to ease the writing of self-modifying code...
- Quine (computing)
- Self-replicationSelf-replicationSelf-replication is any behavior of a dynamical system that yields construction of an identical copy of that dynamical system. Biological cells, given suitable environments, reproduce by cell division. During cell division, DNA is replicated and can be transmitted to offspring during reproduction...
- Reflection (computer science)Reflection (computer science)In computer science, reflection is the process by which a computer program can observe and modify its own structure and behavior at runtime....
External links
- Using self-modifying code under Linux
- Self-modifying C code (Web Archive version)
- "Synthesis: An Efficient Implementation of Fundamental Operating System Services": Alexia Massalin's Ph.D. thesis on the Synthesis kernel
- Futurist Programming
- Certified Self-Modifying Code
- Jürgen SchmidhuberJürgen SchmidhuberJürgen Schmidhuber is a computer scientist and artist known for his work on machine learning, universal Artificial Intelligence , artificial neural networks, digital physics, and low-complexity art. His contributions also include generalizations of Kolmogorov complexity and the Speed Prior...
's publications on self-modifying code for self-referential machine learning systems