Schönhage-Strassen algorithm
Encyclopedia
The Schönhage–Strassen algorithm is an asymptotically fast multiplication algorithm
Multiplication algorithm
A multiplication algorithm is an algorithm to multiply two numbers. Depending on the size of the numbers, different algorithms are in use...

 for large integer
Integer
The integers are formed by the natural numbers together with the negatives of the non-zero natural numbers .They are known as Positive and Negative Integers respectively...

s. It was developed by Arnold Schönhage
Arnold Schönhage
Arnold Schönhage is a mathematician and computer scientist and Professor Emeritus at Rheinische Friedrich-Wilhelms-Universität, Bonn. He was also professor in Tübingen and Konstanz...

 and Volker Strassen
Volker Strassen
Volker Strassen is a German mathematician, a professor emeritus in the department of mathematics and statistics at the University of Konstanz.-Biography:Strassen was born on April 29, 1936, in Düsseldorf-Gerresheim....

 in 1971. The run-time bit complexity is, in Big O notation
Big O notation
In mathematics, big O notation is used to describe the limiting behavior of a function when the argument tends towards a particular value or infinity, usually in terms of simpler functions. It is a member of a larger family of notations that is called Landau notation, Bachmann-Landau notation, or...

, O(N log N log log N). The algorithm uses recursive Fast Fourier transform
Fast Fourier transform
A fast Fourier transform is an efficient algorithm to compute the discrete Fourier transform and its inverse. "The FFT has been called the most important numerical algorithm of our lifetime ." There are many distinct FFT algorithms involving a wide range of mathematics, from simple...

s in rings
Ring (mathematics)
In mathematics, a ring is an algebraic structure consisting of a set together with two binary operations usually called addition and multiplication, where the set is an abelian group under addition and a semigroup under multiplication such that multiplication distributes over addition...

 with 22n + 1 elements, a specific type of number theoretic transform.

The Schönhage–Strassen algorithm was the asymptotically fastest multiplication method known from 1971 until 2007, when a new method, Fürer's algorithm
Fürer's algorithm
Fürer's algorithm is an integer multiplication algorithm for very large numbers possessing a very low asymptotic complexity. It was created in 2007 by Swiss mathematician Martin Fürer of Pennsylvania State University as an asymptotically faster algorithm than its predecessor, the...

, was announced with lower asymptotic complexity; however, Fürer's algorithm currently only achieves an advantage for astronomically large values and is not used in practice.

In practice the Schönhage–Strassen algorithm starts to outperform older methods such as Karatsuba and Toom–Cook multiplication for numbers beyond 2215 to 2217 (10,000 to 40,000 decimal digits). The GNU Multi-Precision Library
GNU Multi-Precision Library
The GNU Multiple Precision Arithmetic Library, also known as GMP, is a free library for arbitrary-precision arithmetic, operating on signed integers, rational numbers, and floating point numbers...

 uses it for values of at least 1728 to 7808 64-bit words (111,000 to 500,000 decimal digits), depending on architecture.

Applications of the Schönhage–Strassen algorithm include mathematical empiricism, such as the Great Internet Mersenne Prime Search
Great Internet Mersenne Prime Search
The Great Internet Mersenne Prime Search is a collaborative project of volunteers who use freely available computer software to search for Mersenne prime numbers. The project was founded by George Woltman, who also wrote the software Prime95 and MPrime for the project...

 and computing approximations of π, as well as practical applications such as Lenstra elliptic curve factorization
Lenstra elliptic curve factorization
The Lenstra elliptic curve factorization or the elliptic curve factorization method is a fast, sub-exponential running time algorithm for integer factorization which employs elliptic curves. For general purpose factoring, ECM is the third-fastest known factoring method...

, in which multiplication of polynomials with integer coefficients can be efficiently reduced to large integer multiplication.

Details

This section explains in detail how Schönhage–Strassen is implemented. It is based primarily on an overview of the method by Crandall and Pomerance in their Prime Numbers: A Computational Perspective. This variant differs somewhat from Schönhage's original method in that it exploits the discrete weighted transform
Discrete weighted transform
In mathematics, the discrete weighted transform is a variation on the discrete Fourier transform over arbitrary fields involving weighting the input before transforming it....

 to perform negacyclic convolutions more efficiently. Another source for detailed information is Knuth
Donald Knuth
Donald Ervin Knuth is a computer scientist and Professor Emeritus at Stanford University.He is the author of the seminal multi-volume work The Art of Computer Programming. Knuth has been called the "father" of the analysis of algorithms...

's The Art of Computer Programming.

Convolutions

Suppose we are multiplying two numbers like 123 and 456 using long multiplication with base B digits, but without performing any carrying. The result might look something like this:
1 2 3
× 4 5 6

6 12 18
5 10 15
4 8 12

4 13 28 27 18


This sequence (4, 13, 28, 27, 18) is called the acyclic or linear convolution of the two original sequences (1,2,3) and (4,5,6). Once you have the acyclic convolution of two sequences, computing the product of the original numbers is easy: you just perform the carrying (for example, in the rightmost column, you'd keep the 8 and add the 1 to the column containing 27). In the example this yields the correct product 56088.

There are two other types of convolutions that will be useful. Suppose the input sequences have n elements (here 3). Then the acyclic convolution has n+n−1 elements; if we take the rightmost n elements and add the leftmost n−1 elements, this produces the cyclic convolution:
28 27 18
+ 4 13

28 31 31


If we perform carrying on the cyclic convolution, the result is equivalent to the product of the inputs mod Bn − 1. In the example, 103 − 1 = 999, performing carrying on (28, 31, 31) yields 3141, and 3141 ≡ 56088 (mod 999).

Conversely, if we take the rightmost n elements and subtract the leftmost n−1 elements, this produces the negacyclic convolution
Negacyclic convolution
In mathematics, negacyclic convolution is a convolution between two vectors a and b.It is also called skew circular convolution or wrapped convolution. It results from multiplication of a skew circulant matrix, generated by vector a, with vector b.- See also :*Circular convolution theorem...

:
28 27 18
4 13

28 23 5


If we perform carrying on the negacyclic convolution, the result is equivalent to the product of the inputs mod Bn + 1. In the example, 103 + 1 = 1001, performing carrying on (28, 23, 5) yields 3035, and 3035 ≡ 56088 (mod 1001). The negacyclic convolution can contain negative numbers, which can be eliminated during carrying using borrowing, as is done in long subtraction.

Convolution theorem

Like other multiplication methods based on the Fast Fourier transform, Schönhage–Strassen depends fundamentally on the convolution theorem
Convolution theorem
In mathematics, the convolution theorem states that under suitableconditions the Fourier transform of a convolution is the pointwise product of Fourier transforms. In other words, convolution in one domain equals point-wise multiplication in the other domain...

, which provides an efficient way to compute the cyclic convolution of two sequences. It states that:
The cyclic convolution of two vectors can be found by taking the discrete Fourier transform
Discrete Fourier transform
In mathematics, the discrete Fourier transform is a specific kind of discrete transform, used in Fourier analysis. It transforms one function into another, which is called the frequency domain representation, or simply the DFT, of the original function...

 (DFT) of each of them, multiplying the resulting vectors element by element, and then taking the inverse discrete Fourier transform (IDFT).


Or in symbols:
CyclicConvolution(X, Y) = IDFT(DFT(X) · DFT(Y))


If we compute the DFT and IDFT using a fast Fourier transform
Fast Fourier transform
A fast Fourier transform is an efficient algorithm to compute the discrete Fourier transform and its inverse. "The FFT has been called the most important numerical algorithm of our lifetime ." There are many distinct FFT algorithms involving a wide range of mathematics, from simple...

 algorithm, and invoke our multiplication algorithm recursively to multiply the entries of the transformed vectors DFT(X) and DFT(Y), this yields an efficient algorithm for computing the cyclic convolution.

In this algorithm, it will be more useful to compute the negacyclic convolution; as it turns out, a slightly modified version of the convolution theorem (see discrete weighted transform
Discrete weighted transform
In mathematics, the discrete weighted transform is a variation on the discrete Fourier transform over arbitrary fields involving weighting the input before transforming it....

) can enable this as well. Suppose the vectors X and Y have length n, and a is a primitive root of unity of order
Order (group theory)
In group theory, a branch of mathematics, the term order is used in two closely related senses:* The order of a group is its cardinality, i.e., the number of its elements....

 2n (that is, a2n = 1 and a to all smaller powers is not 1). Then we can define a third vector A, called the weight vector, as:
A = (aj), 0 ≤ j < n
A−1 = (a−j), 0 ≤ j < n


Now, we can state:
NegacyclicConvolution(X, Y) = A−1 · IDFT(DFT(A · X) · DFT(A · Y))


In other words, it's the same as before except that the inputs are first multiplied by A, and the result is multiplied by A−1.

Choice of ring

The discrete Fourier transform is an abstract operation that can be performed in any algebraic ring
Ring (mathematics)
In mathematics, a ring is an algebraic structure consisting of a set together with two binary operations usually called addition and multiplication, where the set is an abelian group under addition and a semigroup under multiplication such that multiplication distributes over addition...

; typically it's performed in the complex numbers, but actually performing complex arithmetic to sufficient precision to ensure accurate results for multiplication is slow and error-prone. Instead, we will use the approach of the number theoretic transform, which is to perform the transform in the integers mod N for some integer N.

Just like there are primitive roots of unity of every order in the complex plane, given any order n we can choose a suitable N such that b is a primitive root of unity of order n in the integers mod N (in other words, bn ≡ 1 (mod N), and no smaller power of b is equivalent to 1 mod N).

The algorithm will spend most of its time performing recursive multiplications of smaller numbers; with a naive algorithm, these occur in a number of places:
  1. Inside the fast Fourier transform algorithm, where the primitive root of unity b is repeatedly powered, squared, and multiplied by other values.
  2. When taking powers of the primitive root of unity a to form the weight vector A and when multiplying A or A−1 by other vectors.
  3. When performing element-by-element multiplication of the transformed vectors.


The key insight to Schönhage–Strassen is to choose N, the modulus, to be equal to 2n + 1 for some integer n. This has a number of benefits in standard systems that represent large integers in binary form:
  • Any value can be rapidly reduced modulo 2n + 1 using only shifts and adds, as explained in the next section.
  • All primitive roots of unity in this ring can be written in the form 2k; consequently we can multiply or divide any number by a root of unity using a shift, and power or square a root of unity by operating only on its exponent.
  • The element-by-element recursive multiplications of the transformed vectors can be performed using a negacyclic convolution, which is faster than an acyclic convolution and already has "for free" the effect of reducing its result mod 2n + 1.


To make the recursive multiplications convenient, we will frame Schönhage–Strassen as being a specialized multiplication algorithm for computing not just the product of two numbers, but the product of two numbers mod 2n + 1 for some given n. This is not a loss of generality, since one can always choose n large enough so that the product mod 2n + 1 is simply the product.

Shift optimizations

In the course of the algorithm, there are many cases in which multiplication or division by a power of two (including all roots of unity) can be profitably replaced by a small number of shifts and adds. This makes use of the observation that:
k ≡ (−1)k (mod 2n + 1)

Note that a k-digit number in base 2n written in positional notation
Positional notation
Positional notation or place-value notation is a method of representing or encoding numbers. Positional notation is distinguished from other notations for its use of the same symbol for the different orders of magnitude...

 can be expressed as . It represents the number . Also note that for each , .

This makes it simple to reduce a number represented in binary mod 2n + 1: take the rightmost (least significant) n bits, subtract the next n bits, add the next n bits, and so on until the bits are exhausted. If the resulting value is still not between 0 and 2n, normalize it by adding or subtracting a multiple of the modulus 2n + 1. For example, if n=3 (and so the modulus is 23+1 = 9) and the number being reduced is 656, we have:
656 = 10100100002 ≡ 0002 − 0102 + 0102 − 12 = 0 − 2 + 2 − 1 = −1 ≡ 8 (mod 23 + 1).


Moreover, it's possible to effect very large shifts without ever constructing the shifted result. Suppose we have a number A between 0 and 2n, and wish to multiply it by 2k. Dividing k by n we find k = qn + r with r < n. It follows that:
A(2k) = A(2qn + r) = A[(2n)q(2r)] ≡ (−1)q(A shift-left r) (mod 2n + 1).


Since A is ≤ 2n and r < n, A shift-left r has at most 2n−1 bits, and so only one shift and subtraction (followed by normalization) is needed.

Finally, to divide by 2k, observe that squaring the first equivalence above yields:
22n ≡ 1 (mod 2n + 1)


Hence,
A/2k = A(2−k) ≡ A(22n − k) = A shift-left (2n − k) (mod 2n + 1).

Overview

The algorithm follows a split, evaluate (forward FFT), pointwise multiply, interpolate (inverse FFT), and combine phases similar to Karatsuba and Toom-Cook methods.

Given an input numbers x and y, and an integer N, the following algorithm computes the product xy mod 2N + 1. Provided N is sufficiently large this is simply the product.
  1. Split each input number into vectors X and Y of 2k parts each, where 2k divides N. (e.g. 12345678 -> (12, 34, 56, 78)).
  2. In order to make progress, it's necessary to use a smaller N for recursive multiplications. For this purpose choose n as the smallest integer at least 2N/2k + k and divisible by 2k.
  3. Compute the product of X and Y mod 2n + 1 using the negacyclic convolution:
    1. Multiply X and Y each by the weight vector A using shifts (shift the jth entry left by jn/2k).
    2. Compute the DFT of X and Y using the number-theoretic FFT (perform all multiplies using shifts; for the 2k-th root of unity, use 22n/2k).
    3. Recursively apply this algorithm to multiply corresponding elements of the transformed X and Y.
    4. Compute the IDFT of the resulting vector to get the result vector C (perform all multiplies using shifts). This corresponds to interpolation phase.
    5. Multiply the result vector C by A−1 using shifts.
    6. Adjust signs: some elements of the result may be negative. We compute the largest possible positive value for the jth element of C, (j + 1)22N/2k, and if it exceeds this we subtract the modulus 2n + 1.
  4. Finally, perform carrying mod 2N+1 to get the final result.


The optimal number of pieces to divide the input into is proportional to , where m is the number of input bits, and this setting achieves the running time of O(n log n log log n), so the parameter k should be set accordingly. In practice, it is set empirically based on the input sizes and the architecture, typically to a value between 4 and 16.

In step 2, the observation is used that:
  • Each element of the input vectors has at most n/2k bits;
  • The product of any two input vector elements has at most 2n/2k bits;
  • Each element of the convolution is the sum of at most 2k such products, and so cannot exceed 2n/2k + k bits.
  • n must be divisible by 2k to ensure that in the recursive calls the condition "2k divides N" holds in step 1.

Optimizations

This section explains a number of important practical optimizations that have been considered when implementing Schönhage–Strassen in real systems. It is based primarily on a 2007 work by Gaudry, Kruppa, and Zimmermann describing enhancements to the GNU Multi-Precision Library
GNU Multi-Precision Library
The GNU Multiple Precision Arithmetic Library, also known as GMP, is a free library for arbitrary-precision arithmetic, operating on signed integers, rational numbers, and floating point numbers...

.

Below a certain cutoff point, it's more efficient to perform the recursive multiplications using other algorithms, such as Toom–Cook multiplication. The results must be reduced mod 2n + 1, which can be done efficiently as explained above in Shift optimizations with shifts and adds/subtracts.

Computing the IDFT involves dividing each entry by the primitive root of unity 22n/2k, an operation that is frequently combined with multiplying the vector by A−1 afterwards, since both involve division by a power of two.

In a system where a large number is represented as an array of 2w-bit words, it's useful to ensure that the vector size 2k is also a multiple of the bits per word by choosing k ≥ w (e.g. choose k ≥ 5 on a 32-bit computer and k ≥ 6 on a 64-bit computer); this allows the inputs to be broken up into pieces without bit shifts, and provides a uniform representation for values mod 2n + 1 where the high word can only be zero or one.

Normalization involves adding or subtracting the modulus 2n+1; this value has only two bits set, which means this can be done in constant time on average with a specialized operation.

Iterative FFT algorithms such as the Cooley–Tukey FFT algorithm, although frequently used for FFTs on vectors of complex numbers, tend to exhibit very poor cache locality
Locality of reference
In computer science, locality of reference, also known as the principle of locality, is the phenomenon of the same value or related storage locations being frequently accessed. There are two basic types of reference locality. Temporal locality refers to the reuse of specific data and/or resources...

with the large vector entries used in Schönhage–Strassen. The straightforward recursive, not in-place implementation of FFT is more successful, with all operations fitting in the cache beyond a certain point in the call depth, but still makes suboptimal use of the cache in higher call depths. Gaudry, Kruppa, and Zimmerman used a technique combining Bailey's 4-step algorithm with higher radix transforms that combine multiple recursive steps. They also mix phases, going as far into the algorithm as possible on each element of the vector before moving on to the next one.

The "square root of 2 trick", first described by Schönhage, is to note that, provided k ≥ 2, 23n/4−2n/4 is a square root of 2 mod 2n+1, and so a 4n-th root of unity (since 22n ≡ 1). This allows the transform length to be extended from 2k to 2k + 1.

Finally, the authors are careful to choose the right value of k for different ranges of input numbers, noting that the optimal value of k may go back and forth between the same values several times as the input size increases.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK