Implements the proven optimal (in a restricted but reasonable
model^{[1]}) inversion sequence for
Toom-Cook 3x3, a shift less compared to current GMP^{[3]}
code, and no temporaries. For processors without a native `addlsh1`

implementation, *Bodrato'07*^{[2]} uses
also a proven optimal evaluation sequence, formulas can be also found
on the Toom-Cook optimal formulas page.
On my laptop (Centrino) the proposed code shows a slight
speed-up. Timing graphs are normalised with respect to
GMP-4.2.2. *Bodrato-Zanoni* code uses the new inversion,
*Bodrato'07* uses both optimisations.

Latest code *Bodrato'09* introduces a new unpublished trick
to avoid half an operation.

Behaviour can be deeply different on different architectures, particularly those which have native `addlsh1`

. If you try the
code and obtain much different results, please let us know:
Bodrato-Zanoni
<>.

**Licence:** GNU GPLv3+.

**Download:** gmp-4.2.2/mpn/generic/mul_n.c. (and a
patch which *can* be
applied to gmp-impl.h)

**References:** [1]*What
About Toom-Cook Matrices Optimality?* by Marco Bodrato and
Alberto Zanoni

[2]*Towards Optimal Toom-Cook Multiplication for Univariate and Multivariate Polynomials in Characteristic 2 and 0* by Marco Bodrato

[3]GMP - The GNU Multi Precision library

Implements the proven optimal (in a restricted but reasonable
model^{[1]}) inversion sequence for
Toom-3.5 (4x3 and 5x2). Some further optimisation are applied:
adaptatation to "gmp^{[3]} instruction set", and a final saving of an
operation by interlacing interpolation and recomposition phases. Some
optimisation are present also in the evaluation phase^{[2]}.

**Licence:** GNU GPLv3+.

**Download:** toom3_5_mul.c.

**References:** [1]*What
About Toom-Cook Matrices Optimality?* by Marco Bodrato and
Alberto Zanoni

[2]*Towards Optimal Toom-Cook Multiplication for Univariate and Multivariate Polynomials in Characteristic 2 and 0* by Marco Bodrato

[3]GMP - The GNU Multi Precision library

Implements a new experimental strategy, mixing evaluation, interpolation and recomposition phases, to reduce the memory footprint and the linear complexity.

Two versions are available, the first one evaluating in {1,2,4} using exact division by 3 and by 45, the second one evaluating in {1/2,1,2} with divisions by 9 and by 15.

**Licence:** GNU GPLv3+.

**Download:** toom4.5 eval in 1,2,4, toom4.5 eval in 1/2,1,2.

**References:** None yet.

Implements a new experimental strategy, mixing evaluation, interpolation and recomposition phases, to reduce the memory footprint and the linear complexity.

All unbalanced evaluations are condensed in a single function. This
means that the function `mpn_toom6h_mul` implements both Toom-6
and Toom-6.5; i.e. 6x6, 7x6, 7x5, 8x5, 8x4 and 9x4. But 9x3, 10x3,
10x2 and 11x2 are not implemented at all.

The function
`mpn_toom8h_mul` implements both Toom-8 and Toom-8.5; i.e. 8x8,
9x8, 9x7, 10x7, 10x6, 11x6 and 11x5; on 64-bits CPU it implements also
12x5, 12x4 and 13x4. But 13x3, 14x3, 14x2 and 15x2 are not implemented
at all.

A squaring functions are given too, skiping the evaluation at infinity.

WARNING! Dont trust the _itch functions! The functions are *almost* NAILS-ready but has never been tested…

**Licence:** GNU GPLv3+.

**Download:** toom6.5_31ott2009.tbz; toom8.5_31ott2009.tbz.

**References:**
*High degree Toom'n'half for balanced and unbalanced multiplication* by Marco Bodrato.

Graphs above where obtained on my laptop (Centrino) comparing a developement version of GMP, and MPIR Version 1.3.0-rc1. Both compiled with `./configure && make`, without tuning.

All the three graphs shows the value: "cycles used for a multiplication"/"number of limbs^{3/2}", limbs are 32-bits long.

**Observations:**

- Toom-6.5 has a narrow range where it is the fastest algorithm, I fixed for it a range from 250 to 370 limbs; for operands above 370 limbs Toom-8.5 was used.
- The highly optimised Toom-8.5, running within MPIR library, gets
faster wrt. both Toom-4 and Toom-7
**before**they start being faster than Toom-3! - The new Toom-8.5 code is faster than both current GMP and MPIR from around 300 limbs up to arount 8,000 limbs.
**Tests with different CPU/compiler may give drastically different results!**

It is quite hard to understand which one of the many unbalanced Toom is faster for a given couple of operands. To start investigation I produced a table reporting the fastest Toom-NxM algorithm present in (future versions of) GMP.

**Download:** toom.fastest-flavour.dvi.bz2 (DVI).

Some concrete experiments on Toom 4x4, 5x5, 6x6, 7x7, and 4x3-5x2
(Toom3.5), 5x3 (with Toom4) algorithms. The code for interpolation was
semi-automatically generated, with comments indicating limits every
variable value is between, as an indication for mpn optimisation, none
of the implemented method was proven optimal though.

Graphs refer to code executed on my laptop (Centrino), with speedups
around 20% for balanced and around 35% for unbalanced, for
middle-sized integers.

**Licence:** GNU GPLv3+.

**Download:** GMP implementation of Toom-Cook 5, 6, 7, and 3.5 as a tar bzip2 file.

**References:**
*Towards Optimal Toom-Cook Multiplication for Univariate and Multivariate Polynomials in Characteristic 2 and 0* by Marco Bodrato

GMP implementation of
Toom-Cook 4 by Paul Zimmermann

*What About Toom-Cook Matrices
Optimality?* by Marco Bodrato and Alberto Zanoni

Implements the same sequences as above, within the LibTomMath
environment. On my laptop this patch obsoletes the Karatsuba
multiplication, it gives in fact:

`KARATSUBA_MUL_CUTOFF = 128
TOOM_MUL_CUTOFF = 129`

Not the same for squaring, where the tuning suggests:

TOOM_SQR_CUTOFF = 186

The original library was included by many projects, but it seems that it is not maintained any more… Let me know if you need a patch for your project, probably with a different licence.

**Licence:** GNU GPLv3+.

**Download:** libtommath/bn_mp_toom_mul.c, libtommath/bn_mp_toom_sqr.c.

**References:** [1]*What
About Toom-Cook Matrices Optimality?* by Marco Bodrato and
Alberto Zanoni,

[2]

Marco Bodrato - 19 settembre 2010