Re: LibTomMath forked [SSE2 addons]

From: Tom St Denis (tom_at_securescience.net)
Date: 06/30/04


Date: Wed, 30 Jun 2004 15:54:49 GMT

Some more timing stuff... I've added a time mult/square to LTC and I plan to
re-write the timing demo in LTM itself...

Anyways....

In LTC-0.98 [work in progress branch] it outputs [without SSE2] the
following. Note that a "digit" is 28 bits so 36 digits is roughly 1000
bits.

Timing Multiplying:
  4 digits: 1010 cycles
  8 digits: 2588 cycles
 12 digits: 4894 cycles
 16 digits: 7662 cycles
 20 digits: 11352 cycles
 24 digits: 15834 cycles
 28 digits: 20848 cycles
 32 digits: 26470 cycles
 36 digits: 32726 cycles
Timing Squaring:
  4 digits: 740 cycles
  8 digits: 1652 cycles
 12 digits: 3132 cycles
 16 digits: 4822 cycles
 20 digits: 6910 cycles
 24 digits: 9190 cycles
 28 digits: 11472 cycles
 32 digits: 14540 cycles
 36 digits: 17796 cycles

Now with SSE2

Timing Multiplying:
  4 digits: 400 cycles
  8 digits: 764 cycles
 12 digits: 1280 cycles
 16 digits: 1868 cycles
 20 digits: 3068 cycles
 24 digits: 4020 cycles
 28 digits: 5146 cycles
 32 digits: 6382 cycles
 36 digits: 7766 cycles
Timing Squaring:
  4 digits: 344 cycles
  8 digits: 644 cycles
 12 digits: 1056 cycles
 16 digits: 1550 cycles
 20 digits: 2202 cycles
 24 digits: 2866 cycles
 28 digits: 3566 cycles
 32 digits: 4310 cycles
 36 digits: 5122 cycles

Same code [tuned via mcpu/march] on my Athlon XP-M

Timing Multiplying:
  4 digits: 430 cycles
  8 digits: 826 cycles
 12 digits: 1364 cycles
 16 digits: 2054 cycles
 20 digits: 2910 cycles
 24 digits: 3881 cycles
 28 digits: 5024 cycles
 32 digits: 6343 cycles
 36 digits: 8168 cycles
Timing Squaring:
  4 digits: 458 cycles
  8 digits: 782 cycles
 12 digits: 1225 cycles
 16 digits: 1765 cycles
 20 digits: 2439 cycles
 24 digits: 3157 cycles
 28 digits: 3969 cycles
 32 digits: 4917 cycles
 36 digits: 5882 cycles

So you can see with SSE2 optimizations the P4 is roughly 1.1x faster and
with ALU is 4x slower. If anyone doubts my rants about how weak the P4 ALU
is compared to the Athlon... here's your proof ;-)

This is also why I'll be including the patched "mpi.c" in subsequent
releases of LibTomCrypt.

Note: if anyone wants to write similar patches for their boxes [say Altivec
or whatever ARM has] I'll gladly host them on libtomcrypt.org like I have
my SSE2 patches.

Tom



Relevant Pages

  • Re: LibTomMath forked [SSE2 addons]
    ... Tom St Denis wrote: ... |>Timing Multiplying: ... |> 4 digits: 400 cycles ...
    (sci.crypt)
  • Re: LibTomMath forked [SSE2 addons]
    ... > Timing Multiplying: ... > Timing Squaring: ... digits: 464 cycles ...
    (sci.crypt)
  • Re: Hex to ascii
    ... Subtracting to detect digits will average 5 iterations/digit, ... If all iterations except the final one is correctly predicted, then we get an absolute minimum of 5 cycles + a branch miss costing anywhere from 5 to 20 cycles (depending upon the current cpu architecture). ... Since my code needs 4 MUL operations, each of which costs 10-11 cycles on a P4, the entire code will take about 60-80 cycles on the same P4, while on any AMD or P6 style cpu we're down in the 30-50 cycle range. ...
    (comp.lang.asm.x86)
  • Re: Powers and logic
    ... quasi wrote: ... The problem was to find the sum of its digits. ... a terabyte of cycles is what -- half an hour of computing time? ...
    (sci.math)
  • Re: Spartan 3 clock to output tristate timing
    ... is the cycle timing fixed? ... On write cycles the FPGA has to latch the write data before it ... into a 1-cycle pulse used as a clock enable to latch the write data. ...
    (comp.arch.fpga)