Re: LibTomMath forked [SSE2 addons]

From: Tom St Denis (tom_at_securescience.net)
Date: 06/30/04


Date: Wed, 30 Jun 2004 16:08:18 GMT

This is funny.... check this out

> Now with SSE2 [and ICC -O3 -xN -ip]
>
> Timing Multiplying:
> 4 digits: 400 cycles
> 8 digits: 764 cycles
> 12 digits: 1280 cycles
> 16 digits: 1868 cycles
> 20 digits: 3068 cycles
> 24 digits: 4020 cycles
> 28 digits: 5146 cycles
> 32 digits: 6382 cycles
> 36 digits: 7766 cycles
> Timing Squaring:
> 4 digits: 344 cycles
> 8 digits: 644 cycles
> 12 digits: 1056 cycles
> 16 digits: 1550 cycles
> 20 digits: 2202 cycles
> 24 digits: 2866 cycles
> 28 digits: 3566 cycles
> 32 digits: 4310 cycles
> 36 digits: 5122 cycles

Now with GCC [v3.3.3, -march=pentium4 -mcpu=pentium4 -O3
-fomit-frame-pointer -funroll-loops]

Timing Multiplying:
  4 digits: 464 cycles
  8 digits: 740 cycles
 12 digits: 1118 cycles
 16 digits: 1616 cycles
 20 digits: 2162 cycles
 24 digits: 2858 cycles
 28 digits: 3590 cycles
 32 digits: 4400 cycles
 36 digits: 5478 cycles
Timing Squaring:
  4 digits: 460 cycles
  8 digits: 766 cycles
 12 digits: 1076 cycles
 16 digits: 1460 cycles
 20 digits: 1848 cycles
 24 digits: 2300 cycles
 28 digits: 2824 cycles
 32 digits: 3514 cycles
 36 digits: 4196 cycles

Seems GCC optimizes the code better than Intels own C compiler...

Ok I'm done now ;-) Sorry about all the posts... but it's just fun hacking
and might benefit someone.

Tom



Relevant Pages

  • Re: LibTomMath forked [SSE2 addons]
    ... Some more timing stuff... ... Timing Multiplying: ... digits: 1010 cycles ...
    (sci.crypt)
  • Re: LibTomMath forked [SSE2 addons]
    ... Tom St Denis wrote: ... |>Timing Multiplying: ... |> 4 digits: 400 cycles ...
    (sci.crypt)
  • Re: Hex to ascii
    ... Subtracting to detect digits will average 5 iterations/digit, ... If all iterations except the final one is correctly predicted, then we get an absolute minimum of 5 cycles + a branch miss costing anywhere from 5 to 20 cycles (depending upon the current cpu architecture). ... Since my code needs 4 MUL operations, each of which costs 10-11 cycles on a P4, the entire code will take about 60-80 cycles on the same P4, while on any AMD or P6 style cpu we're down in the 30-50 cycle range. ...
    (comp.lang.asm.x86)
  • Re: Powers and logic
    ... quasi wrote: ... The problem was to find the sum of its digits. ... a terabyte of cycles is what -- half an hour of computing time? ...
    (sci.math)