Re: LibTomMath forked [SSE2 addons]

From: Arnold Stang (stang13_at_anonymous.to)
Date: 06/30/04


Date: Wed, 30 Jun 2004 16:49:55 GMT


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Tom St Denis wrote:

| This is funny.... check this out
|
|
|>Now with SSE2 [and ICC -O3 -xN -ip]
|>
|>Timing Multiplying:
|> 4 digits: 400 cycles
|> 8 digits: 764 cycles
|> 12 digits: 1280 cycles
|> 16 digits: 1868 cycles
|> 20 digits: 3068 cycles
|> 24 digits: 4020 cycles
|> 28 digits: 5146 cycles
|> 32 digits: 6382 cycles
|> 36 digits: 7766 cycles
|>Timing Squaring:
|> 4 digits: 344 cycles
|> 8 digits: 644 cycles
|> 12 digits: 1056 cycles
|> 16 digits: 1550 cycles
|> 20 digits: 2202 cycles
|> 24 digits: 2866 cycles
|> 28 digits: 3566 cycles
|> 32 digits: 4310 cycles
|> 36 digits: 5122 cycles
|
|
| Now with GCC [v3.3.3, -march=pentium4 -mcpu=pentium4 -O3
| -fomit-frame-pointer -funroll-loops]
|
| Timing Multiplying:
| 4 digits: 464 cycles
| 8 digits: 740 cycles
| 12 digits: 1118 cycles
| 16 digits: 1616 cycles
| 20 digits: 2162 cycles
| 24 digits: 2858 cycles
| 28 digits: 3590 cycles
| 32 digits: 4400 cycles
| 36 digits: 5478 cycles
| Timing Squaring:
| 4 digits: 460 cycles
| 8 digits: 766 cycles
| 12 digits: 1076 cycles
| 16 digits: 1460 cycles
| 20 digits: 1848 cycles
| 24 digits: 2300 cycles
| 28 digits: 2824 cycles
| 32 digits: 3514 cycles
| 36 digits: 4196 cycles
|
| Seems GCC optimizes the code better than Intels own C compiler...
|
| Ok I'm done now ;-) Sorry about all the posts... but it's just fun
hacking
| and might benefit someone.
|
| Tom

Please, Tom --- take this to the sandbox.

Your friend, Arnold
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2rc1-nr1 (Windows XP)
Comment: Trusted Jewelers Since 1870

iD8DBQFA4u5HdF+zGr+78AURAv0VAKDisRHMrZtx7z3WE0ys1KSYnqm8swCg4Ptg
Gu1VKwxi/XhX0puU9ihweR4=
=eI17
-----END PGP SIGNATURE-----



Relevant Pages

  • Re: LibTomMath forked [SSE2 addons]
    ... Some more timing stuff... ... Timing Multiplying: ... digits: 1010 cycles ...
    (sci.crypt)
  • Re: LibTomMath forked [SSE2 addons]
    ... > Timing Multiplying: ... > Timing Squaring: ... digits: 464 cycles ...
    (sci.crypt)
  • Re: Hex to ascii
    ... Subtracting to detect digits will average 5 iterations/digit, ... If all iterations except the final one is correctly predicted, then we get an absolute minimum of 5 cycles + a branch miss costing anywhere from 5 to 20 cycles (depending upon the current cpu architecture). ... Since my code needs 4 MUL operations, each of which costs 10-11 cycles on a P4, the entire code will take about 60-80 cycles on the same P4, while on any AMD or P6 style cpu we're down in the 30-50 cycle range. ...
    (comp.lang.asm.x86)
  • Re: Powers and logic
    ... quasi wrote: ... The problem was to find the sum of its digits. ... a terabyte of cycles is what -- half an hour of computing time? ...
    (sci.math)