Re: LibTomMath forked [SSE2 addons]
From: Tom St Denis (tom_at_securescience.net)
Date: 06/30/04
- Next message: Michael Amling: "IND-CCA2 public only?"
- Previous message: Michael Amling: "Re: Beginner Qn: Encrypting small data"
- In reply to: Tom St Denis: "Re: LibTomMath forked [SSE2 addons]"
- Next in thread: Tom St Denis: "Re: LibTomMath forked [SSE2 addons]"
- Reply: Tom St Denis: "Re: LibTomMath forked [SSE2 addons]"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Date: Wed, 30 Jun 2004 15:54:49 GMT
Some more timing stuff... I've added a time mult/square to LTC and I plan to
re-write the timing demo in LTM itself...
Anyways....
In LTC-0.98 [work in progress branch] it outputs [without SSE2] the
following. Note that a "digit" is 28 bits so 36 digits is roughly 1000
bits.
Timing Multiplying:
4 digits: 1010 cycles
8 digits: 2588 cycles
12 digits: 4894 cycles
16 digits: 7662 cycles
20 digits: 11352 cycles
24 digits: 15834 cycles
28 digits: 20848 cycles
32 digits: 26470 cycles
36 digits: 32726 cycles
Timing Squaring:
4 digits: 740 cycles
8 digits: 1652 cycles
12 digits: 3132 cycles
16 digits: 4822 cycles
20 digits: 6910 cycles
24 digits: 9190 cycles
28 digits: 11472 cycles
32 digits: 14540 cycles
36 digits: 17796 cycles
Now with SSE2
Timing Multiplying:
4 digits: 400 cycles
8 digits: 764 cycles
12 digits: 1280 cycles
16 digits: 1868 cycles
20 digits: 3068 cycles
24 digits: 4020 cycles
28 digits: 5146 cycles
32 digits: 6382 cycles
36 digits: 7766 cycles
Timing Squaring:
4 digits: 344 cycles
8 digits: 644 cycles
12 digits: 1056 cycles
16 digits: 1550 cycles
20 digits: 2202 cycles
24 digits: 2866 cycles
28 digits: 3566 cycles
32 digits: 4310 cycles
36 digits: 5122 cycles
Same code [tuned via mcpu/march] on my Athlon XP-M
Timing Multiplying:
4 digits: 430 cycles
8 digits: 826 cycles
12 digits: 1364 cycles
16 digits: 2054 cycles
20 digits: 2910 cycles
24 digits: 3881 cycles
28 digits: 5024 cycles
32 digits: 6343 cycles
36 digits: 8168 cycles
Timing Squaring:
4 digits: 458 cycles
8 digits: 782 cycles
12 digits: 1225 cycles
16 digits: 1765 cycles
20 digits: 2439 cycles
24 digits: 3157 cycles
28 digits: 3969 cycles
32 digits: 4917 cycles
36 digits: 5882 cycles
So you can see with SSE2 optimizations the P4 is roughly 1.1x faster and
with ALU is 4x slower. If anyone doubts my rants about how weak the P4 ALU
is compared to the Athlon... here's your proof ;-)
This is also why I'll be including the patched "mpi.c" in subsequent
releases of LibTomCrypt.
Note: if anyone wants to write similar patches for their boxes [say Altivec
or whatever ARM has] I'll gladly host them on libtomcrypt.org like I have
my SSE2 patches.
Tom
- Next message: Michael Amling: "IND-CCA2 public only?"
- Previous message: Michael Amling: "Re: Beginner Qn: Encrypting small data"
- In reply to: Tom St Denis: "Re: LibTomMath forked [SSE2 addons]"
- Next in thread: Tom St Denis: "Re: LibTomMath forked [SSE2 addons]"
- Reply: Tom St Denis: "Re: LibTomMath forked [SSE2 addons]"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Relevant Pages
|