Re: LibTomMath forked [SSE2 addons]

From: Tom St Denis (tom_at_securescience.net)
Date: 06/28/04


Date: Mon, 28 Jun 2004 18:07:57 GMT

Phil Carmody wrote:

> Tom St Denis <tom@securescience.net> writes:
>
>> I said I'd never put asm messyness in any mainline LT project... so I
>> forked
>> LTM and created LTM-SSE. It's a very direct port of LTM that has
>> carefully dropped in SSE2 optimizations [in four files only I might
>> add...]
>>
>> Here are the results [for Montgomery exptmod] with GCC 3.3.3 on a 2.8Ghz
>> Northwood P4.
>>
>> LTM-SSE
>> CLK_PER_SEC == 2808950608
>> Exponentiating 513-bit => 611/sec, 5659540452 ticks
>>
>> LTM-031
>> CLK_PER_SEC == 2810915412
>> Exponentiating 513-bit => 269/sec, 5666527812 ticks
>
> That latter figure looks _way_ off the mark. A 512-bit (Barrett) expmod
> in plain C (using gcc) on my Duron/900 takes <1260000 ticks, which is
> 716/sec.

Let's review some facts though.... Athlon has THREE ALUs and a 6-cycle
multiplier. P4 has 1 SSE port, [iirc] an 8-cycle [2 throughput, 8 latency]
multiplier, 2 cycle latency add, etc, etc...

My later code [that I posted last night] was faster as I moved a "movq" to
"movd". It was faster cuz I didn't break alignment [loading movq
non-aligned has penalties, funny though that loading xmm non-aligned causes
faults... ;-(]

IIRC I got the final code up to around 650/sec.

> Did you try a Barrett rather than Monty method? I considered Monty
> briefly, but prefered Barrett's simplicity.

It's montgomery reduction [which is used in Comba mode].

Also is your code MP or SP? Do you do explicit error checks or passive?
etc, etc, etc.

Let's not forget that LTM is a full fledged MP bignum lib [e.g. has to
maintain size/allocs for numbers] that does explicit error checking [e.g.
thread safe error detection].

Tom