PolyMAC-1305 [was Re: Pelican MAC [speed and vectors]]

From: Tom St Denis (tomstdenis_at_gmail.com)
Date: 03/27/05


Date: 27 Mar 2005 04:49:22 -0800


Tom St Denis wrote:
> Three times faster in ISO C? ... Tom investigates... I always
assumed
> your speed was for ASM coding. Looking at your code you're using
> "doubles" ... hehehe. I can't use those in LTC.
>
> I'll look at adding PolyMAC though.

Been looking it over and the best way I can think to implement it is to
borrow code from TomsFastMath [sans the inline asm [*]] and add a flag
to detect 64-bit platforms.

So to compute the inside product it'll need either a 5x5 or 3x3
multiplier. On the k7 a multiplication takes 6 cycles which is 150
cycles per loop [9 cycles per byte].

On the k8 it's 3x3 multiplications of 8 cycles each which is 72 cycles
[4.5 cycles per byte].

Reduction modulo 2^130 - 5 requires a shift by 2^130 and a modulo by
2^130 [e.g. AND the most sigificant digit], then you multiply the
shifted value by five [or shift/add as it's just x<<2 + x] and add the
two.

let h = 2^130 - 6 represent the largest current MAC value
let r = 2^130 - 6 represent the largest [**] r value
let c = 2^129 - 1 represent the largest msg value

then the reduction is

h' == (h + c) * r
q == h' >> 130
r == h' % 2^130
h == 5q + h

Now h - p == 8847341539944400050047739793225973497823

You have to reduce it twice [at most]
[res=680564733841876926926749214863536422909]

So on a 64-bit platform you need 2*(6shifts + 6adds + 3subs) to reduce
the number. Assuming all single cycle and no stalls that's 30 cyles [2
cycles/byte]. So bare min without using the FPU is sitting around 6.5
cycles/byte on an AMD64. [well some of the reduction could be done in
parallel I think].

On the k7 though the multiplications still put you at 9 cycles/byte so
I don't see how you claim 3.2 cycles/byte vs the 8 cycles/byte my
Pelican implementation in PORTABLE C gets.

The timings I have so far are

Athlon XP-M Barton:
OMAC , 35/byte
PMAC , 37/byte
PELICAN, 13/byte

Athlon64:
OMAC , 24/byte
PMAC , 24/byte
PELICAN, 8/byte

Both are based on stock "optimized for speed" builds of the current
cvssnapshot of LTC 1.01.

I'm still going to try and implement it but I seriously doubt portable
code will be much faster than Pelican.

Tom



Relevant Pages

  • Re: 64-bit AES
    ... AES128 206 cycles (12.9 cycles/byte) ... the code gets a bit ugly since some instructions end up in odd places so ...
    (sci.crypt)
  • Diffie Hellman questions: GF(2^m) Exponentiation Modulo a Trinomial
    ... on my PIII, or about 1 M cycles per exponentiation, with both base and modulus ... For instance a simple GFmultiplier takes n shifts and n ... > XORs which amounts to n^2 single precision shifts and n^2 single precision ...
    (sci.crypt)
  • Re: Fermi
    ... Doubles take four cycles ... Consider a single precision multiplier that calculates A * B = AB. ... so most of the vector units will only do half the throughput for DP values. ...
    (comp.arch)
  • Re: Shannon -- the stream cipher, that is
    ... beats RC4. ... cycles per 64-byte block, which makes for 4.5 cycles/byte. ... Yeah, but I think those are becoming extinct. ...
    (sci.crypt)
  • Re: A basic code-hardware-timing question
    ... cycles taken by a divider and multiplier within a ... It is possible to write a combinatorial multiplier or divider, ... to the next iteration of the set of calculations... ...
    (comp.lang.verilog)

Quantcast