PolyMAC-1305 [was Re: Pelican MAC [speed and vectors]]
From: Tom St Denis (tomstdenis_at_gmail.com)
Date: 03/27/05
- Next message: Phil Carmody: "Re: VMPC free"
- Previous message: john_ramsden_at_sagitta-ps.com: "Re: SF: Incorrect equations"
- In reply to: Tom St Denis: "Re: Pelican MAC [speed and vectors]"
- Next in thread: Tom St Denis: "Re: PolyMAC-1305 [was Re: Pelican MAC [speed and vectors]]"
- Reply: Tom St Denis: "Re: PolyMAC-1305 [was Re: Pelican MAC [speed and vectors]]"
- Reply: D. J. Bernstein: "Re: Poly1305-AES"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Date: 27 Mar 2005 04:49:22 -0800
Tom St Denis wrote:
> Three times faster in ISO C? ... Tom investigates... I always
assumed
> your speed was for ASM coding. Looking at your code you're using
> "doubles" ... hehehe. I can't use those in LTC.
>
> I'll look at adding PolyMAC though.
Been looking it over and the best way I can think to implement it is to
borrow code from TomsFastMath [sans the inline asm [*]] and add a flag
to detect 64-bit platforms.
So to compute the inside product it'll need either a 5x5 or 3x3
multiplier. On the k7 a multiplication takes 6 cycles which is 150
cycles per loop [9 cycles per byte].
On the k8 it's 3x3 multiplications of 8 cycles each which is 72 cycles
[4.5 cycles per byte].
Reduction modulo 2^130 - 5 requires a shift by 2^130 and a modulo by
2^130 [e.g. AND the most sigificant digit], then you multiply the
shifted value by five [or shift/add as it's just x<<2 + x] and add the
two.
let h = 2^130 - 6 represent the largest current MAC value
let r = 2^130 - 6 represent the largest [**] r value
let c = 2^129 - 1 represent the largest msg value
then the reduction is
h' == (h + c) * r
q == h' >> 130
r == h' % 2^130
h == 5q + h
Now h - p == 8847341539944400050047739793225973497823
You have to reduce it twice [at most]
[res=680564733841876926926749214863536422909]
So on a 64-bit platform you need 2*(6shifts + 6adds + 3subs) to reduce
the number. Assuming all single cycle and no stalls that's 30 cyles [2
cycles/byte]. So bare min without using the FPU is sitting around 6.5
cycles/byte on an AMD64. [well some of the reduction could be done in
parallel I think].
On the k7 though the multiplications still put you at 9 cycles/byte so
I don't see how you claim 3.2 cycles/byte vs the 8 cycles/byte my
Pelican implementation in PORTABLE C gets.
The timings I have so far are
Athlon XP-M Barton:
OMAC , 35/byte
PMAC , 37/byte
PELICAN, 13/byte
Athlon64:
OMAC , 24/byte
PMAC , 24/byte
PELICAN, 8/byte
Both are based on stock "optimized for speed" builds of the current
cvssnapshot of LTC 1.01.
I'm still going to try and implement it but I seriously doubt portable
code will be much faster than Pelican.
Tom
- Next message: Phil Carmody: "Re: VMPC free"
- Previous message: john_ramsden_at_sagitta-ps.com: "Re: SF: Incorrect equations"
- In reply to: Tom St Denis: "Re: Pelican MAC [speed and vectors]"
- Next in thread: Tom St Denis: "Re: PolyMAC-1305 [was Re: Pelican MAC [speed and vectors]]"
- Reply: Tom St Denis: "Re: PolyMAC-1305 [was Re: Pelican MAC [speed and vectors]]"
- Reply: D. J. Bernstein: "Re: Poly1305-AES"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Relevant Pages
|