Salsa20 and SSE2?
From: Paul Rubin (//phr.cx_at_NOSPAM.invalid)
Date: 08/29/05
- Next message: mobius30: "Re: The importance of IVs"
- Previous message: Regis: "Re: Re-secured Algorithm?"
- Next in thread: tomstdenis_at_gmail.com: "Re: Salsa20 and SSE2?"
- Reply: tomstdenis_at_gmail.com: "Re: Salsa20 and SSE2?"
- Reply: D. J. Bernstein: "Re: Salsa20 and SSE2?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Date: 28 Aug 2005 18:44:44 -0700
I wonder if anyone (DJB?) has tried coding Salsa20 using SSE2
instructions along the lines of Matthijs van Duin's Altivec
implementation.
Say the input is x0,x1,...,x15. Let a,b,c,d be SSE2 registers holding
four 32-bit ints. Let "a <<< n" (three < signs) denote the value of a
after rotating each of the four ints leftward by n bits. Let "a <<<< n"
denote treating a as a single 128-bit quantity and rotating it left n bits.
The algorithm looks like (pseudocode):
# initial permutation
a = x0,x5,x10,x15
b = x4,x9,x14,x3
c = x8,x13,x2,x7
d = x12,x1,x6,x11
# ten pairs of rounds
for i in 1,2,3,4,5,6,7,8,9,10:
# round 1,3,...
b ^= (a+d) <<< 7
c ^= (b+a) <<< 9
d ^= (c+b) <<< 13
a ^= (d+c) <<< 18
# rearrange variables (parallel assignment)
a,b,c,d = a, (d <<<< 32), (b <<<< 32), (c <<<< 64)
# round 2, 4, ....
b ^= (a+c) <<< 7
d ^= (b+a) <<< 9
c ^= (d+b) <<< 13
a ^= (c+d) <<< 18
# (final permutation and add back to input omitted)
The initial permutation is just a bunch of 32-bit mov instructions
and SSE register loads.
The quarter-round operations look like:
# b ^= (a+d) <<< 7
movdqa t1, a # syntax is mov dst,src
paddd t1, d
pslld t2, t1, 7 # t2 = t1 << 7 (shift left)
psrld t1, t1, 25 # t1 >>= (32-7)
por t1, t2
pxor b, t1
and similarly for the other quarter-rounds. Note that we needed 3
instructions (PSLLD, PSRLD, POR) and an extra temp register to
accomplish a bit rotation.
Rearrangement of the variables uses the PSHUFD instruction:
# after the first set of quarter-rounds
pshufd t1, d, 0b01101100
pshufd c, d, 0b11000110
pshufd b, c, 0b01101100
b = t1 # re-alias
There's a similar arrangement after the second set of quarter rounds.
So we've used a total of seven sse2 registers:
a,b,c,d = the data being worked on
t1,t2 = two temp registers
b1 = alternate place for b (re-alias after each rearrangement,
saves a movdqa per round)
which works out rather nicely (ia32+sse has 8 sse registers, the
athlon-64 has 16).
Maybe some cycles can be saved by reordering somewhat. I'm not enough
of an asm whiz to understand where the stalls in this code will be or
how long they'll last. Or if the Athlon-64 have two sse pipes, maybe
the code can be unrolled to do two blocks simultaneously, using
twice as many registers.
Anyway, has this already been tried and is it a speedup on the usual
P3/P4/PM/Athlon{,-64}?
If not, I might give it a try, if no one else does.
- Next message: mobius30: "Re: The importance of IVs"
- Previous message: Regis: "Re: Re-secured Algorithm?"
- Next in thread: tomstdenis_at_gmail.com: "Re: Salsa20 and SSE2?"
- Reply: tomstdenis_at_gmail.com: "Re: Salsa20 and SSE2?"
- Reply: D. J. Bernstein: "Re: Salsa20 and SSE2?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Relevant Pages
|
|