Salsa20 and SSE2?

From: Paul Rubin (//phr.cx_at_NOSPAM.invalid)
Date: 08/29/05


Date: 28 Aug 2005 18:44:44 -0700

I wonder if anyone (DJB?) has tried coding Salsa20 using SSE2
instructions along the lines of Matthijs van Duin's Altivec
implementation.

Say the input is x0,x1,...,x15. Let a,b,c,d be SSE2 registers holding
four 32-bit ints. Let "a <<< n" (three < signs) denote the value of a
after rotating each of the four ints leftward by n bits. Let "a <<<< n"
denote treating a as a single 128-bit quantity and rotating it left n bits.

The algorithm looks like (pseudocode):

    # initial permutation
    a = x0,x5,x10,x15
    b = x4,x9,x14,x3
    c = x8,x13,x2,x7
    d = x12,x1,x6,x11

    # ten pairs of rounds
    for i in 1,2,3,4,5,6,7,8,9,10:
       # round 1,3,...
       b ^= (a+d) <<< 7
       c ^= (b+a) <<< 9
       d ^= (c+b) <<< 13
       a ^= (d+c) <<< 18

       # rearrange variables (parallel assignment)
       a,b,c,d = a, (d <<<< 32), (b <<<< 32), (c <<<< 64)

       # round 2, 4, ....
       b ^= (a+c) <<< 7
       d ^= (b+a) <<< 9
       c ^= (d+b) <<< 13
       a ^= (c+d) <<< 18

    # (final permutation and add back to input omitted)

The initial permutation is just a bunch of 32-bit mov instructions
and SSE register loads.

The quarter-round operations look like:

   # b ^= (a+d) <<< 7
   movdqa t1, a # syntax is mov dst,src
   paddd t1, d
   pslld t2, t1, 7 # t2 = t1 << 7 (shift left)
   psrld t1, t1, 25 # t1 >>= (32-7)
   por t1, t2
   pxor b, t1

and similarly for the other quarter-rounds. Note that we needed 3
instructions (PSLLD, PSRLD, POR) and an extra temp register to
accomplish a bit rotation.

Rearrangement of the variables uses the PSHUFD instruction:

   # after the first set of quarter-rounds
   pshufd t1, d, 0b01101100
   pshufd c, d, 0b11000110
   pshufd b, c, 0b01101100
   b = t1 # re-alias

There's a similar arrangement after the second set of quarter rounds.

So we've used a total of seven sse2 registers:

  a,b,c,d = the data being worked on
  t1,t2 = two temp registers
  b1 = alternate place for b (re-alias after each rearrangement,
    saves a movdqa per round)

which works out rather nicely (ia32+sse has 8 sse registers, the
athlon-64 has 16).

Maybe some cycles can be saved by reordering somewhat. I'm not enough
of an asm whiz to understand where the stalls in this code will be or
how long they'll last. Or if the Athlon-64 have two sse pipes, maybe
the code can be unrolled to do two blocks simultaneously, using
twice as many registers.

Anyway, has this already been tried and is it a speedup on the usual
P3/P4/PM/Athlon{,-64}?

If not, I might give it a try, if no one else does.



Relevant Pages

  • Re: Can you use SSE2 in DOS?
    ... >> I want to actually just demonstrate SSE2 instructions and XMM ... Operating System Support for FXSAVE and FXRSTOR instructions. ... contents of the XMM and MXCSR registers along with the contents ...
    (comp.lang.asm.x86)
  • Re: Two Click disassembly/reassembly
    ... Map the extra x86 registers to memory. ... > equivalents to the string instructions. ... > got such a limited RISC like instruction set that the assembler is more ...
    (alt.lang.asm)
  • Re: IBM 45nm -- new or licensed from Intel?
    ... 12 have 'L' sub registers, ... Just don't tell your compiler that they exist, ... to insert extra movzx instructions and avoids partial stalls all in one... ... If you want to compare the int results, you usually need to extend the ...
    (comp.arch)
  • Re: IBM 45nm -- new or licensed from Intel?
    ... registers are more restricted, but you do not need to use them (see ... instructions even if they had the option. ... For example the "low power" Silverthorne is ...
    (comp.arch)
  • Re: speed it up
    ... can load many registers at once from memory and put many instructions ... the inner loop is unrolled ... The above loop tells the compiler that 4 registers ...
    (comp.lang.cpp)