Re: Salsa20 and SSE2?

From: D. J. Bernstein (djb_at_cr.yp.to)
Date: 08/29/05


Date: Mon, 29 Aug 2005 06:49:46 +0000 (UTC)

Paul Rubin wrote:
> It's disappointing that the SSE2 instructions are so slow.

The biggest part of the problem is the number of instructions required:

   * XMM instructions can't simply add the way that LEA can. They have
     to copy, then add.

   * XMM instructions can't simply rotate the way that ROL can. They
     have to copy, then shift, then shift, then xor.

   * XMM instructions can't even copy in one instruction without serious
     P4 latency problems. They have to copy a 0, then xor.

Even if you aren't worried about P4 latency, the simple add-rotate-xor
in Salsa20 turns into at least seven XMM instructions. Each round has
four add-rotate-xor operations and three shuffles, for a total of at
least 31 XMM instructions per round.

For comparison, traditional non-XMM code takes 26.75 cycles per round on
the Athlon, and completely straightforward serial code takes 24.5 cycles
per round on the PowerPC G4.

> Alternatively, any chance of a "salsa40" function that uses 64-bit
> scalar operations? What are the obvious pitfalls in designing one
> that more or less imitates salsa20?

The main problem is that the 64-bit speedup is accompanied by a severe
32-bit slowdown. Shuffling sixteen 64-bit variables into x86 registers
takes a lot of time; also, add-with-carry is often fairly slow.

---D. J. Bernstein, Professor, Mathematics, Statistics,
and Computer Science, University of Illinois at Chicago



Relevant Pages

  • Re: shed escape somewhere than contact with Evelyns symbolic booking
    ... instructions on top of a light. ... It should joyously time once again Taysseer when the ... Every popular religious papers will round disappear the ...
    (sci.crypt)
  • Re: My $10 Ural Carb balancer
    ... another round of drinks. ... I picked up some supplies yesterday, and with instructions I found ...
    (rec.motorcycles.harley)
  • Re: My $10 Ural Carb balancer
    ... round of drinks. ... I picked up some supplies yesterday, and with instructions I found on the ...
    (rec.motorcycles.harley)
  • Re: SHA-256 performance on x86
    ... "Shill" wrote in message ... > instructions for one round of SHA-256. ... (You did not include those in your lower bound, ...
    (sci.crypt)
  • Re: Lies, damn lies and benchmarks
    ... When running using just the 16-bit registers, ... extra cycles when run on the 386 over the 286 (these were mostly system ... instructions which didn't get run too often anyways), ... The FPU was another story, the 287 FPU was usually run at an asynchronous ...
    (comp.security.misc)