Re: Salsa20 altivec timings
Date: 09/28/05

Date: 27 Sep 2005 17:32:24 -0700

xmath wrote:
> wrote:
> > xmath wrote:
> > > It's interesting to note that, due to unavoidable data-dependencies, a
> > > Salsa20 round must take at least 12 cycles, as absolute CPU-independent
> > > minimum (unless it offers combined add-rotate, rotate-xor, or xor-add
> > > instructions that execute in one cycle, something I've never seen in
> > > any CPU).
> >
> > ARM offers add-rotate but not the others [it can xor-rotate but not
> > rotate-xor].
> Interesting, didn't know that :-)

ARMs are crazy that way. I mean come on

A = (B + C) >>> D

Is a *common* operation :-)

To be fair

A = (B + C) >> D

Is common enough (e.g. signal average) so I guess supporting rotates
was not a far stretch.

> (Of course, to be relevant in this context it also needs to be able to
> do it on four 32-bit words in parallel in a single cycle, but still...)

There is a four-core ARM11 design out there ... hehehe I know what you
mean... the best ARM does [iirc] is SIMD on 32-bit values. So it's not
quite up to this.

Would be cool to see an FPGA implementation and compare it's size/speed
to other "typical" hardware ciphers.