Re: Salsa20 altivec timings

From: Milan VXdgsvt (milan_vxdgsvt_at_seznam.cz)
Date: 09/28/05


Date: Wed, 28 Sep 2005 07:29:37 +0000 (UTC)

xmath wrote:

> http://cds.xs4all.nl:8081/salsa/
>
> I get 276 cycles, or 4.31 cycles/byte. That's actually a bit more
> than twice as fast as djb's scalar implementation of Salsa20 on a G4.

Did you check the outputs? I believe that
        const vu32 vrr18 = vrr07 + vrr07;
should have been
        const vu32 vrr18 = vrr09 + vrr09;

I also find the reordering quite suspicious, but I don't have an
Altivec compiler to check it for sure:
        // 0 1 2 3 0 5 2 7 0 5 a f
        // 4 5 6 7 ----> 4 9 6 b ----> 4 9 e 3
        // 8 9 a b 8 d a f 8 d 2 7
        // c d e f c 1 e 3 c 1 6 b

        for (int i = 0; i < 20; i++) {
                z1 = y1 ^ vec_rl(y0 + y3, vrr07);
the y0+y3 combines, in the second column, 5 and 1 while it should
combine 1 and D.
I've seen Altivec first time today so maybe I'm just mistaken.

  Milan