Re: Intel core 2 quad - faster XMM?




Paul Rubin wrote:
tomstdenis@xxxxxxxxx writes:
Bignum no. XMM only has a 32x32 => 64 multiplier. So even though you
can pack two of them (and it takes more than a cycle to complete) you
still have twice as many of them as just using MULQ.

But I thought XMM bignum was already faster than MULQ even in the old
processors. Here's some old P4 timings from Eric Young:

On the P4 Prescott this was the case. Not so on the later series,
especially not true on the C2D and AMD64 processors. MULQ is fast
nowadays to the point where the XMM multiply would basically have to be
1-2 cycles to break even.

As for bit-slicing, I imagine latency would be lower so yeah it'd help
there.

Designing new primitives is almost always a bad idea (since we already
have so many), but XMM is so ubiquitous that maybe there's some
justification for trying to find a way to use it.

Really though, they should just add AES operations to the XMM
instruction set. ;)

No, they should add AES to the ISA. When I was at AMD we looked at
that [briefly] and the result was that with an FPU opcode every 2
cycles AES would be slower with discrete steps than a pure integer AES.

The best way to speed up and also secure against side channels is to
just have a one shot AES instruction.

Tom

.



Relevant Pages

  • Re: Intel core 2 quad - faster XMM?
    ... Oh, you mean 64x64 MULQ. ... cycles AES would be slower with discrete steps than a pure integer AES. ... Well, if you can start an XMM op every cycle, you've got the speed back. ... anything just have a complete uninterruptable opcode. ...
    (sci.crypt)
  • Re: Intel core 2 quad - faster XMM?
    ... Oh, you mean 64x64 MULQ. ... cycles AES would be slower with discrete steps than a pure integer AES. ... Well, if you can start an XMM op every cycle, you've got the speed back. ...
    (sci.crypt)
  • REPOST: Re: High Bandwidth Mixing Cipher Chips
    ... > In those cases where the FPGA is not non-volatile (like for example the Actel ... Why would you want to pay for a 1.5M gate FPGA when you can cram AES ... A cipher offload engine is not something you typically put in a desktop ... In this case an AES core running at 50Mhz at 11 cycles per ...
    (sci.crypt)
  • REPOST: Re: High Bandwidth Mixing Cipher Chips
    ... > In those cases where the FPGA is not non-volatile (like for example the Actel ... Why would you want to pay for a 1.5M gate FPGA when you can cram AES ... A cipher offload engine is not something you typically put in a desktop ... In this case an AES core running at 50Mhz at 11 cycles per ...
    (sci.crypt)
  • Re: High Bandwidth Mixing Cipher Chips
    ... > In those cases where the FPGA is not non-volatile (like for example the Actel ... Why would you want to pay for a 1.5M gate FPGA when you can cram AES ... A cipher offload engine is not something you typically put in a desktop ... In this case an AES core running at 50Mhz at 11 cycles per ...
    (sci.crypt)