Re: More LTC timings...
From: Eric Young (eay_nospam_at_pobox.com)
Date: 06/19/03
- Next message: bubba: "Re: Timings issue"
- Previous message: Eric Young: "Re: More LTC timings..."
- In reply to: Brian Gladman: "Re: More LTC timings..."
- Next in thread: Tom St Denis: "Re: More LTC timings..."
- Reply: Tom St Denis: "Re: More LTC timings..."
- Reply: Shill: "Re: More LTC timings..."
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Date: Thu, 19 Jun 2003 08:59:45 +1000
Brian Gladman wrote:
> "Shill" <nobody@example.com> wrote in message
> news:bcppb1$2sjb$1@biggoron.nerim.net...
>
>>>As you suggest, in both cases this is almost certainly the stress
>>>on the processor caches - which is especially in evidence on the
>>>P4 with only 8k at L1.
>>
>>Northwood's 512 KB L2 cache is only 7 cycles away...
>>
>>www.aceshardware.com/Spades/read.php?article_id=20000192
>
>
> Interesting analysis. It seems that AES, which typically has large tables,
> hits the P4 quite hard though.
>
> But my main point is to back up Eric's comment that we should treat the
> often quoted naive TSC based algorithm timings with considerable scepticism.
>
> Brian Gladman
The classic case I remember was way back in the 'lets make fast DES
implementation days', when the variants needed to do the Unix
password were being implemented at high speed.
I had a version that used 8 6bit look ups, for 2k worth of tables.
UltraFast Crypt (I think that was the name), used 4 12bit lookups
for 64k worth of tables.
The 'measure speed' program used by UFC, was repeated encrypts of
the same key value. By this benchmark, it was about 2 times faster
than the 2k table version. BUT, if you changed the password each
time, the 2k table version was 2 times faster. These numbers were
being generated on a machine with a 32k cache. For a single pass,
the relevant values from the 64k table were being loaded into the L1
cache, and all subsequent uses came from the cache. The table
was only being partly loaded. For a password cracking, the
UFC test was crud.
Loading tables also applies to AES using big tables.
If you repeatedly encrypt
the same value, you will only load 160(?) values from your large
tables. How many times does a program repeatedly encrypt the same
value? I also tend to always measure cbc mode (perhaps I should
move to counter mode) since the calling convention and the 'byte
to work/endian conversion' overhead is also relevant.
I also believe I've seen quite a few algorithms where heavy
loop unrolling is bad for the pentium4. It has an 8k(?)
trace cache, which is the number of decoded instructions.
To do the initial decode, instructions are only processed
1 at a time. If your code does not fit in the trace cache,
it will always execute at one instruction per cycle.
eric
- Next message: bubba: "Re: Timings issue"
- Previous message: Eric Young: "Re: More LTC timings..."
- In reply to: Brian Gladman: "Re: More LTC timings..."
- Next in thread: Tom St Denis: "Re: More LTC timings..."
- Reply: Tom St Denis: "Re: More LTC timings..."
- Reply: Shill: "Re: More LTC timings..."
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Relevant Pages
|