Re: Q: Fast computation of parity
From: Michael J. Fromberger (Michael.J.Fromberger_at_Clothing.Dartmouth.EDU)
Date: 12/31/03
- Next message: Tom St Denis: "Re: Q: Fast computation of parity"
- Previous message: David A. Scott: "Re: Cipher whitenoise, David Wagner etc."
- In reply to: Tom St Denis: "Re: Q: Fast computation of parity"
- Next in thread: Tom St Denis: "Re: Q: Fast computation of parity"
- Reply: Tom St Denis: "Re: Q: Fast computation of parity"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Date: Wed, 31 Dec 2003 15:51:45 -0500
In article
<SmEIb.236699$ea%.38292@news01.bloor.is.net.cable.rogers.com>,
"Tom St Denis" <tomstdenis@iahu.ca> wrote:
> "Michael J. Fromberger" <Michael.J.Fromberger@Clothing.Dartmouth.EDU> wrote
> in message
> news:Michael.J.Fromberger-81B301.12263031122003@merrimack.dartmouth.edu...
> > Done. I had to increase the number of tests again to get the resolution
> > high enough to see a difference on the P3.
> >
> > With:
> > gcc -ansi -Wall -O3 -fomit-frame-pointer -funroll-loops -o pars pars2.c
> >
> > # On G4:
> > % ./pars
> > Lookup: 2
> > XOR: 1
> >
> > # On PIII:
> > % ./pars
> > Lookup: 450000
> > XOR: 460000
>
> That's totally odd. Cuz on the x86 the dependency chain basically breaks
> any form of parallelism. So I'd think the lookup would be faster...
Well, it IS (moderately) faster given the P3 results I computed. The
x86 doesn't have enough general-purpose registers, so I suspect the XOR
based implementation winds up touching memory. I haven't checked. But
even if it doesn't, it's quite likely a set of four x86 XORL, MOVL, SHRL
sequences [which is more or less what parity1() is going to translate
into] takes longer than one such sequence plus an access to the cache.
On a PPC, I'd be very surprised to see the analogous instruction
sequence take significantly longer than a cache lookup, and indeed I'd
expect it to be shorter. In fact, if you explicitly unroll the
dependency chain, the XOR solution destroys the lookup table on a PPC:
int parity2(uint32_t x)
{
register uint32_t u, v, w, y;
u = (x >> 16) ^ x;
v = (u >> 8) ^ u;
w = (v >> 4) ^ v;
y = (w >> 2) ^ w;
return ((y >> 1) ^ y) & 1;
}
With this modification, my G4 gets even better:
% ./pars2
Lookup: 49
XOR: 13
And the P3 gets:
% ./pars2
Lookup: 450000
XOR: 110000
I hadn't considered that aspect of it in my original implementation.
So, thanks for putting me back into mind of it, Tom!
-M
-- Michael J. Fromberger | Lecturer, Dept. of Computer Science http://www.dartmouth.edu/~sting/ | Dartmouth College, Hanover, NH, USA
- Next message: Tom St Denis: "Re: Q: Fast computation of parity"
- Previous message: David A. Scott: "Re: Cipher whitenoise, David Wagner etc."
- In reply to: Tom St Denis: "Re: Q: Fast computation of parity"
- Next in thread: Tom St Denis: "Re: Q: Fast computation of parity"
- Reply: Tom St Denis: "Re: Q: Fast computation of parity"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Relevant Pages
|
|