Re: Q: Fast computation of parity

From: Michael J. Fromberger (Michael.J.Fromberger_at_Clothing.Dartmouth.EDU)
Date: 12/31/03


Date: Wed, 31 Dec 2003 15:51:45 -0500

In article
<SmEIb.236699$ea%.38292@news01.bloor.is.net.cable.rogers.com>,
 "Tom St Denis" <tomstdenis@iahu.ca> wrote:

> "Michael J. Fromberger" <Michael.J.Fromberger@Clothing.Dartmouth.EDU> wrote
> in message
> news:Michael.J.Fromberger-81B301.12263031122003@merrimack.dartmouth.edu...
> > Done. I had to increase the number of tests again to get the resolution
> > high enough to see a difference on the P3.
> >
> > With:
> > gcc -ansi -Wall -O3 -fomit-frame-pointer -funroll-loops -o pars pars2.c
> >
> > # On G4:
> > % ./pars
> > Lookup: 2
> > XOR: 1
> >
> > # On PIII:
> > % ./pars
> > Lookup: 450000
> > XOR: 460000
>
> That's totally odd. Cuz on the x86 the dependency chain basically breaks
> any form of parallelism. So I'd think the lookup would be faster...

Well, it IS (moderately) faster given the P3 results I computed. The
x86 doesn't have enough general-purpose registers, so I suspect the XOR
based implementation winds up touching memory. I haven't checked. But
even if it doesn't, it's quite likely a set of four x86 XORL, MOVL, SHRL
sequences [which is more or less what parity1() is going to translate
into] takes longer than one such sequence plus an access to the cache.

On a PPC, I'd be very surprised to see the analogous instruction
sequence take significantly longer than a cache lookup, and indeed I'd
expect it to be shorter. In fact, if you explicitly unroll the
dependency chain, the XOR solution destroys the lookup table on a PPC:

int parity2(uint32_t x)
{
  register uint32_t u, v, w, y;

  u = (x >> 16) ^ x;
  v = (u >> 8) ^ u;
  w = (v >> 4) ^ v;
  y = (w >> 2) ^ w;

  return ((y >> 1) ^ y) & 1;
}

With this modification, my G4 gets even better:

% ./pars2
Lookup: 49
XOR: 13

And the P3 gets:

% ./pars2
Lookup: 450000
XOR: 110000

I hadn't considered that aspect of it in my original implementation.
So, thanks for putting me back into mind of it, Tom!

-M

-- 
Michael J. Fromberger             | Lecturer, Dept. of Computer Science
http://www.dartmouth.edu/~sting/  | Dartmouth College, Hanover, NH, USA


Relevant Pages

  • Re: Algorithm to compare two binary numbers
    ... which processors might have such an instruction) and the compiler is ... > 10,000,000 XOR operations, and need to count the bits of every XOR result, ... > either by lookup table or by a real-time counting function. ... and ended with a CPU time of between 0.8 and 2 seconds for all of it ...
    (comp.programming)
  • Re: (possible) Ratling Guardian Corpse
    ... the lookup table can be found in the executable ... the hex string above is the polynomial used in the CRC ... details as to how it works at the wikipedia link above, ... "30967707" instead, depending on how you're viewing the data, since x86 is ...
    (rec.games.roguelike.adom)
  • Re: Q: Fast computation of parity
    ... > Lookup: 2 ... > XOR: 1 ... This once again shows that the relative efficiency of ... algorithms/codes could be fairly hardware dependent. ...
    (sci.crypt)
  • Re: [PATCH] slab: optimize constant-size kzalloc calls
    ... increases kernel text slightly (~200 bytes for defconfig on x86) ... On Tue, 21 Mar 2006, Andrew Morton wrote: ... Because of the malloc_sizes lookup. ...
    (Linux-Kernel)
  • Re: Q: Fast computation of parity
    ... > Xor: 11888907 ... Lookup: 50 ... This is consistent with my expectations for a register-rich RISC ...
    (sci.crypt)