Re: Linear Shiftback Registers

Phil Carmody <thefatphil_demunged@xxxxxxxxxxx> wrote:
| Tom St Denis <tom@xxxxxxx> writes:
| > Phil Carmody <thefatphil_demun...@xxxxxxxxxxx> wrote:
| >> I saw 1/x being used before I saw x being used. (Not in defining/finding
| >> the polynomial, but in the actual generation of values from the LFSR.)
| >
| > Really? I was kinda joking about that. I know it was used in the
| > OMAC2 spec before he just replaced 1/x with x^2.
| When I first saw LFSRs, in hardware, the shift was done so that the
| XOR of all of the tapping points was injected into the shifted-in end
| of the register. This is the reverse of the usual practice in s/w where
| the shifted-out end of the register XORs with all of the the tapping
| points. So I think it's fair to view it as division rather than
| multiplication.

I believe that it's the case that if some polynomial generates a
maximal-length sequence with one style of feedback wiring then you
can always find a corresponding reciprocal polynomial that gives the
same output sequence with the other style of wiring, and vice-versa.

But while simple, slow, small-polynomial hardware LFSRs [like I used
to build on perfboard! ;-} ] often used the "XOR all the taps into
the one input" method, these days fast, large-polynomial hardware LFSRs
tend to use the "XOR the output *into* various stage-to-stage taps"
method [which is truly the "division" method, b.t.w.], the reason being
that it's easier/faster to do the latter with FPGAs or hard-coded silicon,
since you don't have a huge, honking multi-input XOR that needs to
propagate in one clock(!) and you also have only *one* feedback wire
to route back from the output, not many.

In any event, even in hardware these days most people use some variant
of the "matrix multiplication" method [of which I've written here before]
when implementing checksums [such as CRC-32] in order to be able to
process more than one input data bit per clock. E.g., using ~500 XOR
gates [or equivalent lookup table bits in FPGA logic blocks] you can
perform a CRC-32 step on 32 bits of input data *per clock*. [The wiring
of the XORs is just the 32nd power of the 1-input-bit/step CRC-32 matrix.]
In fact, by using even more gates, it's possible to perform CRC-32 on
*64* bits of input data per clock. [You have to be a little careful
about alignment/fixup issues at the end of packets which aren't an
exact multiple of 64 bits long, but that's a mere detail... ;-} ]

The corresponding software technique is the venerable table-lookup method[1],
e.g., <>,
which contains C code [written by yours truly] for byte-at-a-time CRC-32.
Unfortunately, the needed tables rapidly explode in size when processing
more than one input byte per step: For a 32-bit polynomial, the table is
1 KiB for 1 byte/step, 256 KiB for 2 B/step, and *16 GiB* for 4 B/step.
Cache effects alone would make calling the 1 B/step version four times
be *MUCH* faster than calling the 4 B/step version once! ;-}


[1] Which I first learned from a paper by Stu Wecker [then at DEC]:

Wecker, S., "A Table-Lookup Algorithm for Software
Computation of Cyclic Redundancy Check (CRC),"
Digital Equipment Corporation memorandum, 1974.

There might have been prior art from IBM even then, but I'm not sure.

Rob Warnock <rpw3@xxxxxxxx>
627 26th Avenue <URL:>
San Mateo, CA 94403 (650)572-2607