Re: Linear Shiftback Registers



Phil Carmody <thefatphil_demunged@xxxxxxxxxxx> wrote:
+---------------
| Tom St Denis <tom@xxxxxxx> writes:
| > Phil Carmody <thefatphil_demun...@xxxxxxxxxxx> wrote:
| >> I saw 1/x being used before I saw x being used. (Not in defining/finding
| >> the polynomial, but in the actual generation of values from the LFSR.)
| >
| > Really? I was kinda joking about that. I know it was used in the
| > OMAC2 spec before he just replaced 1/x with x^2.
|
| When I first saw LFSRs, in hardware, the shift was done so that the
| XOR of all of the tapping points was injected into the shifted-in end
| of the register. This is the reverse of the usual practice in s/w where
| the shifted-out end of the register XORs with all of the the tapping
| points. So I think it's fair to view it as division rather than
| multiplication.
+---------------

I believe that it's the case that if some polynomial generates a
maximal-length sequence with one style of feedback wiring then you
can always find a corresponding reciprocal polynomial that gives the
same output sequence with the other style of wiring, and vice-versa.

But while simple, slow, small-polynomial hardware LFSRs [like I used
to build on perfboard! ;-} ] often used the "XOR all the taps into
the one input" method, these days fast, large-polynomial hardware LFSRs
tend to use the "XOR the output *into* various stage-to-stage taps"
method [which is truly the "division" method, b.t.w.], the reason being
that it's easier/faster to do the latter with FPGAs or hard-coded silicon,
since you don't have a huge, honking multi-input XOR that needs to
propagate in one clock(!) and you also have only *one* feedback wire
to route back from the output, not many.

In any event, even in hardware these days most people use some variant
of the "matrix multiplication" method [of which I've written here before]
when implementing checksums [such as CRC-32] in order to be able to
process more than one input data bit per clock. E.g., using ~500 XOR
gates [or equivalent lookup table bits in FPGA logic blocks] you can
perform a CRC-32 step on 32 bits of input data *per clock*. [The wiring
of the XORs is just the 32nd power of the 1-input-bit/step CRC-32 matrix.]
In fact, by using even more gates, it's possible to perform CRC-32 on
*64* bits of input data per clock. [You have to be a little careful
about alignment/fixup issues at the end of packets which aren't an
exact multiple of 64 bits long, but that's a mere detail... ;-} ]

The corresponding software technique is the venerable table-lookup method[1],
e.g., <http://www.faqs.org/faqs/compression-faq/part1/section-26.html>,
which contains C code [written by yours truly] for byte-at-a-time CRC-32.
Unfortunately, the needed tables rapidly explode in size when processing
more than one input byte per step: For a 32-bit polynomial, the table is
1 KiB for 1 byte/step, 256 KiB for 2 B/step, and *16 GiB* for 4 B/step.
Cache effects alone would make calling the 1 B/step version four times
be *MUCH* faster than calling the 4 B/step version once! ;-}


-Rob

[1] Which I first learned from a paper by Stu Wecker [then at DEC]:

Wecker, S., "A Table-Lookup Algorithm for Software
Computation of Cyclic Redundancy Check (CRC),"
Digital Equipment Corporation memorandum, 1974.

There might have been prior art from IBM even then, but I'm not sure.

-----
Rob Warnock <rpw3@xxxxxxxx>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607

.