Re: Some questions about stream cyphers.

From: Terry Ritter (ritter_at_ciphersbyritter.com)
Date: 06/28/04


Date: 28 Jun 2004 01:38:22 -0700

da5id65536@yahoo.com (David Sexton) wrote in message news:<1ac1d2b9.0406260928.6aed7522@posting.google.com>...
> Ernst Lippe <ernstl-at-planet-dot-nl@ignore.this> wrote in message news:<pan.2004.06.14.20.24.13.411319@ignore.this>...
> > On Sat, 12 Jun 2004 13:38:06 -0700, David Sexton wrote:
> >
> > > Evaluating whether or not the P-values are evenly distributed is something
> > > that different people approach differently. I'd actually like to see some
> > > more discussion about this. Bartosz Zoltak and I have (one might almost
> > > say) argued about this. Masaglia uses an
> > > Anderson-Darling-Kolmogorov-Smirnov test implemented in RNGUtil.c
> > > available in
> > >
> > > http://www.csis.hku.hk/cisc/download/idetect/rng_src.zip
> > >
> > > He has published on the subject. See
> > >
> > > http://www.jstatsoft.org/v09/i02/ad.pdf
> >
> > The Kolmogorov-Smirnov test is "the" standard statistical test
> > if you want to compare an observed distribution with an expected
> > distribution.

Nonsense. Both K-S and chi-square are commonly used
to compare distributions. I suspect that chi-square
is more common.

> > > Ritter suggests partitioning the space between 0 and 1 and calculating a
> > > chi-squared statistic. See
> >
> > I assume that his approach is to partition all p values into B equally size
> > bins

Normally I work with raw statistic values, not p-values,
and thus avoid any error or approximation made in the
p-value computation. My distributions are never flat.
And when the expected distribution is not flat, it is
extremely common statistical practice for each tail
to be given its own bin with a reasonable expectation
value.

> and to take the expected number as N/B, where N is the total number
> > of different P-values. There are several reasons that I see problems with
> > this approach. First of all, it ignores the ordering between the bins.
> > If the p-values are not uniformly distributed it is very likely that
> > adjacent bins will have the same kind of deviation from the expected value,
> > but for the chi-square test it does not make any difference if the bins
> > are adjacent or not.

After reading that several times, I still have no
idea what it means.

Having bins certainly sets a fineness or precision
with which we compare distributions by chi square.
But the experimenter has complete control over that.

Presumably if there were some defect that occurred
only and exactly on a bin boundary, the problem would
be distributed in two bins rather than one. That may
reduce the detection effectiveness by half, we also
have twice as much chance to see it in the context
of random variation.

We are comparing distributions. If we are working
with p-values, the expected distribution should be
flat. So if the average of many sample distributions
is not flat, we can see that something is wrong even
without using a statistical test. That average occurs
naturally in chi square. The whole point of this is to
create insight, not to get some single resulting value
which somehow compares sizable vectors of values.

We delude ourselves to imagine that any computation
producing a single value is able to "best" capture
the complexity of the difference between two full
vectors or arrays of comparisons. The resulting
digest or statistic is at best shorthand for the real
situation. Since the first distribution we can get
to is the sample distribution over many trials, we
should be presenting that data as a graphic, which,
when scaled, just happens to represent the average
values actually used in chi square.

It is easy to compute both K-S and chi-square.
Excessive concentration on the supposed "best" single
value digest is not nearly as insightful as a graphical
comparison between distributions.

>The KS-test on the other hand is sensitive to these
> > differences because it uses the cumulative distribution. Another
> > reason to be suspicious of this approach is that the chi-square distribution
> > for this test is only an asymptotic approximation, that is only valid
> > when you have large expected values for each bin.
> >
> > Ernst Lippe
>
> Yes, you've got the idea. The reason I don't use this approach is
> precisely because I can't always guarantee large "expected" values in
> the chi-squared calculation.

Nonsense. Both sample size and bin width affect the
expected bin count. It is common statistical practice
to collect the ends of distributions into wide bins
with reasonable expectation values.

>The other issue you bring up is, I agree,
> also a valid concern.

Nonsense. If we do not bin a distribution of real
values, we have a problem even showing that distribution.
And if we do not show it, we only can discuss the
resulting digest or statistic, a single value which
is inherently deceptive.

Since asymptotic approximations seem to be the rule
in cryptography, that seems a strange complaint to
make here. Approximations are also commonly made in
the p-value conversion, but nothing is said about that.

Common statistical practice, taken over millions of
experiments by trained experimenters shows chi-square
to be a reasonable summary value when we have a reasonable
expectation count. And the main reason for that
requirement is not any asymptotic approximation in the
evaluation function but instead the inherent quantization
error that occurs when comparing integer counts to real
expectations.

 
> On the other hand, the problem with the KS test is the need to keep a
> (potentially) very large number of floating-point values in memory in
> order to sort them later. This is especially messy when it can't be
> easily determined beforehand how many values will have to be stored.

The real problem with the KS test is that it scores
the difference between distributions on the basis of
a single max (or min) difference value. Granted that
is the accumulated value, and not a single value. But
to me, that means random sampling extrema near the
start have a better chance of dominating the results,
whereas chi square inherently uses randomness-hiding
averaging.

Neither statistic is "right" with the other being
"wrong." Both can find problems, and they probably
find different types of problems.

---
Terry Ritter     1.2MB Crypto Glossary
http://www.ciphersbyritter.com/GLOSSARY.HTM


Relevant Pages

  • Re: "Robbins lemma"?
    ... >If X is a random variable with a Poisson distribution with ... >expectation lambda, and g is any function for which the ... The way he used it was in a situation where lambda is ... If one is computing a Bayes estimate with squared error ...
    (sci.math.research)
  • Re: infinite moments
    ... >>defined expectation or variance. ... For bounded distributions, all moments exist. ... if you truncate a distribution with no mean to the ... > representable range, you'll likely get behaviour which for all practical ...
    (sci.stat.math)
  • Re: infinite moments
    ... >expected value and variance of the process must be defined in the reals. ... >defined expectation or variance. ... In practice ... if you truncate a distribution with no mean to the ...
    (sci.stat.math)
  • Re: Memory As Captured By Distribution Functions
    ... the natural measure of central tendency for the cdf FX is ... The exponential distribution is of course a special case of the ... fY/X=x of conditional expectation. ... like the normal/Gaussian which have undefined PI maximum entropy. ...
    (sci.stat.math)
  • Re: how many bins for goodness of fit
    ...  So I did binning for the distribution ... 100 bins for 10000 sample points. ...   degrees of freedom. ... The chi-square test for goodness of fit is just the first tool. ...
    (sci.stat.math)