SOURCE CODE
FOR PROGRAMMERS who wish to test the randomness of a random number generator
The following is from the book
NUMERICAL RECIPES: The Art of Scientific Computing by William H. Press, et al.
There are several programming language versions of this book
available, including FORTRAN-77, FORTRAN-90, C, C++, Pascal, and BASIC. Machine
readable forms in various languages of all the sample routines
are also available if you don't want to type in the code yourself.
The code presented below is FORTRAN, but should be easily translatable
to any programming language. This is an excellent book; you should own a copy.
Also see their web site at: http://numerical.recipes/
Chi-Square Test
Suppose that Ni is the member of events observed in the ith bin, and
that ni is the number expected according to some known distribution. Note
that the Ni’s are integers, white the ni’s may not be. Then the chi-square
statistic is
(13.5.1)
where the sum is over all bins. A large value of indicates that the null
hypothesis (that the Ni’s are drawn from the population represented by the
ni’s) is rather unlikely.
Any term j in (13.5.1) with 0 = nj = Nj should be omitted from the
sum. A term with nj = 0, Nj ≠ 0 gives an infinite , as it should, since in
this case the Ni’s cannot possibly be drawn from the ni’s!
The chi-square probability function Q(|ν) is an incomplete gamma function,
and was already discussed in §6.2 (see equation 6.2.18). Strictly speaking
Q(|ν) is the probability that the sum of the squares of ν random normal
variables of unit variance will be greater than . The terms in the sum
(13.5.1) are not individually normal. However, if either the number of bins
is large (>>1), or the number of events in each bin is large (>>1), then the
chi-square probability function is a good approximation to the distribution of
(13.5.1) in the case of the null hypothesis. Its use to estimate the significance
of the chi-square test is standard.
The appropriate value of ν, the number of degrees of freedom, bears some
additional discussion. If the data were collected in such a way that the ni’s
were all determined in advance, and there were no a priori constraints on
any of the Ni’s, then ν equals the number of bins NB (note that this is not
the total number of events!). Much more commonly, the ni’s are normalized
after the fact so that their sum equals the sum of the Ni’s, the total number
of events measured. In this case the correct value for ν is NB - 1. If the
model that gives the ni’s had additional free parameters that were adjusted
after the fact to agree with the data, then each of these additional "fitted"
parameters reduces ν by one additional unit. The number of these additional
fitted parameters (not including the normalization of the ni’s) is commonly
called the "number of constraints," so the number of degrees of freedom is
ν = NB - 1 when there are "zero constraints."
We have, then, the following program:
SUBROUTINE CHSONE(BINS,EBINS,NBINS,KNSTRN,DF,CHSQ,PROB)
Given the array BINS of length NBINS, containing the observed numbers of events, and an
array EBINS of length NBINS containing the expected numbers of events, and given the
number of constraints KNSTRN (normally zero), this routine returns (trivially) the number
of degrees of freedom DF, and (nontrivially) the chi-square CHSQ and the significance PROB.
A Small value of PROB indicates a significant difference between the distributions BINS
and EBINS. Note that BINS and EBINS are both real arrays, although BINS will normally
contain integer values.
DIMENSION BINS(NBINS),EBINS(NBINS)
DF=NBINS-1-KNSTRN
CHSQ=0.
DO 11 J=1,NBINS
IF(EBINS(J).LE.0.)PAUSE 'bad expected number'
CBSQ=CHSQ+(BINS(J)-EBINS(J))**2/EBINS(J)
11 CONTINUE
PROB=GAMMQ(0.5*DF,0.5*CHSQ) //Chi-square probability function. See §6.2.
RETURN
END
Next we consider the case of comparing two binned data sets. Let Ri be
the number of events in bin i for the first data set, Si, the number of events
in the same bin i for the second data set. Then the chi-square statistic is
(13.5.2)
Comparing (13.5.2) to (13.5.1), you should note that the denominator of
(13.5.2) is not just the average of Ri and Si (which would be an estimator of
ni in 13.5.1). Rather, it is twice the average, the sum. The, reason is that each
term in a chi-square sum is supposed to approximate the square of a normally
distributed quantity with unit variance. The variance of the difference of two
normal quantities is the sum of their individual variances, not the average.
If the data were collected in such a way that the sum of the Ri’s is
necessarily equal to the sum of Si’s, then the number of degrees of freedom
is equal to one less than the number of bins, NB - 1 (that is, KNSTRN=0), the
usual case. If this requirement were absent, then the number of degrees of
freedom would be NB. Example: A birdwatcher wants to know whether the
distribution of sighted birds as a function of species is the same this year as
last. Each bin corresponds to one species. If the birdwatcher takes his data to
be the first 1000 birds that he saw in each year, then the number of degrees of
freedom is NB - 1. If he takes his data to be all the birds he saw on a random
sample of days, the same days in each year, then the number of degrees of
freedom is NB (KNSTRN= -1). In this latter case, note that he is also testing
whether the birds were more numerous overall in one year or the other: that is
the extra degree of freedom. Of course, any additional constraints on the data
set lower the number of degrees of freedom (i,e., increase KNSTRN to positive
values) in accordance with their number.
The program is
SUBROUTINE CHSTWO(BINS1,BINS2,NBINS,KNSTRN,DF,CHSQ,PROB)
Given the arrays BINS1 and BINS2, both of length NBINS, containing two sets of binned
data, and given the number of additional constraints KNSTRN (normally 0 or -1), this routine
returns the number of degrees of freedom DF, the chi-square CHSQ and the significance
PROB. A small value of PROB indicates a significant difference between the distributions
BINS1 and BINS2. Note that BINS1 and BINS2 are both real arrays, although they will
normally contain integer values.
DIMENSION BINS1(NBINS),BINS2(NBINS)
DF=NBINS-1-KNSTRN
CHSQ=0.
DO 11 J=1,NBINS
IF(BINS1(J).EQ.0..AND.BINS2(J).EQ.0.)THEN
DF=DF-1. //No data means one less degree of freedom.
ELSE
CHSQ=CHSQ*(BINS1(J)-BINS2(J))**2/(BINS1(J)+BINS2(J))
ENDIF
11 CONTINUE
PROB=GAMMQ(0.5*DF,0.5*CHSQ) //Chi-square probability function. See §6.2
RETURN
END
Kolmogorov-Smirnov Test
The Kolmogorov-Smirnov (or K-S) test is applicable to unbinned distributions
that are functions of a single independent variable, that is, to data
sets where each data point can be associated with a single number (lifetime of
each lightbulb when it burns out, or declination of each star). In such cases,
the list of data points can be easily converted to an unbiased estimator SN(x)
of the cumulative distribution function of the probability distribution from
which it was drawn: If the N events are located at values xi, i = 1,...,N,
then SN(x) is the function giving the fraction of data points to the left of a
given value x. This function is obviously constant between consecutive (i.e.
sorted into ascending order) xi’s, and jumps by the same constant 1/N at
each xi. (See Figure 13.5.1).
Figure 13.5.1. Kolmogorov-Smirnov statistic D. A measured distribution of values in x
(shown as N dots on the lower abscissa) is to be compared with a theoretical distribution
whose cumulative probability distribution is plotted as P(x). A step-function cumulative
probability distribution SN(x) is constructed, one which rises an equal amount at each
measured point. D is the greatest distance between the two cumulative distributions.
Different distribution functions, or sets of data, give different cumulative
distribution function estimates by the above procedure. However, all cumulative
distribution functions agree at the smallest allowable value of x (where
they are zero), and at the largest allowable value of x (where they are unity).
(The smallest and largest values might of course be ±∞). So it is the behavior
between the largest and smallest values that distinguishes distributions.
One can think of any number of statistics to measure the overall difference
between two cumulative distribution functions: The absolute value of the area
between them, for example. Or their integrated mean square difference. The
Kolmogorov-Smirnov D is a particularly simple measure: It is defined as the
maximum value of the absolute difference between two cumulative distribution
functions. Thus, for comparing one data set's SN(x) to a known cumulative
distribution function P(x), the K-S statistic is
(13.5.3)
D = | max - ∞ < x < ∞ | |SN(x) - P(x)| |
while for comparing two different cumulative distribution functions SN1(x)
and SN2(x), the K-S statistic is
(13.5.4)
D = | max - ∞ < x < ∞ | |SN1(x) - SN2(x)| |
What makes the K-S statistic useful is that its distribution in the case
of the null hypothesis (data sets drawn from the same distribution) can be
calculated, at least to useful approximation, thus giving the significance of
any observed nonzero value of D.
The function which enters into the calculation of the significance can be
written as the following sum,
(13.5.5)
which is a monotone function with the limiting values
(13.5.6)
QKS(0) = 1
QKS(∞) = 1
In terms of this function, the significance level of an observed value of D (as
a disproof of the null hypothesis that the distributions are the same) is given
approximately by the formulas
(13.5.7)
Probability (D > observed) =
for the case (13.5.3) of one distribution, where N is the number of data points,
and
(13.5.8)
Probability (D > observed ) =
for the case (13.5.4) of two distributions, where N1 is the number of data
points in the first distribution, N2 the number in the second.
The nature of the approximation involved in (13.5.7) and (13.3.8) is that
it becomes asymptotically accurate as the N’s become large. In practice
N = 20 is large enough. especially if you are being conservative and requiring
a strong significance level (0.01 or smaller).
So, we have the following routines for the cases of one and two distributions:
SUBROUTINE KSONE(DATA,N,FUNC,D,PROB)
Given a array of N values, DATA, and given a user-supplied function of a single variable
FUNC which is cumulative distribution function ranging from 0 (for smallest values of
argument) to 1 (for largest values of its argument), this routine returns the K-S statistic D
and the significance level PROB. Small values of PROB show that the cumulative distribution
function of DATA is Significantly different from FUNC. The array DATA is modified by being
sorted into ascending order.
DIMENSION DATA(N)
CALL SORT(N,DATA) //If the data are already sorted into ascending order, then this call can be omitted.
EN=N
D=0.
FO=0. //Data's c.d.f. before the next step.
DO 11 J=1,N //Loop over the sorted data points.
FN=J/EN //Data's c.d.f. after this step.
FF=FUNC(DATA(J)) //Compare to the user-supplied function.
DT=AMAX1(ABS(FO-FF),ABS(FN-FF)) //Maximum distance.
IF(DT.GT.D)D=DT
FO=FN
11 CONTINUE
PROB=PROBKS(SQRT(EN)*D) //Compute significance.
RETURN
END
SUBROUTINE KSTWO(DATA1,N1,DATA2,N2,D,PROB)
Given an array DATA1 of N1 values, and an array DATA2 of N2 values, this routine returns
the K-S statistic D, and the significance level PROB for the null hypothesis that the data
sets are drawn from the same distribution. Small values of PROB show that the cumulative
distribution function of DATA1 is significantly different from that of DATA2. The arrays
DATA1 and DATA2 are modified by being sorted into ascending order.
DIMENSION DATA1(N1),DATA2(N2)
CALL SORT(N1,DATA1)
CALL SORT(N2,DATA2)
EN1=N1
EN2=N2
J1=1 //Next value of DATA1 to be processed.
J2=1 //Ditto, DATA2.
FO1=0. //value of c.d.f. before the next step.
FO2=0. //Ditto, for DATA2.
D=0.
1 IF(J1.LE.N1 .AND. J2.LE.N2) THEN //If we are not done...
IF(DATA1(J1).LT.DATA2(J2))THEN //Next step is in DATA1.
FN=J1/EN1
DT=AMAX1(ABS(FN1-FO2),ABS(FOl-FO2))
IF(DT.GT.D)D=DT
FO1=FN1
J1=J1+1
ELSE //Next step is in DATA2.
FN2=J2/EN2
DT=AMAX1(ABS(FN2-FO1),ABS(FO2-FO1))
IF (DT.GT.D)D=DT
FO2=FN2
J2=J2+1
ENDIF
GO TO 1
ENDIF
PROB=PROBKS(SQRT(EN1*EN2/(EN1+EN2))*D) //Compute significance.
RETURN
END
Both of the above routines use the following routine for calculating the
function QKS
FUNCTION PROBKS(ALAM)
PARAMETER (EPS1=0.001, EPS2=1.E-8)
A2=-2.*ALAM**2
FAC=2.
PROBKS=0.
TERMBF=0. //Previous term in sum.
DO 11 J=1,100
TERM=FAC*EXP(A2*J**2)
PROBKS=PROBKS+TERM
IF (ABS(TERM) .LT. EPS1*TERMBF .OR. ABS(TERM) .LT. EPS2*PROBKS) RETURN
FAC=-FAC //Alternating signs in sum.
TERMBF=ABS(TERM)
11 CONTINUE
PROBKS=1. //Get here only by failing to converge.
RETURN
END
REFERENCES AND FURTHER READING:
von Mises, Richard. 1964, Mathematical Theory of Probability and Statistics
(New York: Academic Press), Chapters IX(C) and IX(E).
6.2 Incomplete Gamma Function, Error
Function, Chi-Square Probability Function,
Cumulative Poisson Function
The incomplete gamma function is defined by
(6.2.1)
It has the limiting vales
(6.2.2)
P(a,0) = 0 and P(a,∞) = 1
The incomplete gamma function P(a,x) is monotonic and (for a greater than
one or so) rises from "near-zero" to "near-unity" in a range of x centered on
about a - 1, and of width about √a (see Figure 6.2.1).
Figure 6.2.1. The incomplete gamma function P(a,x) for four values of a.
The complement of P(a,x) is also confusingly called an incomplete gamma
function,
(6.2.3)
It has the limiting values
(6.2.4)
Q(a,0) = 1 and Q(a,∞) = 0
The notations P(a,x), γ(a,x), and Γ(a,x) are standard; the notation Q(a,x)
is specific to this book.
There is a series development for γ(a,x) as follows:
(6.2.5)
One does not actually need to compute a new Γ(a + 1 + n) for each n; one
rather uses equation (6.1.3) and the previous coefficient.
A continued fraction development for Γ(a,x) is
(6.2.6)
It turns out that (6.2.5) converges rapidly for x less than about a + 1, while
(6.2.6) converges rapidly for x greater than about a + 1. In these respective
regimes each requires at most a few times √a terms to converge, and this many
only near x = a, where the incomplete gamma functions are varying most
rapidly. Thus (6.2.5) and (6.2.6) together allow evaluation of the function
for all positive a and x. An extra dividend is that we never need compute
a function value near zero by subtracting two nearly equal numbers. The
higher-level calls for P(a,x) and Q(a,x) are
FUNCTION GAMMP(A,X)
Returns the incomplete gamma function P(a,x)
IF (X.LT.0. .OR. A.LE.0.)PAUSE
IF (X.LT.A+1.) THEN //Use the series representation.
CALL GSER(GAMSER,A,X,GLN)
GAMMP=GAMSER
ELSE //Use the continued fraction representation
CALL GCF(GAMMCF,A,X,GLN)
GAMMP=1.-GAMMCF
ENDIF
RETURN
END
FUNCTION GAMMQ(A,X)
Returns the incomplete gamma function Q(a,x) ≡ 1 - P(a,x).
IF (X. LT. 0. .OR. A.LE.0.) PAUSE
IF(X.LT.A+1.)THEN //Use the series representation
CALL GSER(GAMSER,A,X,GLN)
GAMMQ=1.-GAMSER //and take its complement.
ELSE //Use the continued fraction representation.
CALL GCF(GAMMCF,A,X,GLN)
GAMMQ=GAMMCF
ENDIF
RETURN
END
The argument GLN is returned by both the series and continued fraction,
procedures containing the value ln Γ(a); the reason for this is so that it is
available to you if you want to modify the above two procedures to give γ(a,x)
and Γ(a,x), in addition to P(a,x) and Q(a,x) (cf. equations 6.2.1 and 6.2.3).
The procedures GSER and GCF which implement (6.2.5) and (6.2.6) are
SUBROUTINE GSER(GAMSER,A,X,GLN)
Returns the incomplete gamma function P(a,x) evaluated by its series-representation as
GAMSER. Also returns ln Γ(a) as GLN.
PARAMETER (ITMAX=100,EPS=3.E-7)
GLN=GAMMLN(A)
IF(X.LE.0.)THEN
IF (X.LT.0.)PAUSE
GAMSER=0.
RETURN
ENDIF
AP=A
SUM=1./A
DEL=SUM
DO 11 N=1,ITMAX
AP=AP+1.
DEL=DEL*X/AP
SUM=SUM+DEL
IF(ABS(DEL).LT.ABS(SUM)*EPS)GO TO 1
11 CONTINUE
PAUSE 'A too large, ITMAX too small'
1 GAMSER=SUM*EXP(-X+A*LOG(X)-GLN)
RETURN
END
SUBROUTINE GCF(GAMMCF,A,X,GLN)
Returns the incomplete gamma function Q(a,x) evaluated by its continued fraction
representation as GAMMCF. Also returns Γ(a) as GLN.
PARAMETER (ITMAX=100,EPS=3.E-7)
GLN=GAMMLN(A)
GOLD=0. //This is the previous value, tested against for convergence.
A0=1.
A1=X //We are here setting up the A's and B's of equation (5.2.4) for evaluating the continued fraction.
B0=0.
B1=1.
FAC=1. //FAC is the renormalization factor for preventing overflow of the partial numerators and denominators.
DO 11 N=1,ITMAX
AN=FLOAT(N)
ANA=AN-A
A0=(A1+A0*ANA)*FAC //One step of the recurrence (5.2.5).
B0=(B1+B0*ANA)*FAC
ANF=AN+FAC
A1=X*A0+ANF*A1 //The next step of the recurrence (5.2.5).
B1=X+B0+ANF*B1
IF(A1.NE.0.)THEN //Shall we renormalize?
FAC=1./A1 //Yes. Set FAC so it happens.
G=B1*FAC //New value of answer.
IF(ABS((G-GOLD)/G).LT.EPS)GO TO 1 //Converged? If so, exit.
GOLD=G //If not, save value.
ENDIF
11 CONTINUE
PAUSE 'A too large, ITMAX too small'
1 GAMMCF=EXP(-X+A*ALOG(X)-GLN)*G //Put factors In front.
RETURN
END
Below is function GAMMLN used by GSER and GCF above. There are a variety of methods in use for
calculating the function Γ(z) numerically, but none is quite as neat as the
approximation derived by Lanczos. This scheme is entirely specific to the gamma function,
seemingly plucked from thin air. We will not attempt to derive the approximation.
FUNCTION GAMMLN(XX)
Returns the value ln[Γ(XX)] for XX > 0.
Full accuracy is obtained for XX > 1.
REAL*8 COF(6),STP,HALF,ONE,FPF,X,TMP,SER
Internal arithmetic will be done in double precision,
a nicety that you can omit if five-figure accuracy is good enough.
DATA COF,STP/76.18009173D0,-86.50532033D0,24.01409822D0,
* -1.231739516D0,.120858003D-2,-.536382D-5,2.50662827465D0/
DATA HALF,ONE,FPF/0.5D0,1.0D0,5.5D0/
X=XX-ONE
TMP=X+FPF
TMP=(X+HALF)*LOG(TMP)-TMP
SER=ONE
DO 11 J=1,6
X=X+ONE
SER=SER+COF(J)/X
11 CONTINUE
GAMMLN=TMP+LOG(STP*SER)
RETURN
END
Error Function
The error function and complementary error function are special cases of
the incomplete gamma function, and are obtained moderately efficiently by
the above procedures. Their definitions are
(6.2.7)
and
(6.2.8)
The functions have the following limiting values and symmetries:
(6.2.9)
erf(0) = 0 erf(∞) = 1 erf(-x) = -erf(x)
(6.2.10)
erfc(1) = 1 erfc(∞) = 0 erfc(-x) = 2 - erfc(x)
They are related to the incomplete gamma functions by
(6.2.11)
erf(x) = P(½,x2) (x ≥ 0)
and
(6.2.12)
erfc(x) = Q(½,x2) (x ≥ 0)
Hence we have
FUNCTION ERF(X)
Returns the error function erf(X).
IF(X.LT.0.)THEN
ERF=-GAMMP(.5,X**2)
ELSE
ERF=GAMMP(.5,X**2)
ENDIF
RETURN
END
FUNCTION ERFC(X)
Returns the complementary error function erfc(X).
IF( X.LT.0.)THEN
ERFC=1.+GAMMP(.5,X**2)
ELSE
ERFC=GAMMQ(.5,X**2)
ENDIF
RETURN
END
If you care to do so, you can easily remedy the minor inefficiency in ERF
and ERFC, namely that Γ(0.5) = √π is computed unnecessarily when GAMMP
or GAMMA is called. Before you do that, however, you might wish to consider
the following routine, based on Chebyshev fitting to an inspired guess as to
the functional form:
FUNCTION ERFCC(X)
Returns the complementary error function erfc(X) with fractional error
everywhere less than 1.2 x 10-7.
Z=ABS(X)
T=1./(1.+0.5*Z)
ERFCC=T*EXP(-Z*Z-1.26551223+T*(1.00002368+T*(.37409196+
T*(.09678418+T*(-.18628806+T*(.27886807+T*(-1.13520398+
T*(1.48851587+T*(-.82215223+T*.17087277)))))))))
IF (X.LT.0.) ERFCC=2.-ERFCC
RETURN
END
There are also some functions of two variables which are special cases of
the incomplete gamma function:
Cumulative Poisson Probability Function
Px( < k), for positive x and integer k ≥ 1, denotes the cumulative Poisson
probability function, It is defined as the probability that the number of Poisson
random events occurring will be between 0 and k - 1 inclusive, if the expected
mean number is x. It has the limiting values
(6.2.13)
Px( < 1) = e-x Px(∞) = 1
Its relation to the incomplete gamma function is simply
(6.2.14)
Px( < k) = Q(k,x) = GAMMQ (k,x)
Chi-Square Probability Function
P( | ν) is defined as the probability that the observed chi-square for a
correct model should be less than a value . (We will discuss the rise of
this function in Chapter 14.) Its complement Q( | ν) is the probability that
the observed chi-square will exceed the value by chance even for a correct
model. In both cases ν is an integer, the number of degrees of freedom. The
functions have the limiting values
(6.2.15)
P(0|ν) = 0 P(∞|ν) = 1
(6.2.16)
Q(0|ν) = 1 Q(∞|ν) = 0
and the following relation to the incomplete gamma functions,
(6.2.17)
(6.2.18)
REFERENCES AND FURTHER READING:
Abramowitz, Milton, and Stegun, Irene A. 1964, Handbook of Mathematical Functions, Applied Mathematics Series, vol. 55 (Washington
National Bureau of Standards; reprinted 1968 by Dover Publications, New York), Chapters 6, 7, and 26.
Pearson, K (ed.) 1951, Tables of the Incomplete Gamma Function (Cambridge: Cambridge University Press).
ALL OF THE ABOVE IS FROM THE FOLLOWING BOOK: |
NUMERICAL RECIPES: The Art of Scientific Computing |
| William H. Press, |
| | Harvard-Smithsonian Center for Astrophysics |
| Brian P. Flannery, |
| EXXON Research and Engineering Company |
| Saul A. Teukolsky, |
| Department of Physics, Cornell University |
| William T. Vetterling, |
| Polaroid Corporation |
| Cambridge University Press, 1986 |
| This book contains a wealth of knowledge along with source code
examples on how to solve various kinds of programming problems.
There are several programming language versions of this book
available, including FORTRAN-77, FORTRAN-90, C, C++, Pascal, and BASIC. Machine
readable forms in various languages of all the sample routines
are also available if you don't want to type the code in yourself.
This is an excellent book; you should own a copy. |
Click here to search Amazon.com for this and similar books. |
Also see thier web site at: http://www.numerical-recipes.com/index.html |