The Structure of Language Source Code and Discussion of Newer & Better Portable Random Number Generators
index
Back to Deley's Homepage

SOURCE CODE

FOR PROGRAMMERS who wish to test the randomness of a random number generator

The following is from the book NUMERICAL RECIPES: The Art of Scientific Computing by William H. Press, et al. There are several programming language versions of this book available, including FORTRAN-77, FORTRAN-90, C, C++, Pascal, and BASIC. Machine readable forms in various languages of all the sample routines are also available if you don't want to type in the code yourself. The code presented below is FORTRAN, but should be easily translatable to any programming language. This is an excellent book; you should own a copy.
Also see their web site at: http://numerical.recipes/


Chi-Square Test

Suppose that Ni is the member of events observed in the ith bin, and that ni is the number expected according to some known distribution. Note that the Ni’s are integers, white the ni’s may not be. Then the chi-square statistic is

(13.5.1)



where the sum is over all bins. A large value of chisq indicates that the null hypothesis (that the Ni’s are drawn from the population represented by the ni’s) is rather unlikely.

Any term j in (13.5.1) with 0 = nj = Nj should be omitted from the sum. A term with nj = 0, Nj ≠ 0 gives an infinite chisq, as it should, since in this case the Ni’s cannot possibly be drawn from the ni’s!

The chi-square probability function Q(chisq|ν) is an incomplete gamma function, and was already discussed in §6.2 (see equation 6.2.18). Strictly speaking Q(chisq|ν) is the probability that the sum of the squares of ν random normal variables of unit variance will be greater than chisq. The terms in the sum (13.5.1) are not individually normal. However, if either the number of bins is large (>>1), or the number of events in each bin is large (>>1), then the chi-square probability function is a good approximation to the distribution of (13.5.1) in the case of the null hypothesis. Its use to estimate the significance of the chi-square test is standard.

The appropriate value of ν, the number of degrees of freedom, bears some additional discussion. If the data were collected in such a way that the ni’s were all determined in advance, and there were no a priori constraints on any of the Ni’s, then ν equals the number of bins NB (note that this is not the total number of events!). Much more commonly, the ni’s are normalized after the fact so that their sum equals the sum of the Ni’s, the total number of events measured. In this case the correct value for ν is NB - 1. If the model that gives the ni’s had additional free parameters that were adjusted after the fact to agree with the data, then each of these additional "fitted" parameters reduces ν by one additional unit. The number of these additional fitted parameters (not including the normalization of the ni’s) is commonly called the "number of constraints," so the number of degrees of freedom is ν = NB - 1 when there are "zero constraints."

We have, then, the following program:

SUBROUTINE CHSONE(BINS,EBINS,NBINS,KNSTRN,DF,CHSQ,PROB)
   Given the array BINS of length NBINS, containing the observed numbers of events, and an
   array EBINS of length NBINS containing the expected numbers of events, and given the
   number of constraints KNSTRN (normally zero), this routine returns (trivially) the number
   of degrees of freedom DF, and (nontrivially) the chi-square CHSQ and the significance PROB.
   A Small value of PROB indicates a significant difference between the distributions BINS
   and EBINS. Note that BINS and EBINS are both real arrays, although BINS will normally
   contain integer values.
DIMENSION BINS(NBINS),EBINS(NBINS)
DF=NBINS-1-KNSTRN
CHSQ=0.
DO 11 J=1,NBINS
     IF(EBINS(J).LE.0.)PAUSE 'bad expected number'
     CBSQ=CHSQ+(BINS(J)-EBINS(J))**2/EBINS(J)
11   CONTINUE
PROB=GAMMQ(0.5*DF,0.5*CHSQ)     //Chi-square probability function. See §6.2.
RETURN
END

Next we consider the case of comparing two binned data sets. Let Ri be the number of events in bin i for the first data set, Si, the number of events in the same bin i for the second data set. Then the chi-square statistic is

(13.5.2)

1352hs

Comparing (13.5.2) to (13.5.1), you should note that the denominator of (13.5.2) is not just the average of Ri and Si (which would be an estimator of ni in 13.5.1). Rather, it is twice the average, the sum. The, reason is that each term in a chi-square sum is supposed to approximate the square of a normally distributed quantity with unit variance. The variance of the difference of two normal quantities is the sum of their individual variances, not the average.

If the data were collected in such a way that the sum of the Ri’s is necessarily equal to the sum of Si’s, then the number of degrees of freedom is equal to one less than the number of bins, NB - 1 (that is, KNSTRN=0), the usual case. If this requirement were absent, then the number of degrees of freedom would be NB. Example: A birdwatcher wants to know whether the distribution of sighted birds as a function of species is the same this year as last. Each bin corresponds to one species. If the birdwatcher takes his data to be the first 1000 birds that he saw in each year, then the number of degrees of freedom is NB - 1. If he takes his data to be all the birds he saw on a random sample of days, the same days in each year, then the number of degrees of freedom is NB (KNSTRN= -1). In this latter case, note that he is also testing whether the birds were more numerous overall in one year or the other: that is the extra degree of freedom. Of course, any additional constraints on the data set lower the number of degrees of freedom (i,e., increase KNSTRN to positive values) in accordance with their number.

The program is

SUBROUTINE CHSTWO(BINS1,BINS2,NBINS,KNSTRN,DF,CHSQ,PROB)
   Given the arrays BINS1 and BINS2, both of length NBINS, containing two sets of binned
   data, and given the number of additional constraints KNSTRN (normally 0 or -1), this routine
   returns the number of degrees of freedom DF, the chi-square CHSQ and the significance
   PROB. A small value of PROB indicates a significant difference between the distributions
   BINS1 and BINS2. Note that BINS1 and BINS2 are both real arrays, although they will
   normally contain integer values.
DIMENSION BINS1(NBINS),BINS2(NBINS)
DF=NBINS-1-KNSTRN
CHSQ=0.
DO 11 J=1,NBINS
     IF(BINS1(J).EQ.0..AND.BINS2(J).EQ.0.)THEN
          DF=DF-1.     //No data means one less degree of freedom.
     ELSE
          CHSQ=CHSQ*(BINS1(J)-BINS2(J))**2/(BINS1(J)+BINS2(J))
     ENDIF
11   CONTINUE
PROB=GAMMQ(0.5*DF,0.5*CHSQ)     //Chi-square probability function. See §6.2
RETURN
END



Kolmogorov-Smirnov Test

The Kolmogorov-Smirnov (or K-S) test is applicable to unbinned distributions that are functions of a single independent variable, that is, to data sets where each data point can be associated with a single number (lifetime of each lightbulb when it burns out, or declination of each star). In such cases, the list of data points can be easily converted to an unbiased estimator SN(x) of the cumulative distribution function of the probability distribution from which it was drawn: If the N events are located at values xi, i = 1,...,N, then SN(x) is the function giving the fraction of data points to the left of a given value x. This function is obviously constant between consecutive (i.e. sorted into ascending order) xi’s, and jumps by the same constant 1/N at each xi. (See Figure 13.5.1).
ks
Figure 13.5.1. Kolmogorov-Smirnov statistic D. A measured distribution of values in x (shown as N dots on the lower abscissa) is to be compared with a theoretical distribution whose cumulative probability distribution is plotted as P(x). A step-function cumulative probability distribution SN(x) is constructed, one which rises an equal amount at each measured point. D is the greatest distance between the two cumulative distributions.

Different distribution functions, or sets of data, give different cumulative distribution function estimates by the above procedure. However, all cumulative distribution functions agree at the smallest allowable value of x (where they are zero), and at the largest allowable value of x (where they are unity). (The smallest and largest values might of course be ±∞). So it is the behavior between the largest and smallest values that distinguishes distributions.

One can think of any number of statistics to measure the overall difference between two cumulative distribution functions: The absolute value of the area between them, for example. Or their integrated mean square difference. The Kolmogorov-Smirnov D is a particularly simple measure: It is defined as the maximum value of the absolute difference between two cumulative distribution functions. Thus, for comparing one data set's SN(x) to a known cumulative distribution function P(x), the K-S statistic is

(13.5.3)

D = max
- ∞ < x < ∞
|SN(x) - P(x)|

while for comparing two different cumulative distribution functions SN1(x) and SN2(x), the K-S statistic is

(13.5.4)
D = max
- ∞ < x < ∞
|SN1(x) - SN2(x)|

What makes the K-S statistic useful is that its distribution in the case of the null hypothesis (data sets drawn from the same distribution) can be calculated, at least to useful approximation, thus giving the significance of any observed nonzero value of D.

The function which enters into the calculation of the significance can be written as the following sum,

(13.5.5)


which is a monotone function with the limiting values

(13.5.6)
QKS(0) = 1            QKS(∞) = 1

In terms of this function, the significance level of an observed value of D (as a disproof of the null hypothesis that the distributions are the same) is given approximately by the formulas

(13.5.7)
Probability (D > observed) =

for the case (13.5.3) of one distribution, where N is the number of data points, and

(13.5.8)
Probability (D > observed ) =


for the case (13.5.4) of two distributions, where N1 is the number of data points in the first distribution, N2 the number in the second.

The nature of the approximation involved in (13.5.7) and (13.3.8) is that it becomes asymptotically accurate as the N’s become large. In practice N = 20 is large enough. especially if you are being conservative and requiring a strong significance level (0.01 or smaller).

So, we have the following routines for the cases of one and two distributions:

SUBROUTINE KSONE(DATA,N,FUNC,D,PROB)
Given a array of N values, DATA, and given a user-supplied function of a single variable FUNC which is cumulative distribution function ranging from 0 (for smallest values of argument) to 1 (for largest values of its argument), this routine returns the K-S statistic D and the significance level PROB. Small values of PROB show that the cumulative distribution function of DATA is Significantly different from FUNC. The array DATA is modified by being sorted into ascending order.
DIMENSION DATA(N)
CALL SORT(N,DATA)     //If the data are already sorted into ascending order, then this call can be omitted.
EN=N
D=0.
FO=0.     //Data's c.d.f. before the next step.
DO 11 J=1,N     //Loop over the sorted data points.
     FN=J/EN     //Data's c.d.f. after this step.
     FF=FUNC(DATA(J))     //Compare to the user-supplied function.
     DT=AMAX1(ABS(FO-FF),ABS(FN-FF))     //Maximum distance.
     IF(DT.GT.D)D=DT
     FO=FN
11   CONTINUE
PROB=PROBKS(SQRT(EN)*D)     //Compute significance.
RETURN
END


SUBROUTINE KSTWO(DATA1,N1,DATA2,N2,D,PROB)
Given an array DATA1 of N1 values, and an array DATA2 of N2 values, this routine returns the K-S statistic D, and the significance level PROB for the null hypothesis that the data sets are drawn from the same distribution. Small values of PROB show that the cumulative distribution function of DATA1 is significantly different from that of DATA2. The arrays DATA1 and DATA2 are modified by being sorted into ascending order.
DIMENSION DATA1(N1),DATA2(N2)
CALL SORT(N1,DATA1)
CALL SORT(N2,DATA2)
EN1=N1
EN2=N2
J1=1     //Next value of DATA1 to be processed.
J2=1     //Ditto, DATA2.
FO1=0.     //value of c.d.f. before the next step.
FO2=0.     //Ditto, for DATA2.
D=0.
1 IF(J1.LE.N1 .AND. J2.LE.N2) THEN     //If we are not done...
     IF(DATA1(J1).LT.DATA2(J2))THEN     //Next step is in DATA1.
          FN=J1/EN1
          DT=AMAX1(ABS(FN1-FO2),ABS(FOl-FO2))
          IF(DT.GT.D)D=DT
          FO1=FN1
          J1=J1+1
     ELSE     //Next step is in DATA2.
          FN2=J2/EN2
          DT=AMAX1(ABS(FN2-FO1),ABS(FO2-FO1))
          IF (DT.GT.D)D=DT
          FO2=FN2
          J2=J2+1
     ENDIF
     GO TO 1
ENDIF
PROB=PROBKS(SQRT(EN1*EN2/(EN1+EN2))*D)     //Compute significance.
RETURN
END


Both of the above routines use the following routine for calculating the function QKS

FUNCTION PROBKS(ALAM)
PARAMETER (EPS1=0.001, EPS2=1.E-8)
A2=-2.*ALAM**2
FAC=2.
PROBKS=0.
TERMBF=0.     //Previous term in sum.
DO 11 J=1,100
     TERM=FAC*EXP(A2*J**2)
     PROBKS=PROBKS+TERM
     IF (ABS(TERM) .LT. EPS1*TERMBF .OR. ABS(TERM) .LT. EPS2*PROBKS) RETURN
     FAC=-FAC     //Alternating signs in sum.
     TERMBF=ABS(TERM)
11   CONTINUE
PROBKS=1.     //Get here only by failing to converge.
RETURN
END


REFERENCES AND FURTHER READING:
von Mises, Richard. 1964, Mathematical Theory of Probability and Statistics (New York: Academic Press), Chapters IX(C) and IX(E).

6.2 Incomplete Gamma Function, Error Function, Chi-Square Probability Function, Cumulative Poisson Function

The incomplete gamma function is defined by

(6.2.1)



It has the limiting vales

(6.2.2)
P(a,0) = 0      and     P(a,∞) = 1


The incomplete gamma function P(a,x) is monotonic and (for a greater than one or so) rises from "near-zero" to "near-unity" in a range of x centered on about a - 1, and of width about √a (see Figure 6.2.1).

gamma


Figure 6.2.1. The incomplete gamma function P(a,x) for four values of a.

The complement of P(a,x) is also confusingly called an incomplete gamma function,

(6.2.3)



It has the limiting values

(6.2.4)
Q(a,0) = 1     and     Q(a,∞) = 0


The notations P(a,x), γ(a,x), and Γ(a,x) are standard; the notation Q(a,x) is specific to this book.

There is a series development for γ(a,x) as follows:

(6.2.5)



One does not actually need to compute a new Γ(a + 1 + n) for each n; one rather uses equation (6.1.3) and the previous coefficient.

A continued fraction development for Γ(a,x) is

(6.2.6)





It turns out that (6.2.5) converges rapidly for x less than about a + 1, while (6.2.6) converges rapidly for x greater than about a + 1. In these respective regimes each requires at most a few times √a terms to converge, and this many only near x = a, where the incomplete gamma functions are varying most rapidly. Thus (6.2.5) and (6.2.6) together allow evaluation of the function for all positive a and x. An extra dividend is that we never need compute a function value near zero by subtracting two nearly equal numbers. The higher-level calls for P(a,x) and Q(a,x) are

FUNCTION GAMMP(A,X)
   Returns the incomplete gamma function P(a,x)
IF (X.LT.0. .OR. A.LE.0.)PAUSE
IF (X.LT.A+1.) THEN     //Use the series representation.
     CALL GSER(GAMSER,A,X,GLN)
     GAMMP=GAMSER
ELSE     //Use the continued fraction representation      CALL GCF(GAMMCF,A,X,GLN)
     GAMMP=1.-GAMMCF
ENDIF
RETURN
END


FUNCTION GAMMQ(A,X)
   Returns the incomplete gamma function Q(a,x) ≡ 1 - P(a,x).
IF (X. LT. 0. .OR. A.LE.0.) PAUSE
IF(X.LT.A+1.)THEN     //Use the series representation
     CALL GSER(GAMSER,A,X,GLN)
     GAMMQ=1.-GAMSER     //and take its complement.
ELSE     //Use the continued fraction representation.
     CALL GCF(GAMMCF,A,X,GLN)
     GAMMQ=GAMMCF
ENDIF
RETURN
END

The argument GLN is returned by both the series and continued fraction, procedures containing the value ln Γ(a); the reason for this is so that it is available to you if you want to modify the above two procedures to give γ(a,x) and Γ(a,x), in addition to P(a,x) and Q(a,x) (cf. equations 6.2.1 and 6.2.3).

The procedures GSER and GCF which implement (6.2.5) and (6.2.6) are

SUBROUTINE GSER(GAMSER,A,X,GLN)
Returns the incomplete gamma function P(a,x) evaluated by its series-representation as
GAMSER. Also returns ln Γ(a) as GLN.
PARAMETER (ITMAX=100,EPS=3.E-7)
GLN=GAMMLN(A)
IF(X.LE.0.)THEN
     IF (X.LT.0.)PAUSE
     GAMSER=0.
     RETURN
ENDIF
AP=A
SUM=1./A
DEL=SUM
DO 11 N=1,ITMAX
     AP=AP+1.
     DEL=DEL*X/AP
     SUM=SUM+DEL
     IF(ABS(DEL).LT.ABS(SUM)*EPS)GO TO 1
11 CONTINUE
  PAUSE 'A too large, ITMAX too small'
1 GAMSER=SUM*EXP(-X+A*LOG(X)-GLN)
RETURN
END


SUBROUTINE GCF(GAMMCF,A,X,GLN)    Returns the incomplete gamma function Q(a,x) evaluated by its continued fraction
   representation as GAMMCF. Also returns Γ(a) as GLN.
PARAMETER (ITMAX=100,EPS=3.E-7)
GLN=GAMMLN(A)
GOLD=0.     //This is the previous value, tested against for convergence.
A0=1.
A1=X     //We are here setting up the A's and B's of equation (5.2.4) for evaluating the continued fraction.
B0=0.
B1=1.
FAC=1.     //FAC is the renormalization factor for preventing overflow of the partial numerators and denominators.
DO 11 N=1,ITMAX
     AN=FLOAT(N)
     ANA=AN-A
     A0=(A1+A0*ANA)*FAC     //One step of the recurrence (5.2.5).
     B0=(B1+B0*ANA)*FAC
     ANF=AN+FAC
     A1=X*A0+ANF*A1     //The next step of the recurrence (5.2.5).
     B1=X+B0+ANF*B1
     IF(A1.NE.0.)THEN     //Shall we renormalize?
          FAC=1./A1     //Yes. Set FAC so it happens.
          G=B1*FAC     //New value of answer.
          IF(ABS((G-GOLD)/G).LT.EPS)GO TO 1     //Converged? If so, exit.
          GOLD=G     //If not, save value.
     ENDIF
11 CONTINUE
  PAUSE 'A too large, ITMAX too small'
1 GAMMCF=EXP(-X+A*ALOG(X)-GLN)*G     //Put factors In front.
RETURN
END


Below is function GAMMLN used by GSER and GCF above. There are a variety of methods in use for calculating the function Γ(z) numerically, but none is quite as neat as the approximation derived by Lanczos. This scheme is entirely specific to the gamma function, seemingly plucked from thin air. We will not attempt to derive the approximation.

FUNCTION GAMMLN(XX)
   Returns the value ln[Γ(XX)] for XX > 0.    Full accuracy is obtained for XX > 1. REAL*8 COF(6),STP,HALF,ONE,FPF,X,TMP,SER
   Internal arithmetic will be done in double precision,    a nicety that you can omit if five-figure accuracy is good enough. DATA COF,STP/76.18009173D0,-86.50532033D0,24.01409822D0,
*    -1.231739516D0,.120858003D-2,-.536382D-5,2.50662827465D0/
DATA HALF,ONE,FPF/0.5D0,1.0D0,5.5D0/
X=XX-ONE
TMP=X+FPF
TMP=(X+HALF)*LOG(TMP)-TMP
SER=ONE
DO 11 J=1,6
     X=X+ONE
     SER=SER+COF(J)/X
11 CONTINUE
GAMMLN=TMP+LOG(STP*SER)
RETURN
END


Error Function

The error function and complementary error function are special cases of the incomplete gamma function, and are obtained moderately efficiently by the above procedures. Their definitions are

(6.2.7)


and

(6.2.8)


The functions have the following limiting values and symmetries:

(6.2.9)
erf(0) = 0      erf(∞) = 1      erf(-x) = -erf(x)


(6.2.10)
erfc(1) = 1      erfc(∞) = 0      erfc(-x) = 2 - erfc(x)


They are related to the incomplete gamma functions by

(6.2.11)
erf(x) = P(½,x2)     (x ≥ 0)

and

(6.2.12)
erfc(x) = Q(½,x2)     (x ≥ 0)


Hence we have

FUNCTION ERF(X)
   Returns the error function erf(X).
IF(X.LT.0.)THEN
     ERF=-GAMMP(.5,X**2)
ELSE
     ERF=GAMMP(.5,X**2)
ENDIF
RETURN
END


FUNCTION ERFC(X)
   Returns the complementary error function erfc(X).
IF( X.LT.0.)THEN
     ERFC=1.+GAMMP(.5,X**2)
ELSE
     ERFC=GAMMQ(.5,X**2)
ENDIF
RETURN
END


If you care to do so, you can easily remedy the minor inefficiency in ERF and ERFC, namely that Γ(0.5) = √π is computed unnecessarily when GAMMP or GAMMA is called. Before you do that, however, you might wish to consider the following routine, based on Chebyshev fitting to an inspired guess as to the functional form:

FUNCTION ERFCC(X)
   Returns the complementary error function erfc(X) with fractional error
   everywhere less than 1.2 x 10-7.
Z=ABS(X)
T=1./(1.+0.5*Z)
ERFCC=T*EXP(-Z*Z-1.26551223+T*(1.00002368+T*(.37409196+
     T*(.09678418+T*(-.18628806+T*(.27886807+T*(-1.13520398+
     T*(1.48851587+T*(-.82215223+T*.17087277)))))))))
IF (X.LT.0.) ERFCC=2.-ERFCC
RETURN
END


There are also some functions of two variables which are special cases of the incomplete gamma function:


Cumulative Poisson Probability Function

Px( < k), for positive x and integer k ≥ 1, denotes the cumulative Poisson probability function, It is defined as the probability that the number of Poisson random events occurring will be between 0 and k - 1 inclusive, if the expected mean number is x. It has the limiting values

(6.2.13)
Px( < 1) = e-x       Px(∞) = 1


Its relation to the incomplete gamma function is simply

(6.2.14)
Px( < k) = Q(k,x) = GAMMQ (k,x)



Chi-Square Probability Function

P(chisq | ν) is defined as the probability that the observed chi-square for a correct model should be less than a value chisq. (We will discuss the rise of this function in Chapter 14.) Its complement Q(chisq | ν) is the probability that the observed chi-square will exceed the value chisq by chance even for a correct model. In both cases ν is an integer, the number of degrees of freedom. The functions have the limiting values

(6.2.15)
P(0|ν) = 0     P(∞|ν) = 1

(6.2.16)
Q(0|ν) = 1     Q(∞|ν) = 0

and the following relation to the incomplete gamma functions,

(6.2.17)


(6.2.18)


REFERENCES AND FURTHER READING:

Abramowitz, Milton, and Stegun, Irene A. 1964, Handbook of Mathematical Functions, Applied Mathematics Series, vol. 55 (Washington National Bureau of Standards; reprinted 1968 by Dover Publications, New York), Chapters 6, 7, and 26.

Pearson, K (ed.) 1951, Tables of the Incomplete Gamma Function (Cambridge: Cambridge University Press).


ALL OF THE ABOVE IS FROM THE FOLLOWING BOOK:
NUMERICAL RECIPES: The Art of Scientific Computing
      William H. Press,
    Harvard-Smithsonian Center for Astrophysics
Brian P. Flannery,
EXXON Research and Engineering Company
Saul A. Teukolsky,
Department of Physics, Cornell University
William T. Vetterling,
Polaroid Corporation
Cambridge University Press, 1986
This book contains a wealth of knowledge along with source code examples on how to solve various kinds of programming problems. There are several programming language versions of this book available, including FORTRAN-77, FORTRAN-90, C, C++, Pascal, and BASIC. Machine readable forms in various languages of all the sample routines are also available if you don't want to type the code in yourself. This is an excellent book; you should own a copy.
Click here to search Amazon.com for this and similar books.
Also see thier web site at: http://www.numerical-recipes.com/index.html

The Structure of Language Source Code and Discussion of Newer & Better Portable Random Number Generators
index
Back to Deley's Homepage