SSJ
3.3.1
Stochastic Simulation in Java

Goodnessoffit test Statistics. More...
Classes  
class  FBar 
This class is similar to FDist, except that it provides static methods to compute or approximate the complementary distribution function of \(X\), which we define as \(\bar{F} (x) = P[X\ge x]\), instead of \(F (x)=P[X\le x]\). More...  
class  FDist 
This class provides methods to compute (or approximate) the distribution functions of special types of goodnessoffit test statistics. More...  
class  GofFormat 
This class contains methods used to format results of GOF test statistics, or to apply a series of tests simultaneously and format the results. More...  
class  GofStat 
This class provides methods to compute several types of EDF goodnessoffit test statistics and to apply certain transformations to a set of observations. More...  
class  KernelDensity 
This static class provides methods to compute a kernel density estimator from a set of \(n\) individual observations \(x_0, …, x_{n1}\), which define an empirical distribution. More...  
Goodnessoffit test Statistics.
This package contains tools for performing univariate goodnessoffit (GOF) statistical tests. Methods for computing (or approximating) the distribution function \(F(x)\) of certain GOF test statistics, as well as their complementary distribution function \(\bar{F}(x) = 1  F(x)\), are implemented in classes of package umontreal.ssj.probdist. Tools for computing the GOF test statistics and the corresponding \(p\)values, and for formating the results, are provided in classes GofStat and GofFormat.
We are concerned here with GOF test statistics for testing the hypothesis \(\mathcal{H}_0\) that a sample of \(N\) observations \(X_1,…,X_N\) comes from a given univariate probability distribution \(F\). We consider tests such as those of KolmogorovSmirnov, AndersonDarling, Crámervon Mises, etc. These test statistics generally measure, in different ways, the distance between a continuous cumulative distribution function (cdf) \(F\) and the corresponding empirical distribution function (EDF) \(\hat{F}_N\) of \(X_1,…,X_N\). They are also called EDF test statistics. The observations \(X_i\) are usually transformed into \(U_i = F (X_i)\), which satisfy \(0\le U_i\le1\) and which follow the \(U(0,1)\) distribution under \(\mathcal{H}_0\). (This is called the probability integral transformation.) Methods for applying this transformation, as well as other types of transformations, to the observations \(X_i\) or \(U_i\) are provided in umontreal.ssj.gof.GofStat.
Then the GOF tests are applied to the \(U_i\) sorted by increasing order. The corresponding \(p\)values are easily computed by calling the appropriate methods in the classes of package umontreal.ssj.probdist. If a GOF test statistic \(Y\) has a continuous distribution under \(\mathcal{H}_0\) and takes the value \(y\), its (right) \(p\)value is defined as \(p = P[Y \ge y \mid\mathcal{H}_0]\). The test usually rejects \(\mathcal{H}_0\) if \(p\) is deemed too close to 0 (for a onesided test) or too close to 0 or 1 (for a twosided test).
In the case where \(Y\) has a discrete distribution under \(\mathcal{H}_0\), we distinguish the right \(p\)value \(p_R = P[Y \ge y \mid\mathcal{H}_0]\) and the left \(p\)value \(p_L = P[Y \le y \mid\mathcal{H}_0]\). We then define the \(p\)value for a twosided test as
\begin{align} p & = \left\{ \begin{array}{l@{qquad}l} p_R, & \mbox{if } p_R < p_L \\ 1  p_L, \mbox{if } p_R \ge p_L \mbox{ and } p_L < 0.5 \\ 0.5 & \mbox{otherwise.} \end{array} \right. \tag{pdisc} \end{align}
Why such a definition? Consider for example a Poisson random variable \(Y\) with mean 1 under \(\mathcal{H}_0\). If \(Y\) takes the value 0, the right \(p\)value is \(p_R = P[Y \ge0 \mid\mathcal{H}_0] = 1\). In the uniform case, this would obviously lead to rejecting \(\mathcal{H}_0\) on the basis that the \(p\)value is too close to 1. However, \(P[Y = 0 \mid\mathcal{H}_0] = 1/e \approx0.368\), so it does not really make sense to reject \(\mathcal{H}_0\) in this case. In fact, the left \(p\)value here is \(p_L = 0.368\), and the \(p\)value computed with the above definition is \(p = 1  p_L \approx0.632\). Note that if \(p_L\) is very small, in this definition, \(p\) becomes close to 1. If the left \(p\)value was defined as \(p_L = 1  p_R = P[Y < y \mid\mathcal{H}_0]\), this would also lead to problems. In the example, one would have \(p_L = 0\) in that case.
A very common type of test in the discrete case is the chisquare test, which applies when the possible outcomes are partitioned into a finite number of categories. Suppose there are \(k\) categories and that each observation belongs to category \(i\) with probability \(p_i\), for \(0\le i < k\). If there are \(n\) independent observations, the expected number of observations in category \(i\) is \(e_i = n p_i\), and the chisquare test statistic is defined as
\[ X^2 = \sum_{i=0}^{k1} \frac{(o_i  e_i)^2}{e_i} \tag{chisquare0} \]
where \(o_i\) is the actual number of observations in category \(i\). Assuming that all \(e_i\)’s are large enough (a popular rule of thumb asks for \(e_i \ge5\) for each \(i\)), \(X^2\) follows approximately the chisquare distribution with \(k1\) degrees of freedom [207]. The class GofStat.OutcomeCategoriesChi2, a nested class defined inside the GofStat class, provides tools to automatically regroup categories in the cases where some \(e_i\)’s are too small.
The class GofFormat contains methods used to format results of GOF test statistics, or to apply several such tests simultaneously to a given data set and format the results to produce a report that also contains the \(p\)values of all these tests. A C version of this class is actually used extensively in the package TestU01, which applies statistical tests to random number generators [133]. The class also provides tools to plot an empirical or theoretical distribution function, by creating a data file that contains a graphic plot in a format compatible with a given software.