SSJ
3.3.1
Stochastic Simulation in Java
|
This class provides methods to compute several types of EDF goodness-of-fit test statistics and to apply certain transformations to a set of observations. More...
Classes | |
class | OutcomeCategoriesChi2 |
This class helps managing the partitions of possible outcomes into categories for applying chi-square tests. More... | |
Transforming the observations | |
static DoubleArrayList | unifTransform (DoubleArrayList data, ContinuousDistribution dist) |
Applies the probability integral transformation \(U_i = F (V_i)\) for \(i = 0, 1, …, n-1\), where \(F\) is a continuous distribution function, and returns the result as an array of length \(n\). More... | |
static DoubleArrayList | unifTransform (DoubleArrayList data, DiscreteDistribution dist) |
Applies the transformation \(U_i = F (V_i)\) for \(i = 0, 1, …, n-1\), where \(F\) is a discrete distribution function, and returns the result as an array of length \(n\). More... | |
static void | diff (IntArrayList sortedData, IntArrayList spacings, int n1, int n2, int a, int b) |
Assumes that the real-valued observations \(U_0,…,U_{n-1}\) contained in sortedData are already sorted in increasing order and computes the differences between the successive observations. More... | |
static void | diff (DoubleArrayList sortedData, DoubleArrayList spacings, int n1, int n2, double a, double b) |
Same as method diff(IntArrayList,IntArrayList,int,int,int,int), but for the continuous case. More... | |
static void | iterateSpacings (DoubleArrayList data, DoubleArrayList spacings) |
Applies one iteration of the iterated spacings transformation [112], [226] . More... | |
static void | powerRatios (DoubleArrayList sortedData) |
Applies the power ratios transformation \(W\) described in section 8.4 of Stephens [226] . More... | |
Computing EDF test statistics | |
static double | EPSILONAD = Num.DBL_EPSILON/2 |
Used by andersonDarling(DoubleArrayList). More... | |
static double | chi2 (double[] nbExp, int[] count, int smin, int smax) |
Computes and returns the chi-square statistic for the observations \(o_i\) in count[smin...smax] , for which the corresponding expected values \(e_i\) are in nbExp[smin...smax] . More... | |
static double | chi2 (OutcomeCategoriesChi2 cat, int[] count) |
Computes and returns the chi-square statistic for the observations \(o_i\) in count , for which the corresponding expected values \(e_i\) are in cat . More... | |
static double | chi2 (IntArrayList data, DiscreteDistributionInt dist, int smin, int smax, double minExp, int[] numCat) |
Computes and returns the chi-square statistic for the observations stored in data , assuming that these observations follow the discrete distribution dist . More... | |
static double | chi2Equal (double nbExp, int[] count, int smin, int smax) |
Similar to #chi2(double[],int[],int,int), except that the expected number of observations per category is assumed to be the same for all categories, and equal to nbExp . More... | |
static double | chi2Equal (DoubleArrayList data, double minExp) |
Computes the chi-square statistic for a continuous distribution. More... | |
static double | chi2Equal (DoubleArrayList data) |
Equivalent to chi2Equal (data, 10) . More... | |
static int | scan (DoubleArrayList sortedData, double d) |
Computes and returns the scan statistic \(S_n (d)\), defined in ( scan ). More... | |
static double | cramerVonMises (DoubleArrayList sortedData) |
Computes and returns the Cramér-von Mises statistic \(W_n^2\) (see [55], [224], [225] ), defined by. More... | |
static double | watsonG (DoubleArrayList sortedData) |
Computes and returns the Watson statistic \(G_n\) (see [238], [41] ), defined by. More... | |
static double | watsonU (DoubleArrayList sortedData) |
Computes and returns the Watson statistic \(U_n^2\) (see [55], [224], [225] ), defined by. More... | |
static double | andersonDarling (DoubleArrayList sortedData) |
Computes and returns the Anderson-Darling statistic \(A_n^2\) (see method #andersonDarling(double[]) ). More... | |
static double | andersonDarling (double[] sortedData) |
Computes and returns the Anderson-Darling statistic \(A_n^2\) (see [165], [225], [6] ), defined by. More... | |
static double [] | andersonDarling (double[] data, ContinuousDistribution dist) |
Computes the Anderson-Darling statistic \(A_n^2\) and the corresponding \(p\)-value \(p\). More... | |
static double [] | kolmogorovSmirnov (double[] sortedData) |
Computes the Kolmogorov-Smirnov (KS) test statistics \(D_n^+\), \(D_n^-\), and \(D_n\) (see method kolmogorovSmirnov(DoubleArrayList) ). More... | |
static double [] | kolmogorovSmirnov (DoubleArrayList sortedData) |
Computes the Kolmogorov-Smirnov (KS) test statistics \(D_n^+\), \(D_n^-\), and \(D_n\) defined by. More... | |
static void | kolmogorovSmirnov (double[] data, ContinuousDistribution dist, double[] sval, double[] pval) |
Computes the KolmogorovSmirnov (KS) test statistics and their \(p\)-values. More... | |
static double [] | kolmogorovSmirnovJumpOne (DoubleArrayList sortedData, double a) |
Compute the KS statistics \(D_n^+(a)\) and \(D_n^-(a)\) defined in the description of the method FDist.kolmogorovSmirnovPlusJumpOne, assuming that \(F\) is the uniform distribution over \([0,1]\) and that \(U_{(1)},…,U_{(n)}\) are in sortedData . More... | |
static double | pDisc (double pL, double pR) |
Computes a variant of the \(p\)-value \(p\) whenever a test statistic has a discrete probability distribution. More... | |
This class provides methods to compute several types of EDF goodness-of-fit test statistics and to apply certain transformations to a set of observations.
This includes the probability integral transformation \(U_i = F(X_i)\), as well as the power ratio and iterated spacings transformations [226] . Here, \(U_{(0)}, …, U_{(n-1)}\) stand for \(n\) observations \(U_0,…,U_{n-1}\) sorted by increasing order, where \(0\le U_i\le1\).
Note: This class uses the Colt library.
|
static |
Computes and returns the Anderson-Darling statistic \(A_n^2\) (see method #andersonDarling(double[]) ).
sortedData | array of sorted real-valued observations in the interval \([0,1]\) |
|
static |
Computes and returns the Anderson-Darling statistic \(A_n^2\) (see [165], [225], [6] ), defined by.
\begin{align*} A_n^2 & = -n -\frac{1}{ n} \sum_{j=0}^{n-1} \left\{ (2j+1)\ln(U_{(j)}) + (2n-1-2j) \ln(1-U_{(j)}) \right\}, \tag{Andar} \end{align*}
assuming that sortedData
contains \(U_{(0)},…,U_{(n-1)}\) sorted in increasing order. When computing \(A_n^2\), all observations \(U_i\) are projected on the interval \([\epsilon, 1-\epsilon]\) for some \(\epsilon> 0\), in order to avoid numerical overflow when taking the logarithm of \(U_i\) or \(1-U_i\). The variable EPSILONAD
gives the value of \(\epsilon\).
sortedData | array of sorted real-valued observations in the interval \([0,1]\) |
|
static |
Computes the Anderson-Darling statistic \(A_n^2\) and the corresponding \(p\)-value \(p\).
The \(n\) (unsorted) observations in data
are assumed to be independent and to come from the continuous distribution dist
. Returns the 2-elements array [ \(A_n^2\), \(p\)].
data | array of observations |
dist | assumed distribution of the observations |
|
static |
Computes and returns the chi-square statistic for the observations \(o_i\) in count[smin...smax]
, for which the corresponding expected values \(e_i\) are in nbExp[smin...smax]
.
Assuming that \(i\) goes from 1 to \(k\), where \(k =\) smax-smin+1
is the number of categories, the chi-square statistic is defined as
\[ X^2 = \sum_{i=1}^k \frac{(o_i - e_i)^2}{e_i}. \tag{chi-square} \]
Under the hypothesis that the \(e_i\) are the correct expectations and if these \(e_i\) are large enough, \(X^2\) follows approximately the chi-square distribution with \(k-1\) degrees of freedom. If some of the \(e_i\) are too small, one can use OutcomeCategoriesChi2
to regroup categories.
nbExp | numbers expected in each category |
count | numbers observed in each category |
smin | index of the first valid data in count and nbExp |
smax | index of the last valid data in count and nbExp |
|
static |
Computes and returns the chi-square statistic for the observations \(o_i\) in count
, for which the corresponding expected values \(e_i\) are in cat
.
This assumes that cat.regroupCategories
has been called before to regroup categories in order to make sure that the expected numbers in each category are large enough for the chi-square test.
cat | numbers expected in each category |
count | numbers observed in each category |
|
static |
Computes and returns the chi-square statistic for the observations stored in data
, assuming that these observations follow the discrete distribution dist
.
For dist
, we assume that there is one set \(S=\{a, a+1,…, b-1, b\}\), where \(a<b\) and \(a\ge0\), for which \(p(s)>0\) if \(s\in S\) and \(p(s)=0\) otherwise.
Generally, it is not possible to divide the integers in intervals satisfying \(nP(a_0\le s< a_1)=nP(a_1\le s< a_2)=\cdots=nP(a_{j-1}\le s< a_j)\) for a discrete distribution, where \(n\) is the sample size, i.e., the number of observations stored into data
. To perform a general chi-square test, the method starts from smin
and finds the first non-negligible probability \(p(s)\ge\epsilon\), where \(\epsilon=\) DiscreteDistributionInt.EPSILON. It uses smax
to allocate an array storing the number of expected observations ( \(np(s)\)) for each \(s\ge\) smin
. Starting from \(s=\) smin
, the \(np(s)\) terms are computed and the allocated array grows if required until a negligible probability term is found. This gives the number of expected elements for each category, where an outcome category corresponds here to an interval in which sample observations could lie. The categories are regrouped to have at least minExp
observations per category. The method then counts the number of samples in each categories and calls #chi2(double[],int[],int,int) to get the chi-square test statistic. If numCat
is not null
, the number of categories after regrouping is returned in numCat[0]
. The number of degrees of freedom is equal to numCat[0]-1
. We usually choose minExp
= 10.
data | observations, not necessarily sorted |
dist | assumed probability distribution |
smin | estimated minimum value of \(s\) for which \(p(s)>0\) |
smax | estimated maximum value of \(s\) for which \(p(s)>0\) |
minExp | minimum number of expected observations in each interval |
numCat | one-element array that will be filled with the number of categories after regrouping |
|
static |
Similar to #chi2(double[],int[],int,int), except that the expected number of observations per category is assumed to be the same for all categories, and equal to nbExp
.
nbExp | number of expected observations in each category (or interval) |
count | number of counted observations in each category |
smin | index of the first valid data in count and nbExp |
smax | index of the last valid data in count and nbExp |
|
static |
Computes the chi-square statistic for a continuous distribution.
Here, the equiprobable case can be used. Assuming that data
contains observations coming from the uniform distribution, the \([0,1]\) interval is divided into \(1/p\) subintervals, where \(p=\) minExp
\(/n\), \(n\) being the sample size, i.e., the number of observations stored in data
. For each subinterval, the method counts the number of contained observations and the chi-square statistic is computed using #chi2Equal(double,int[],int,int). We usually choose minExp
= 10.
data | array of observations in \([0,1)\) |
minExp | minimum number of expected observations in each subintervals |
|
static |
Equivalent to chi2Equal (data, 10)
.
data | array of observations in \([0,1)\) |
|
static |
Computes and returns the Cramér-von Mises statistic \(W_n^2\) (see [55], [224], [225] ), defined by.
\[ W_n^2 = \frac{1}{ 12n} + \sum_{j=0}^{n-1} \left(U_{(j)} - \frac{(j+0.5) }{ n}\right)^2, \tag{CraMis} \]
assuming that sortedData
contains \(U_{(0)},…,U_{(n-1)}\) sorted in increasing order.
sortedData | array of sorted real-valued observations in the interval \([0,1]\) |
|
static |
Assumes that the real-valued observations \(U_0,…,U_{n-1}\) contained in sortedData
are already sorted in increasing order and computes the differences between the successive observations.
Let \(D\) be the differences returned in spacings
. The difference \(U_i - U_{i-1}\) is put in \(D_i\) for n1 < i <= n2
, whereas \(U_{n1} - a\) is put into \(D_{n1}\) and \(b - U_{n2}\) is put into \(D_{n2+1}\). The number of observations must be greater or equal than n2
, we must have n1 < n2
, and n1
and n2
are greater than 0. The size of spacings
will be at least \(n+1\) after the call returns.
sortedData | array of sorted observations |
spacings | pointer to an array object that will be filled with spacings |
n1 | starting index, in sortedData , of the processed observations |
n2 | ending index, in sortedData of the processed observations |
a | minimum value of the observations |
b | maximum value of the observations |
|
static |
Same as method diff(IntArrayList,IntArrayList,int,int,int,int), but for the continuous case.
sortedData | array of sorted observations |
spacings | pointer to an array object that will be filled with spacings |
n1 | starting index of the processed observations in sortedData |
n2 | ending index, in sortedData of the processed observations |
a | minimum value of the observations |
b | maximum value of the observations |
|
static |
Applies one iteration of the iterated spacings transformation [112], [226] .
Let \(U\) be the \(n\) observations contained into data
, and let \(S\) be the spacings contained into spacings
, Assumes that \(S[0..n]\) contains the spacings between \(n\) real numbers \(U_0,…,U_{n-1}\) in the interval \([0,1]\). These spacings are defined by
\[ S_i = U_{(i)} - U_{(i-1)}, \qquad1\le i < n, \]
where \(U_{(0)}=0\), \(U_{(n-1)}=1\), and \(U_{(0)},…,U_{(n-1)}\), are the \(U_i\) sorted in increasing order. These spacings may have been obtained by calling diff(DoubleArrayList,DoubleArrayList,int,int,double,double). This method transforms the spacings into new spacings, by a variant of the method described in section 11 of [177] and also by Stephens [226] : it sorts \(S_0,…,S_n\) to obtain \(S_{(0)} \le S_{(1)} \le S_{(2)} \le\cdots\le S_{(n)}\), computes the weighted differences
\begin{align*} S_0 & = (n+1) S_{(0)}, \\ S_1 & = n (S_{(1)}-S_{(0)}), \\ S_2 & = (n-1) (S_{(2)}-S_{(1)}), \\ & \vdots \\ S_n & = S_{(n)}-S_{(n-1)}, \end{align*}
and computes \(V_i = S_0 + S_1 + \cdots+ S_i\) for \(0\le i < n\). It then returns \(S_0,…,S_n\) in S[0..n]
and \(V_1,…,V_n\) in V[1..n]
.
Under the assumption that the \(U_i\) are i.i.d. \(U (0,1)\), the new \(S_i\) can be considered as a new set of spacings having the same distribution as the original spacings, and the \(V_i\) are a new sample of i.i.d. \(U (0,1)\) random variables, sorted by increasing order.
This transformation is useful to detect clustering in a data set: A pair of observations that are close to each other is transformed into an observation close to zero. A data set with unusually clustered observations is thus transformed to a data set with an accumulation of observations near zero, which is easily detected by the Anderson-Darling GOF test.
data | array of observations |
spacings | spacings between the observations, will be filled with the new spacings |
|
static |
Computes the Kolmogorov-Smirnov (KS) test statistics \(D_n^+\), \(D_n^-\), and \(D_n\) (see method kolmogorovSmirnov(DoubleArrayList) ).
Returns the array [ \(D_n^+\), \(D_n^-\), \(D_n\)].
sortedData | array of sorted real-valued observations in the interval \([0,1]\) |
|
static |
Computes the Kolmogorov-Smirnov (KS) test statistics \(D_n^+\), \(D_n^-\), and \(D_n\) defined by.
\begin{align} D_n^+ & = \max_{0\le j\le n-1} \left((j+1)/n - U_{(j)}\right), \tag{DNp} \\ D_n^- & = \max_{0\le j\le n-1} \left(U_{(j)} - j/n\right), \tag{DNm} \\ D_n & = \max (D_n^+, D_n^-). \tag{DN} \end{align}
and returns an array of length 3 that contains [ \(D_n^+\), \(D_n^-\), \(D_n\)]. These statistics compare the empirical distribution of \(U_{(1)},…,U_{(n)}\), which are assumed to be in sortedData
, with the uniform distribution over \([0,1]\).
sortedData | array of sorted real-valued observations in the interval \([0,1]\) |
|
static |
Computes the KolmogorovSmirnov (KS) test statistics and their \(p\)-values.
This is to compare the empirical distribution of the (unsorted) observations in data
with the theoretical distribution dist
. The KS statistics \(D_n^+\), \(D_n^-\) and \(D_n\) are returned in sval[0]
, sval[1]
, and sval[2]
respectively, and their corresponding \(p\)-values are returned in pval[0]
, pval[1]
, and pval[2]
.
data | array of observations to be tested |
dist | assumed distribution of the observations |
sval | values of the 3 KS statistics |
pval | \(p\)-values for the 3 KS statistics |
|
static |
Compute the KS statistics \(D_n^+(a)\) and \(D_n^-(a)\) defined in the description of the method FDist.kolmogorovSmirnovPlusJumpOne, assuming that \(F\) is the uniform distribution over \([0,1]\) and that \(U_{(1)},…,U_{(n)}\) are in sortedData
.
Returns the array [ \(D_n^+\), \(D_n^-\)].
sortedData | array of sorted real-valued observations in the interval \([0,1]\) |
a | size of the jump |
|
static |
Computes a variant of the \(p\)-value \(p\) whenever a test statistic has a discrete probability distribution.
This \(p\)-value is defined as follows:
\begin{align*} p_L & = P[Y \le y] \\ p_R & = P[Y \ge y] \\ p & = \left\{ \begin{array}{l@{qquad}l} p_R, & \mbox{if } p_R < p_L \\ 1 - p_L, \mbox{if } p_R \ge p_L \mbox{ and } p_L < 0.5 \\ 0.5 & \mbox{otherwise.} \end{array} \right. \end{align*}
\[ \begin{array}{rll} p = & p_R, & \qquad\mbox{if } p_R < p_L, \\ p = & 1 - p_L, & \qquad\mbox{if } p_R \ge p_L \mbox{ and } p_L < 0.5, \\ p = & 0.5 & \qquad\mbox{otherwise.} \end{array} \]
The function takes \(p_L\) and \(p_R\) as input and returns \(p\).
pL | left \(p\)-value |
pR | right \(p\)-value |
|
static |
Applies the power ratios transformation \(W\) described in section 8.4 of Stephens [226] .
Let \(U\) be the \(n\) observations contained into sortedData
. Assumes that \(U\) contains \(n\) real numbers \(U_{(0)},…,U_{(n-1)}\) from the interval \([0,1]\), already sorted in increasing order, and computes the transformations:
\[ U’_i = (U_{(i)} / U_{(i+1)})^{i+1}, \qquad i=0,…,n-1, \]
with \(U_{(n)} = 1\). These \(U’_i\) are sorted in increasing order and put back in U[1...n]
. If the \(U_{(i)}\) are i.i.d. \(U (0,1)\) sorted by increasing order, then the \(U’_i\) are also i.i.d. \(U (0,1)\).
This transformation is useful to detect clustering, as explained in iterateSpacings(DoubleArrayList,DoubleArrayList), except that here a pair of observations close to each other is transformed into an observation close to 1. An accumulation of observations near 1 is also easily detected by the Anderson-Darling GOF test.
sortedData | sorted array of real-valued observations in the interval \([0,1]\) that will be overwritten with the transformed observations |
|
static |
Computes and returns the scan statistic \(S_n (d)\), defined in ( scan ).
Let \(U\) be the \(n\) observations contained into sortedData
. The \(n\) observations in \(U[0..n-1]\) must be real numbers in the interval \([0,1]\), sorted in increasing order. (See FBar.scan for the distribution function of \(S_n (d)\)).
sortedData | sorted array of real-valued observations in the interval \([0,1]\) |
d | length of the test interval ( \(\in(0,1)\)) |
|
static |
Applies the probability integral transformation \(U_i = F (V_i)\) for \(i = 0, 1, …, n-1\), where \(F\) is a continuous distribution function, and returns the result as an array of length \(n\).
\(V\) represents the \(n\) observations contained in data
, and \(U\), the returned transformed observations. If data
contains random variables from the distribution function dist
, then the result will contain uniform random variables over \([0,1]\).
data | array of observations to be transformed |
dist | assumed distribution of the observations |
|
static |
Applies the transformation \(U_i = F (V_i)\) for \(i = 0, 1, …, n-1\), where \(F\) is a discrete distribution function, and returns the result as an array of length \(n\).
\(V\) represents the \(n\) observations contained in data
, and \(U\), the returned transformed observations.
Note: If \(V\) are the values of random variables with distribution function dist
, then the result will contain the values of discrete random variables distributed over the set of values taken by dist
, not uniform random variables over \([0,1]\).
data | array of observations to be transformed |
dist | assumed distribution of the observations |
|
static |
Computes and returns the Watson statistic \(G_n\) (see [238], [41] ), defined by.
\begin{align} G_n & = \sqrt{n} \max_{\Rule{0.0pt}{7.0pt}{0.0pt} 0\le j \le n-1} \left\{ (j+1)/n - U_{(j)} + \overline{U}_n - 1/2 \right\} \tag{WatsonG} \\ & = \sqrt{n}\left(D_n^+ + \overline{U}_n - 1/2\right), \nonumber \end{align}
where \(\overline{U}_n\) is the average of the observations \(U_{(j)}\), assuming that sortedData
contains the sorted \(U_{(0)},…,U_{(n-1)}\).
sortedData | array of sorted real-valued observations in the interval \([0,1]\) |
|
static |
Computes and returns the Watson statistic \(U_n^2\) (see [55], [224], [225] ), defined by.
\begin{align} W_n^2 & = \frac{1}{ 12n} + \sum_{j=0}^{n-1} \left\{U_{(j)} - \frac{(j + 0.5)}{ n} \right\}^2, \\ U_n^2 & = W_n^2 - n\left(\overline{U}_n - 1/2\right)^2. \tag{WatsonU} \end{align}
where \(\overline{U}_n\) is the average of the observations \(U_{(j)}\), assuming that sortedData
contains the sorted \(U_{(0)},…,U_{(n-1)}\).
sortedData | array of sorted real-valued observations in the interval \([0,1]\) |
|
static |
Used by andersonDarling(DoubleArrayList).
Num.DBL_EPSILON
is usually \(2^{-52}\).