clQMC
An OpenCL library for quasi-Monte Carlo methods
|
We introduce clQMC, an OpenCL library for quasi-Monte Carlo methods. It provides facilities for using quasi-Monte Carlo methods to integrate functions over the unit hypercube in arbitrary dimension. Quasi-Monte Carlo methods can radically improve the accuracy of the estimator as compared to simple Monte Carlo. The main idea is to replace the independent uniform random numbers used in a simulation by quasi-Monte Carlo points.
With clQMC, quasi-Monte Carlo point sets are created on the host, and streams are attached to them to enumerate the points on the host or on the device. These streams act as sources of numbers in \((0,1)\) and can replace the streams from the clRNG library in code that uses them (we show how to do this in Usage Examples: From Monte Carlo to Randomized Quasi-Monte Carlo). The clProbDist library can also be used on top of clQMC for generating nonuniform variates. The design is closely inspired from that of the SSJ library [5] .
Monte Carlo integration estimates the expectation \(\mu = \mathbb E[X]\) of a random variate \(X\) by the average
\[ \hat\mu_{\mathrm{mc}} = \frac1n \sum_{i=0}^{n-1} X_i \]
of \(n\) independent realizations \(X_0,\dots,X_{n-1}\) of \(X\). Each \(X_i\) is usually generated by simulation, using say \(s\) independent (pseudo)random numbers \(U_{i,1},\dots,U_{i,s}\) uniformly distributed over \((0,1)\) for its source of randomness, so one can write \(X_i = f(\boldsymbol U_i)\), where \(\boldsymbol U_i = (U_{i,1},\dots,U_{i,s})\) for \(i=0,\dots,n-1\), for some function \(f : [0,1)^s \to \mathbb R\). For standard Monte Carlo, the points \(\boldsymbol U_i\) are independent and uniformly distributed in the \(s\)-dimensional hypercube \((0,1)^s\).
The variance of the Monte Carlo estimator \(\hat\mu_{\mathrm{mc}}\) decreases as \(n^{-1}\). This slow convergence rate is due to the uneven coverage of the integration domain \((0,1)^s\) by the independent random points \(\boldsymbol U_0, \dots, \boldsymbol U_{n-1}\).
Quasi-Monte Carlo methods (see [8], [10], [2], [6], [7], [1], [9]) can provide faster convergence. They replace the independent random points \(\boldsymbol U_0, \dots, \boldsymbol U_{n-1}\) with deterministic, structured points \(\boldsymbol u_0, \dots, \boldsymbol u_{n-1}\) that cover the integration domain more evenly. Popular quasi-Monte Carlo methods include lattice rules and digital nets (which include Sobol sequences and polynomial lattice rules, for example). The quasi-Monte Carlo estimator is deterministic:
\[ \hat\mu_{\mathrm{qmc}} = \frac1n \sum_{i=0}^{n-1} f(\boldsymbol u_i). \]
For rank-1 lattice rules, the points are defined as
\[ \boldsymbol u_i = \frac{i \boldsymbol a \bmod n}{n} \]
for \(i=0,\dots,n-1\), where the quality of the point set depends on the choice of the generating vector \(\boldsymbol a \in \{1,\dots,n-1\}^s\). The modulo operation in the above formula is applied coordinate by coordinate. External software may be needed in order to find a proper generating vector \(\boldsymbol a\) for any given integration problem (for a given dimension, number of points, etc.). We recommend Lattice Builder [4] . An important advantage of lattice rules is that enumerating their points requires a trivial computational effort, which can be very helpful for use on a GPU.
Randomized Quasi-Monte Carlo produces an unbiased stochastic estimator
\[ \hat\mu_{\mathrm{rqmc}} = \frac1n \sum_{i=0}^{n-1} f(\boldsymbol U_i) \]
by randomizing the quasi-Monte Carlo points \(\boldsymbol u_i \mapsto \boldsymbol U_i\) in a way that preserves their mutual structure, and such that each randomized point \(\boldsymbol U_i\) is uniformly distributed in \((0,1)^s\), when taken separately.
Usually, one generates several replications of \(\hat\mu_{\mathrm{rqmc}}\) to estimate its mean and variance, and thus obtain both a better estimation of \(\mu\) together with an indication of the accuracy of this estimation. It is generally more productive to increase the number of points than the number of randomizations to obtain a better estimation.
For rank-1 lattice rules, the randomization is generally a periodic random shift, i.e., a random vector \(\boldsymbol U\) uniformly distributed in \((0,1)^s\) is added (modulo 1) to each point:
\[ \boldsymbol U_i = (\boldsymbol u_i + \boldsymbol U) \bmod 1, \]
for \(i=0,\dots,n-1\). The result is called a randomly-shifted rank-1 lattice rule. See [2] and [3] .
We show here how existing Monte Carlo code that uses clRNG streams can be adapted to quasi-Monte Carlo methods by using clQMC streams instead. We further show how to combine clQMC and clRNG streams to apply randomized quasi-Monte Carlo.
Suppose we want to integrate
\[ f(u_1,\dots,u_s) = \prod_{j=1}^s \frac{u_j^2}{3}, \]
in dimension \(s=30\).
We want our implementation of \(f(u_1,\dots,u_s)\) to take as input a stream from either clRNG (for Monte Carlo) or clQMC (for quasi-Monte Carlo), and to obtain the successive coordinates \(u_1,\dots,u_s\) as successive outputs from the stream. However, the type of the stream and the function that generates outputs from the stream have different names in clRNG and clQMC, so we refer to them through the preprocessor symbols StreamType
and nextCoordinate()
, respectively. This allows us to define them later, in accordance with the simulation method we want to use.
The following function evaluates \(f(u_1,\dots,u_s)\) by generating the coordinates of the point \((u_1,\dots,u_s)\) through successive calls to nextCoordinate()
on the input stream stream
, as specified above:
The complete implementation is given in DocsTutorial/common.clh. Henceforth, we refer to the dimension \(s=30\) through the constant DIMENSION
.
To perform Monte Carlo integration, we use the MRG31k3p generator from clRNG. Every work item evaluates the integrand \(f\) at nw
distinct points, each provided by a distinct substream (see the documentation of clRNG).
We define StreamType
and nextCoordinate
, from the definition of simulateOneRun()
given in Example Model, to use MRG31k3p streams:
The OpenCL kernel stores at a unique location in the output array out
the average of nw
values of \(f\), obtained by invoking simulateOneRun()
with nw
different substreams:
The above code assumes that the variable stream
already contains the stream object assigned to the current work item. The host is responsible for averaging the values stored in out
by the work items and obtain \(\hat\mu_{\mathrm{mc}}\).
Details about the clRNG device API (used in the above kernel) and about writing host code to send stream objects to the device are explained in the clRNG documentation.
The complete code for this example is given in DocsTutorial/example1.c and DocsTutorial/example1_kernel.cl.
Here, we replace the Monte Carlo points from Example 1: Monte Carlo with quasi-Monte Carlo points by using rank-1 lattice rule streams from clQMC instead of MRG31k3p streams from clRNG. Every work item evaluates the integrand \(f\) at only nw
of the lattice points.
We redefine StreamType
and nextCoordinate
to use lattice rule streams:
In the device code, we include the clQMC header for lattice rules:
The kernel receives the lattice point set object as one of its arguments, declared as:
This lattice point set object is stored in global memory and is shared across all work items. Each work item, however, attaches its own stream object to the point set by invoking clqmcLatticeRuleCreateOverStream():
The third argument to clqmcLatticeRuleCreateOverStream() defines a partition of the lattice point set into get_global_size(0)
even subsets, and the fourth argument selects the subset of index get_global_id(0)
as the source of this work item's stream. The internal structure of the partitioning may depend on the type of point set used. We slightly modify the main loop to forward the stream to the next point instead of to the next substream:
On the host, we include the clQMC header file for lattice:
A lattice point set that contains n
points is created by invoking clqmcLatticeRuleCreate() with a proper generating vector gen_vec
(selected beforehand):
The resulting object pointset
, of size pointset_size
in bytes, can be copied to the device using standard OpenCL techniques (not shown here).
The complete code for this example is given in DocsTutorial/example2.c and DocsTutorial/example2_kernel.cl.
To use randomized quasi-Monte Carlo, we can add a periodic random shift \(\boldsymbol U\), as explained in Randomized Quasi-Monte Carlo, to a lattice rule. As in Example 2: Quasi-Monte Carlo, each work item is assigned a specific subset of the whole point set. But here, a work item is also responsible for generating all of the replications
realizations of \(f(\boldsymbol U_i)\) for each point \(\boldsymbol U_i\) it was assigned. Each realization requires a distinct random \(\boldsymbol U\), and it is the same for all points, so it is the same for all work items. We thus store the random shifts in advance on the host and copy them into the device's global memory in an array named shifts
, composed of replications
tuples of DIMENSION
values (one for each point and each coordinate).
In the kernel code, for each replication of index k
, we create a new lattice rule stream using the k
-th random shift vector:
The average associated with each random shift is stored at a distinct location in the output array out
, divided into replications
blocks of size get_global_size(0)
(one for each replication).
In very high dimension, it might be preferable not to store all the random shifts in advance but to pass random streams to the kernel and let the work items generate the single shift they need at the moment they need it.
The complete code for this example is given in DocsTutorial/example3.c and DocsTutorial/example3_kernel.cl.
We refine Example 3: Randomized Quasi-Monte Carlo to allow for the work on different replications for the same points to be shared across multiple work items. We partition the replications into r / rw
blocks of size rw
, as the whole point set is partitioned into N = n / nw
subsets of equal cardinality nw
. Each work item processes a distinct combination of replication block and point subset. More precisely, we assign to the gid
-th work item the (gid / N)
-th replication block for the (gid % N)
-th point subset.
The kernel code from Example 3: Randomized Quasi-Monte Carlo is changed to compute the index (stored in the variable k
) of each replication that a work item is responsible for, and to partition the point set into N
(instead of get_global_size(0)
) subsets by invoking clqmcLatticeRuleCreateOverStream():
The complete code for this example is given in DocsTutorial/example4.c and DocsTutorial/example4_kernel.cl.
This example is very general because it allows the extreme cases where each work item to takes care of all randomizations at single point and where each work item takes care of a single randomization at all points, and of all intermediary cases. We recall that, usually, it is preferable to use as many points as possible to obtain an accurate estimator \(\hat\mu_{\mathrm{rqmc}}\), with a small number of randomizations (maybe 5 to 30), just enough to be able to estimate the variance of \(\hat\mu_{\mathrm{rqmc}}\).
For all features of the library to work properly, the CLQMC_ROOT
environment variable must be set to point to the installation path of the clQMC package, that is, the directory under which lies the include/clQMC
subdirectory. Means of setting an environment variable depend on the operating system used.
The API of each point set assumes that the point set objects are stored in a specific type of memory (the stream objects are always stored in private memory). It defaults to global memory, but can be customized by the user by changing the value of the preprocessor symbol CLQMC_<POINTSET>_OBJ_MEM
, where <POINTSET>
is the uppercase name of the point set, before including the device header file of the point set, to one of the following values:
For example, to store lattice rule objects in constant memory, the device code should simply begin with: