clQMC
An OpenCL library for quasi-Monte Carlo methods
Introduction

We introduce clQMC, an OpenCL library for quasi-Monte Carlo methods. It provides facilities for using quasi-Monte Carlo methods to integrate functions over the unit hypercube in arbitrary dimension. Quasi-Monte Carlo methods can radically improve the accuracy of the estimator as compared to simple Monte Carlo. The main idea is to replace the independent uniform random numbers used in a simulation by quasi-Monte Carlo points.

With clQMC, quasi-Monte Carlo point sets are created on the host, and streams are attached to them to enumerate the points on the host or on the device. These streams act as sources of numbers in \((0,1)\) and can replace the streams from the clRNG library in code that uses them (we show how to do this in Usage Examples: From Monte Carlo to Randomized Quasi-Monte Carlo). The clProbDist library can also be used on top of clQMC for generating nonuniform variates. The design is closely inspired from that of the SSJ library [5] .

Warning
This is only an API proposal. The actual library is not fully implemented. Only rank-1 lattice rules are implemented for illustration of the ideas.

What to read next?

About Quasi-Monte Carlo Methods

Monte Carlo

Monte Carlo integration estimates the expectation \(\mu = \mathbb E[X]\) of a random variate \(X\) by the average

\[ \hat\mu_{\mathrm{mc}} = \frac1n \sum_{i=0}^{n-1} X_i \]

of \(n\) independent realizations \(X_0,\dots,X_{n-1}\) of \(X\). Each \(X_i\) is usually generated by simulation, using say \(s\) independent (pseudo)random numbers \(U_{i,1},\dots,U_{i,s}\) uniformly distributed over \((0,1)\) for its source of randomness, so one can write \(X_i = f(\boldsymbol U_i)\), where \(\boldsymbol U_i = (U_{i,1},\dots,U_{i,s})\) for \(i=0,\dots,n-1\), for some function \(f : [0,1)^s \to \mathbb R\). For standard Monte Carlo, the points \(\boldsymbol U_i\) are independent and uniformly distributed in the \(s\)-dimensional hypercube \((0,1)^s\).

The variance of the Monte Carlo estimator \(\hat\mu_{\mathrm{mc}}\) decreases as \(n^{-1}\). This slow convergence rate is due to the uneven coverage of the integration domain \((0,1)^s\) by the independent random points \(\boldsymbol U_0, \dots, \boldsymbol U_{n-1}\).

Quasi-Monte Carlo

Quasi-Monte Carlo methods (see [8], [10], [2], [6], [7], [1], [9]) can provide faster convergence. They replace the independent random points \(\boldsymbol U_0, \dots, \boldsymbol U_{n-1}\) with deterministic, structured points \(\boldsymbol u_0, \dots, \boldsymbol u_{n-1}\) that cover the integration domain more evenly. Popular quasi-Monte Carlo methods include lattice rules and digital nets (which include Sobol sequences and polynomial lattice rules, for example). The quasi-Monte Carlo estimator is deterministic:

\[ \hat\mu_{\mathrm{qmc}} = \frac1n \sum_{i=0}^{n-1} f(\boldsymbol u_i). \]

Rank-1 Lattice Rules

For rank-1 lattice rules, the points are defined as

\[ \boldsymbol u_i = \frac{i \boldsymbol a \bmod n}{n} \]

for \(i=0,\dots,n-1\), where the quality of the point set depends on the choice of the generating vector \(\boldsymbol a \in \{1,\dots,n-1\}^s\). The modulo operation in the above formula is applied coordinate by coordinate. External software may be needed in order to find a proper generating vector \(\boldsymbol a\) for any given integration problem (for a given dimension, number of points, etc.). We recommend Lattice Builder [4] . An important advantage of lattice rules is that enumerating their points requires a trivial computational effort, which can be very helpful for use on a GPU.

Randomized Quasi-Monte Carlo

Randomized Quasi-Monte Carlo produces an unbiased stochastic estimator

\[ \hat\mu_{\mathrm{rqmc}} = \frac1n \sum_{i=0}^{n-1} f(\boldsymbol U_i) \]

by randomizing the quasi-Monte Carlo points \(\boldsymbol u_i \mapsto \boldsymbol U_i\) in a way that preserves their mutual structure, and such that each randomized point \(\boldsymbol U_i\) is uniformly distributed in \((0,1)^s\), when taken separately.

Usually, one generates several replications of \(\hat\mu_{\mathrm{rqmc}}\) to estimate its mean and variance, and thus obtain both a better estimation of \(\mu\) together with an indication of the accuracy of this estimation. It is generally more productive to increase the number of points than the number of randomizations to obtain a better estimation.

Randomly-Shifted Rank-1 Lattice Rules

For rank-1 lattice rules, the randomization is generally a periodic random shift, i.e., a random vector \(\boldsymbol U\) uniformly distributed in \((0,1)^s\) is added (modulo 1) to each point:

\[ \boldsymbol U_i = (\boldsymbol u_i + \boldsymbol U) \bmod 1, \]

for \(i=0,\dots,n-1\). The result is called a randomly-shifted rank-1 lattice rule. See [2] and [3] .

Usage Examples: From Monte Carlo to Randomized Quasi-Monte Carlo

We show here how existing Monte Carlo code that uses clRNG streams can be adapted to quasi-Monte Carlo methods by using clQMC streams instead. We further show how to combine clQMC and clRNG streams to apply randomized quasi-Monte Carlo.

Example Model

Suppose we want to integrate

\[ f(u_1,\dots,u_s) = \prod_{j=1}^s \frac{u_j^2}{3}, \]

in dimension \(s=30\).

We want our implementation of \(f(u_1,\dots,u_s)\) to take as input a stream from either clRNG (for Monte Carlo) or clQMC (for quasi-Monte Carlo), and to obtain the successive coordinates \(u_1,\dots,u_s\) as successive outputs from the stream. However, the type of the stream and the function that generates outputs from the stream have different names in clRNG and clQMC, so we refer to them through the preprocessor symbols StreamType and nextCoordinate(), respectively. This allows us to define them later, in accordance with the simulation method we want to use.

The following function evaluates \(f(u_1,\dots,u_s)\) by generating the coordinates of the point \((u_1,\dots,u_s)\) through successive calls to nextCoordinate() on the input stream stream, as specified above:

#define DIMENSION 30
double simulateOneRun(StreamType* stream)
{
double ret = 1.0;
for (uint j = 0; j < DIMENSION; j++) {
double uj = nextCoordinate(stream);
ret *= 3 * uj * uj;
}
return ret;
}

The complete implementation is given in DocsTutorial/common.clh. Henceforth, we refer to the dimension \(s=30\) through the constant DIMENSION.

Example 1: Monte Carlo

To perform Monte Carlo integration, we use the MRG31k3p generator from clRNG. Every work item evaluates the integrand \(f\) at nw distinct points, each provided by a distinct substream (see the documentation of clRNG).

We define StreamType and nextCoordinate, from the definition of simulateOneRun() given in Example Model, to use MRG31k3p streams:

#define StreamType clrngMrg31k3pStream
#define nextCoordinate clrngMrg31k3pRandomU01

The OpenCL kernel stores at a unique location in the output array out the average of nw values of \(f\), obtained by invoking simulateOneRun() with nw different substreams:

double sum = 0.0;
for (uint i = 0; i < nw; i++) {
sum += simulateOneRun(&stream);
clrngMrg31k3pForwardToNextSubstreams(1, &stream);
}
out[get_global_id(0)] = sum / nw;

The above code assumes that the variable stream already contains the stream object assigned to the current work item. The host is responsible for averaging the values stored in out by the work items and obtain \(\hat\mu_{\mathrm{mc}}\).

Details about the clRNG device API (used in the above kernel) and about writing host code to send stream objects to the device are explained in the clRNG documentation.

The complete code for this example is given in DocsTutorial/example1.c and DocsTutorial/example1_kernel.cl.

Example 2: Quasi-Monte Carlo

Here, we replace the Monte Carlo points from Example 1: Monte Carlo with quasi-Monte Carlo points by using rank-1 lattice rule streams from clQMC instead of MRG31k3p streams from clRNG. Every work item evaluates the integrand \(f\) at only nw of the lattice points.

We redefine StreamType and nextCoordinate to use lattice rule streams:

#define StreamType clqmcLatticeRuleStream
#define nextCoordinate clqmcLatticeRuleNextCoordinate

In the device code, we include the clQMC header for lattice rules:

The kernel receives the lattice point set object as one of its arguments, declared as:

__global const clqmcLatticeRule* pointset

This lattice point set object is stored in global memory and is shared across all work items. Each work item, however, attaches its own stream object to the point set by invoking clqmcLatticeRuleCreateOverStream():

clqmcLatticeRuleStream stream; // in private memory
get_global_size(0), get_global_id(0),
(void*)0);

The third argument to clqmcLatticeRuleCreateOverStream() defines a partition of the lattice point set into get_global_size(0) even subsets, and the fourth argument selects the subset of index get_global_id(0) as the source of this work item's stream. The internal structure of the partitioning may depend on the type of point set used. We slightly modify the main loop to forward the stream to the next point instead of to the next substream:

for (uint i = 0; i < nw; i++) {
sum += simulateOneRun(&stream);
}

On the host, we include the clQMC header file for lattice:

A lattice point set that contains n points is created by invoking clqmcLatticeRuleCreate() with a proper generating vector gen_vec (selected beforehand):

size_t pointset_size;
cl_int gen_vec[] = { ... };
clqmcLatticeRule* pointset = clqmcLatticeRuleCreate(n, DIMENSION, gen_vec, &pointset_size, &err);

The resulting object pointset, of size pointset_size in bytes, can be copied to the device using standard OpenCL techniques (not shown here).

The complete code for this example is given in DocsTutorial/example2.c and DocsTutorial/example2_kernel.cl.

Example 3: Randomized Quasi-Monte Carlo

To use randomized quasi-Monte Carlo, we can add a periodic random shift \(\boldsymbol U\), as explained in Randomized Quasi-Monte Carlo, to a lattice rule. As in Example 2: Quasi-Monte Carlo, each work item is assigned a specific subset of the whole point set. But here, a work item is also responsible for generating all of the replications realizations of \(f(\boldsymbol U_i)\) for each point \(\boldsymbol U_i\) it was assigned. Each realization requires a distinct random \(\boldsymbol U\), and it is the same for all points, so it is the same for all work items. We thus store the random shifts in advance on the host and copy them into the device's global memory in an array named shifts, composed of replications tuples of DIMENSION values (one for each point and each coordinate).

In the kernel code, for each replication of index k, we create a new lattice rule stream using the k-th random shift vector:

clqmcLatticeRuleStream stream;
for (uint k = 0; k < replications; k++) {
get_global_size(0), get_global_id(0),
&shifts[k * DIMENSION]);
// ... (compute the value of `sum' here; this is unchanged)
out[k * get_global_size(0) + get_global_id(0)] = sum / nw;
}

The average associated with each random shift is stored at a distinct location in the output array out, divided into replications blocks of size get_global_size(0) (one for each replication).

In very high dimension, it might be preferable not to store all the random shifts in advance but to pass random streams to the kernel and let the work items generate the single shift they need at the moment they need it.

The complete code for this example is given in DocsTutorial/example3.c and DocsTutorial/example3_kernel.cl.

Example 4: Advanced Randomized Quasi-Monte Carlo

We refine Example 3: Randomized Quasi-Monte Carlo to allow for the work on different replications for the same points to be shared across multiple work items. We partition the replications into r / rw blocks of size rw, as the whole point set is partitioned into N = n / nw subsets of equal cardinality nw. Each work item processes a distinct combination of replication block and point subset. More precisely, we assign to the gid-th work item the (gid / N)-th replication block for the (gid % N)-th point subset.

The kernel code from Example 3: Randomized Quasi-Monte Carlo is changed to compute the index (stored in the variable k) of each replication that a work item is responsible for, and to partition the point set into N (instead of get_global_size(0)) subsets by invoking clqmcLatticeRuleCreateOverStream():

for (uint kw = 0; kw < rw; kw++) {
uint k = (gid / N) * rw + kw;
uint j = gid % N;
N, j,
&shifts[k * DIMENSION]);
// ... (compute the value of `sum' here; this is unchanged)
out[k * N + j] = sum / nw;
}

The complete code for this example is given in DocsTutorial/example4.c and DocsTutorial/example4_kernel.cl.

This example is very general because it allows the extreme cases where each work item to takes care of all randomizations at single point and where each work item takes care of a single randomization at all points, and of all intermediary cases. We recall that, usually, it is preferable to use as many points as possible to obtain an accurate estimator \(\hat\mu_{\mathrm{rqmc}}\), with a small number of randomizations (maybe 5 to 30), just enough to be able to estimate the variance of \(\hat\mu_{\mathrm{rqmc}}\).

Configuration

Environment variables

For all features of the library to work properly, the CLQMC_ROOT environment variable must be set to point to the installation path of the clQMC package, that is, the directory under which lies the include/clQMC subdirectory. Means of setting an environment variable depend on the operating system used.

Device memory types

The API of each point set assumes that the point set objects are stored in a specific type of memory (the stream objects are always stored in private memory). It defaults to global memory, but can be customized by the user by changing the value of the preprocessor symbol CLQMC_<POINTSET>_OBJ_MEM, where <POINTSET> is the uppercase name of the point set, before including the device header file of the point set, to one of the following values:

For example, to store lattice rule objects in constant memory, the device code should simply begin with:

#define CLQMC_LATTICERULE_OBJ_MEM CLQMC_MEM_TYPE_CONSTANT
#include <clQMC/latticerule.clh>
Todo:
This is not implemented yet, but should be.