next up previous contents index
Next: Using polylines with gstat Up: Equations Previous: Change of support: details   Contents   Index


Latin hypercube sampling

Suppose we want to simulate the outcome of a single continuous random variable $Z$ with known distribution $F_Z$, to study how the outcome of a model $g(Z)$ depends on the distribution of $Z$. We could do this by using simple random sampling, i.e. randomly drawing values from $F_Z$. However, if $g$ is expensive to evaluate, other sampling strategies may be more efficient: it is for instance well known that stratified sampling can characterise the population equally well as simple random sampling with a smaller sample size. Stratified sampling works as follows: (i) the distribution of $Z$ is divided into $m$ segments, (ii) the distribution of $n$ samples over these segments is proportional to the probabilities of $Z$ falling in the segments, and (iii) each sample is drawn from its segment by simple random sampling. Maximal stratification takes place when the number of segments (strata) $m$ equals the number of data $n$, and when $Z$ has probability $m^{-1}$ of falling in each segment, and this is the most efficient.

When the model depends on two variables, like $h(Z_1 , Z_2)$, then values of both $Z_1$ and $Z_2$ can be drawn by simple random sampling or by stratified random sampling. The most efficient sample is maximally stratified for both $Z_1$ and $Z_2$ simultaneously. This means that a sample of size $n$ of pairs $(z_1 , z_2 )$ is marginally $n-$stratified for both $Z_1$ and $Z_2$. Such a sample is called a Latin hypercube sample [16].

When, in a spatial context, multiple Gaussian simulations are created, the simulations are independent in the sense that they are a random sample from the ensemble of all realisations that were possible under the model specified: at a specific location $x_0$ the subsequently realised values of the variable $Z(x_0)$, $(z^1(x_0), ..., z^n (x_0))$ are a simple random sample from the distribution of $Z(x_0)$. (The values of $Z^i(x)$ and $Z^j(x')$, $i \neq j, x \neq x'$ may be spatially dependent when both simulations were conditioned on a common data set.) When we want a stratified sample of $Z(x_0)$, then this could simply be drawn when $F_{Z(x_0)}$ were known. However, this is less trivial when $Z^i(x)$ and $Z^i(x')$, $x \neq x'$ still have to obey the prespecified spatial correlation. A procedure for obtaining a Latin hypercube sample in this context (multiple, spatially correlated variables) is given in [23] and this procedure has been implemented in gstat.

The procedure requires the marginal distribution of $Z(x_0)$. In case of unconditional simulations, this distribution will be constant everywhere, and when simple point kriging is used to obtain the Gaussian simulations, this distribution is ${\cal N}(\mu_{sk}, C(0))$ with $\mu_{sk}$ the simple kriging mean (sk_mean) and $C(0)$ the sill of the variogram used. If for instance 3 variables are defined in a command file, and they have marginal distributions of respectively ${\cal N}(4,3)$, ${\cal N}(2,5)$ and ${\cal N}(7,2.5)$, then this is defined in a command file by the command

marginals: 4, 3, 2, 5, 7, 2.5;

If no such command is given, marginal distributions for variable $i$ will be taken as ${\cal N}(\mu_{sk,i}, C_i(0))$, a normal distribution with mean $\mu_{sk,i}$ and variance $C_i(0)$ (the sill of the variogram for variable $i$). If simulations are conditional to known point data, then the marginal distributions are location specific, and are obtained from simple kriging (prediction and prediction variance) of the conditioning data. If, for 2 variables, such marginals reside in the maps pr1, var1, pr2 and var2, denoting marginal distributions (as maps) ${\cal N}($pr1,var1$)$ and ${\cal N}($pr2, var2$)$, then this is defined in a command file as

marginals: 'pr1', 'var1', 'pr2', 'var2';

In case of stratified simulation, only the two maps with the (stratified) marginals have to be specified in the marginals command. Note that the marginals construct limits the definition of location specific marginal distributions (for conditional simulation with Latin hypercube sampling) to gridded simulation.

A Latin hypercube sample is constructed from a simple random sample if the command

set lhs=1;

is set. The size of the (Latin hypercube) sample is set to 1000 by

set nsim=1000;

In order to obtain correct results, the sample size (the number of simulations) must be large: in small samples the spatial correlation may be disturbed by the Latin hypercube procedure [21].


next up previous contents index
Next: Using polylines with gstat Up: Equations Previous: Change of support: details   Contents   Index
Edzer Pebesma
1999-08-31