If the objective cannot be achieved with the available data, more information is required, and a proper sampling design should be made before collecting new data. There exists a variety of different ways to approach this, and the main drivers here are the available data, the type of problem at hand (revealed by the exploratory data analysis), the outcome of the data analysis, and the reason why the objective cannot be achieved. As many factors therefore influence what would be the best approach, we take a top down approach here, and describe the individual approaches briefly in this section, discussing typical cases where they are commonly applied. Some aspects related to sample locations, size and support are discussed first below however.

In case it would not be clear which method to use, or if multiple methods seem to be equally adequate based on the Venn diagram, it is recommended to go through the brief introductions first. If it would still be unclear what method to use, further reading is recommended, or the consultation of a more experienced person.

Note that the list of approaches discussed here is non-limitative. Sampling approaches more applicable in alterntive fields were not considered, and a more advanced set of approaches, which are less commonly used, but might be useful in certain cases, is discussed separately under “optimization”.

Sample locations

A Venn diagram providing an overview of different sampling design approaches is provided here. The expert can select one or more suitable methods from the Venn diagram. The same problem can be tackled using various methods. Often, a combination of approaches is being implemented. In any case, the sample locations should be selected so that subsequent extrapolation during data analysis is avoided. Many characterisation projects have the tendency to focus their sampling efforts on the highest affected areas, neglecting areas with lower activity concentration levels. Nonetheless, it is necessary to sample the supposedly least impacted zones as well as the most impacted zones to achieve a realistic understanding of the statistical distribution of the activity concentration. Confirming some non-impacted areas is often as important as (or even more important than) confirming historically impacted areas. From the point of view of waste volume management, transition zones are more critical, since it is difficult to categorize them with respect to the reference thresholds. Uncertainty being the most important in these areas for proper delineation (and limiting misclassification errors), the sampling distribution should favour them over other areas that only require confirmation of impacted or non-impacted. We make a distinction here between probabilistic and non-probabilistic approaches, and designs with equal or non-equal probability of selection:

  1. Probabilistic sampling: We use the term here to indicate sampling strategies where all elements in a population have a certain probability to be selected. This probability should be known, or easily determined, so proper inference on the total population can be made.
  2. Non-probabilistic sampling: We consider an approach to be non-probabilistic when certain elements in a population have a zero probability of being selected, or the probability cannot be determined. Hence, a non-probabilistic sample cannot be used to do inference on the total population, without making assumptions, and is only targeted at a specific part of it.
  3. Equal probability of selection: We use the term here to indicate sampling strategies in which all obtained samples had the same probability of being selected. The part of the population considered for sampling is therefore explored in a uniform way.
  4. Unequal probability of selection: We use the term here to indicate sampling strategies in which the obtained samples had different probabilities of being selected. The part of the population considered for sampling is therefore explored in a non-uniform way.

Intersections in the Venn diagram indicate here that the details of the respective sampling design approaches can be chosen such that they can fall both under the probabilistic or non-probabilistic classes, and the equal or unequal probability of selection classes. The effective implementation of a sampling design approach can however not be probabilistic and non-probabilistic, or equal and unequal probability of selection at the same time.

The different approaches listed here are discussed in greater detail on their respective pages. It should be noted here however, that in practice, sampling design consists most often of a combination of these approaches, as objectives and/or sampling targets are often multifold in real life.

Sample size or density

The sample size can be determined according to the estimator (mean, proportion, quantile, etc.) used and the confidence interval required. In general, the sample size can be obtained from the formulation of the error margin resulting from the maximum difference between the observed sample mean and the true value mean of the population (cf. §2.3.2, Pérot et al. (2019)).

Sometimes because of access constraints, measurement costs are such that it is unconceivable to achieve many measurements, but it remains important to assess the sample representativeness before any statistical analysis. The sample representativeness can be studied through the evolution of bootstrap statistical indicators like the mean or the standard deviation with replicate size varying from a minimum to the size of the reference sample (cf. §4.2, Pérot et al. (2019)). If we observe a stabilization of the bootstrap estimator and its confidence interval, we can deduce the data set size is correct. Otherwise, more measurement data are required. Wilks method is another way to estimate the size of the data set required to estimate quantiles with a given confidence level (cf. § 4.5, Pérot et al. (2019)).

Defining sample size might be challenging when several physical parameters need to be assessed (i.e. total activity, activity concentration, thresholds) based on various data sources that might not always be representative. Moreover, a data set might contain considerable amounts of values below detection limit and confidence levels required might not always be unequivocally defined. Reducing the size generally leads to an increasing uncertainty and could result in extreme under/over estimation of the volume exceeding a threshold. Such deviations can be strongly reduced by combining the limited higher quality and costly primary dataset (e.g. in-lab sample measurements) with a large cheap secondary data set (e.g. in-situ measurements).

Sample support

In certain cases, the sample support is not really an issue, as the population consists of discrete objects that are measured in their entirety. In many cases however, samples have to be taken from a continuum of material with a certain spatial support (i.e. length, area or volume). The number of possible sample locations is infinite in such a case, and an appropriate sample support should be defined. The selection of a sample support can be influenced by different factors, amongst others:

When the sample support would be far larger than the amount of material required for performing a measurement, homogenization or subsampling techniques can be considered to effectively homogenize the contents of the sample, and reduce the amount of measurements required.

If the most interesting sample support is of a practically infeasible magnitude, composite sampling techniques can be used for homogenization across the targeted support, resulting in a manageable set of samples that carry the targeted information.

Note that when samples with different supports are collected, measured and combined for a certain analysis, prior regularization or correction for the support effect should be performed.