For organizing the different data analysis techniques, we make use of a Venn diagram. The different categories we use are based on the four aspects of the data, studied in the exploratory data analysis step: The requirement for multivariate methods, the presence of spatial structure, the presence of trends, and the requirement for robust methods. The methods that are able to handle two, three or all aspects, are listed in the corresponding intersections. It is also possible none of these aspects apply, in which case we list the method outside of the diagram. In case it would not be clear which method to use, or if multiple methods are mentioned in the relevant section of the diagram, it is recommended to go through the brief introductions first. If it would still be unclear what method to use, further reading is recommended, or the consultation of a more experienced person.

Validation techniques for assessing the results obtained

Cross-validation can be used to assessing how the results of a statistical analysis will generalize to an independent data set. The model is initially fit on a training dataset, which is a set of values used to fit the parameters of the model. Successively, the fitted model is used to predict the responses for the observations in a second dataset called the validation dataset. This hold back sample of the full data should be used to give an unbiased estimate of the skill of the final tuned model by comparing or selecting between final models. A dataset can be repeatedly split into a training dataset and a validation dataset. Cross-validation combines (averages) measures of fitness in prediction to derive a more accurate estimate of model prediction performance. We distinguish two types of cross-validation: exhaustive and non-exhaustive cross-validation. Exhaustive cross-validation methods are cross-validation methods which learn and test on all possible ways to divide the original sample into a training and a validation set. Non-exhaustive cross validation methods do not compute all ways of splitting the original sample.

Attention should be paid to the re-estimation of extreme values as they can significantly bias the cross validation in the case of skewed distribution. As a consequence, a model can be preferred globally from cross-validation by better honouring the extreme values whereas the characterisation objective focuses on a low threshold where the model is not satisfactory.

Uncertainty & sensitivity analysis

The final outcome is anyhow affected by a certain degree of uncertainty, which often strongly impacts the decision-making. Quantifying the uncertainty on the variables of interest is usually a multi-dimensional and therefore complex task. Ideally, a Bayesian inference approach would be preferential compared to Monte Carlo error propagation or an approach based on first-order Taylor expansion. The latter might suffer from important drawbacks related to non-Gaussian distributions and non-linear expressions, or the omission of systematic uncertainties. Although applying a Bayesian inference approach or even a Monte Carlo error propagation would be the best technical choice, it might not be particularly suitable due to its complexity and the individual character of each model and case. For a one-time case, the following alternative options could be considered:

In addition, sensitivity analysis is a valuable instrument, which enriches the quantitative analysis of impact with a deeper investigation and identification of the sources of uncertainty. Among the most relevant techniques:

In particular, variance- based methods related to Sobol’s sensitivity indices are the most popular methods among practitioners due to their versatility and easiness of interpretation.

Uncertainty and sensitivity analysis are crucial as they help identifying the factors (assumptions, variables, data, and uncertainties) at play and provide information on their influence in quantitatively driving the impacts of the various decision options. In particular in view of nuclear site characterization: