arviz_stats.loo_subsample#
- arviz_stats.loo_subsample(data, observations, pointwise=None, var_name=None, reff=None, log_weights=None, log_p=None, log_q=None, seed=315, method='lpd', thin=None, log_lik_fn=None, param_names=None, log=True)[source]#
Compute PSIS-LOO-CV using sub-sampling.
Estimates the expected log pointwise predictive density (elpd) using Pareto smoothed importance sampling leave-one-out cross-validation (PSIS-LOO-CV) with sub-sampling for large datasets. Uses either log predictive density (LPD) or point log predictive density (PLPD) approximation and applies a difference estimator based on a simple random sample without replacement.
The PSIS-LOO-CV method is described in [1], [2]. The sub-sampling method is described in [3].
- Parameters:
- data
xarray.DataTreeorInferenceData Input data. It should contain the posterior and the log_likelihood groups.
- observations
intorndarray The sub-sample observations to use:
An integer specifying the number of observations to randomly sub-sample without replacement.
An array of integer indices specifying the exact observations to use.
- pointwise: bool, optional
If True the pointwise predictive accuracy will be returned. Defaults to
rcParams["stats.ic_pointwise"].- var_name
str, optional The name of the variable in log_likelihood groups storing the pointwise log likelihood data to use for loo computation.
- reff: float, optional
Relative MCMC efficiency,
ess / ni.e. number of effective samples divided by the number of actual samples. Computed from trace by default.- log_weights
xarray.DataArrayorELPDData, optional Smoothed log weights. Can be either:
A DataArray with the same shape as the log likelihood data
An ELPDData object from a previous
arviz_stats.loocall.
Defaults to None. If not provided, it will be computed using the PSIS-LOO method.
- log_p
ndarrayorxarray.DataArray, optional The (target) log-density evaluated at samples from the target distribution (p). If provided along with
log_q, approximate posterior correction will be applied.- log_q
ndarrayorxarray.DataArray, optional The (proposal) log-density evaluated at samples from the proposal distribution (q). If provided along with
log_p, approximate posterior correction will be applied.- seed: int, optional
Seed for random sampling.
- method: str, optional
Method used for approximating the pointwise log predictive density:
lpd: Use standard log predictive density approximation (default)plpd: Use point log predictive density approximation which requires alog_lik_fn.
- thin: int or str, optional
Thinning factor for posterior draws. Can be an integer to thin by that factor, “auto” to automatically determine thinning based on bulk and tail ESS, or None (default) to use all posterior draws. This value is stored in the returned
ELPDDataobject and will be automatically used byupdate_subsample.- log_lik_fn
callable, optional Custom log-likelihood function. The signature must be
log_lik_fn(observed, data)whereobservedis anDataArraycontaining one or more observations anddatais the fullDataTreeorInferenceData. The function must return aDataArray. Formethod="lpd"it must include dimensions("chain", "draw", *obs_dims)and contain the per-draw log-likelihood values. Formethod="plpd"it must have dimensions matching the observation dimensionsobs_dimsand provide the pointwise log predictive density.- param_names
list, optional List of parameter names to extract from the posterior. If None, all parameters are used. Recommended to pass the required parameter names from the posterior group that are necessary for the log-likelihood function.
- log: bool, optional
Whether the
log_lik_fnreturns log-likelihood (True) or likelihood (False). Default is True.
- data
- Returns:
ELPDDataObject with the following attributes:
elpd: approximated expected log pointwise predictive density (elpd)
se: standard error of the elpd (includes approximation and sampling uncertainty)
p: effective number of parameters
n_samples: number of samples in the posterior
n_data_points: total number of data points (N)
warning: True if the estimated shape parameter k of the Pareto distribution is >
good_kfor any observation in the subsample.elpd_i:
DataArraywith pointwise elpd values (filled with NaNs for non-subsampled points), only ifpointwise=True.pareto_k:
DataArraywith Pareto shape values for the subsample (filled with NaNs for non-subsampled points), only ifpointwise=True.scale: scale of the elpd results (“log”, “negative_log”, or “deviance”).
good_k: Threshold for Pareto k warnings.
approx_posterior: True if approximate posterior was used.
subsampling_se: Standard error estimate from subsampling uncertainty only.
subsample_size: Number of observations in the subsample (m).
log_p: Log density of the target posterior.
log_q: Log density of the proposal posterior.
thin: Thinning factor for posterior draws.
log_weights: Smoothed log weights.
loo_subsample_observations: Indices of subsampled observations.
elpd_loo_approx: Approximation for all N observations.
Warning
When using custom log-likelihood functions with auxiliary data (e.g., measurement errors, covariates, or any observation-specific parameters), that data must be stored in the
constant_datagroup of your DataTree/InferenceData object. During sub-sampling, data from this group is automatically aligned with the subset of observations being evaluated. This ensures that when computing the log-likelihood for observation i, the corresponding auxiliary data is correctly matched.If auxiliary data is not properly placed in this group, indexing mismatches can occur, leading to incorrect likelihood calculations.
See also
looStandard PSIS-LOO-CV.
update_subsampleUpdate a previously computed sub-sampled LOO-CV.
References
[1]Vehtari et al. Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing. 27(5) (2017) https://doi.org/10.1007/s11222-016-9696-4 arXiv preprint https://arxiv.org/abs/1507.04544.
[2]Vehtari et al. Pareto Smoothed Importance Sampling. Journal of Machine Learning Research, 25(72) (2024) https://jmlr.org/papers/v25/19-556.html arXiv preprint https://arxiv.org/abs/1507.02646
[3]Magnusson, M., Riis Andersen, M., Jonasson, J., & Vehtari, A. Bayesian Leave-One-Out Cross-Validation for Large Data. Proceedings of the 36th International Conference on Machine Learning, PMLR 97:4244–4253 (2019) https://proceedings.mlr.press/v97/magnusson19a.html arXiv preprint https://arxiv.org/abs/1904.10679
Examples
Calculate sub-sampled PSIS-LOO-CV using 4 random observations:
In [1]: from arviz_stats import loo_subsample ...: from arviz_base import load_arviz_data ...: data = load_arviz_data("centered_eight") ...: loo_results = loo_subsample(data, observations=4, var_name="obs", pointwise=True) ...: loo_results ...: Out[1]: Computed from 2000 by 4 subsampled log-likelihood values from 8 total observations. Estimate SE subsampling SE elpd_loo -30.8 1.5 0.3 p_loo 0.9 ------ Pareto k diagnostic values: Count Pct. (-Inf, 0.70] (good) 4 100.0% (0.70, 1] (bad) 0 0.0% (1, Inf) (very bad) 0 0.0%
Return the pointwise values for the sub-sample:
In [2]: loo_results.elpd_i Out[2]: <xarray.DataArray 'elpd_i' (school: 8)> Size: 64B array([-4.89190585, nan, nan, -3.46496198, -3.4780878 , nan, nan, -3.95934834]) Coordinates: * school (school) <U16 512B 'Choate' 'Deerfield' ... 'Mt. Hermon'
We can also use custom log-likelihood functions with both lpd and plpd methods. Passing a custom log-likelihood function is required for the plpd method and optional for the lpd method. Note that in this example, the constant_data group already exists in this data object so we can add the sigma data array to it. In other cases, you may need to create the constant_data group to store your auxiliary data:
In [3]: import numpy as np ...: import xarray as xr ...: from scipy import stats ...: ...: sigma = np.array([15.0, 10.0, 16.0, 11.0, 9.0, 11.0, 10.0, 18.0]) ...: sigma_da = xr.DataArray(sigma, ...: dims=["school"], ...: coords={"school": data.observed_data.school.values}) ...: data['constant_data'] = ( ...: data['constant_data'].to_dataset().assign(sigma=sigma_da) ...: ) ...: ...: def log_lik_fn(obs_da, data): ...: theta = data.posterior["theta"] ...: sigma = data.constant_data["sigma"] ...: return stats.norm.logpdf(obs_da, loc=theta, scale=sigma) ...: ...: loo_results = loo_subsample( ...: data, ...: observations=4, ...: var_name="obs", ...: method="plpd", ...: log_lik_fn=log_lik_fn, ...: param_names=["theta"], ...: pointwise=True ...: ) ...: loo_results ...: Out[3]: Computed from 2000 by 4 subsampled log-likelihood values from 8 total observations. Estimate SE subsampling SE elpd_loo -30.7 1.4 0.2 p_loo 1.3 ------ Pareto k diagnostic values: Count Pct. (-Inf, 0.70] (good) 4 100.0% (0.70, 1] (bad) 0 0.0% (1, Inf) (very bad) 0 0.0%
We can also use the lpd approximation with a custom log-likelihood function, which receives full posterior samples. This should match the results from the default method using the full, pre-computed log-likelihood.
Passing a custom log-likelihood function is optional for the lpd method, but it is recommended in the large data case so that we can compute the log-likelihood on the fly:
In [4]: loo_results_lpd = loo_subsample( ...: data, ...: observations=4, ...: var_name="obs", ...: method="lpd", ...: log_lik_fn=log_lik_fn, ...: pointwise=True ...: ) ...: loo_results_lpd ...: Out[4]: Computed from 2000 by 4 subsampled log-likelihood values from 8 total observations. Estimate SE subsampling SE elpd_loo -30.8 1.5 0.3 p_loo 0.9 ------ Pareto k diagnostic values: Count Pct. (-Inf, 0.70] (good) 4 100.0% (0.70, 1] (bad) 0 0.0% (1, Inf) (very bad) 0 0.0%