diff --git a/DESCRIPTION b/DESCRIPTION index 70d4cc510..8e2c511e2 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -86,5 +86,5 @@ Additional_repositories: https://mc-stan.org/r-packages/ LazyData: TRUE Roxygen: list(markdown = TRUE) -RoxygenNote: 7.2.3 +RoxygenNote: 7.3.1 VignetteBuilder: knitr, rmarkdown diff --git a/NEWS.md b/NEWS.md index dad6bfcc3..400383f95 100644 --- a/NEWS.md +++ b/NEWS.md @@ -4,6 +4,9 @@ If you read this from a place other than ). This threshold depends on the Monte Carlo sample size and is often close to the former fixed threshold of 0.7 (a short introduction may also be found in the [LOO glossary](https://mc-stan.org/loo/reference/loo-glossary.html)). Correspondingly, the former "secondary" threshold of 0.5 is not used anymore either. # projpred 2.8.0 diff --git a/R/cv_varsel.R b/R/cv_varsel.R index ef1e4416a..472106aac 100644 --- a/R/cv_varsel.R +++ b/R/cv_varsel.R @@ -16,13 +16,13 @@ #' #' @inheritParams varsel #' @param cv_method The CV method, either `"LOO"` or `"kfold"`. In the `"LOO"` -#' case, a Pareto-smoothed importance sampling leave-one-out CV (PSIS-LOO CV) +#' case, a Pareto-smoothed importance sampling leave-one-out CV (PSIS-LOO-CV) #' is performed, which avoids refitting the reference model `nloo` times (in -#' contrast to a standard LOO CV). In the `"kfold"` case, a \eqn{K}-fold CV is +#' contrast to a standard LOO-CV). In the `"kfold"` case, a \eqn{K}-fold-CV is #' performed. See also section "Note" below. #' @param nloo **Caution:** Still experimental. Only relevant if `cv_method = #' "LOO"`. If `nloo` is smaller than the number of all observations, -#' approximate full LOO CV using probability-proportional-to-size-sampling +#' approximate full LOO-CV using probability-proportional-to-size-sampling #' (PPS) to make accurate computation only for `nloo` (anything from 1 to the #' number of all observations) leave-one-out folds (Magnusson et al., 2019). #' Smaller values lead to faster computation but higher uncertainty in the @@ -30,13 +30,13 @@ #' @param K Only relevant if `cv_method = "kfold"` and if `cvfits` is `NULL` #' (which is the case for reference model objects created by #' [get_refmodel.stanreg()] or [brms::get_refmodel.brmsfit()]). Number of -#' folds in \eqn{K}-fold CV. +#' folds in \eqn{K}-fold-CV. #' @param cvfits Only relevant if `cv_method = "kfold"`. The same as argument #' `cvfits` of [init_refmodel()], but repeated here so that output from #' [run_cvfun()] can be inserted here straightforwardly. #' @param validate_search A single logical value indicating whether to #' cross-validate also the search part, i.e., whether to run the search -#' separately for each CV fold (`TRUE`) or not (`FALSE`). We strongly do not +#' separately for each CV-fold (`TRUE`) or not (`FALSE`). We strongly do not #' recommend setting this to `FALSE`, because this is known to bias the #' predictive performance estimates of the selected submodels. However, #' setting this to `FALSE` can sometimes be useful because comparing the @@ -50,9 +50,9 @@ #' `NA`, then the PRNG state is reset (to the state before calling #' [cv_varsel()]) upon exiting [cv_varsel()]. Here, `seed` is used for #' clustering the reference model's posterior draws (if `!is.null(nclusters)` -#' or `!is.null(nclusters_pred)`), for subsampling PSIS-LOO CV folds (if +#' or `!is.null(nclusters_pred)`), for subsampling PSIS-LOO-CV folds (if #' `nloo` is smaller than the number of observations), for sampling the folds -#' in \eqn{K}-fold CV, and for drawing new group-level effects when predicting +#' in \eqn{K}-fold-CV, and for drawing new group-level effects when predicting #' from a multilevel submodel (however, not yet in case of a GAMM). #' @param parallel A single logical value indicating whether to run costly parts #' of the CV in parallel (`TRUE`) or not (`FALSE`). See also section "Note" @@ -68,22 +68,48 @@ #' @inherit varsel details return #' #' @note If `validate_search` is `FALSE`, the search is not included in the CV -#' so that only a single full-data search is run. +#' so that only a single full-data search is run. If the number of +#' observations is big, the fast PSIS-LOO-CV along the full-data search path +#' is likely to be accurate. If the number of observations is small or +#' moderate, the fast PSIS-LOO-CV along the full-data search path is likely to +#' have optimistic bias in the middle of the search path. This result can be +#' used to guide further actions and the optimistic bias can be greatly +#' reduced by using `validate_search = TRUE`. #' -#' For PSIS-LOO CV, \pkg{projpred} calls [loo::psis()] (or, exceptionally, +#' PSIS uses Pareto-\eqn{\hat{k}} diagnostic to assess the reliability of +#' PSIS-LOO-CV. Whether the Pareto-\eqn{\hat{k}} diagnostics are shown as +#' warnings, is controlled with a global option `projpred.warn_psis` (default +#' is `TRUE`). See [loo::loo-glossary] for how to interpret the +#' Pareto-\eqn{\hat{k}} values and the warning thresholds. \pkg{projpred} does +#' not support the usually recommended moment-matching (see +#' [loo::loo_moment_match()] and [brms::loo_moment_match()]), mixture +#' importance sampling (`vignette("loo2-mixis", package="loo")`), or +#' `reloo`-ing ([brms::reloo()]). If the reference model PSIS-LOO-CV +#' Pareto-\eqn{\hat{k}} values are good, but there are high +#' Pareto-\eqn{\hat{k}} values for the projected models, you can try +#' increasing the number of draws used for the PSIS-LOO-CV (`ndraws` in case +#' of `refit_prj = FALSE`; `ndraws_pred` in case of `refit_prj = TRUE`). If +#' increasing the number of draws does not help and if the reference model +#' PSIS-LOO-CV Pareto-\eqn{\hat{k}} values are high, and the reference model +#' PSIS-LOO-CV results change substantially when using moment-matching, +#' mixture importance sampling, or `reloo`-ing, we recommend to use +#' \eqn{K}-fold-CV within `projpred`. +#' +#' For PSIS-LOO-CV, \pkg{projpred} calls [loo::psis()] (or, exceptionally, #' [loo::sis()], see below) with `r_eff = NA`. This is only a problem if there #' was extreme autocorrelation between the MCMC iterations when the reference #' model was built. In those cases however, the reference model should not #' have been used anyway, so we don't expect \pkg{projpred}'s `r_eff = NA` to #' be a problem. #' -#' PSIS cannot be used if the draws have different (i.e., nonconstant) weights -#' or if the number of draws is too small. In such cases, \pkg{projpred} -#' resorts to standard importance sampling (SIS) and throws a warning about -#' this. Throughout the documentation, the term "PSIS" is used even though in -#' fact, \pkg{projpred} resorts to SIS in these special cases. +#' PSIS cannot be used if the number of draws or clusters is too small. In +#' such cases, \pkg{projpred} resorts to standard importance sampling (SIS) +#' and shows a message about this. Throughout the documentation, the term +#' "PSIS" is used even though in fact, \pkg{projpred} resorts to SIS in these +#' special cases. If SIS is used, check that the reference model PSIS-LOO-CV +#' Pareto-\eqn{\hat{k}} values are good. #' -#' With `parallel = TRUE`, costly parts of \pkg{projpred}'s CV are run in +#' With `parallel = TRUE`, costly parts of \pkg{projpred}'s CV can be run in #' parallel. Costly parts are the fold-wise searches and performance #' evaluations in case of `validate_search = TRUE`. (Note that in case of #' \eqn{K}-fold CV, the \eqn{K} reference model refits are not affected by @@ -332,7 +358,7 @@ cv_varsel.refmodel <- function( if (!is.null(search_out) && validate_search) { # Extract the fold-wise predictor rankings (to avoid passing the large # object `search_out` itself) and coerce them to a `list` (in a row-wise - # manner) which is needed for the K-fold CV parallelization: + # manner) which is needed for the K-fold-CV parallelization: search_out_rks <- search_out[["rk_foldwise"]] if (!is.null(search_out_rks)) { n_folds <- nrow(search_out_rks) @@ -374,7 +400,7 @@ cv_varsel.refmodel <- function( # passing the large object `search_path_fulldata` to loo_varsel(): NULL } else { - # For K-fold CV, `validate_search = FALSE` may not be combined with + # For K-fold-CV, `validate_search = FALSE` may not be combined with # `refit_prj = FALSE`, so element `predictor_ranking` is all we need: search_path_fulldata["predictor_ranking"] }, @@ -494,7 +520,7 @@ parse_args_cv_varsel <- function(refmodel, cv_method, nloo, K, cvfits, if (!validate_search && !refit_prj) { # Not allowed because this would induce a dependency between training and # test data: - stop("For K-fold CV, `validate_search = FALSE` may not be combined with ", + stop("For K-fold-CV, `validate_search = FALSE` may not be combined with ", "`refit_prj = FALSE`.") } } else { @@ -504,7 +530,7 @@ parse_args_cv_varsel <- function(refmodel, cv_method, nloo, K, cvfits, stop("nloo must be at least 1") } else if (nloo < refmodel[["nobs"]] && getOption("projpred.warn_subsampled_loo", TRUE)) { - warning("Subsampled PSIS-LOO CV is still experimental.") + warning("Subsampled PSIS-LOO-CV is still experimental.") } } @@ -522,12 +548,12 @@ parse_args_cv_varsel <- function(refmodel, cv_method, nloo, K, cvfits, return(nlist(cv_method, nloo, K, cvfits)) } -# PSIS-LOO CV ------------------------------------------------------------- +# PSIS-LOO-CV ------------------------------------------------------------- -# Workhorse function for a variable selection with PSIS-LOO CV +# Workhorse function for a variable selection with PSIS-LOO-CV # # Argument `validate_search` indicates whether the search is performed -# separately for each LOO CV fold (i.e., separately for each observation). For +# separately for each LOO-CV fold (i.e., separately for each observation). For # all other arguments, see the documentation of cv_varsel(). loo_varsel <- function(refmodel, method, nterms_max, ndraws, nclusters, ndraws_pred, nclusters_pred, refit_prj, @@ -543,7 +569,7 @@ loo_varsel <- function(refmodel, method, nterms_max, ndraws, "probabilistic model for which the log-likelihood can be evaluated.") } - # Log-likelihood values for the reference model (necessary for the PSIS-LOO CV + # Log-likelihood values for the reference model (necessary for the PSIS-LOO-CV # weights, but also for performance statistics like ELPD, MLPD, and GMPD): if (refmodel$family$for_latent) { mu_offs_oscale <- refmodel$family$latent_ilink( @@ -580,14 +606,14 @@ loo_varsel <- function(refmodel, method, nterms_max, ndraws, } n <- ncol(loglik_forPSIS) - # PSIS-LOO CV weights: + # PSIS-LOO-CV weights: if (length(unique(refmodel$wdraws_ref)) != 1) { stop("Currently, projpred requires the reference model's posterior draws ", "to have constant weights.") } if (nrow(loglik_forPSIS) <= 1) { stop("Currently, more than one posterior draw from the reference model is ", - "needed (because projpred relies on loo::psis() for PSIS-LOO CV).") + "needed (because projpred relies on loo::psis() for PSIS-LOO-CV).") } # Call loo::psis() and while doing so, catch messages and warnings via # capt_mssgs_warns() to filter out some of them. @@ -609,23 +635,13 @@ loo_varsel <- function(refmodel, method, nterms_max, ndraws, # turn, i.e., brms's `reloo` argument) currently cannot be used because all # these techniques result in new MCMC draws for the reference model, meaning # that the projection would have to be adapted. Therefore, it is easier to - # recommend K-fold CV (for the reference model refits, i.e., brms's `reloo` + # recommend K-fold-CV (for the reference model refits, i.e., brms's `reloo` # argument, another reason is that they can quickly become as costly as - # K-fold CV). + # K-fold-CV). warn_pareto( - n07 = sum(pareto_k > 0.7), n05 = sum(0.7 >= pareto_k & pareto_k > 0.5), - warn_txt_start = paste0("In the calculation of the reference model's ", - "PSIS-LOO CV weights, "), - warn_txt_mid_common = paste0(" (out of ", n, ") Pareto k-values are "), - warn_txt_end = paste0( - ". Moment matching (see the loo package), mixture importance ", - "sampling (see the loo package), and `reloo`-ing (see the brms package) ", - "are not supported by projpred. If these techniques (run outside of ", - "projpred, i.e., for the reference model only; note that `reloo`-ing ", - "may be computationally costly) result in a markedly different ", - "reference model ELPD estimate than ordinary PSIS-LOO CV does, we ", - "recommend to use K-fold CV within projpred." - ) + n07 = sum(pareto_k > .ps_khat_threshold(dim(psisloo)[1])), n = n, + khat_threshold = .ps_khat_threshold(dim(psisloo)[1]), + warn_txt = "Some (%d / %d) Pareto k's for the reference model's PSIS-LOO weights are > %s." ) lw <- weights(psisloo) @@ -644,17 +660,16 @@ loo_varsel <- function(refmodel, method, nterms_max, ndraws, log_ratios = -loglik_forPSIS ) # The k-values are h-specific (expectation-specific) here (see Vehtari et - # al., 2022, , beginning of + # al., 2024, , beginning of # section 3, section 3.2.8, appendix D, and appendix E). warn_pareto( - n07 = sum(y_lat_E$pareto_k > 0.7), - n05 = sum(0.7 >= y_lat_E$pareto_k & y_lat_E$pareto_k > 0.5), - warn_txt_start = paste0("In the recalculation of the latent response ", - "values, "), - warn_txt_mid_common = paste0( - " (out of ", n, ") expectation-specific Pareto k-values are " - ), - warn_txt_end = ". In general, we recommend K-fold CV in this case." + n07 = sum(y_lat_E$pareto_k > .ps_khat_threshold(dim(psisloo)[1])), n = n, + khat_threshold = .ps_khat_threshold(dim(psisloo)[1]), + warn_txt = paste0( + "In the recalculation of the latent response values, some (%d / %d) ", + "expectation-specific Pareto k-values are > %s.\n", + "In general, we recommend K-fold-CV in this case." + ) ) refmodel$y <- y_lat_E$value } @@ -713,7 +728,7 @@ loo_varsel <- function(refmodel, method, nterms_max, ndraws, refdist_eval <- perf_eval_out[["p_ref"]] # Step 2: Weight the full-data performance evaluation results according to - # the PSIS-LOO CV weights. + # the PSIS-LOO-CV weights. if (refmodel$family$for_latent) { refdist_eval_mu_offs_oscale <- refmodel$family$latent_ilink( t(refdist_eval$mu_offs), cl_ref = refdist_eval$cl, @@ -770,14 +785,10 @@ loo_varsel <- function(refmodel, method, nterms_max, ndraws, if (no_psis_eval) { if (getOption("projpred.warn_psis", TRUE)) { warning( - "In the recalculation of the reference model's PSIS-LOO CV ", - "weights for the performance evaluation, the number of draws ", - "after clustering or thinning is too small for Pareto ", - "smoothing. Using standard importance sampling (SIS) instead. ", - "Watch out for warnings thrown by the original-draws Pareto ", - "smoothing to see whether it makes sense to increase the number ", - "of draws (resulting from the clustering or thinning for the ", - "performance evaluation). Alternatively, K-fold CV can be used." + "Using standard importance sampling (SIS), as the number of ", + "draws or clusters is too small for PSIS. For improved ", + "accuracy, increase the number of draws or clusters, or use ", + "K-fold-CV." ) } # Use loo::sis(). @@ -792,8 +803,8 @@ loo_varsel <- function(refmodel, method, nterms_max, ndraws, # Use loo::psis(). # Usually, we have a small number of projected draws here (400 by # default), which means that the 'loo' package will automatically - # perform the regularization from Vehtari et al. (2022, - # , appendix G). + # perform the regularization from Vehtari et al. (2024, + # , appendix G). importance_sampling_nm <- "psis" } } else { @@ -823,24 +834,19 @@ loo_varsel <- function(refmodel, method, nterms_max, ndraws, if (length(mssgs_warns_capt) > 0) { warning(mssgs_warns_capt) } + if (importance_sampling_nm == "psis") { pareto_k_eval <- loo::pareto_k_values(sub_psisloo) warn_pareto( - n07 = sum(pareto_k_eval > 0.7), - n05 = sum(0.7 >= pareto_k_eval & pareto_k_eval > 0.5), - warn_txt_start = paste0( - "In the recalculation of the reference model's PSIS-LOO CV ", - "weights for the performance evaluation (based on clustered or ", - "thinned posterior draws), " - ), - warn_txt_mid_common = paste0( - " (out of ", nloo, ") Pareto k-values are " - ), - warn_txt_end = paste0( - ". Watch out for warnings thrown by the original-draws Pareto ", - "smoothing to see whether it makes sense to increase the number ", - "of draws (resulting from the clustering or thinning for the ", - "performance evaluation). Alternatively, K-fold CV can be used." + n07 = sum(pareto_k_eval > .ps_khat_threshold(dim(psisloo)[1])), n = n, + khat_threshold = .ps_khat_threshold(dim(sub_psisloo)[1]), + warn_txt = paste0( + "Some (%d / %d) Pareto k's for the reference model's PSIS-LOO ", + "weights given ", + ifelse(clust_used_eval, + paste0(nclusters_pred, " clustered "), + paste0(ndraws_pred, " posterior ")), + "draws are > %s." ) ) } @@ -928,7 +934,7 @@ loo_varsel <- function(refmodel, method, nterms_max, ndraws, } verb_out(verb_txt_start, verb_txt_mid, "the performance evaluation with ", "`refit_prj = ", refit_prj, "` for each of the N = ", nloo, " ", - "LOO CV folds separately ...") + "LOO-CV folds separately ...") } one_obs <- function(run_index, verbose_search = verbose && @@ -1190,26 +1196,14 @@ loo_varsel <- function(refmodel, method, nterms_max, ndraws, return(out_list) } -warn_pareto <- function(n07, n05, warn_txt_start, warn_txt_mid_common, - warn_txt_end) { - if (!getOption("projpred.warn_psis", TRUE) || (n07 == 0 && n05 == 0)) return() - if (n07 > 0) { - warn_txt_mid <- paste0(n07, warn_txt_mid_common, "> 0.7") - if (n05 > 0) { - warn_txt_mid <- paste0(warn_txt_mid, " and ") - } - } else { - warn_txt_mid <- "" - } - if (n05 > 0) { - warn_txt_mid <- paste0(warn_txt_mid, n05, warn_txt_mid_common, "in the ", - "interval (0.5, 0.7]") - } - warning(warn_txt_start, warn_txt_mid, warn_txt_end) +warn_pareto <- function(n07, n, khat_threshold = 0.7, warn_txt) { + if (!getOption("projpred.warn_psis", TRUE) || (n07 == 0)) return() + warning(sprintf(warn_txt, n07, n, as.character(round(khat_threshold, 2))), + call. = FALSE) return() } -# K-fold CV --------------------------------------------------------------- +# K-fold-CV --------------------------------------------------------------- # Needed to avoid a NOTE in `R CMD check`: if (getRversion() >= package_version("2.15.1")) { @@ -1433,7 +1427,7 @@ get_kfold <- function(refmodel, K, cvfits, verbose) { verb_out("-----", verbose = verbose) } else { stop("For a reference model which is not of class `datafit`, either ", - "`cvfits` or `cvfun` needs to be provided for K-fold CV (see ", + "`cvfits` or `cvfun` needs to be provided for K-fold-CV (see ", "`?init_refmodel`).") } } else { @@ -1524,7 +1518,7 @@ get_kfold <- function(refmodel, K, cvfits, verbose) { #' cvfits = cv_fits, nterms_max = 3, nclusters = 5, #' nclusters_pred = 10, seed = 5555) #' -#' # Stratified K-fold CV is straightforward: +#' # Stratified K-fold-CV is straightforward: #' n_strat <- 3L #' set.seed(692) #' # Some example strata: @@ -1583,7 +1577,7 @@ run_cvfun.refmodel <- function(object, return(structure(cvfits, folds = folds)) } -# PSIS-LOO CV helpers ----------------------------------------------------- +# PSIS-LOO-CV helpers ----------------------------------------------------- # ## decide which points to go through in the validation (i.e., which points # ## belong to the semi random subsample of validation points) @@ -1645,3 +1639,23 @@ loo_subsample_pps <- function(nloo, lppd) { return(nlist(inds, wcv)) } + +#' Pareto-smoothing k-hat threshold +#' +#' Copied from loo package. Remove after loo package exposes this. +#' +#' Given sample size S computes khat threshold for reliable Pareto +#' smoothed estimate (to have small probability of large error). See +#' section 3.2.4, equation (13). Sample sizes 100, 320, 1000, 2200, +#' 10000 correspond to thresholds 0.5, 0.6, 0.67, 0.7, 0.75. Although +#' with bigger sample size S we can achieve estimates with small +#' probability of large error, it is difficult to get accurate MCSE +#' estimates as the bias starts to dominate when k > 0.7 (see Section 3.2.3). +#' Thus the sample size dependend k-ht threshold is capped at 0.7. +#' @param S sample size +#' @param ... unused +#' @return threshold +#' @noRd +.ps_khat_threshold <- function(S, ...) { + min(1 - 1 / log10(S), 0.7) +} diff --git a/R/varsel.R b/R/varsel.R index e80215f5b..b434a758c 100644 --- a/R/varsel.R +++ b/R/varsel.R @@ -5,7 +5,7 @@ #' known as solution path), i.e., the best submodel for each submodel size #' (number of predictor terms). The evaluation part determines the predictive #' performance of the submodels along the predictor ranking. A special method is -#' [varsel.vsel()] because it re-uses the search results from an earlier +#' [varsel.vsel()] which re-uses the search results from an earlier #' [varsel()] (or [cv_varsel()]) run, as illustrated in the main vignette. #' #' @param object An object of class `refmodel` (returned by [get_refmodel()] or @@ -141,8 +141,9 @@ #' #' L1 search is faster than forward search, but forward search may be more #' accurate. Furthermore, forward search may find a sparser model with -#' comparable performance to that found by L1 search, but it may also start -#' overfitting when more predictors are added. +#' comparable performance to that found by L1 search, but it may also +#' overfit when more predictors are added. This overfit can be detected +#' by running search validation (see [cv_varsel()]). #' #' An L1 search may select an interaction term before all involved lower-order #' interaction terms (including main-effect terms) have been selected. In diff --git a/man/cv_varsel.Rd b/man/cv_varsel.Rd index 55ecb679d..cb01e1974 100644 --- a/man/cv_varsel.Rd +++ b/man/cv_varsel.Rd @@ -62,13 +62,13 @@ when refitting the submodels for the performance evaluation (if \code{refit_prj} is \code{TRUE}).} \item{cv_method}{The CV method, either \code{"LOO"} or \code{"kfold"}. In the \code{"LOO"} -case, a Pareto-smoothed importance sampling leave-one-out CV (PSIS-LOO CV) +case, a Pareto-smoothed importance sampling leave-one-out CV (PSIS-LOO-CV) is performed, which avoids refitting the reference model \code{nloo} times (in -contrast to a standard LOO CV). In the \code{"kfold"} case, a \eqn{K}-fold CV is +contrast to a standard LOO-CV). In the \code{"kfold"} case, a \eqn{K}-fold-CV is performed. See also section "Note" below.} \item{nloo}{\strong{Caution:} Still experimental. Only relevant if \code{cv_method = "LOO"}. If \code{nloo} is smaller than the number of all observations, -approximate full LOO CV using probability-proportional-to-size-sampling +approximate full LOO-CV using probability-proportional-to-size-sampling (PPS) to make accurate computation only for \code{nloo} (anything from 1 to the number of all observations) leave-one-out folds (Magnusson et al., 2019). Smaller values lead to faster computation but higher uncertainty in the @@ -77,7 +77,7 @@ evaluation part. If \code{NULL}, all observations are used (as by default).} \item{K}{Only relevant if \code{cv_method = "kfold"} and if \code{cvfits} is \code{NULL} (which is the case for reference model objects created by \code{\link[=get_refmodel.stanreg]{get_refmodel.stanreg()}} or \code{\link[brms:get_refmodel.brmsfit]{brms::get_refmodel.brmsfit()}}). Number of -folds in \eqn{K}-fold CV.} +folds in \eqn{K}-fold-CV.} \item{cvfits}{Only relevant if \code{cv_method = "kfold"}. The same as argument \code{cvfits} of \code{\link[=init_refmodel]{init_refmodel()}}, but repeated here so that output from @@ -85,7 +85,7 @@ folds in \eqn{K}-fold CV.} \item{validate_search}{A single logical value indicating whether to cross-validate also the search part, i.e., whether to run the search -separately for each CV fold (\code{TRUE}) or not (\code{FALSE}). We strongly do not +separately for each CV-fold (\code{TRUE}) or not (\code{FALSE}). We strongly do not recommend setting this to \code{FALSE}, because this is known to bias the predictive performance estimates of the selected submodels. However, setting this to \code{FALSE} can sometimes be useful because comparing the @@ -182,9 +182,9 @@ results can be obtained again if needed. Passed to argument \code{seed} of \code{NA}, then the PRNG state is reset (to the state before calling \code{\link[=cv_varsel]{cv_varsel()}}) upon exiting \code{\link[=cv_varsel]{cv_varsel()}}. Here, \code{seed} is used for clustering the reference model's posterior draws (if \code{!is.null(nclusters)} -or \code{!is.null(nclusters_pred)}), for subsampling PSIS-LOO CV folds (if +or \code{!is.null(nclusters_pred)}), for subsampling PSIS-LOO-CV folds (if \code{nloo} is smaller than the number of observations), for sampling the folds -in \eqn{K}-fold CV, and for drawing new group-level effects when predicting +in \eqn{K}-fold-CV, and for drawing new group-level effects when predicting from a multilevel submodel (however, not yet in case of a GAMM).} \item{search_terms}{Only relevant for forward search. A custom character @@ -236,8 +236,9 @@ effect. L1 search is faster than forward search, but forward search may be more accurate. Furthermore, forward search may find a sparser model with -comparable performance to that found by L1 search, but it may also start -overfitting when more predictors are added. +comparable performance to that found by L1 search, but it may also +overfit when more predictors are added. This overfit can be detected +by running search validation (see \code{\link[=cv_varsel]{cv_varsel()}}). An L1 search may select an interaction term before all involved lower-order interaction terms (including main-effect terms) have been selected. In @@ -275,22 +276,48 @@ that leads to the same results, \code{search_terms = c("x2", "x3")}). } \note{ If \code{validate_search} is \code{FALSE}, the search is not included in the CV -so that only a single full-data search is run. - -For PSIS-LOO CV, \pkg{projpred} calls \code{\link[loo:psis]{loo::psis()}} (or, exceptionally, +so that only a single full-data search is run. If the number of +observations is big, the fast PSIS-LOO-CV along the full-data search path +is likely to be accurate. If the number of observations is small or +moderate, the fast PSIS-LOO-CV along the full-data search path is likely to +have optimistic bias in the middle of the search path. This result can be +used to guide further actions and the optimistic bias can be greatly +reduced by using \code{validate_search = TRUE}. + +PSIS uses Pareto-\eqn{\hat{k}} diagnostic to assess the reliability of +PSIS-LOO-CV. Whether the Pareto-\eqn{\hat{k}} diagnostics are shown as +warnings, is controlled with a global option \code{projpred.warn_psis} (default +is \code{TRUE}). See \link[loo:loo-glossary]{loo::loo-glossary} for how to interpret the +Pareto-\eqn{\hat{k}} values and the warning thresholds. \pkg{projpred} does +not support the usually recommended moment-matching (see +\code{\link[loo:loo_moment_match]{loo::loo_moment_match()}} and \code{\link[brms:loo_moment_match.brmsfit]{brms::loo_moment_match()}}), mixture +importance sampling (\code{vignette("loo2-mixis", package="loo")}), or +\code{reloo}-ing (\code{\link[brms:reloo.brmsfit]{brms::reloo()}}). If the reference model PSIS-LOO-CV +Pareto-\eqn{\hat{k}} values are good, but there are high +Pareto-\eqn{\hat{k}} values for the projected models, you can try +increasing the number of draws used for the PSIS-LOO-CV (\code{ndraws} in case +of \code{refit_prj = FALSE}; \code{ndraws_pred} in case of \code{refit_prj = TRUE}). If +increasing the number of draws does not help and if the reference model +PSIS-LOO-CV Pareto-\eqn{\hat{k}} values are high, and the reference model +PSIS-LOO-CV results change substantially when using moment-matching, +mixture importance sampling, or \code{reloo}-ing, we recommend to use +\eqn{K}-fold-CV within \code{projpred}. + +For PSIS-LOO-CV, \pkg{projpred} calls \code{\link[loo:psis]{loo::psis()}} (or, exceptionally, \code{\link[loo:sis]{loo::sis()}}, see below) with \code{r_eff = NA}. This is only a problem if there was extreme autocorrelation between the MCMC iterations when the reference model was built. In those cases however, the reference model should not have been used anyway, so we don't expect \pkg{projpred}'s \code{r_eff = NA} to be a problem. -PSIS cannot be used if the draws have different (i.e., nonconstant) weights -or if the number of draws is too small. In such cases, \pkg{projpred} -resorts to standard importance sampling (SIS) and throws a warning about -this. Throughout the documentation, the term "PSIS" is used even though in -fact, \pkg{projpred} resorts to SIS in these special cases. +PSIS cannot be used if the number of draws or clusters is too small. In +such cases, \pkg{projpred} resorts to standard importance sampling (SIS) +and shows a message about this. Throughout the documentation, the term +"PSIS" is used even though in fact, \pkg{projpred} resorts to SIS in these +special cases. If SIS is used, check that the reference model PSIS-LOO-CV +Pareto-\eqn{\hat{k}} values are good. -With \code{parallel = TRUE}, costly parts of \pkg{projpred}'s CV are run in +With \code{parallel = TRUE}, costly parts of \pkg{projpred}'s CV can be run in parallel. Costly parts are the fold-wise searches and performance evaluations in case of \code{validate_search = TRUE}. (Note that in case of \eqn{K}-fold CV, the \eqn{K} reference model refits are not affected by @@ -330,20 +357,19 @@ cvvs <- cv_varsel(fit, method = "L1", cv_method = "kfold", K = 2, \dontshow{\}) # examplesIf} } \references{ -Magnusson, Måns, Michael Andersen, Johan Jonasson, and Aki Vehtari. 2019. -"Bayesian Leave-One-Out Cross-Validation for Large Data." In \emph{Proceedings of -the 36th International Conference on Machine Learning}, edited by Kamalika -Chaudhuri and Ruslan Salakhutdinov, 97:4244--53. Proceedings of Machine -Learning Research. PMLR. -\url{https://proceedings.mlr.press/v97/magnusson19a.html}. - -Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2017. "Practical Bayesian Model -Evaluation Using Leave-One-Out Cross-Validation and WAIC." \emph{Statistics and -Computing} 27 (5): 1413--32. \doi{10.1007/s11222-016-9696-4}. - -Vehtari, Aki, Daniel Simpson, Andrew Gelman, Yuling Yao, and Jonah Gabry. -2022. "Pareto Smoothed Importance Sampling." arXiv. -\doi{10.48550/arXiv.1507.02646}. +Måns Magnusson, Michael Riis Andersen, Johan Jonasson, Aki Vehtari +(2020). Leave-one-out cross-validation for Bayesian model +comparison in large data. Proceedings of the 23rd International +Conference on Artificial Intelligence and Statistics (AISTATS), +PMLR 108:341-351. + +Aki Vehtari, Andrew Gelman, and Jonah Gabry (2017). Practical Bayesian Model +Evaluation Using Leave-One-Out Cross-Validation and WAIC. Statistics and +Computing, 27(5):1413--32. \doi{10.1007/s11222-016-9696-4}. + +Aki Vehtari, Daniel Simpson, Andrew Gelman, Yuling Yao, and Jonah +Gabry (2024). Pareto smoothed importance sampling. Journal of +Machine Learning Research, 25(72):1-58. } \seealso{ \code{\link[=varsel]{varsel()}} diff --git a/man/run_cvfun.Rd b/man/run_cvfun.Rd index 516837730..8c7715ab9 100644 --- a/man/run_cvfun.Rd +++ b/man/run_cvfun.Rd @@ -91,7 +91,7 @@ cvvs_fw <- cv_varsel(ref, method = "forward", cv_method = "kfold", cvfits = cv_fits, nterms_max = 3, nclusters = 5, nclusters_pred = 10, seed = 5555) -# Stratified K-fold CV is straightforward: +# Stratified K-fold-CV is straightforward: n_strat <- 3L set.seed(692) # Some example strata: diff --git a/man/varsel.Rd b/man/varsel.Rd index 2cac4afb4..b9ade6fbb 100644 --- a/man/varsel.Rd +++ b/man/varsel.Rd @@ -165,7 +165,7 @@ variable selection. The search part determines the predictor ranking (also known as solution path), i.e., the best submodel for each submodel size (number of predictor terms). The evaluation part determines the predictive performance of the submodels along the predictor ranking. A special method is -\code{\link[=varsel.vsel]{varsel.vsel()}} because it re-uses the search results from an earlier +\code{\link[=varsel.vsel]{varsel.vsel()}} which re-uses the search results from an earlier \code{\link[=varsel]{varsel()}} (or \code{\link[=cv_varsel]{cv_varsel()}}) run, as illustrated in the main vignette. } \details{ @@ -185,8 +185,9 @@ effect. L1 search is faster than forward search, but forward search may be more accurate. Furthermore, forward search may find a sparser model with -comparable performance to that found by L1 search, but it may also start -overfitting when more predictors are added. +comparable performance to that found by L1 search, but it may also +overfit when more predictors are added. This overfit can be detected +by running search validation (see \code{\link[=cv_varsel]{cv_varsel()}}). An L1 search may select an interaction term before all involved lower-order interaction terms (including main-effect terms) have been selected. In