Title: | Ensembles of Caret Models |
---|---|
Description: | Functions for creating ensembles of caret models: caretList() and caretStack(). caretList() is a convenience function for fitting multiple caret::train() models to the same dataset. caretStack() will make linear or non-linear combinations of these models, using a caret::train() model as a meta-model. |
Authors: | Zachary A. Deane-Mayer [aut, cre, cph], Jared E. Knowles [ctb], Antón López [ctb] |
Maintainer: | Zachary A. Deane-Mayer <[email protected]> |
License: | MIT + file LICENSE |
Version: | 4.0.2 |
Built: | 2025-01-16 06:12:12 UTC |
Source: | https://github.com/zachmayer/caretensemble |
Index a caret list to extract caret models into a new caretList object
## S3 method for class 'caretList' object[index]
## S3 method for class 'caretList' object[index]
object |
an object of class caretList |
index |
selected index |
Converts object into a caretList
as.caretList(object)
as.caretList(object)
object |
R Object |
a caretList
object
Converts object into a caretList - For Future Use
## Default S3 method: as.caretList(object)
## Default S3 method: as.caretList(object)
object |
R object |
NA
Converts list to caretList
## S3 method for class 'list' as.caretList(object)
## S3 method for class 'list' as.caretList(object)
object |
list of caret models |
a caretList
object
This function provides a more robust series of diagnostic plots for a caretEnsemble object.
## S3 method for class 'caretStack' autoplot(object, training_data = NULL, xvars = NULL, show_class_id = 2L, ...)
## S3 method for class 'caretStack' autoplot(object, training_data = NULL, xvars = NULL, show_class_id = 2L, ...)
object |
a |
training_data |
The data used to train the ensemble. Required if xvars is not NULL Must be in the same row order as when the models were trained. |
xvars |
a vector of the names of x variables to plot against residuals |
show_class_id |
For classification only: which class level to show on the plot |
... |
ignored |
A grid of diagnostic plots. Top left is the range of the performance metric across each component model along with its standard deviation. Top right is the residuals from the ensembled model plotted against fitted values. Middle left is a bar graph of the weights of the component models. Middle right is the disagreement in the residuals of the component models (unweighted) across the fitted values. Bottom left and bottom right are the plots of the residuals against two random or user specified variables. Note that the ensemble must have been trained with savePredictions = "final", which is required to get residuals from the stack for the plot.
set.seed(42) data(models.reg) ens <- caretStack(models.reg[1:2], method = "lm") autoplot(ens)
set.seed(42) data(models.reg) ens <- caretStack(models.reg[1:2], method = "lm") autoplot(ens)
take N objects of class caretList and concatenate them into a larger object of class caretList for future ensembling
## S3 method for class 'caretList' c(...)
## S3 method for class 'caretList' c(...)
... |
the objects of class caretList or train to bind into a caretList |
a caretList
object
data(iris) model_list1 <- caretList(Sepal.Width ~ ., data = iris, tuneList = list( lm = caretModelSpec(method = "lm") ) ) model_list2 <- caretList(Sepal.Width ~ ., data = iris, tuneLength = 1L, tuneList = list( rf = caretModelSpec(method = "rf") ) ) bigList <- c(model_list1, model_list2)
data(iris) model_list1 <- caretList(Sepal.Width ~ ., data = iris, tuneList = list( lm = caretModelSpec(method = "lm") ) ) model_list2 <- caretList(Sepal.Width ~ ., data = iris, tuneLength = 1L, tuneList = list( rf = caretModelSpec(method = "rf") ) ) bigList <- c(model_list1, model_list2)
take N objects of class train and concatenate into an object of class caretList for future ensembling
## S3 method for class 'train' c(...)
## S3 method for class 'train' c(...)
... |
the objects of class train to bind into a caretList |
a caretList
object
data(iris) model_lm <- caret::train(Sepal.Length ~ ., data = iris, method = "lm" ) model_rf <- caret::train(Sepal.Length ~ ., data = iris, method = "rf", tuneLength = 1L ) model_list <- c(model_lm, model_rf)
data(iris) model_lm <- caret::train(Sepal.Length ~ ., data = iris, method = "lm" ) model_rf <- caret::train(Sepal.Length ~ ., data = iris, method = "rf", tuneLength = 1L ) model_list <- c(model_lm, model_rf)
Find a greedy, positive only linear combination of several train
objects
Functions for creating ensembles of caret models: caretList and caretStack
caretEnsemble(all.models, excluded_class_id = 0L, tuneLength = 1L, ...)
caretEnsemble(all.models, excluded_class_id = 0L, tuneLength = 1L, ...)
all.models |
an object of class caretList |
excluded_class_id |
The integer level to exclude from binary classification or multiclass problems. By default no classes are excluded, as the greedy optimizer requires all classes because it cannot use negative coefficients. |
tuneLength |
The size of the grid to search for tuning the model. Defaults to 1, as the only parameter to optimize is the number of iterations, and the default of 100 works well. |
... |
additional arguments to pass caret::train |
greedyMSE works well when you want an ensemble that will never be worse than any single model in the dataset. In the worst case scenario, it will select the single best model, if none of them can be ensembled to improve the overall score. It will also never assign any model a negative coefficient, which can help avoid unintuitive cases at prediction time (e.g. if the correlations between predictors breaks down on new data, negative coefficients can lead to bad results).
a caretEnsemble
object
Every model in the "library" must be a separate train
object. For
example, if you wish to combine a random forests with several different
values of mtry, you must build a model for each value of mtry. If you
use several values of mtry in one train model, (e.g. tuneGrid =
expand.grid(.mtry=2:5)), caret will select the best value of mtry
before we get a chance to include it in the ensemble. By default,
RMSE is used to ensemble regression models, and AUC is used to ensemble
Classification models. This function does not currently support multi-class
problems
Maintainer: Zachary A. Deane-Mayer [email protected] [copyright holder]
Other contributors:
Jared E. Knowles [email protected] [contributor]
Antón López [email protected] [contributor]
Useful links:
Report bugs at https://github.com/zachmayer/caretEnsemble/issues
set.seed(42) models <- caretList(iris[1:50, 1:2], iris[1:50, 3], methodList = c("rpart", "rf")) ens <- caretEnsemble(models) summary(ens)
set.seed(42) models <- caretList(iris[1:50, 1:2], iris[1:50, 3], methodList = c("rpart", "rf")) ens <- caretEnsemble(models) summary(ens)
Build a list of train objects suitable for ensembling using the caretStack
function.
caretList( ..., trControl = NULL, methodList = NULL, tuneList = NULL, metric = NULL, continue_on_fail = FALSE, trim = TRUE, aggregate_resamples = TRUE )
caretList( ..., trControl = NULL, methodList = NULL, tuneList = NULL, metric = NULL, continue_on_fail = FALSE, trim = TRUE, aggregate_resamples = TRUE )
... |
arguments to pass to |
trControl |
a |
methodList |
optional, a character vector of caret models to ensemble. One of methodList or tuneList must be specified. |
tuneList |
optional, a NAMED list of caretModelSpec objects. This much more flexible than methodList and allows the specification of model-specific parameters (e.g. passing trace=FALSE to nnet) |
metric |
a string, the metric to optimize for. If NULL, we will choose a good one. |
continue_on_fail |
logical, should a valid caretList be returned that excludes models that fail, default is FALSE |
trim |
logical should the train models be trimmed to save memory and speed up stacking |
aggregate_resamples |
logical, whether to aggregate stacked predictions. Default is TRUE. |
A list of train
objects. If the model fails to build,
it is dropped from the list.
caretList( Sepal.Length ~ Sepal.Width, head(iris, 50), methodList = c("glm", "lm"), tuneList = list( nnet = caretModelSpec(method = "nnet", trace = FALSE, tuneLength = 1L) ) )
caretList( Sepal.Length ~ Sepal.Width, head(iris, 50), methodList = c("glm", "lm"), tuneList = list( nnet = caretModelSpec(method = "nnet", trace = FALSE, tuneLength = 1L) ) )
A caret model specification consists of 2 parts: a model (as a string) and the arguments to the train call for fitting that model
caretModelSpec(method = "rf", ...)
caretModelSpec(method = "rf", ...)
method |
the modeling method to pass to caret::train |
... |
Other arguments that will eventually be passed to caret::train |
a list of lists
caretModelSpec("rf", tuneLength = 5L, preProcess = "ica")
caretModelSpec("rf", tuneLength = 5L, preProcess = "ica")
Stack several train
models using a train
model.
caretStack( all.models, new_X = NULL, new_y = NULL, metric = NULL, trControl = NULL, excluded_class_id = 1L, original_features = NULL, aggregate_resamples = TRUE, ... )
caretStack( all.models, new_X = NULL, new_y = NULL, metric = NULL, trControl = NULL, excluded_class_id = 1L, original_features = NULL, aggregate_resamples = TRUE, ... )
all.models |
a caretList, or an object coercible to a caretList (such as a list of train objects) |
new_X |
Data to predict on for the caretList, prior to training the stack (for transfer learning). if NULL, the stacked predictions will be extracted from the caretList models. |
new_y |
The outcome variable to predict on for the caretList, prior to training the stack (for transfer learning). If NULL, will use the observed levels from the first model in the caret stack If 0, will include all levels. |
metric |
the metric to use for grid search on the stacking model. |
trControl |
a trainControl object to use for training the ensemble model. If NULL, will use defaultControl. |
excluded_class_id |
The integer level to exclude from binary classification or multiclass problems. |
original_features |
a character vector of the names of the original features to include in the stack or NULL to not include any features. These features will be added to the stacked predictions from the models to train the ensemble model. |
aggregate_resamples |
logical, whether to aggregate resamples by keys. Default is TRUE. |
... |
additional arguments to pass to the stacking model |
Uses either transfer learning or stacking to stack models. Assumes that all models were trained on the same number of rows of data, with the same target values. The features, cross-validation strategies, and model types (class vs reg) may vary however. If your stack of models were trained with different number of rows, please provide new_X and new_y so the models can predict on a common set of data for stacking.
If your models were trained on different columns, you should use stacking.
If you have both differing rows and columns in your model set, you are out of luck. You need at least a common set of rows during training (for stacking) or a common set of columns at inference time for transfer learning.
S3 caretStack object
Caruana, R., Niculescu-Mizil, A., Crew, G., & Ksikes, A. (2004). Ensemble Selection from Libraries of Models. https://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icml04.icdm06long.pdf
models <- caretList( x = iris[1:50, 1:2], y = iris[1:50, 3], methodList = c("rpart", "glm") ) caretStack(models, method = "glm")
models <- caretList( x = iris[1:50, 1:2], y = iris[1:50, 3], methodList = c("rpart", "glm") ) caretStack(models, method = "glm")
Unlike caret::trainControl, this function defaults to 5 fold CV. CV is good for stacking, as every observation is in the test set exactly once. We use 5 instead of 10 to save compute time, as caretList is for fitting many models. We also construct explicit fold indexes and return the stacked predictions, which are needed for stacking. For classification models we return class probabilities.
defaultControl( target, method = "cv", number = 5L, savePredictions = "final", index = caret::createFolds(target, k = number, list = TRUE, returnTrain = TRUE), is_class = is.factor(target) || is.character(target), is_binary = length(unique(target)) == 2L, ... )
defaultControl( target, method = "cv", number = 5L, savePredictions = "final", index = caret::createFolds(target, k = number, list = TRUE, returnTrain = TRUE), is_class = is.factor(target) || is.character(target), is_binary = length(unique(target)) == 2L, ... )
target |
the target variable. |
method |
the method to use for trainControl. |
number |
the number of folds to use. |
savePredictions |
the type of predictions to save. |
index |
the fold indexes to use. |
is_class |
logical, is this a classification or regression problem. |
is_binary |
logical, is this binary classification. |
... |
other arguments to pass to |
Caret defaults to RMSE for classification and RMSE for regression. For classification, I would rather use ROC.
defaultMetric(is_class, is_binary)
defaultMetric(is_class, is_binary)
is_class |
logical, is this a classification or regression problem. |
is_binary |
logical, is this binary classification. |
This is a function to make a dotplot from a caretStack. It uses dotplot from the caret package on all the models in the ensemble, excluding the final ensemble model.At the moment, this function only works if the ensembling model has the same number of resamples as the component models.
## S3 method for class 'caretStack' dotplot(x, ...)
## S3 method for class 'caretStack' dotplot(x, ...)
x |
An object of class caretStack |
... |
passed to dotplot |
set.seed(42) models <- caretList( x = iris[1:100, 1:2], y = iris[1:100, 3], methodList = c("rpart", "glm") ) meta_model <- caretStack(models, method = "lm") lattice::dotplot(meta_model)
set.seed(42) models <- caretList( x = iris[1:100, 1:2], y = iris[1:100, 3], methodList = c("rpart", "glm") ) meta_model <- caretStack(models, method = "lm") lattice::dotplot(meta_model)
A generic function to extract cross-validated accuracy metrics from model objects.
extractMetric(x, ...)
extractMetric(x, ...)
x |
An object from which to extract metrics.
The specific method will be dispatched based on the class of |
... |
Additional arguments passed to the specific methods. |
extractMetric.train
,
extractMetric.caretList
,
extractMetric.caretStack
caretList
objectExtract the cross-validated accuracy metrics from each model in a caretList.
## S3 method for class 'caretList' extractMetric(x, ...)
## S3 method for class 'caretList' extractMetric(x, ...)
x |
a caretList object |
... |
passed to extractMetric.train |
A data.table with metrics from each model.
caretStack
objectExtract the cross-validated accuracy metrics from the ensemble model and individual models in a caretStack.
## S3 method for class 'caretStack' extractMetric(x, ...)
## S3 method for class 'caretStack' extractMetric(x, ...)
x |
a caretStack object |
... |
passed to extractMetric.train and extractMetric.caretList |
A data.table with metrics from the ensemble model and individual models.
train
modelExtract the cross-validated accuracy metrics and their SDs from caret.
## S3 method for class 'train' extractMetric(x, metric = NULL, ...)
## S3 method for class 'train' extractMetric(x, metric = NULL, ...)
x |
a train object |
metric |
a character string representing the metric to extract. |
... |
ignored If NULL, uses the metric that was used to train the model. |
A numeric representing the metric desired metric.
Greedy optimization for minimizing the mean squared error. Works for classification and regression.
greedyMSE(X, Y, max_iter = 100L)
greedyMSE(X, Y, max_iter = 100L)
X |
A numeric matrix of features. |
Y |
A numeric matrix of target values. |
max_iter |
An integer scalar of the maximum number of iterations. |
A list with components:
model_weights |
A numeric matrix of model_weights. |
RMSE |
A numeric scalar of the root mean squared error. |
max_iter |
An integer scalar of the maximum number of iterations. |
caret interface for greedyMSE. greedyMSE works well when you want an ensemble that will never be worse than any single predictor in the dataset. It does not use an intercept and it does not allow for negative coefficients. This makes it highly constrained and in general does not work well on standard classification and regression problems. However, it does work well in the case of: * The predictors are highly correlated with each other * The predictors are highly correlated with the model * You expect or want positive only coefficients In the worse case, this method will select one input and use that, but in many other cases it will return a positive, weighted average of the inputs. Since it never uses negative weights, you never get into a scenario where one model is weighted negative and on new data you get were predictions because a correlation changed. Since this model will always be a positive weighted average of the inputs, it will rarely do worse than the individual models on new data.
greedyMSE_caret()
greedyMSE_caret()
Permute each variable in a dataset and use the change in predictions to calculate the importance of each variable. Based on the scikit learn implementation of permutation importance: https://scikit-learn.org/stable/modules/permutation_importance.html. However, we don't compare to the target by a metric. We JUST look at the change in the model's predictions, as measured by MAE. (for classification, this is like using a Brier score). We shuffle each variable and recompute the predictions before and after the shuffle. The difference in MAE. is the importance of that variable. We normalize by computing the MAE of the shuffled original predictions as an upper bound on the MAE and divide by this value. So a variable that, when shuffled, caused predictions as bad as shuffling the output predictions, we know that variable is 100 Similarly, as with regular permutation importance, a variable that, when shuffled, gives the same MAE as the original model has an importance of 0.
This method cannot yield negative importances. It is merely a measure of how much the models uses the variable, and does not tell you which variables help or hurt generalization. Use the model's cross-validated metrics to assess generalization.
permutationImportance(model, newdata, normalize = TRUE)
permutationImportance(model, newdata, normalize = TRUE)
model |
A train object from the caret package. |
newdata |
A data.frame of new data to use to compute importances. Can be the training data. |
normalize |
A logical indicating whether to normalize the importances to sum to one. |
A named numeric vector of variable importances.
This function plots the performance of each model in a caretList object.
## S3 method for class 'caretList' plot(x, metric = NULL, ...)
## S3 method for class 'caretList' plot(x, metric = NULL, ...)
x |
a caretList object |
metric |
which metric to plot |
... |
ignored |
A ggplot2 object
This function plots the performance of each model in a caretList object.
## S3 method for class 'caretStack' plot(x, metric = NULL, ...)
## S3 method for class 'caretStack' plot(x, metric = NULL, ...)
x |
a caretStack object |
metric |
which metric to plot. If NULL, will use the default metric used to train the model. |
... |
ignored |
a ggplot2 object
Make a matrix of predictions from a list of caret models
## S3 method for class 'caretList' predict( object, newdata = NULL, verbose = FALSE, excluded_class_id = 1L, aggregate_resamples = TRUE, ... )
## S3 method for class 'caretList' predict( object, newdata = NULL, verbose = FALSE, excluded_class_id = 1L, aggregate_resamples = TRUE, ... )
object |
an object of class caretList |
newdata |
New data for predictions. It can be NULL, but this is ill-advised. |
verbose |
Logical. If FALSE no progress bar is printed if TRUE a progress bar is shown. Default FALSE. |
excluded_class_id |
Integer. The class id to drop when predicting for multiclass |
aggregate_resamples |
logical, whether to aggregate resamples by keys. Default is TRUE. |
... |
Other arguments to pass to |
Make predictions from a caretStack. This function passes the data to each function in turn to make a matrix of predictions, and then multiplies that matrix by the vector of weights to get a single, combined vector of predictions.
## S3 method for class 'caretStack' predict( object, newdata = NULL, se = FALSE, level = 0.95, excluded_class_id = 0L, return_class_only = FALSE, verbose = FALSE, aggregate_resamples = TRUE, ... )
## S3 method for class 'caretStack' predict( object, newdata = NULL, se = FALSE, level = 0.95, excluded_class_id = 0L, return_class_only = FALSE, verbose = FALSE, aggregate_resamples = TRUE, ... )
object |
a |
newdata |
a new dataframe to make predictions on |
se |
logical, should prediction errors be produced? Default is false. |
level |
tolerance/confidence level should be returned |
excluded_class_id |
Which class to exclude from predictions. Note that if the caretStack was trained with an excluded_class_id, that class is ALWAYS excluded from the predictions from the caretList of input models. excluded_class_id for predict.caretStack is for the final ensemble model. So different classes could be excluded from the caretList models and the final ensemble model. |
return_class_only |
a logical indicating whether to return only the class predictions as a factor. If TRUE, the return will be a factor rather than a data.table. This is a convenience function, and should not be widely used. For example if you have a downstream process that consumes the output of the model, you should have that process consume probabilities for each class. This will make it easier to change prediction probability thresholds if needed in the future. |
verbose |
a logical indicating whether to print progress |
aggregate_resamples |
logical, whether to aggregate resamples by keys. Default is TRUE. |
... |
arguments to pass to |
Prediction weights are defined as variable importance in the stacked caret model. This is not available for all cases such as where the library model predictions are transformed before being passed to the stacking model.
a data.table of predictions
models <- caretList( x = iris[1:100, 1:2], y = iris[1:100, 3], methodList = c("rpart", "glm") ) meta_model <- caretStack(models, method = "lm") RMSE(predict(meta_model, iris[101:150, 1:2]), iris[101:150, 3])
models <- caretList( x = iris[1:100, 1:2], y = iris[1:100, 3], methodList = c("rpart", "glm") ) meta_model <- caretStack(models, method = "lm") RMSE(predict(meta_model, iris[101:150, 1:2]), iris[101:150, 3])
Predict method for greedyMSE objects.
## S3 method for class 'greedyMSE' predict(object, newdata, return_labels = FALSE, ...)
## S3 method for class 'greedyMSE' predict(object, newdata, return_labels = FALSE, ...)
object |
A greedyMSE object. |
newdata |
A numeric matrix of new data. |
return_labels |
A logical scalar of whether to return labels. |
... |
Additional arguments. Ignored. |
A numeric matrix of predictions.
This is a function to print a caretStack.
## S3 method for class 'caretStack' print(x, ...)
## S3 method for class 'caretStack' print(x, ...)
x |
An object of class caretStack |
... |
ignored |
models <- caretList( x = iris[1:100, 1:2], y = iris[1:100, 3], methodList = c("rpart", "glm") ) meta_model <- caretStack(models, method = "lm") print(meta_model)
models <- caretList( x = iris[1:100, 1:2], y = iris[1:100, 3], methodList = c("rpart", "glm") ) meta_model <- caretStack(models, method = "lm") print(meta_model)
Print method for greedyMSE objects.
## S3 method for class 'greedyMSE' print(x, ...)
## S3 method for class 'greedyMSE' print(x, ...)
x |
A greedyMSE object. |
... |
Additional arguments. Ignored. |
This is a function to print a summary.caretList
## S3 method for class 'summary.caretList' print(x, ...)
## S3 method for class 'summary.caretList' print(x, ...)
x |
An object of class summary.caretList |
... |
ignored |
This is a function to print a summary.caretStack.
## S3 method for class 'summary.caretStack' print(x, ...)
## S3 method for class 'summary.caretStack' print(x, ...)
x |
An object of class summary.caretStack |
... |
ignored |
This function summarizes the performance of each model in a caretList object.
## S3 method for class 'caretList' summary(object, metric = NULL, ...)
## S3 method for class 'caretList' summary(object, metric = NULL, ...)
object |
a caretList object |
metric |
The metric to show. If NULL will use the metric used to train each model |
... |
passed to extractMetric |
A data.table with metrics from each model.
This is a function to summarize a caretStack.
## S3 method for class 'caretStack' summary(object, ...)
## S3 method for class 'caretStack' summary(object, ...)
object |
An object of class caretStack |
... |
ignored |
models <- caretList( x = iris[1:100, 1:2], y = iris[1:100, 3], methodList = c("rpart", "glm") ) meta_model <- caretStack(models, method = "lm") summary(meta_model)
models <- caretList( x = iris[1:100, 1:2], y = iris[1:100, 3], methodList = c("rpart", "glm") ) meta_model <- caretStack(models, method = "lm") summary(meta_model)
This function makes sure the tuning parameters passed by the user are valid and have the proper naming, etc.
tuneCheck(x)
tuneCheck(x)
x |
a list of user-supplied tuning parameters and methods |
This is a function to extract variable importance from a caretStack.
## S3 method for class 'caretStack' varImp(object, newdata = NULL, normalize = TRUE, ...)
## S3 method for class 'caretStack' varImp(object, newdata = NULL, normalize = TRUE, ...)
object |
An object of class caretStack |
newdata |
the data to use for computing importance. If NULL, will use the stacked predictions from the models |
normalize |
a logical indicating whether to normalize the importances to sum to one. |
... |
passed to predict.caretList |
Variable importance for a greedyMSE model.
## S3 method for class 'greedyMSE' varImp(object, ...)
## S3 method for class 'greedyMSE' varImp(object, ...)
object |
A greedyMSE object. |
... |
Additional arguments. Ignored. |
Used to weight deviations among ensembled model predictions
wtd.sd(x, w, na.rm = FALSE)
wtd.sd(x, w, na.rm = FALSE)
x |
a numeric vector |
w |
a vector of weights equal to length of x |
na.rm |
a logical indicating how to handle missing values, default = TRUE |