|Contact:||Torsten.Hothorn at R-project.org|
|Contributions:||Suggestions and improvements for this task view are very welcome and can be made through issues or pull requests on GitHub or via e-mail to the maintainer address. For further details see the Contributing guide.|
|Citation:||Torsten Hothorn (2023). CRAN Task View: Machine Learning & Statistical Learning. Version 2023-07-20. URL https://CRAN.R-project.org/view=MachineLearning.|
|Installation:||The packages from this task view can be installed automatically using the ctv package. For example, |
Several add-on packages implement ideas and methods developed at the borderline between computer science and statistics - this field of research is usually referred to as machine learning. The packages can be roughly structured into the following topics:
Recursive Partitioning : Tree-structured models for regression, classification and survival analysis, following the ideas in the CART book, are implemented in rpart (shipped with base R) and tree. Package rpart is recommended for computing CART-like trees. A rich toolbox of partitioning algorithms is available in Weka, package RWeka provides an interface to this implementation, including the J4.8-variant of C4.5 and M5. The Cubist package fits rule-based models (similar to trees) with linear regression models in the terminal leaves, instance-based corrections and boosting. The C50 package can fit C5.0 classification trees, rule-based models, and boosted versions of these. pre can fit rule-based models for a wider range of response variable types.
Two recursive partitioning algorithms with unbiased variable selection and statistical stopping criterion are implemented in package party and partykit. Function
ctree() is based on non-parametric conditional inference procedures for testing independence between response and each input variable whereas
mob() can be used to partition parametric models. Extensible tools for visualizing binary trees and node distributions of the response are available in package party and partykit as well. Partitioning of mixed-effects models (GLMMs) can be performed with package glmertree; partitioning of structural equation models (SEMs) can be performed with package semtree. Graphical tools for the visualization of trees are available in package maptree.
Partitioning of mixture models is performed by RPMM.
Computational infrastructure for representing trees and unified methods for prediction and visualization is implemented in partykit. This infrastructure is used by package evtree to implement evolutionary learning of globally optimal trees. Survival trees are available in various packages.
Trees for subgroup identification with respect to heterogenuous treatment effects are available in packages partykit, model4you, dipm, quint,
pkg("MrSGUIDE") (and probably many more).
svm()from e1071 offers an interface to the LIBSVM library and package kernlab implements a flexible framework for kernel learning (including SVMs, RVMs and other kernel learning algorithms). An interface to the SVMlight implementation (only for one-against-all classification) is provided in package klaR.
tune()for hyper parameter tuning and function
errorest()(ipred) can be used for error rate estimation. The cost parameter C for support vector machines can be chosen utilizing the functionality of package svmpath. Data splitting for crossvalidation and other resampling schemes is available in the splitTools package. Package nestedcv provides nested cross-validation for glmnet and caret models. Functions for ROC analysis and other visualisation techniques for comparing candidate classifiers are available from package ROCR. Packages hdi and stabs implement stability selection for a range of models, hdi also offers other inference procedures in high-dimensional models.
stats::termplot()function package can be used to plot the terms in a model whose predict method supports
type="terms". The effects package provides graphical and tabular effect displays for models with a linear predictor (e.g., linear and generalized linear models). Friedman’s partial dependence plots (PDPs), that are low dimensional graphical renderings of the prediction function, are implemented in a few packages. gbm, randomForest and randomForestSRC provide their own functions for displaying PDPs, but are limited to the models fit with those packages (the function
partialPlotfrom randomForest is more limited since it only allows for one predictor at a time). Packages pdp, plotmo, and ICEbox are more general and allow for the creation of PDPs for a wide variety of machine learning models (e.g., random forests, support vector machines, etc.); both pdp and plotmo support multivariate displays (plotmo is limited to two predictors while pdp uses trellis graphics to display PDPs involving three predictors). By default, plotmo fixes the background variables at their medians (or first level for factors) which is faster than constructing PDPs but incorporates less information. ICEbox focuses on constructing individual conditional expectation (ICE) curves, a refinement over Friedman’s PDPs. ICE curves, as well as centered ICE curves can also be constructed with the
partial()function from the pdp package.
XAI : Most packages and functions from the last section “Visualization” belong to the field of explainable artificial intelligence (XAI). The meta packages DALEX and iml offer different methods to interpret any model, including partial dependence, accumulated local effects, and permutation importance. Accumulated local effects plots are also directly available in ALEPlot. SHAP (from SHapley Additive exPlanations) is one of the most frequently used techniques to interpret ML models. It decomposes - in a fair way - predictions into additive contributions of the predictors. For tree-based models, the very fast TreeSHAP algorithm exists. It is shipped directly with h2o, xgboost, and lightgbm. Model-agnostic implementations of SHAP are available in additional packages: fastshap mainly uses Monte-Carlo sampling to approximate SHAP values, while shapr and kernelshap provide implementations of KernelSHAP. SHAP values of any of these packages can be plotted by the package shapviz. A port to Python’s “shap” package is provided in shapper. Alternative decompositions of predictions are implemented in lime and iBreakDown.
|Core:||abess, e1071, gbm, kernlab, mboost, nnet, randomForest, rpart.|
|Regular:||adabag, ahaz, ALEPlot, arules, BART, bartMachine, BayesTree, BDgraph, Boruta, bst, C50, caret, CORElearn, Cubist, DALEX, deepnet, dipm, DoubleML, earth, effects, elasticnet, evclass, evreg, evtree, fastshap, frbs, gamboostLSS, glmertree, glmnet, glmpath, GMMBoost, grf, grplasso, grpreg, h2o, hda, hdi, hdm, iBreakDown, ICEbox, iml, ipred, islasso, joinet, kernelshap, klaR, lars, LiblineaR, lightgbm, lime, maptree, mlpack, mlr3, model4you, mpath, naivebayes, ncvreg, nestedcv, OneR, opusminer, pamr, party, partykit, pdp, penalized, penalizedLDA, picasso, plotmo, pre, quantregForest, quint, randomForestSRC, ranger, Rborist, rgenoud, RGF, RLT, Rmalschains, rminer, ROCR, RoughSets, RPMM, RSNNS, RWeka, RXshrink, sda, semtree, shapper, shapr, shapviz, SIS, splitTools, ssgraph, stabs, SuperLearner, svmpath, tensorflow, tgp, tidymodels, torch, tree, trtf, varSelRF, wsrf, xgboost.|