CRAN Task View: Sports Analytics
|Maintainer:||Benjamin S. Baumer, Quang Nguyen, Gregory J. Matthews|
|Contact:||ben.baumer at gmail.com|
|Contributions:||Suggestions and improvements for this task view are very welcome and can be made through issues or pull requests on GitHub or via e-mail to the maintainer address. For further details see the Contributing guide.|
|Citation:||Benjamin S. Baumer, Quang Nguyen, Gregory J. Matthews (2023). CRAN Task View: Sports Analytics. Version 2023-04-06. URL https://CRAN.R-project.org/view=SportsAnalytics.|
|Installation:||The packages from this task view can be installed automatically using the ctv package. For example, |
ctv::install.views("SportsAnalytics", coreOnly = TRUE) installs all the core packages or
ctv::update.views("SportsAnalytics") installs all packages that are not yet installed and up-to-date. See the CRAN Task View Initiative for more details.
This CRAN Task View contains a list of packages useful for sports analytics. Most of the packages are sport-specific and are grouped as such. However, we also include a General section for packages that provide ancillary functionality relevant to sports analytics (e.g., team-themed color palettes), and a Modeling section for packages useful for statistical modeling. Throughout the task view, and collected in the Related links section at the end, we have included a list of selected books and articles that use some of these packages in substantive ways. Our goal in compiling this list is to help researchers find the tools they need to complete their work in R.
To be considered for inclusion, the package must be useful for conducting sports analytics. Most packages provide functionality for some combination of:
- acquiring data for a specific sport or league
- performing common computations on sport-specific data
Esports and sports betting packages are within scope.
The list of packages is aspirationally comprehensive. If there is a sports analytics package on CRAN that we have missed, please let us know. Contributions are always welcome, and encouraged – please see the linked GitHub repository for details.
- nflverse is a collection of packages for obtaining and analyzing NFL data. The core nflverse includes nflfastR, nflseedR, nfl4th, nflreadr, and nflplotR.
- nflfastR contains functions to efficiently scrape NFL play-by-play data from 1999 to present. It is similar to nflscrapR, but much faster. All models required by nflfastR are hosted in fastrmodels.
- nflreadr efficiently downloads data from GitHub repositories of the nflverse project, including pre-computed nflfastR data frames.
- nfl4th consists of functions to calculate optimal Fourth Down decisions in the National Football League. Data on 4th downs is collected from NFL and ESPN.
- nflseedR contains functions for ranking NFL teams based on the complex NFL tie breaking rules. It includes division ranking, playoff seeding, and draft order.
- nflplotR includes functions for making NFL data visualization in ggplot2 easier.
- NFLSimulatoR consists of tools for simulating plays and drives, and furthermore evaluating in-game strategies in the NFL.
- fflr provides functions to access ESPN raw fantasy football data from the ESPN fantasy football API and formatting the raw data.
- ffscrapr helps access various fantasy football APIs including MFL, Sleeper, ESPN, and Fleaflicker with a consistent interface and built-in authentication, rate-limiting, and caching.
- ffsimulator allows users to simulate fantasy football seasons using bootstrap resampling. Simulations are based on historical rankings and data from the package nflfastR. In addition, functions for computing optimal lineups and aggregating results are provided.
- gsisdecoder contains functions to decode NFL Player IDs for use in conjunction with the nflfastR package.
- cfbfastR provides function for accessing college football play-by-play data from collegefootballdata.com.
- worldfootballR provides clean and tidy football data from a number of popular sites, including FBref, transfer and valuations data from Transfermarkt and shooting location data from Understat and fotmob.
- socceR provides functions for evaluating soccer predictions and simulating results from soccer matches and tournament.
- ggsoccer provides functions for visualizing soccer event data in ggplot2.
- footballpenaltiesBL contains data and plotting functions for analyzing penalty kicks in the German Men’s Bundesliga from 1963-64 to 2016-17.
- footBayes consists of functions for fitting widely known soccer models (double Poisson, bivariate Poisson, Skellam, Student’s t) through Hamiltonian Monte Carlo and Maximum Likelihood estimation approaches using Stan. The package also provides tools for visualizing team strengths and predicting match outcomes.
- itscalledsoccer enables access to American soccer (MLS, NWSL, and USL) data through the American Soccer Analysis app API.
- FPLdata contains functions for retrieving player attributes on Fantasy Premier League.
- EUfootball provides European football match results for top leagues in England, France, Germany, Italy, Spain, Netherlands, and Turkey from 2010-2011 to 2019-2020.
- Historical baseball data is available through the Lahman package, which contains season-level data for Major League Baseball going back to 1871.
- retrosheet facilitates downloading game log, team IDs, rosters, and play-by-play and other files from Retrosheet.org, and returning the results as data frames. Local caching can be employed to improve efficiency. Note that the play-by-play data returned comes directly from the event files and is not parsed (i.e., Chadwick is not bundled).
- pitchRx (archived) provides access to pitch-level data through the Major League Baseball Advanced Media API. The package is featured prominently in Marchi, M., Albert, J., and Baumer, B. S. (2018). Analyzing baseball data with R (doi:10.1201/9781351107099). For a full description of the package see Sievert, C. (2014). Taming PITCHf/x Data with XML2R and pitchRx (doi:10.32614/RJ-2014-001).
- mlbstats provides functions for vector-based computation of many baseball statistics, both traditional and sabermetric.
- baseballDBR leverages the backend database functionality of dplyr to build local databases that mirror the data contained in Lahman. Like mlbstats, it also includes functions to compute baseball statistics, but on data frames rather than vectors.
- baseballr (archived) consists of functions for extracting and analyzing baseball data from various sources such as Baseball Reference, FanGraphs, and Baseball Savant.
- chess is an opinionated wrapper for R around python-chess. It reads and writes PGN files and SVGs of game boards.
- Like chess, bigchess reads and writes PGN files. bigchess provides an API to the UCI chess engines. bigchess is also able to read multiple game files at once without copying to RAM.
- rchess (archived) provides functions for chess validations, pieces movements, check detection, and plotting chess boards.
- yorkr provides functions for analyzing statistics of cricket players and teams based on Cricsheet data.
- cricketr is a collection of tools for analyzing cricket performances of players and teams based on ESPN Cricinfo Statsguru data.
- cricketdata includes functions to obtain international cricket data from two major sources, ESPNCricinfo and Cricsheet.
- howzatR consists of functions for calculating various cricket statistics.
GPS Tracking 📍
- trackeR and trackeRapp provide tools for analyzing running, cycling and swimming data from GPS-enabled tracking devices within R. These two packages allow users to tidy and explore data from workouts and competitions.
- rStrava contains functions to access Strava activity data from the Strava API.
- A detailed overview of tools for processing and analyzing tracking data can be found in the Tracking CRAN Task View.
- NHLData contains scores from NHL games dating back to 1917. Data are stored one season at a time and contains scores for every game during a particular season.
- Access to data exposed by the NHL API is provided by the nhlapi and nhlscrape (archived) packages.
- fastRhockey provides API wrappers for the NHL and Premier Hockey Federation (PHF), formerly known as the National Women’s Hockey League (NWHL).
- runexp provides methods for estimating runs scored in softball. In particular, runexp centers around theoretical expectation using discrete Markov chains and empirical distribution using multinomial random simulation.
- SwimmeR reads swimming results in a variety of formats and returns results in tidy data frame. It also includes functions for converting times between short-course yards (SCY), short-course meters (SCM), and long-course meters (LCM).
Track and Field 🏃
- teamcolors provides color palettes, ggplot2 themes, xaringan themes, and logos for professional teams across a variety of sports and leagues. teamcolors was originally designed to create the data graphics in Lopez, et al. (2018) (doi:10.1214/18-AOAS1165).
- colorr contains color palettes for professional sports teams in the EPL, MLB, NBA, NHL, and NFL.
- nbapalettes contains color palettes inspired by NBA team jersey colors.
- sleeperapi offers functions for gathering data from the Sleeper API for fantasy sports.
- sportyR contains functions for creating ggplot2 representations of sports playing surfaces pursuant to rule-book specifications. This is particularly useful for plotting player tracking data.
- SportsTour provides functions for displaying tournament fixtures using knock-out and round robin methods.
- injurytools provides functionality for analyzing, visualizing, and modeling sports injuries.
A wide array of functions for modeling in sports analytics are available in the R base package (e.g.
glm()). In addition, other CRAN Task Views such as Bayesian, MachineLearning, Robust, Spatial, and SpatioTemporal may contain appropriate packages for applying statistical methods to sports.
- oddsapiR provides tools for accessing sports odds from The Odds API.
- odds.converter contributes functions for converting common sports betting odds types, including US odds, Hong Kong odds, Decimal odds, Indonesian odds, Malaysian odds, and raw probability.
- implied is a collection of functions that convert between bookmaker odds and probabilities, based on various algorithms.
- pinnacle.data contains Pinnacle market odds, highlighted by a dataset of all wagering lines for the 2016 MLB season.
- RKelly computes the Kelly criterion for betting and provides functions to calculate outcome probabilities for multi-leg contests.
|Core:||BAwiR, BradleyTerry2, Lahman, nflverse.|
|Regular:||AdvancedBasketballStats, baseballDBR, BasketballAnalyzeR, bigchess, cfbfastR, chess, colorr, combinedevents, cricketdata, cricketr, CSGo, elo, EloChoice, EloRating, EUfootball, fastRhockey, fastrmodels, fflr, ffscrapr, ffsimulator, fitzRoy, footballpenaltiesBL, footBayes, FPLdata, ggsoccer, gsisdecoder, hoopR, howzatR, implied, injurytools, itscalledsoccer, mlbstats, mvglmmRank, NBAloveR, nbapalettes, nfl4th, nflfastR, nflplotR, nflreadr, nflseedR, NFLSimulatoR, nhlapi, NHLData, odds.converter, oddsapiR, opendotaR, pinnacle.data, piratings, PlayerRatings, rbedrock, RDota2, retrosheet, RKelly, ROpenDota, rStrava, runexp, sleeperapi, socceR, SportsTour, sportyR, SwimmeR, teamcolors, trackeR, trackeRapp, uncmbb, volleystat, wehoop, welo, worldfootballR, yorkr.|
|Archived:||baseballr, nhlscrape, pitchRx, rchess.|
- Lopez, M. J., Matthews, G. J., and Baumer, B. S. (2018). How often does the best team win? A unified approach to understanding randomness in North American sport. The Annals of Applied Statistics, 12(4), 2483-2516.
- Constantinou, A. C., Fenton, N. E., and Neil, M. (2013). Profiting from an inefficient Association Football gambling market: Prediction, Risk and Uncertainty using Bayesian networks. Knowledge-Based Systems, 50, 60-86.
- Zuccolotto, P., and Manisera, M. (2020). Basketball data science: with applications in R. CRC Press.
- Marchi, M., Albert, J., and Baumer, B. S. (2018). Analyzing baseball data with R. 2nd edition. Chapman and Hall/CRC.
- Sievert, C. (2014). Taming PITCHf/x Data with XML2R and pitchRx. R Journal, 6(1).
- Bradley, R. A., & Terry, M. E. (1952). Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika, 39(3/4), 324-345.