Table of Contents generated with DocToc
Thank you for your interest in contributing to baseballr! This package provides clean, tidy baseball data from a range of public sources – the MLB Stats API, FanGraphs, Baseball Reference, Baseball Savant (Statcast), the NCAA baseball stats site, Spotrac, the Chadwick Bureau register, and Retrosheet. Contributions of all kinds are welcome: bug reports, new endpoint wrappers, documentation fixes, and tests.
This document covers how to get set up, the conventions the codebase follows, and what a reviewable pull request looks like. When this guide differs from the current repository docs, treat CLAUDE.md and the current test implementations as authoritative.
Code of Conduct
This project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms. Report unacceptable behavior to the maintainer at saiem.gilani@gmail.com.
Development Setup
Fork the repository (
BillPetti/baseballr) to your own account on GitHub.-
Clone your fork locally:
-
Install development dependencies from within R:
# install.packages("devtools") devtools::install_dev_deps() -
Branch from
master.masteris the default and release branch. Active development is currently staged ondevelopment_branch; ask the maintainer which base to branch from if you are unsure, but the safe default is the latestmaster. -
Confirm the package builds and checks before you start:
devtools::load_all() devtools::document() devtools::check()
baseballr requires R (>= 4.1.0) because the codebase uses the native pipe (|>).
Workflow
Build & Development Commands
# Regenerate roxygen documentation + NAMESPACE
devtools::document()
# Run all tests
devtools::test()
# Run a specific test file
testthat::test_file("tests/testthat/test-mlb_schedule.R")
# Full R CMD check
devtools::check()
# Install locally
devtools::install()
# Build pkgdown site locally
pkgdown::build_site()Making Changes
- Make your changes on a feature branch.
- Keep code, tests, and roxygen/doc updates in the same PR when you change exported behavior.
- Run
devtools::document()whenever you touch roxygen comments, add a function, or change a signature. - Run
devtools::test()anddevtools::check()before opening the PR.
Adding a New Endpoint Wrapper
- Place the function in the appropriate
R/file for its data source (see the prefix table below), or create a newR/<source>_<topic>.Rfile if it does not fit an existing one. - Follow the function pattern documented in
CLAUDE.md: initialize the return variable before thetryCatch, parse the payload, run it through thejanitor::clean_names()+make_baseballr_data()pipeline, and emitclimessages from the error/warning handlers. - Add a roxygen block with
@title,@param,@return(with a column table),@export,@family, and a runnable@examplesblock. - Add a test in
tests/testthat/using the subset-direction column assertions. - Run
devtools::document()to regenerateman/andNAMESPACE. - Add a
NEWS.mdbullet and confirm the function is picked up by the_pkgdown.ymlreference index.
Naming Conventions
Functions are named by their data source. New wrappers must use the matching prefix so the pkgdown starts_with() selectors pick them up automatically.
| Data Source | Prefix | Example |
|---|---|---|
| MLB Stats API | mlb_ |
mlb_schedule(), mlb_pbp()
|
| FanGraphs | fg_ |
fg_batter_leaders(), fg_team_pitcher()
|
| Baseball Reference | bref_ |
bref_daily_batter(), bref_team_results()
|
| Baseball Savant / Statcast |
statcast_ / sc_
|
statcast_search(), statcast_leaderboards()
|
| NCAA baseball | ncaa_ |
ncaa_schedule_info(), ncaa_roster()
|
| Spotrac | sptrc_ |
sptrc_team_active_payroll() |
| Chadwick Bureau register | chadwick_ |
chadwick_player_lu() |
| Retrosheet | retrosheet_ |
retrosheet_data() |
| Metrics |
metrics_ (family) |
woba_plus(), fip_plus()
|
| Data loaders | load_ |
load_ncaa_baseball_pbp(), load_umpire_ids()
|
| Visualizations | (named) | ggspraychart() |
General Naming Rules
- Use the data-source prefix for every new exported wrapper.
- Use
snake_casefor function and argument names. - Function default arguments should be a single value (e.g.
output = "default"), with the valid choices documented and validated inside the function body – not amatch.arg-stylec(...)choice vector in the signature.
Data Processing Pipeline
Wrappers should funnel their parsed payload through the standard pipeline so the return object carries the baseballr_data class and metadata attributes:
raw_data |>
janitor::clean_names() |>
make_baseballr_data("Description of the data from <source>", Sys.time())make_baseballr_data() sets the class to c("baseballr_data", "tbl_df", "tbl", "data.table", "data.frame") and attaches provenance attributes.
Roxygen Documentation Checklist
-
@title/@description -
@paramfor every argument (including...) -
@returnwith a column-name/type table where applicable @export@family <source> Functions- a runnable
@examplesblock (wrap live-site calls in\donttest{}soR CMD checkdoes not hit the network during routine checking)
Coding Conventions
The authoritative reference is CLAUDE.md. Key rules:
-
Native pipe
|>everywhere.magrittr(%>%) is retained as a dependency only because some NCAA wrappers still use it; do not introduce new%>%usage. -
Return-value initialization (CRITICAL). Every wrapper that returns a variable assigned inside a
tryCatchmust initialize that variable (usually<- NULL) before thetryCatch. Otherwise an API error runs the error handler, the return variable is never bound, andreturn(<var>)throwsobject '<var>' not foundinstead of returning an empty value with a message. -
Messaging via
cli. Usecli::cli_alert_danger()in error handlers,cli::cli_alert_warning()andcli::cli_alert_info()for warnings/info, andcli::cli_warn()/cli::cli_abort()for raised conditions. Never pass a raw condition object to acli_*call (the message is glue-interpolated); passconditionMessage(cond)through a value placeholder. -
Column-drift resilience. Upstream sites add and occasionally rename columns. When dropping a known-transient column, use
dplyr::select(-dplyr::any_of("colname"))rather than the bare form so a schema change is survivable.
Testing
Tests live in tests/testthat/. Many of them hit live sites and are gated / skipped so they do not run during routine R CMD check or on CI.
Live-API and NCAA caveats
- Live-API tests are gated. Network-dependent tests skip on CRAN and CI and should guard against empty/transient responses. Do not assume the network is available in a check run.
-
The NCAA stats site aggressively IP-bans scrapers. Tests and development work that hit NCAA endpoints (
ncaa_*) must be done sparingly and cached wherever possible. Do not run NCAA tests in a tight loop, and do not add tests that hammer NCAA endpoints. A single careless test run can get your IP (or a CI runner’s IP) banned. When iterating on NCAA wrappers, save a sample payload locally and develop against the cached fixture.
Test Pattern
Always use the subset direction for column assertions. Because the upstream sources add columns over time, a strict expect_equal on the full column set will break the moment a non-breaking column is added. The rule is: the expected column list must be a subset of the actual columns.
test_that("mlb_function returns expected columns", {
skip_on_cran()
skip_on_ci()
x <- mlb_function(game_pk = 632970)
# Skip-if-empty guard -- always right after the API call, before any
# assertion that touches the data. Handles transient errors and outages.
if (is.null(x) || !is.data.frame(x) || nrow(x) == 0) {
skip("No rows returned from endpoint at test time")
}
cols_x <- c("col1", "col2")
expect_true(all(cols_x %in% colnames(x))) # expected subset of actual
expect_s3_class(x, "data.frame")
})Anti-patterns to avoid:
# WRONG -- flags when upstream adds a column, even though it is non-breaking
expect_equal(sort(colnames(x)), sort(cols_x))
# WRONG -- same direction problem, just phrased with expect_in
expect_in(sort(colnames(x)), sort(cols_x))For list-returning endpoints that sometimes return fewer elements than expected, use an inline helper so individual asserts skip gracefully rather than erroring:
Documentation Maintenance
Several regeneration steps are part of the commit workflow whenever the relevant sources change. All of them are mechanical – never edit the generated regions by hand.
Markdown TOCs (doctoc)
NEWS.md, CLAUDE.md, CONTRIBUTING.md, .github/copilot-instructions.md, and .github/pull_request_template.md carry a doctoc-generated table of contents inside the standard marker comments. After editing any of those files, regenerate the TOC before committing:
Rscript tools/run_doctoc.R --maxlevel 2 \
NEWS.md CLAUDE.md CONTRIBUTING.md \
.github/copilot-instructions.md .github/pull_request_template.mdtools/run_doctoc.R is a no-deps, idempotent R replacement for the npm doctoc CLI – it produces output indistinguishable from the upstream tool and runs without Node.js. Use --maxlevel 2 so the TOC only lists # and ## headings. cran-comments.md is intentionally excluded.
README.md (rmarkdown)
README.md is rendered from README.Rmd. After editing README.Rmd, re-render before committing:
devtools::build_readme()Commit README.Rmd and the regenerated README.md together. Never hand-edit README.md.
DESCRIPTION (usethis)
After editing DESCRIPTION (adding/removing packages, bumping versions, updating Authors@R, etc.), normalize formatting before committing:
usethis::use_tidy_description()This re-orders fields, alphabetizes Imports/Suggests, and reflows long lines so subsequent diffs stay minimal. Run it even for one-line edits.
Release notes triad: NEWS.md / cran-comments.md / _pkgdown.yml
Three files describe the same release at different audiences. Whenever you add a NEWS.md bullet, think through all three before committing:
-
NEWS.md– authoritative changelog for downstream users; rendered into the pkgdown changelog. New bullets go under the most recent unreleased version heading. Extend an existing subsection (### Bug fixes,### New features, etc.) rather than starting a new one when the change is incremental. -
cran-comments.md– the short-form summary submitted to CRAN. Every behavioral or user-visible change inNEWS.mdshould be reflected here before submission. Purely internal changes (refactors, test infrastructure, dev tooling) can be omitted. -
_pkgdown.yml– the pkgdown reference index. New exported functions need to land in the rightreference:section. The config usesstarts_with()selectors (starts_with("mlb_"),starts_with("fg_"), etc.), so new functions matching those prefixes are picked up automatically; explicitly listed functions need a manual entry.
When a change touches the API surface (new export, deprecation, removal), include a one-line note in your commit message confirming you’ve checked all three files.
Commit Messages
Use Conventional Commits:
feat: add mlb_draft_pick_history() endpoint wrapper
fix: initialize return value before tryCatch in fg_team_pitcher()
docs: update NEWS.md for v1.6.0
test: add subset-direction column checks for statcast_search()
refactor: extract NCAA payload parsing into helper
chore: update .Rbuildignore patterns
ci: bump actions/checkout to v5
Prefer scoped subjects when useful (e.g. feat(mlb): ..., docs(contrib): ...). Use type!: or a BREAKING CHANGE: footer for breaking changes. Split unrelated work into separate commits for reviewability.
Important: Never include AI tools or assistants (e.g. Claude, Copilot) as commit co-authors. Omit all Co-Authored-By trailers that reference AI tools.
Pull Requests
-
Target
master(the default and release branch). - Fill out the pull request template completely.
- Ensure
devtools::check()passes with no new errors, warnings, or notes. - Ensure
devtools::document()has been run soman/andNAMESPACEare current. - Add or update tests for any changed behavior, respecting the live-API and NCAA IP-ban caveats above.
- Add a
NEWS.mdbullet (and reflect user-visible changes incran-comments.mdand_pkgdown.ymlwhere applicable). - Keep the PR focused; split unrelated changes into separate PRs.
Reporting Issues
When filing a bug report, please include:
- A minimal reprex (reproducible example) showing the call you made.
- The relevant identifier(s) you passed (e.g.
game_pk, FanGraphs player id, season, team). - The output of
sessionInfo(). - The full error or warning message.
For NCAA-related issues, please note whether you may have been rate-limited or IP-banned (the symptom is usually an HTTP error or an empty/timeout response after repeated requests).
License
baseballr is released under the MIT License. By contributing, you agree that your contributions will be licensed under the same terms.
