The latest release of the baseballr
includes a function for acquiring player statistics from the NCAA’s website for baseball teams
across the three major divisions (I, II, III).
In order to look up teams, you can either load the teams for all
divisions from the baseballr-data
repository or access them
directly from the NCAA website for a given year and division.
Loading from the baseballr-data repository:
library(baseballr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
ncaa_teams_df <- load_ncaa_baseball_teams()
From the NCAA website:
try(ncaa_teams(year = most_recent_ncaa_baseball_season(), division = "1"))
#> ── NCAA Baseball Teams data from stats.ncaa.org ───── baseballr 1.6.0 ──
#> ℹ Data updated: 2024-04-13 22:00:37 UTC
#> # A tibble: 305 × 8
#> team_id team_name team_url conference_id conference division year
#> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 458 Charlotte /team/4… 823 AAC 1 2024
#> 2 196 East Caroli… /team/1… 823 AAC 1 2024
#> 3 229 Fla. Atlant… /team/2… 823 AAC 1 2024
#> 4 404 Memphis /team/4… 823 AAC 1 2024
#> 5 574 Rice /team/5… 823 AAC 1 2024
#> 6 651 South Fla. /team/6… 823 AAC 1 2024
#> 7 718 Tulane /team/7… 823 AAC 1 2024
#> 8 9 UAB /team/9… 823 AAC 1 2024
#> 9 706 UTSA /team/7… 823 AAC 1 2024
#> 10 782 Wichita St. /team/7… 823 AAC 1 2024
#> # ℹ 295 more rows
#> # ℹ 1 more variable: season_id <chr>
The function, ncaa_team_player_stats()
, requires the
user to pass values for three parameters for the function to work:
team_id
: numerical code used by the NCAA for each school
year
: a four-digit year type
: whether to pull
data for batters or pitchers
If you want to pull batting statistics for Florida State for the 2024 season, you would use the following:
team_id <- ncaa_teams_df %>%
dplyr::filter(.data$team_name == "Florida St.") %>%
dplyr::select("team_id") %>%
dplyr::distinct() %>%
dplyr::pull("team_id")
year <- most_recent_ncaa_baseball_season()
ncaa_team_player_stats(team_id = team_id, year = year, "batting")
#> ── NCAA Baseball Team Batting Stats data from stats.ncaa.org ───────────
#> ℹ Data updated: 2024-04-13 22:00:42 UTC
#> # A tibble: 36 × 35
#> year team_name team_id conference_id conference division player_id
#> <int> <chr> <dbl> <int> <chr> <dbl> <int>
#> 1 2024 Florida St. 234 821 ACC 1 2797459
#> 2 2024 Florida St. 234 821 ACC 1 2649339
#> 3 2024 Florida St. 234 821 ACC 1 2649334
#> 4 2024 Florida St. 234 821 ACC 1 2305362
#> 5 2024 Florida St. 234 821 ACC 1 2813214
#> 6 2024 Florida St. 234 821 ACC 1 2943072
#> 7 2024 Florida St. 234 821 ACC 1 2802463
#> 8 2024 Florida St. 234 821 ACC 1 2799847
#> 9 2024 Florida St. 234 821 ACC 1 2797460
#> 10 2024 Florida St. 234 821 ACC 1 2813213
#> # ℹ 26 more rows
#> # ℹ 28 more variables: player_url <chr>, player_name <chr>, Yr <chr>,
#> # Pos <chr>, Jersey <chr>, GP <dbl>, GS <dbl>, BA <dbl>, OBPct <dbl>,
#> # SlgPct <dbl>, R <dbl>, AB <dbl>, H <dbl>, `2B` <dbl>, `3B` <dbl>,
#> # TB <dbl>, HR <dbl>, RBI <dbl>, BB <dbl>, HBP <dbl>, SF <dbl>,
#> # SH <dbl>, K <dbl>, DP <dbl>, CS <dbl>, Picked <dbl>, SB <dbl>,
#> # RBI2out <dbl>
The same can be done for pitching, just by changing the
type
parameter:
ncaa_team_player_stats(team_id = team_id, year = year, "pitching")
#> ── NCAA Baseball Team Pitching Stats data from stats.ncaa.org ──────────
#> ℹ Data updated: 2024-04-13 22:00:47 UTC
#> # A tibble: 36 × 43
#> year team_name team_id conference_id conference division player_id
#> <int> <chr> <dbl> <int> <chr> <dbl> <int>
#> 1 2024 Florida St. 234 821 ACC 1 2797459
#> 2 2024 Florida St. 234 821 ACC 1 2649339
#> 3 2024 Florida St. 234 821 ACC 1 2649334
#> 4 2024 Florida St. 234 821 ACC 1 2305362
#> 5 2024 Florida St. 234 821 ACC 1 2813214
#> 6 2024 Florida St. 234 821 ACC 1 2943072
#> 7 2024 Florida St. 234 821 ACC 1 2802463
#> 8 2024 Florida St. 234 821 ACC 1 2799847
#> 9 2024 Florida St. 234 821 ACC 1 2797460
#> 10 2024 Florida St. 234 821 ACC 1 2813213
#> # ℹ 26 more rows
#> # ℹ 36 more variables: player_url <chr>, player_name <chr>, Yr <chr>,
#> # Pos <chr>, Jersey <chr>, GP <dbl>, App <dbl>, GS <dbl>, ERA <dbl>,
#> # IP <dbl>, H <dbl>, R <dbl>, ER <dbl>, BB <dbl>, SO <dbl>,
#> # SHO <dbl>, BF <dbl>, `P-OAB` <dbl>, `2B-A` <dbl>, `3B-A` <dbl>,
#> # Bk <dbl>, `HR-A` <dbl>, WP <dbl>, HB <dbl>, IBB <dbl>,
#> # `Inh Run` <dbl>, `Inh Run Score` <dbl>, SHA <dbl>, SFA <dbl>, …
Now, the function is dependent on the user knowing the
team_id
used by the NCAA website. Given that, I’ve included
a ncaa_school_id_lu
function so that users can find the
team_id
they need.
Just pass a string to the function and it will return possible matches based on the school’s name:
ncaa_school_id_lu("Vand")
#> ── NCAA Baseball Teams Information from baseballr data repository ──────
#> ℹ Data updated: 2024-01-09 04:44:08 UTC
#> # A tibble: 15 × 8
#> team_id team_name team_url conference_id conference division year
#> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl>
#> 1 736 Vanderbilt /team/736… 911 SEC 1 2024
#> 2 736 Vanderbilt /team/736… 911 SEC 1 2023
#> 3 736 Vanderbilt /team/736… 911 SEC 1 2022
#> 4 736 Vanderbilt /team/736… 911 SEC 1 2021
#> 5 736 Vanderbilt /team/736… 911 SEC 1 2020
#> 6 736 Vanderbilt /team/736… 911 SEC 1 2019
#> 7 736 Vanderbilt /team/736… 911 SEC 1 2018
#> 8 736 Vanderbilt /team/736… 911 SEC 1 2017
#> 9 736 Vanderbilt /team/736… 911 SEC 1 2016
#> 10 736 Vanderbilt /team/736… 911 SEC 1 2015
#> 11 736 Vanderbilt /team/736… 911 SEC 1 2014
#> 12 736 Vanderbilt /team/736… 911 SEC 1 2013
#> 13 736 Vanderbilt /team/736… 911 SEC 1 2012
#> 14 736 Vanderbilt /team/736… 911 SEC 1 2011
#> 15 736 Vanderbilt /team/736… 911 SEC 1 2010
#> # ℹ 1 more variable: season_id <dbl>