The latest release of the baseballr
includes a function for acquiring player statistics from the NCAA’s website for baseball teams
across the three major divisions (I, II, III).
The function, ncaa_scrape
, requires the user to pass
values for three parameters for the function to work:
school_id
: numerical code used by the NCAA for each
school year
: a four-digit year type
: whether
to pull data for batters or pitchers
If you want to pull batting statistics for Vanderbilt for the 2013 season, you would use the following:
library(baseballr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
ncaa_scrape(736, 2021, "batting") %>%
select("year":"OBPct")
#> ── NCAA Baseball Team Stats data from stats.ncaa.org ───────────────────
#> ℹ Data updated: 2022-12-29 00:09:27 UTC
#> # A tibble: 41 × 12
#> year school confe…¹ divis…² Jersey Player Yr Pos GP GS
#> <int> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
#> 1 2021 Vanderbi… SEC 1 51 Bradf… Fr OF 67 67
#> 2 2021 Vanderbi… SEC 1 25 Nolan… So INF 66 66
#> 3 2021 Vanderbi… SEC 1 99 Gonza… So INF 61 58
#> 4 2021 Vanderbi… SEC 1 9 Young… So INF 61 61
#> 5 2021 Vanderbi… SEC 1 12 Keega… Jr UT 60 60
#> 6 2021 Vanderbi… SEC 1 8 Thoma… Jr OF 59 57
#> 7 2021 Vanderbi… SEC 1 5 Rodri… So C 58 52
#> 8 2021 Vanderbi… SEC 1 16 Bulge… Fr UT 50 41
#> 9 2021 Vanderbi… SEC 1 6 Kolwy… Jr INF 43 39
#> 10 2021 Vanderbi… SEC 1 19 LaNev… So OF 37 19
#> # … with 31 more rows, 2 more variables: BA <dbl>, OBPct <dbl>, and
#> # abbreviated variable names ¹conference, ²division
The same can be done for pitching, just by changing the
type
parameter:
ncaa_scrape(736, 2021, "pitching") %>%
select("year":"ERA")
#> ── NCAA Baseball Team Stats data from stats.ncaa.org ───────────────────
#> ℹ Data updated: 2022-12-29 00:09:28 UTC
#> # A tibble: 41 × 12
#> year school confe…¹ divis…² Jersey Player Yr Pos GP App
#> <int> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
#> 1 2021 Vanderbi… SEC 1 51 Bradf… Fr OF 67 67
#> 2 2021 Vanderbi… SEC 1 25 Nolan… So INF 66 66
#> 3 2021 Vanderbi… SEC 1 99 Gonza… So INF 61 61
#> 4 2021 Vanderbi… SEC 1 9 Young… So INF 61 61
#> 5 2021 Vanderbi… SEC 1 12 Keega… Jr UT 60 60
#> 6 2021 Vanderbi… SEC 1 8 Thoma… Jr OF 59 59
#> 7 2021 Vanderbi… SEC 1 5 Rodri… So C 58 58
#> 8 2021 Vanderbi… SEC 1 16 Bulge… Fr UT 50 50
#> 9 2021 Vanderbi… SEC 1 6 Kolwy… Jr INF 43 43
#> 10 2021 Vanderbi… SEC 1 19 LaNev… So OF 37 37
#> # … with 31 more rows, 2 more variables: GS <dbl>, ERA <dbl>, and
#> # abbreviated variable names ¹conference, ²division
Now, the function is dependent on the user knowing the
school_id
used by the NCAA website. Given that, I’ve
included a ncaa_school_id_lu
function so that users can
find the school_id
they need.
Just pass a string to the function and it will return possible matches based on the school’s name:
ncaa_school_id_lu("Vand")
#> # A tibble: 10 × 6
#> school conference school_id year division conference_id
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Vanderbilt SEC 736 2013 1 911
#> 2 Vanderbilt SEC 736 2014 1 911
#> 3 Vanderbilt SEC 736 2015 1 911
#> 4 Vanderbilt SEC 736 2016 1 911
#> 5 Vanderbilt SEC 736 2017 1 911
#> 6 Vanderbilt SEC 736 2018 1 911
#> 7 Vanderbilt SEC 736 2019 1 911
#> 8 Vanderbilt SEC 736 2020 1 911
#> 9 Vanderbilt SEC 736 2021 1 911
#> 10 Vanderbilt SEC 736 2022 1 911