NCAA Scraping

The latest release of the baseballr includes a function for acquiring player statistics from the NCAA’s website for baseball teams across the three major divisions (I, II, III).

In order to look up teams, you can either load the teams for all divisions from the baseballr-data repository or access them directly from the NCAA website for a given year and division.

Loading from the baseballr-data repository:

library(baseballr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
ncaa_teams_df <- load_ncaa_baseball_teams()

From the NCAA website:

try(ncaa_teams(year = most_recent_ncaa_baseball_season(), division = "1"))
#> ── NCAA Baseball Teams data from stats.ncaa.org ───── baseballr 1.6.0 ──
#> ℹ Data updated: 2024-04-13 22:00:37 UTC
#> # A tibble: 305 × 8
#>    team_id team_name    team_url conference_id conference division  year
#>    <chr>   <chr>        <chr>    <chr>         <chr>      <chr>    <dbl>
#>  1 458     Charlotte    /team/4… 823           AAC        1         2024
#>  2 196     East Caroli… /team/1… 823           AAC        1         2024
#>  3 229     Fla. Atlant… /team/2… 823           AAC        1         2024
#>  4 404     Memphis      /team/4… 823           AAC        1         2024
#>  5 574     Rice         /team/5… 823           AAC        1         2024
#>  6 651     South Fla.   /team/6… 823           AAC        1         2024
#>  7 718     Tulane       /team/7… 823           AAC        1         2024
#>  8 9       UAB          /team/9… 823           AAC        1         2024
#>  9 706     UTSA         /team/7… 823           AAC        1         2024
#> 10 782     Wichita St.  /team/7… 823           AAC        1         2024
#> # ℹ 295 more rows
#> # ℹ 1 more variable: season_id <chr>

The function, ncaa_team_player_stats(), requires the user to pass values for three parameters for the function to work:

team_id: numerical code used by the NCAA for each school year: a four-digit year type: whether to pull data for batters or pitchers

If you want to pull batting statistics for Florida State for the 2024 season, you would use the following:


team_id <- ncaa_teams_df %>% 
  dplyr::filter(.data$team_name == "Florida St.") %>% 
  dplyr::select("team_id") %>% 
  dplyr::distinct() %>% 
  dplyr::pull("team_id")

year <- most_recent_ncaa_baseball_season()

ncaa_team_player_stats(team_id = team_id, year = year, "batting")
#> ── NCAA Baseball Team Batting Stats data from stats.ncaa.org ───────────
#> ℹ Data updated: 2024-04-13 22:00:42 UTC
#> # A tibble: 36 × 35
#>     year team_name   team_id conference_id conference division player_id
#>    <int> <chr>         <dbl>         <int> <chr>         <dbl>     <int>
#>  1  2024 Florida St.     234           821 ACC               1   2797459
#>  2  2024 Florida St.     234           821 ACC               1   2649339
#>  3  2024 Florida St.     234           821 ACC               1   2649334
#>  4  2024 Florida St.     234           821 ACC               1   2305362
#>  5  2024 Florida St.     234           821 ACC               1   2813214
#>  6  2024 Florida St.     234           821 ACC               1   2943072
#>  7  2024 Florida St.     234           821 ACC               1   2802463
#>  8  2024 Florida St.     234           821 ACC               1   2799847
#>  9  2024 Florida St.     234           821 ACC               1   2797460
#> 10  2024 Florida St.     234           821 ACC               1   2813213
#> # ℹ 26 more rows
#> # ℹ 28 more variables: player_url <chr>, player_name <chr>, Yr <chr>,
#> #   Pos <chr>, Jersey <chr>, GP <dbl>, GS <dbl>, BA <dbl>, OBPct <dbl>,
#> #   SlgPct <dbl>, R <dbl>, AB <dbl>, H <dbl>, `2B` <dbl>, `3B` <dbl>,
#> #   TB <dbl>, HR <dbl>, RBI <dbl>, BB <dbl>, HBP <dbl>, SF <dbl>,
#> #   SH <dbl>, K <dbl>, DP <dbl>, CS <dbl>, Picked <dbl>, SB <dbl>,
#> #   RBI2out <dbl>

The same can be done for pitching, just by changing the type parameter:

ncaa_team_player_stats(team_id = team_id, year = year,  "pitching")
#> ── NCAA Baseball Team Pitching Stats data from stats.ncaa.org ──────────
#> ℹ Data updated: 2024-04-13 22:00:47 UTC
#> # A tibble: 36 × 43
#>     year team_name   team_id conference_id conference division player_id
#>    <int> <chr>         <dbl>         <int> <chr>         <dbl>     <int>
#>  1  2024 Florida St.     234           821 ACC               1   2797459
#>  2  2024 Florida St.     234           821 ACC               1   2649339
#>  3  2024 Florida St.     234           821 ACC               1   2649334
#>  4  2024 Florida St.     234           821 ACC               1   2305362
#>  5  2024 Florida St.     234           821 ACC               1   2813214
#>  6  2024 Florida St.     234           821 ACC               1   2943072
#>  7  2024 Florida St.     234           821 ACC               1   2802463
#>  8  2024 Florida St.     234           821 ACC               1   2799847
#>  9  2024 Florida St.     234           821 ACC               1   2797460
#> 10  2024 Florida St.     234           821 ACC               1   2813213
#> # ℹ 26 more rows
#> # ℹ 36 more variables: player_url <chr>, player_name <chr>, Yr <chr>,
#> #   Pos <chr>, Jersey <chr>, GP <dbl>, App <dbl>, GS <dbl>, ERA <dbl>,
#> #   IP <dbl>, H <dbl>, R <dbl>, ER <dbl>, BB <dbl>, SO <dbl>,
#> #   SHO <dbl>, BF <dbl>, `P-OAB` <dbl>, `2B-A` <dbl>, `3B-A` <dbl>,
#> #   Bk <dbl>, `HR-A` <dbl>, WP <dbl>, HB <dbl>, IBB <dbl>,
#> #   `Inh Run` <dbl>, `Inh Run Score` <dbl>, SHA <dbl>, SFA <dbl>, …

Now, the function is dependent on the user knowing the team_id used by the NCAA website. Given that, I’ve included a ncaa_school_id_lu function so that users can find the team_id they need.

Just pass a string to the function and it will return possible matches based on the school’s name:

ncaa_school_id_lu("Vand")
#> ── NCAA Baseball Teams Information from baseballr data repository ──────
#> ℹ Data updated: 2024-01-09 04:44:08 UTC
#> # A tibble: 15 × 8
#>    team_id team_name  team_url   conference_id conference division  year
#>      <dbl> <chr>      <chr>              <dbl> <chr>         <dbl> <dbl>
#>  1     736 Vanderbilt /team/736…           911 SEC               1  2024
#>  2     736 Vanderbilt /team/736…           911 SEC               1  2023
#>  3     736 Vanderbilt /team/736…           911 SEC               1  2022
#>  4     736 Vanderbilt /team/736…           911 SEC               1  2021
#>  5     736 Vanderbilt /team/736…           911 SEC               1  2020
#>  6     736 Vanderbilt /team/736…           911 SEC               1  2019
#>  7     736 Vanderbilt /team/736…           911 SEC               1  2018
#>  8     736 Vanderbilt /team/736…           911 SEC               1  2017
#>  9     736 Vanderbilt /team/736…           911 SEC               1  2016
#> 10     736 Vanderbilt /team/736…           911 SEC               1  2015
#> 11     736 Vanderbilt /team/736…           911 SEC               1  2014
#> 12     736 Vanderbilt /team/736…           911 SEC               1  2013
#> 13     736 Vanderbilt /team/736…           911 SEC               1  2012
#> 14     736 Vanderbilt /team/736…           911 SEC               1  2011
#> 15     736 Vanderbilt /team/736…           911 SEC               1  2010
#> # ℹ 1 more variable: season_id <dbl>

Bill Petti

2016-11-22