Retrosheet remains one of the very best data resources for the game of baseball. While we are all used to play-by-play data being readily availabel through Baseball Savant, if you really want to do any kind of research relying on that kind of data before 2008, Retrosheet is the only source.

Getting the data, especially in bulk, can be a little tricky. The event files can be downloaded by year, or by decade, but they do not come in a straightforwad format. That means it isn’t as simple as using read_csv, or some other delimited file reading function and putting the files url as an argument.

The good news is the fine folks at the Chadwick Baseball Bureau make available a set of command line tools for parsing and formating the raw Retrosheet files. We can use these tools, and R, to create our very own event-level database in a matter of minutes.

Set up

First, you’ll need to download and install the command line tools from Chadwick onto your computer. You can find the latest release here.

Second, you will need to install baseballr if you haven’t already. You can install using the following code:

require(devtools)

install_github("BIllPetti/baseballr")

Now, you’re ready to go

Download Event and Roster files

A while back, the great Jim Albert put together a script for downloading and parsing these types of riles in R. When I went to adapt the script I noticed a few things were buggy (mostly, I think, due to Retrosheet changing their url structure for the files). I made some updates to the code and wrapped it in a single function get_retrosheet_data(). The function takes the following arguments:

  • path_to_directory: A path to a directory (existing or new) where you want the Retrosheet files to be located. If the folder does not currently exist it will be created, as well as three necessary subfolders (download.folder, unzipped, and zipped).
  • years_to_acquire: Which years you are interested in downloading. This can be a single integer, like 2018, or it can be a vector. You can include a funch of non-continous years (e.g. c(1935, 1977, 1986)), or the first and last years that you want to sequence (e.g. c(1990,1999)). More on that below.
  • sequence_years: If you provide two years in the previous argument and essentially want all years including and between those two to be downloaded, set this argument to TRUE. This makes it easy to grab large batches of files, or to build a comprehensive database in one step. This argument defaults to FALSE.

The function does the following under the hood:

1) Downloads the Retrosheet event file for a given year 2) Unzipps the files 3) Parses and formats both the events and roster files with Chadwich 4) Does some additional formating of the column names to make them more friendly across platforms 5) Removes the original Retrosheet files 6) If multiple seasons are selected, loops over those years and repeats the five steps above

Let’s try it out on a single season:

require(baseballr)
require(dplyr)
require(readr)

get_retrosheet_data(path_to_directory = "~/Desktop/retrosheet", 
                    years_to_acquire = 2018)

read_csv("~/Desktop/retrosheet/download.folder/unzipped/all2018.csv") %>%
  glimpse()			

Observations: 191,051
Variables: 98
$ game_id                   <chr> "ANA201804020", "ANA201804020", "ANA201…
$ away_team_id              <chr> "CLE", "CLE", "CLE", "CLE", "CLE", "CLE…
$ inn_ct                    <dbl> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, …
$ bat_home_id               <dbl> 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, …
$ outs_ct                   <dbl> 0, 1, 2, 0, 1, 2, 2, 0, 1, 1, 2, 0, 1, …
$ balls_ct                  <dbl> 2, 1, 1, 0, 0, 3, 1, 3, 0, 1, 1, 2, 3, …
$ strikes_ct                <dbl> 2, 2, 2, 0, 2, 2, 1, 1, 1, 1, 2, 2, 2, …
$ pitch_seq_tx              <chr> "BCBSX", "CBCX", "CBSFFX", "X", "CCT", …
$ away_score_ct             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, …
$ home_score_ct             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ bat_id                    <chr> "lindf001", "kipnj001", "ramij003", "co…
$ bat_hand_cd               <chr> "L", "L", "L", "R", "R", "R", "R", "L",…
$ resp_bat_id               <chr> "lindf001", "kipnj001", "ramij003", "co…
$ resp_bat_hand_cd          <chr> "L", "L", "L", "R", "R", "R", "R", "L",…
$ pit_id                    <chr> "ramij002", "ramij002", "ramij002", "cl…
$ pit_hand_cd               <chr> "R", "R", "R", "R", "R", "R", "R", "R",…
$ resp_pit_id               <chr> "ramij002", "ramij002", "ramij002", "cl…
$ resp_pit_hand_cd          <chr> "R", "R", "R", "R", "R", "R", "R", "R",…
$ pos2_fld_id               <chr> "maldm001", "maldm001", "maldm001", "go…
$ pos3_fld_id               <chr> "martj007", "martj007", "martj007", "al…
$ pos4_fld_id               <chr> "cozaz001", "cozaz001", "cozaz001", "ki…
$ pos5_fld_id               <chr> "valbl001", "valbl001", "valbl001", "ra…
$ pos6_fld_id               <chr> "simma001", "simma001", "simma001", "li…
$ pos7_fld_id               <chr> "uptoj001", "uptoj001", "uptoj001", "na…
$ pos8_fld_id               <chr> "troum001", "troum001", "troum001", "zi…
$ pos9_fld_id               <chr> "calhk001", "calhk001", "calhk001", "ch…
$ base1_run_id              <chr> NA, NA, NA, NA, NA, NA, "uptoj001", NA,…
$ base2_run_id              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ base3_run_id              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ event_tx                  <chr> "8/L+", "31/G", "31/G", "8/F", "K", "W"…
$ leadoff_fl                <lgl> TRUE, FALSE, FALSE, TRUE, FALSE, FALSE,…
$ ph_fl                     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ bat_fld_cd                <dbl> 6, 4, 5, 4, 8, 7, 10, 3, 10, 9, 2, 9, 6…
$ bat_lineup_id             <dbl> 1, 2, 3, 1, 2, 3, 4, 4, 5, 6, 7, 5, 6, …
$ event_cd                  <dbl> 2, 2, 2, 2, 3, 14, 2, 2, 23, 2, 3, 3, 2…
$ bat_event_fl              <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU…
$ ab_fl                     <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TR…
$ h_fl                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 1, …
$ sh_fl                     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ sf_fl                     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ event_outs_ct             <dbl> 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, …
$ dp_fl                     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ tp_fl                     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ rbi_ct                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, …
$ wp_fl                     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ pb_fl                     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ fld_cd                    <dbl> 8, 3, 3, 8, 0, 0, 2, 8, 7, 4, 0, 0, 4, …
$ battedball_cd             <chr> "L", "G", "G", "F", NA, NA, "P", "F", "…
$ bunt_fl                   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ foul_fl                   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ battedball_loc_tx         <chr> NA, NA, NA, NA, NA, NA, "2F", NA, NA, N…
$ err_ct                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ err1_fld_cd               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ err1_cd                   <chr> "N", "N", "N", "N", "N", "N", "N", "N",…
$ err2_fld_cd               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ err2_cd                   <chr> "N", "N", "N", "N", "N", "N", "N", "N",…
$ err3_fld_cd               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ err3_cd                   <chr> "N", "N", "N", "N", "N", "N", "N", "N",…
$ bat_dest_id               <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 4, 0, 0, 0, 1, …
$ run1_dest_id              <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
$ run2_dest_id              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ run3_dest_id              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ bat_play_tx               <dbl> 8, 31, 31, 8, 2, NA, 2, 8, NA, 4, 2, 2,…
$ run1_play_tx              <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ run2_play_tx              <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ run3_play_tx              <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ run1_sb_fl                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ run2_sb_fl                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ run3_sb_fl                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ run1_cs_fl                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ run2_cs_fl                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ run3_cs_fl                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ run1_pk_fl                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ run2_pk_fl                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ run3_pk_fl                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ run1_resp_pit_id          <chr> NA, NA, NA, NA, NA, NA, "clevm001", NA,…
$ run2_resp_pit_id          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ run3_resp_pit_id          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ game_new_fl               <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ game_end_fl               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ pr_run1_fl                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ pr_run2_fl                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ pr_run3_fl                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ removed_for_pr_run1_id    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ removed_for_pr_run2_id    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ removed_for_pr_run3_id    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ removed_for_ph_bat_id     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ removed_for_ph_bat_fld_cd <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ po1_fld_cd                <dbl> 8, 1, 1, 8, 2, 0, 2, 8, 0, 4, 2, 2, 0, …
$ po2_fld_cd                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ po3_fld_cd                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ ass1_fld_cd               <dbl> 0, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ ass2_fld_cd               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ ass3_fld_cd               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ ass4_fld_cd               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ ass5_fld_cd               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ event_id                  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, …
$ year                      <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 201…

The roster data can also be found in the unzipped folder after processing.

So this is a single year. What if we wanted to grab all the available years and create our own database?

Build a Retrosheet Event and Roster database

The first step is acquiring all the data. We can do this using the sequencing functionality in get_retrosheet_data().

get_retrosheet_data("~/Desktop/retrosheet",
                    years_to_acquire = c(1919,2018),
                    sequence_years = T)

This function cal will acquire and parse all event and roster files for 1919 through 2018. The entire job takes about 20 minutes on my Macbook Air.

Once you have all the data downloaded the next step is to combine the event and roster files into two single files. To do that, we need to make sure all of the columns in each discrete file are of the same type (some get a little wonky depending on the season, etc.).

We’ll use the most recent file to set the column types. We create three vectors of column names, one each for the different types we encounter:

# creat column type mapping
setwd("~/Desktop/retrosheet/download.folder/unzipped/")

for_type_mapping <- read_csv("all2018.csv")

# for this next step you will need the `variable_list` function from my personal package (https://github.com/BillPetti/bpettir/blob/master/R/variable_list.R)

vars <- variable_list(for_type_mapping) %>%
  mutate(variable_name = as.character(variable_name))

character_vars <- filter(vars, class == "character") %>%
  pull(variable_name)

numeric_vars <- filter(vars, class == "numeric") %>%
  pull(variable_name)

logical_vars <- filter(vars, class == "logical") %>%
  pull(variable_name)

Next, we loop over each event file, format their columns appropriately using the vectors we just created, and then combine the files. (Note, I use furrr to try and speed things up):

# combine all event files

files <- dir(path = "~Desktop/retrosheet/download.folder/unzipped/", pattern = "all*")

load_format_retrosheet_event_files <- function(path) {
  
  df <- read_csv(path)
  
  df <- df %>%
    mutate_if(names(df) %in% character_vars, as.character) %>%
    mutate_if(names(df) %in% numeric_vars, as.numeric) %>%
    mutate_if(names(df) %in% logical_vars, as.logical)
  
  df
}

require(future)
require(furrr)

plan(multiprocess)

sample_combined_pbp <- files %>%
  future_map_dfr(~load_format_retrosheet_event_files(.), .progress = TRUE)

sample_combined_pbp <- sample_combined_pbp %>%
  janitor::clean_names()

And we do the same thing for the roster files:

# combine all roster files

load_format_retrosheet_roster_files <- function(path) {
  
  df <- read_csv(path)
  names(df) <-c("row", "Player.ID", "Last.Name", "First.Name",
                "Bats", "Pitches", "Team", "Pos", "year")
  df <- df %>%
    mutate_all(as.character)
  
  df <- df %>%
    select(Player.ID:year)
  
  df
}

files <- dir(path = "~/Desktop/retrosheet/download.folder/unzipped/", pattern = "roster*")

sample_combined_roster_files <- files %>%
  map(load_format_retrosheet_roster_files) %>%
  bind_rows()

sample_combined_roster_files <- sample_combined_roster_files %>%
  janitor::clean_names()

Now we are ready to upload our files into a database.

I use Postgres, and the following code assumes you already have a database set up where you can deposit the files. The code below creates the new tables and uploads the combined files we just created. It will also index the database using some common columns (or, at least, columns that find useful to have indexed).

(note: I have a package that allows me to connec to my databases with a single call to save time, the myDBconnections referrenced below.)

# add event and roster files to Postgres database

retrosheet_db <- myDBconnections::connect_retrosheet_postgreSQL()

dbWriteTable(conn = retrosheet_db, 
             name = "retro_rosters", 
             value = sample_combined_roster_files, 
             overwrite = TRUE, 
             row.names = FALSE)

dbWriteTable(conn = retrosheet_db, 
             name = "retro_pbp", 
             value = sample_combined_pbp, 
             overwrite = TRUE, 
             row.names = FALSE)

# create indices for pbp table

dbGetQuery(retrosheet_db, 
           'create index retro_pbp_year on retro_pbp (year)')
dbGetQuery(retrosheet_db, 
           'create index retro_pbp_game_id on retro_pbp (game_id)')
dbGetQuery(retrosheet_db, 
           'create index retro_pbp_away_team_id on retro_pbp (away_team_id)')
dbGetQuery(retrosheet_db, 
           'create index retro_pbp_bat_id on retro_pbp (bat_id)')
dbGetQuery(retrosheet_db, 
           'create index retro_pbp_pit_id on retro_pbp (pit_id)')
dbGetQuery(retrosheet_db, 
           'create index retro_pbp_bat_hand_cd on retro_pbp (bat_hand_cd)')
dbGetQuery(retrosheet_db, 
           'create index retro_pbp_event_tx on retro_pbp (event_tx)')
dbGetQuery(retrosheet_db, 
           'create index retro_pbp_event_id on retro_pbp (event_id)')
dbGetQuery(retrosheet_db, 
           'create index retro_pbp_pit_hand_cd on retro_pbp (pit_hand_cd)')
dbGetQuery(retrosheet_db, 
           'create index retro_pbp_event_cd on retro_pbp (event_cd)')
dbGetQuery(retrosheet_db, 
           'create index retro_pbp_event_outs_ct on retro_pbp (event_outs_ct)')
dbGetQuery(retrosheet_db, 
           'create index retro_pbp_inn_ct on retro_pbp (inn_ct)')
dbGetQuery(retrosheet_db, 
           'alter table retro_rosters alter column year type int using year::integer')

And now you have your very own Retrosheet event and roster database to query to your heart’s content!

I recommend running all the code above as a separate job in R (or multiple, sequential, jobs if you want to be able to troubleshoot different steps) to make things run as fast as possible.