Sunday, October 24, 2010

Postseason Homeruns

Before I get started on the meat of this post I want to do a bit of housekeeping. In my second post I credited Sean Forman for the database I downloaded from Baseball Databank. While Sean Forman does maintain Baseball Databank, the database is supplied and maintained by Sean Lahman who runs Sean Lahman's Baseball Archive. I will correct this mistake in my earlier post. Also it appears that when I retrieved the event files from Retrosheet I didn't do my due diligence by scrolling to the end of the page. It turns out that the event files from years prior to 1974 have missing games. My last post could have gone south quick if Lou Brock had stolen a base in one of those games (though I got lucky with the example I picked and really got to explore the database). However, Retrosheet provides box scores files from 1920 to 1973 and a smattering of years before. While these files provide less data than event files, they fill in the gaps of the missing games and extend our reach into history by 30 years. I will be adding this data to my database in the coming days.

Okay then. Last night while I was watching the Giants/Phillies game they mentioned that Jayson Werth is tied for the lead in postseason homeruns in the National League with 13. They showed who he was tied with and of course it was all players from the wild card era. Just using a total ignores all the great postseason sluggers. However it is worth noting that Babe Ruth (an American League player his entire career) had 15 postseason homeruns and didn't have division series and league championship series to play in. So this got me thinking. Who has the lowest at bat to homerun ratio in the postseason?

First let's start out with the top ten in raw postseason homeruns (this is through 2009):

select m.nameFirst, m.nameLast, sum(b.HR) as hrTotal
from master m, battingpost b
where b.playerID = m.playerID
group by b.playerID
order by hrTotal desc
limit 10;

+-----------+-----------+---------+
| nameFirst | nameLast | hrTotal |
+-----------+-----------+---------+
| Manny | Ramirez | 29 |
| Bernie | Williams | 22 |
| Derek | Jeter | 20 |
| Reggie | Jackson | 18 |
| Mickey | Mantle | 18 |
| Jim | Thome | 17 |
| Babe | Ruth | 15 |
| David | Justice | 14 |
| Alex | Rodriguez | 13 |
| Chipper | Jones | 13 |
+-----------+-----------+---------+
10 rows in set (0.24 sec)

Now let's see how these guys fare when we take into account the number of at bats it took them to amass these totals. I'm a bit torn about using at bats, because I don't really think it accurately captures "homerun chances", but neither does plate appearances (intentional walks get in the way there, we are talking about homerun hitters after all). For now I will use at bats. Maybe when I have enough time I'll come up with my own metric, or maybe a better one already exists.

select m.nameFirst, m.nameLast, sum(b.AB) / sum(b.HR) as abPerHR
from master m, battingpost b
where b.playerID = m.playerID
and b.HR > 0
group by b.playerID
order by abPerHR
limit 10;
+-----------+----------+---------+
| nameFirst | nameLast | abPerHR |
+-----------+----------+---------+
| Jim | Mason | 1.0000 |
| Bobby | Kielty | 1.0000 |
| Greg | Myers | 1.0000 |
| Geoff | Blum | 1.0000 |
| Matt | Stairs | 1.0000 |
| Tom | Wilson | 1.0000 |
| George | Shuba | 1.0000 |
| Chuck | Essegian | 1.5000 |
| Ed | Sprague | 2.0000 |
| So | Taguchi | 2.0000 |
+-----------+----------+---------+
10 rows in set (0.05 sec)

Well that isn't what we are looking for. Obviously we need to put some sort of lower limit on the number of homeruns hit. Let's try five.

select m.nameFirst, m.nameLast, sum(b.AB) / sum(b.HR) as abPerHR, sum(b.HR) HR
from master m, battingpost b
where b.playerID = m.playerID
and (select sum(HR)
from battingpost b_sub
where b_sub.playerID = b.playerID) > 5
group by b.playerID
order by abPerHR
limit 10;

+-----------+-----------+---------+------+
| nameFirst | nameLast | abPerHR | HR |
+-----------+-----------+---------+------+
| Carlos | Beltran | 7.4545 | 11 |
| Jim | Leyritz | 7.6250 | 8 |
| Juan | Gonzalez | 7.7500 | 8 |
| Troy | Glaus | 8.2222 | 9 |
| Babe | Ruth | 8.6000 | 15 |
| Bob | Robertson | 8.8333 | 6 |
| Gary | Matthews | 9.2857 | 7 |
| B.J. | Upton | 9.4286 | 7 |
| Matt | Holliday | 9.5000 | 6 |
| Ken | Caminiti | 9.8750 | 8 |
+-----------+-----------+---------+------+
10 rows in set (0.74 sec)

Impressive work by Carlo Beltran, and from our first list Babe Ruth isn't too far behind. Let's up our cutoff to thirteen so we get everyone from the original list.

select m.nameFirst, m.nameLast, sum(b.AB) / sum(b.HR) as abPerHR, sum(b.HR) HR
from master m, battingpost b
where b.playerID = m.playerID
and (select sum(HR)
from battingpost b_sub
where b_sub.playerID = b.playerID) >= 13
group by b.playerID
order by abPerHR;

+-----------+-----------+---------+------+
| nameFirst | nameLast | abPerHR | HR |
+-----------+-----------+---------+------+
| Babe | Ruth | 8.6000 | 15 |
| Jim | Thome | 12.1765 | 17 |
| Mickey | Mantle | 12.7778 | 18 |
| Manny | Ramirez | 14.1379 | 29 |
| Alex | Rodriguez | 15.3077 | 13 |
| Albert | Pujols | 15.3077 | 13 |
| Reggie | Jackson | 15.6111 | 18 |
| Jim | Edmonds | 17.6923 | 13 |
| Bernie | Williams | 21.1364 | 22 |
| Chipper | Jones | 25.6154 | 13 |
| Derek | Jeter | 27.9500 | 20 |
| David | Justice | 28.4286 | 14 |
+-----------+-----------+---------+------+
12 rows in set (0.71 sec)

Babe Ruth is on top as is to be expected. I no longer limited the list to 10, so we see that Derek Jeter and David Justice fall to the bottom of the list and Jim Edmonds and Albert Pujols break into the mix. But wait, what about Jayson Werth? This data is from 2009, so his 2 homeruns this year haven't been added yet. Werth has 13 homeruns in 153 at bats, which comes to 11.7692 at bats per homerun. That's good enough for second on this list, but well behind Carlos Beltran. By the way, Jeter had 0 homeruns in 30 at bats this postseason, A-Rod had 0 in 32, and Jim Thome had 0 in 10. So their stock all fell.

So who is the best postseason homerun hitter? Manny's 29 has to be respected and Carlos Beltran's 7.5 at bats per homerun with a total of 11 is extremely impressive, but in my opinion Babe Ruth's consistent power takes the cake.

Saturday, October 23, 2010

Validating the Database

Now that I have my database setup it is time to do a bit of validation of the data. The first thing I'll do is confirm Albert Pujols's homerun total.

select count(*)
from master m, eventlogs e
where m.retroID = e.batter
and m.nameFirst = "Albert"
and m.nameLast = "Pujols"
and e.hit_value = 4;

The master database, among other things, holds the IDs for players, managers, umpires, etc used in various systems. Since I'm querying my eventlogs database, which uses data provided by Retrosheet, I need to use retroID. The hit_value column corresponds to the number of bases the batter got for his hit. Since I'm looking for homeruns, I need to filter by the value 4. Here is the result:

+----------+
| count(*) |
+----------+
| 366 |
+----------+
1 row in set (49.11 sec)

A quick check of Baseball Reference shows that Albert Pujols has 408 homeruns for his career. However, my database currently only has data through 2009, so I need to subtract the 42 homeruns he had in 2010. 366, it works!!

Now let's go back at bit. The eventlogs database holds data from 1950 forward. Let's try calculating Lou Brock's stolen base total. This is going to be a bit more complicated. We first need to check whether Lou Brock is on base, so we will use the first_runner, second_runner, and third_runner fields. Then we need to check whether he stole a base in the event we are looking at, so we will use the SB_for_runner_on_1st_flag, SB_for_runner_on_2nd_flag, and SB_for_runner_on_3rd_flag fields. Additionally we can use the event_type field so that we only look at stolen base events. The value for stolen base events is 4. By combining this all together we get:

select count(*)
from master m, eventlogs e
where m.nameFirst = "Lou"
and m.nameLast = "Brock"
and e.event_type = 4
and ((e.first_runner = m.retroID and SB_for_runner_on_1st_flag = 'T')
or (e.second_runner = m.retroID and SB_for_runner_on_2nd_flag = 'T')
or (e.third_runner = m.retroID and SB_for_runner_on_3rd_flag = 'T'));

+----------+
| count(*) |
+----------+
| 936 |
+----------+
1 row in set (51.88 sec)

Hmm, all the sources I've consulted show that Lou Brock stole 938 bases. So what is going on? Is something wrong with our query? Is something wrong with our data? Let's first break this down by year.

select substr(e.game_id, 4, 4) as year, count(*)
from master m, eventlogs e
where m.nameFirst = "Lou"
and m.nameLast = "Brock"
and e.event_type = 4
and ((e.first_runner = m.retroID and SB_for_runner_on_1st_flag = 'T')
or (e.second_runner = m.retroID and SB_for_runner_on_2nd_flag = 'T')
or (e.third_runner = m.retroID and SB_for_runner_on_3rd_flag = 'T'))
group by year;

+------+----------+
| year | count(*) |
+------+----------+
| 1962 | 16 |
| 1963 | 24 |
| 1964 | 41 |
| 1965 | 63 |
| 1966 | 74 |
| 1967 | 52 |
| 1968 | 62 |
| 1969 | 53 |
| 1970 | 51 |
| 1971 | 64 |
| 1972 | 63 |
| 1973 | 70 |
| 1974 | 118 |
| 1975 | 56 |
| 1976 | 56 |
| 1977 | 35 |
| 1978 | 17 |
| 1979 | 21 |
+------+----------+
18 rows in set (55.25 sec)

A quick check shows that 1964 is the problem year. Baseball Reference says Lou had 43 stolen bases and our data says 41. Now let's see all 41 stolen base events we have in 1964 and figure out when the missing 2 occurred.

select e.game_id
from master m, eventlogs e
where m.nameFirst = "Lou"
and m.nameLast = "Brock"
and e.event_type = 4
and ((e.first_runner = m.retroID and SB_for_runner_on_1st_flag = 'T')
or (e.second_runner = m.retroID and SB_for_runner_on_2nd_flag = 'T')
or (e.third_runner = m.retroID and SB_for_runner_on_3rd_flag = 'T'))
and substr(e.game_id, 4, 4) = "1964"
order by substr(e.game_id, 4);

+--------------+
| game_id |
+--------------+
| HOU196405020 |
| CIN196405241 |
| CIN196405242 |
| CHN196405280 |
| CHN196406040 |
| NYN196406110 |
| CHN196406120 |
| CHN196406130 |
| HOU196406160 |
| SLN196406180 |
| SLN196406230 |
| SLN196406230 |
| SLN196406281 |
| SLN196406290 |
| SLN196406300 |
| CIN196407050 |
| NYN196407110 |
| SLN196407170 |
| SLN196407210 |
| CHN196407290 |
| SLN196408010 |
| SLN196408040 |
| SLN196408070 |
| SLN196408090 |
| SLN196408100 |
| LAN196408150 |
| LAN196408150 |
| HOU196408180 |
| HOU196408190 |
| HOU196408190 |
| SFN196408220 |
| SFN196408220 |
| SLN196408240 |
| SLN196408280 |
| SLN196409020 |
| SLN196409040 |
| SLN196409060 |
| PHI196409090 |
| CHN196409110 |
| CIN196409191 |
| NYN196409220 |
+--------------+
41 rows in set (52.11 sec)

Comparing these results with the data at Baseball Reference shows that the missing stolen bases occured on April 25th at Philadelphia and June 12th at home as a Cub (the game_id is prefixed with the home team's abbreviation). So let's look at all of the stolen base events for those games.

select game_id, first_runner as 1st,
SB_for_runner_on_1st_flag as 1st_flag, second_runner as 2nd,
SB_for_runner_on_2nd_flag as 2nd_flag, third_runner as 3rd,
SB_for_runner_on_3rd_flag as 3rd_flag
from eventlogs e
where e.event_type = 4
and game_id in ("PHI196404250", "CHN196406120");

+--------------+----------+----------+----------+----------+-----+----------+
| game_id | 1st | 1st_flag | 2nd | 2nd_flag | 3rd | 3rd_flag |
+--------------+----------+----------+----------+----------+-----+----------+
| CHN196406120 | brocl102 | T | | F | | F |
| PHI196404250 | rodga101 | T | | F | | F |
| PHI196404250 | | F | cowab101 | T | | F |
+--------------+----------+----------+----------+----------+-----+----------+
3 rows in set (52.02 sec)

Hmm, on a whim what if we remove the event_type restriction?

select game_id, event_type as type, first_runner as 1st,
SB_for_runner_on_1st_flag as 1st_flag, second_runner as 2nd,
SB_for_runner_on_2nd_flag as 2nd_flag, third_runner as 3rd,
SB_for_runner_on_3rd_flag as 3rd_flag
from master m, eventlogs e
where m.nameFirst = "Lou"
and m.nameLast = "Brock"
and game_id in ("PHI196404250", "CHN196406120")
and ((e.first_runner = m.retroID and SB_for_runner_on_1st_flag = 'T')
or (e.second_runner = m.retroID and SB_for_runner_on_2nd_flag = 'T')
or (e.third_runner = m.retroID and SB_for_runner_on_3rd_flag = 'T'));

+--------------+------+----------+----------+----------+----------+-----+----------+
| game_id | type | 1st | 1st_flag | 2nd | 2nd_flag | 3rd | 3rd_flag |
+--------------+------+----------+----------+----------+----------+-----+----------+
| CHN196406120 | 4 | brocl102 | T | | F | | F |
| CHN196406120 | 14 | | F | brocl102 | T | | F |
| PHI196404250 | 3 | brocl102 | T | | F | | F |
+--------------+------+----------+----------+----------+----------+-----+----------+
3 rows in set (52.42 sec)

Huh, stolen bases on non stolen base event types. 3 is strikeout and 14 is walk. I'm not sure if those are typos, or if my knowledge of the eventlog data is just lacking. Probably the latter, but nonetheless, if we go back to the original query and remove the event_type filter we should get the 938 we are looking for.

select count(*)
from master m, eventlogs e
where m.nameFirst = "Lou"
and m.nameLast = "Brock"
and ((e.first_runner = m.retroID and SB_for_runner_on_1st_flag = 'T')
or (e.second_runner = m.retroID and SB_for_runner_on_2nd_flag = 'T')
or (e.third_runner = m.retroID and SB_for_runner_on_3rd_flag = 'T'));

+----------+
| count(*) |
+----------+
| 938 |
+----------+
1 row in set (52.85 sec)

Wednesday, October 20, 2010

Setting Up The Database

In order to setup my database I relied heavily on Chapter 2 of Baseball Hacks. This book is a few years old, but it remains fairly relevant. I think it does a really good job of balancing baseball and programming and providing something for everybody no matter your level of expertise in either subject. Anyway, here is what you will need to do to set up a database like mine.

First off you need to get Perl and MySql. MySql is MySql, there is only one place to get it. You don't have all the options like you do with Perl. I'm using Windows, so for Perl I went with Strawberry Perl, but there are other options. If you want to run a web server too, there are package that combine MySql, Perl, and Apache (and maybe some other stuff too). The installs are pretty straightforward. The only thing I had trouble with was that MySql wasn't added to my path automatically, but that was easy to fix.

The first thing I tackled was Hack #10. If you don't have Baseball Hacks, you can still get a lot of helpful files at the book's website. This step is loading the database from Baseball Databank. This is pretty easy since Sean Lahman provides an *.sql file that can easily be imported. I logged into MySql, created a new database called baseball,:

create database baseball;

logged out, and imported the *.sql file:

mysql -u root -p -s baseball < /BDB-sql-2009-11-25.sql

That's it. Now you have a database with a whole host of information like you might see on the back of baseball card.

Next up is Hack #21. Completing this step will give you game logs from as far back as 1872. There aren't any individual stats, just team totals, batting lineups, etc. The first you need to do for this is go Retrosheet, download all of the game log files, and unzip them. Once you have them all unzipped, get rid of GL1871.txt. For whatever reason it is incomplete and will mess up your sql query if you include it. Next grab hack_21_game_log_header.csv from zip file on the Baseball Hacks website and copy that to a file called all.hdr.txt. Then concatenate all the game log files and append the result to all.hdr.txt. On Windows you would use this command:

copy *.TXT >> all.hdr.txt

Once that is complete, Baseball Hacks provides an excellent Perl script (in the same zip file from above) for generating sql table creation and import commands. Use it like so:

hack_20_check_field_sizes.pl -h -s "," -i all.hdr.txt > all.hdr.sql

Now just import the resulting sql file like before:

mysql -u root -p -s baseball < all.hdr.sql

Baseball Hacks loads all these different datasets into different databases. There is nothing wrong with that, but I kept mine all in the same one.

The last step is Hack #22. By far this step gave me the most trouble and I had to deviate quite a bit from the book. I'll try to document what I did, but I make no gurantees about how smoothly this will go, but the results are worth it. This step will give you event logs going back to 1950. Every plate appearance, every everything has an entry. From this you can build box scores for games, generate player totals at home for a particular season, against lefties, with RISP, whatever you want. This is the mother load.

Again the files needed for this step are available on Retrosheet. The Baseball Hacks script (hack_22_translate.pl) supposedly unzips the files for you, but I could get very little about this script to work. I ended up unzipping them myself. You will also need to a tool called bevent.exe to translate these files into comma separated lists. There is also a collection of tools called Chadwick, but they produce different (better) output than what the scripts in Baseball Hacks expect, so I didn't use them this time around. Perhaps I'll experiment in the future and post about it.

As I said the translate.pl script gave me all sorts of trouble, so what I ended up doing was modifying it to the point that it just printed out the bevent calls it needed to do and I pasted them into my command window. If you go this route you don't need to do it one by one. As long as each command is on a separate line you can select all, copy and paste. First however you need to copy the column names to the *.csv file you are going to use. There is a space in the file I just linked, so make sure you fix that and also make sure you have a return after the column names, otherwise your table definition command will be messed up. Another problem I had was that due to the size of my *.csv file (3Mb+) hack_20_check_field_sizes.pl took forever and in the end returned junk. So... I had to create a *.csv file that was generated using only some of the event files, created the *.sql file:

hack_20_check_field_sizes.pl -h -s "," -i temp.csv > pbp.sql

then edited it specify the right filenames and table names. Finally I was ready to import the data:

mysql -u root -p -s baseball < pbp.sql

But hold on, there's more. There are roster files that you also need to load into the database. This step went better. The hack_22_rosters.pl script worked much better, with only a bit of tweaking (the column names are messed up). However, now that I look at it I think I messed this up. But... I'm not really sure why this information is needed. I'll probably just drop this table add it back correctly if I feel like I need the information. I'm pretty sure however that I can get it elsewhere.

So, that is my database. By no means is this all the baseball data available for free on the Internet, but it is a very good starting point and I know it will keep me busy for a quite a while. If you choose to create your own database, I hope this helps at least a little bit. If I didn't make it clear, I think Baseball Hacks is a pretty good book to have around and it's only $5 used on Amazon right now. In the coming posts I will be exploring the data in my database and providing interesting queries I try out. I hope to show how fun and easy it is to discover interesting information about players once you have full control of the data.

Tuesday, October 19, 2010

First Post

Welcome to SaberNerd! As the name of this blog suggests, I am a fan of the type of baseball analysis called sabermetrics, but that is not the only thing this blog is going to focus on. While it will be stat-heavy, I will also focus on interesting historical aspects of baseball. One of the first such topics I plan to address is the evolution of the pitcher's mound. While most fans know about the mound being lowered prior to the 1969 season, many may not know that pitchers originally threw from flat ground and as late as the early 20th century there was no specification for mound height. However, in this post I want to introduce myself, explain my motivation for this blog, and describe a bit more about what you can expect in the coming posts.

I grew up in a Cardinals household, as did my dad (a result of the wide reach of KMOX). Based on this and when I was born, Ozzie Smith was my childhood hero. While the level of my attention to the team has vacillated throughout my life, St. Louis will always be my team. Perhaps prophetically, I also had a loose interest in the Twins due to Kirby Puckett and later Torii Hunter (from my home state of Arkansas) and rooted for them when they weren't playing the Cards. As of March 2010 I became a resident of Hopkins, MN (a suburb of the Twin Cities) and my first weekend here I saw an exhibition game between the Cardinals and Twins in the newly opened Target Field (I was wearing a Pujols jersey).

Another of my interests that has sustained itself since childhood is computers. In 2005 I graduated from the University of Tulsa with a degree in computer science. In the process I had developed a strong interest in artificial intelligence and the following fall I entered the AI masters program at the University of Georgia. By the next summer it was time for me to come up with a thesis proposal and was stumped. Having recently read Moneyball, I had a moment of inspiration. The result was a thesis comparing various machine learning techniques to predict player season totals for a handful of stats. I compared my results with the output of 3 publicly available projection systems. While I didn't beat the gold standard (Baseball Prospectus), I made an admirable showing.

After graduation I got a real job and my projection software gathered dust. I would get a new idea once in a while, code up something, get mediocre results, and not spend the time to improve them. Then a couple months ago I finally got around to reading Watching Baseball Smarter. It was an overall bad book with an interesting fact sprinkled here and there, but it piqued my interest enough to tackle Koppett's Concise History of Major League Baseball (still a work in process), which I had picked up on a whim at a library book sale a few months earlier. I also decided to watch Ken Burn's Baseball in conjunction. Additionally, somewhere in the middle of all that I became the last baseball fan to discover the great site Fangraphs.

Immersed in the history of my favorite sport and inspired by Fangraphs, I got the idea for this blog. I dug out Baseball Hacks, which I bought back in my thesis days, and built a respectable stats database (the subject of my next post). Early on this blog is going to focus on me exploring that new database. I'll provide detailed information on building the database, information on the tables and their contents, and interesting queries I build. I will also provide tidbits from my reading. As I get more comfortable with the database I plan to write a series of Sabermetrics 101 posts and ultimately tackle questions like "How many bases would Rickey Henderson have stolen if he started his career in 1935?". As I mentioned before, I will also be writing historical essays that don't directly relate to statistics. Lastly I will be reviving my projection software and will provide results prior to the next season.

I've laid out a pretty arduous plan for this blog, but I'm excited for the challenge. My hope is that my excitement for my topics will come across in the posts I write and that I will be able to develop an audience of like-minded individuals. Enjoy!