Wednesday, October 20, 2010

Setting Up The Database

In order to setup my database I relied heavily on Chapter 2 of Baseball Hacks. This book is a few years old, but it remains fairly relevant. I think it does a really good job of balancing baseball and programming and providing something for everybody no matter your level of expertise in either subject. Anyway, here is what you will need to do to set up a database like mine.

First off you need to get Perl and MySql. MySql is MySql, there is only one place to get it. You don't have all the options like you do with Perl. I'm using Windows, so for Perl I went with Strawberry Perl, but there are other options. If you want to run a web server too, there are package that combine MySql, Perl, and Apache (and maybe some other stuff too). The installs are pretty straightforward. The only thing I had trouble with was that MySql wasn't added to my path automatically, but that was easy to fix.

The first thing I tackled was Hack #10. If you don't have Baseball Hacks, you can still get a lot of helpful files at the book's website. This step is loading the database from Baseball Databank. This is pretty easy since Sean Lahman provides an *.sql file that can easily be imported. I logged into MySql, created a new database called baseball,:

create database baseball;

logged out, and imported the *.sql file:

mysql -u root -p -s baseball < /BDB-sql-2009-11-25.sql

That's it. Now you have a database with a whole host of information like you might see on the back of baseball card.

Next up is Hack #21. Completing this step will give you game logs from as far back as 1872. There aren't any individual stats, just team totals, batting lineups, etc. The first you need to do for this is go Retrosheet, download all of the game log files, and unzip them. Once you have them all unzipped, get rid of GL1871.txt. For whatever reason it is incomplete and will mess up your sql query if you include it. Next grab hack_21_game_log_header.csv from zip file on the Baseball Hacks website and copy that to a file called all.hdr.txt. Then concatenate all the game log files and append the result to all.hdr.txt. On Windows you would use this command:

copy *.TXT >> all.hdr.txt

Once that is complete, Baseball Hacks provides an excellent Perl script (in the same zip file from above) for generating sql table creation and import commands. Use it like so:

hack_20_check_field_sizes.pl -h -s "," -i all.hdr.txt > all.hdr.sql

Now just import the resulting sql file like before:

mysql -u root -p -s baseball < all.hdr.sql

Baseball Hacks loads all these different datasets into different databases. There is nothing wrong with that, but I kept mine all in the same one.

The last step is Hack #22. By far this step gave me the most trouble and I had to deviate quite a bit from the book. I'll try to document what I did, but I make no gurantees about how smoothly this will go, but the results are worth it. This step will give you event logs going back to 1950. Every plate appearance, every everything has an entry. From this you can build box scores for games, generate player totals at home for a particular season, against lefties, with RISP, whatever you want. This is the mother load.

Again the files needed for this step are available on Retrosheet. The Baseball Hacks script (hack_22_translate.pl) supposedly unzips the files for you, but I could get very little about this script to work. I ended up unzipping them myself. You will also need to a tool called bevent.exe to translate these files into comma separated lists. There is also a collection of tools called Chadwick, but they produce different (better) output than what the scripts in Baseball Hacks expect, so I didn't use them this time around. Perhaps I'll experiment in the future and post about it.

As I said the translate.pl script gave me all sorts of trouble, so what I ended up doing was modifying it to the point that it just printed out the bevent calls it needed to do and I pasted them into my command window. If you go this route you don't need to do it one by one. As long as each command is on a separate line you can select all, copy and paste. First however you need to copy the column names to the *.csv file you are going to use. There is a space in the file I just linked, so make sure you fix that and also make sure you have a return after the column names, otherwise your table definition command will be messed up. Another problem I had was that due to the size of my *.csv file (3Mb+) hack_20_check_field_sizes.pl took forever and in the end returned junk. So... I had to create a *.csv file that was generated using only some of the event files, created the *.sql file:

hack_20_check_field_sizes.pl -h -s "," -i temp.csv > pbp.sql

then edited it specify the right filenames and table names. Finally I was ready to import the data:

mysql -u root -p -s baseball < pbp.sql

But hold on, there's more. There are roster files that you also need to load into the database. This step went better. The hack_22_rosters.pl script worked much better, with only a bit of tweaking (the column names are messed up). However, now that I look at it I think I messed this up. But... I'm not really sure why this information is needed. I'll probably just drop this table add it back correctly if I feel like I need the information. I'm pretty sure however that I can get it elsewhere.

So, that is my database. By no means is this all the baseball data available for free on the Internet, but it is a very good starting point and I know it will keep me busy for a quite a while. If you choose to create your own database, I hope this helps at least a little bit. If I didn't make it clear, I think Baseball Hacks is a pretty good book to have around and it's only $5 used on Amazon right now. In the coming posts I will be exploring the data in my database and providing interesting queries I try out. I hope to show how fun and easy it is to discover interesting information about players once you have full control of the data.

No comments:

Post a Comment