Saturday, October 23, 2010

Validating the Database

Now that I have my database setup it is time to do a bit of validation of the data. The first thing I'll do is confirm Albert Pujols's homerun total.

select count(*)
from master m, eventlogs e
where m.retroID = e.batter
and m.nameFirst = "Albert"
and m.nameLast = "Pujols"
and e.hit_value = 4;

The master database, among other things, holds the IDs for players, managers, umpires, etc used in various systems. Since I'm querying my eventlogs database, which uses data provided by Retrosheet, I need to use retroID. The hit_value column corresponds to the number of bases the batter got for his hit. Since I'm looking for homeruns, I need to filter by the value 4. Here is the result:

+----------+
| count(*) |
+----------+
| 366 |
+----------+
1 row in set (49.11 sec)

A quick check of Baseball Reference shows that Albert Pujols has 408 homeruns for his career. However, my database currently only has data through 2009, so I need to subtract the 42 homeruns he had in 2010. 366, it works!!

Now let's go back at bit. The eventlogs database holds data from 1950 forward. Let's try calculating Lou Brock's stolen base total. This is going to be a bit more complicated. We first need to check whether Lou Brock is on base, so we will use the first_runner, second_runner, and third_runner fields. Then we need to check whether he stole a base in the event we are looking at, so we will use the SB_for_runner_on_1st_flag, SB_for_runner_on_2nd_flag, and SB_for_runner_on_3rd_flag fields. Additionally we can use the event_type field so that we only look at stolen base events. The value for stolen base events is 4. By combining this all together we get:

select count(*)
from master m, eventlogs e
where m.nameFirst = "Lou"
and m.nameLast = "Brock"
and e.event_type = 4
and ((e.first_runner = m.retroID and SB_for_runner_on_1st_flag = 'T')
or (e.second_runner = m.retroID and SB_for_runner_on_2nd_flag = 'T')
or (e.third_runner = m.retroID and SB_for_runner_on_3rd_flag = 'T'));

+----------+
| count(*) |
+----------+
| 936 |
+----------+
1 row in set (51.88 sec)

Hmm, all the sources I've consulted show that Lou Brock stole 938 bases. So what is going on? Is something wrong with our query? Is something wrong with our data? Let's first break this down by year.

select substr(e.game_id, 4, 4) as year, count(*)
from master m, eventlogs e
where m.nameFirst = "Lou"
and m.nameLast = "Brock"
and e.event_type = 4
and ((e.first_runner = m.retroID and SB_for_runner_on_1st_flag = 'T')
or (e.second_runner = m.retroID and SB_for_runner_on_2nd_flag = 'T')
or (e.third_runner = m.retroID and SB_for_runner_on_3rd_flag = 'T'))
group by year;

+------+----------+
| year | count(*) |
+------+----------+
| 1962 | 16 |
| 1963 | 24 |
| 1964 | 41 |
| 1965 | 63 |
| 1966 | 74 |
| 1967 | 52 |
| 1968 | 62 |
| 1969 | 53 |
| 1970 | 51 |
| 1971 | 64 |
| 1972 | 63 |
| 1973 | 70 |
| 1974 | 118 |
| 1975 | 56 |
| 1976 | 56 |
| 1977 | 35 |
| 1978 | 17 |
| 1979 | 21 |
+------+----------+
18 rows in set (55.25 sec)

A quick check shows that 1964 is the problem year. Baseball Reference says Lou had 43 stolen bases and our data says 41. Now let's see all 41 stolen base events we have in 1964 and figure out when the missing 2 occurred.

select e.game_id
from master m, eventlogs e
where m.nameFirst = "Lou"
and m.nameLast = "Brock"
and e.event_type = 4
and ((e.first_runner = m.retroID and SB_for_runner_on_1st_flag = 'T')
or (e.second_runner = m.retroID and SB_for_runner_on_2nd_flag = 'T')
or (e.third_runner = m.retroID and SB_for_runner_on_3rd_flag = 'T'))
and substr(e.game_id, 4, 4) = "1964"
order by substr(e.game_id, 4);

+--------------+
| game_id |
+--------------+
| HOU196405020 |
| CIN196405241 |
| CIN196405242 |
| CHN196405280 |
| CHN196406040 |
| NYN196406110 |
| CHN196406120 |
| CHN196406130 |
| HOU196406160 |
| SLN196406180 |
| SLN196406230 |
| SLN196406230 |
| SLN196406281 |
| SLN196406290 |
| SLN196406300 |
| CIN196407050 |
| NYN196407110 |
| SLN196407170 |
| SLN196407210 |
| CHN196407290 |
| SLN196408010 |
| SLN196408040 |
| SLN196408070 |
| SLN196408090 |
| SLN196408100 |
| LAN196408150 |
| LAN196408150 |
| HOU196408180 |
| HOU196408190 |
| HOU196408190 |
| SFN196408220 |
| SFN196408220 |
| SLN196408240 |
| SLN196408280 |
| SLN196409020 |
| SLN196409040 |
| SLN196409060 |
| PHI196409090 |
| CHN196409110 |
| CIN196409191 |
| NYN196409220 |
+--------------+
41 rows in set (52.11 sec)

Comparing these results with the data at Baseball Reference shows that the missing stolen bases occured on April 25th at Philadelphia and June 12th at home as a Cub (the game_id is prefixed with the home team's abbreviation). So let's look at all of the stolen base events for those games.

select game_id, first_runner as 1st,
SB_for_runner_on_1st_flag as 1st_flag, second_runner as 2nd,
SB_for_runner_on_2nd_flag as 2nd_flag, third_runner as 3rd,
SB_for_runner_on_3rd_flag as 3rd_flag
from eventlogs e
where e.event_type = 4
and game_id in ("PHI196404250", "CHN196406120");

+--------------+----------+----------+----------+----------+-----+----------+
| game_id | 1st | 1st_flag | 2nd | 2nd_flag | 3rd | 3rd_flag |
+--------------+----------+----------+----------+----------+-----+----------+
| CHN196406120 | brocl102 | T | | F | | F |
| PHI196404250 | rodga101 | T | | F | | F |
| PHI196404250 | | F | cowab101 | T | | F |
+--------------+----------+----------+----------+----------+-----+----------+
3 rows in set (52.02 sec)

Hmm, on a whim what if we remove the event_type restriction?

select game_id, event_type as type, first_runner as 1st,
SB_for_runner_on_1st_flag as 1st_flag, second_runner as 2nd,
SB_for_runner_on_2nd_flag as 2nd_flag, third_runner as 3rd,
SB_for_runner_on_3rd_flag as 3rd_flag
from master m, eventlogs e
where m.nameFirst = "Lou"
and m.nameLast = "Brock"
and game_id in ("PHI196404250", "CHN196406120")
and ((e.first_runner = m.retroID and SB_for_runner_on_1st_flag = 'T')
or (e.second_runner = m.retroID and SB_for_runner_on_2nd_flag = 'T')
or (e.third_runner = m.retroID and SB_for_runner_on_3rd_flag = 'T'));

+--------------+------+----------+----------+----------+----------+-----+----------+
| game_id | type | 1st | 1st_flag | 2nd | 2nd_flag | 3rd | 3rd_flag |
+--------------+------+----------+----------+----------+----------+-----+----------+
| CHN196406120 | 4 | brocl102 | T | | F | | F |
| CHN196406120 | 14 | | F | brocl102 | T | | F |
| PHI196404250 | 3 | brocl102 | T | | F | | F |
+--------------+------+----------+----------+----------+----------+-----+----------+
3 rows in set (52.42 sec)

Huh, stolen bases on non stolen base event types. 3 is strikeout and 14 is walk. I'm not sure if those are typos, or if my knowledge of the eventlog data is just lacking. Probably the latter, but nonetheless, if we go back to the original query and remove the event_type filter we should get the 938 we are looking for.

select count(*)
from master m, eventlogs e
where m.nameFirst = "Lou"
and m.nameLast = "Brock"
and ((e.first_runner = m.retroID and SB_for_runner_on_1st_flag = 'T')
or (e.second_runner = m.retroID and SB_for_runner_on_2nd_flag = 'T')
or (e.third_runner = m.retroID and SB_for_runner_on_3rd_flag = 'T'));

+----------+
| count(*) |
+----------+
| 938 |
+----------+
1 row in set (52.85 sec)

2 comments:

  1. Great stuff so far Arlo. Although I don't understand any of the technical aspects of your work, your passion still comes through your writing. You should really use this blog to help you develop a presentation for the 2012 SABR convention which is being held in Minneapolis.

    Andy

    ReplyDelete
  2. Oh wow, thanks for the heads up. I had no idea.

    ReplyDelete