Baseball Player Won-Loss Records
Home     List of Articles

Constructing Ballpark-Specific Base-Out Transition Matrices

The probability of winning, given a base-out-score-inning state, is very much dependent on time and place. Individual positive offensive events are less valuable in higher-offense environments than in lower-offense environments. Hence, the win probability matrix to be used within a particular baseball game should be specific to the specific league (by which I intend to differentiate between, for example, the 2004 American League and the 2004 National League as well as the 2003 American League and the 2004 American League), and the specific ballpark in which the game takes place. For example, as estimated by me, a team had a 30% chance of winning a game in which they trailed by one run entering the bottom of the ninth inning if the game was played in Coors Field in 2000, whereas a team had only an 14.5% chance of winning the same game if it were played in Dodger Stadium in 1968.*

*The percentages here are for an average team in the 2000 National League and 1968 National League, respectively, not specifically for the 2000 Rockies and 1968 Dodgers.

I construct a unique base-out transition matrix – from which I construct a unique win-probability matrix – for each ballpark for each year for each league which I consider.

Why 1 Year?
The use of one-year ballpark factors is fairly widely disdained in sabermetric circles as being generally inappropriate because of the large degree of noise which is inherent in a single year’s worth of data. If one’s primary purpose is prospective, then I think that this is probably true. Even if one’s primary purpose is explanatory, if one is only considering a single-value run factor along the lines of “Ballpark A increases runs scored by 5%,” then the noise inherent in a single season of data might well be sufficiently large that one would be better off using a multi-year park factor.

In fact, however, I am doing neither of these things. My purpose is explanatory – I am measuring the value of what actually happened – but I am looking at a much finer level of detail of data than simply looking at runs scored. As such, my “sample size” for a particular ballpark is not the 800 – 900 runs that were scored at that ballpark that year, but instead the 6,000 – 7,000 plate appearances that took place at that ballpark that year.

The purpose of this project is to measure the actual value of individual baseball players. That actual value, however, depends on what a ballpark actually did, not what it averaged. Technically, day-specific base-out transition matrices would be appropriate if possible, but, alas, they’re not. There are many reasons why the run-scoring environment of a ballpark may change from year to year including,
(1) the league’s run-scoring environment may change,

(2) the efficiency of run-scoring may change (i.e., the expected runs (XR, RC, whatever) may not change, but the actual runs do, perhaps because teams hit better or worse than expected with runners in scoring position, for example),

(3) the conditions of the ballpark may change (wind, temperature, change in field dimensions), or

(4) hitters (or pitchers) may simply perform somewhat differently from one year to another.
For many purposes, it may be desirable to try to remove some of these reasons, particularly numbers 2 and 4. For my purposes, however, ALL of these reasons are valid reasons and will legitimately affect the win probabilities.

Having said that, it is still extremely important to attempt to control for anomalous results as much as possible in constructing ballpark-specific base-out transition matrices, and it is important to utilize as much data as possible. The technique I use to construct ballpark-specific base-out transition matrices is outlined below.

Constructing a Ballpark-Specific Base-Out Transition Matrix

Step 1

The first step in constructing a ballpark-specific base-out transition matrix is to construct a league-wide base-out transition matrix. Call this BOL.

For a particular ballpark, call it ballpark p, find all team combos that met in this ballpark as well as in at least one other ballpark. Note that teams that only played each other in one ballpark are not used in this calculation. In addition, inter-league games are not used here, since games played at American League ballparks use the designated hitter rule while games played at National League ballparks do not, which affects the relative run-scoring environments of the two leagues.*

*Technically, I consider all games played in American League ballparks to be “American League” games and all games played in National League ballparks to be “National League” games. Hence, inter-league games played between two teams at an American League ballpark are not considered to be in the same “league” as games played between the same two teams at a National League ballpark.

Step 2

For each team combination within the same league which met in ballpark p and at least one other ballpark, calculate a base-out transition matrix for all of their games against each other in ballpark p. For teams j and k, call this BOpjk. Construct a second base-out transition matrix, then, for all games between teams j and k that did not take place in ballpark p. Call this matrix (BO’)pjk.

Step 3

Re-size each of these base-out transition matrices so that all of the BOpjk and (BO’)pjk, for all teams j and k, are the same size (by “size” I mean they should include the same number of events – i.e., plate appearances). That is, multiply each element of BOpjk by the ratio of the desired number of elements (call it E) to the raw number of elements in BOpjk. For example, suppose that BOpjk was a 3-by-3 matrix as shown below (in reality, of course, BOpjk will be a 24-by-28 matrix):


The size of BOpjk in this example is 60 (13+5+10+8+2+2+11+6+3). If the value of E to which one wanted to re-size this matrix was 12, then each element of this matrix would be multiplied by (12/60) = 0.2. Hence, the re-sized matrix would be the following:

Step 4

Sum all BOpjk for ballpark p (the ballpark of interest here). Call this BOp. This is, in effect, a home-game transition matrix for ballpark p.

Re-size this sum, BOp, to be the same size as each of the (BO’)pjk. Sum all of the (BO’)pjk and the re-sized BOp, and call this (BO’)p. The home-game transition matrix, BOp, is included here with the same weight as other ballparks. This creates, in effect, a league-wide transition matrix for the teams that played in ballpark p, (BO’)p.

Step 5

Now, re-size BOp and (BO’)p so that they are both the same size as BOL. The initial estimate of the base-out transition matrix for ballpark p is then equal to the following:

BOp = BOL + (BOp – (BO’)p)

In some cases, the difference (BOp – (BO’)p) may be very large relative to BOL for some cells. At the extreme, in fact, it is theoretically possible that (BOp – (BO’)p) may be a negative number which is greater in absolute value than BOL. Hence, it is theoretically possible for some cells of BOp to become negative given this formula. This problem is avoided by restricting the maximum size of the (BOp – (BO’)p) term. The restriction is that the term (BOp – (BO’)p) cannot exceed T% of the corresponding cell of BOL, where T is equal to 1.96 times the (weighted) standard deviation of the percentage difference between (BOp – (BO’)p) and BOL for all cells across all of the ballparks of interest (the 1.96 figure was chosen because 95% of all data points will be within 1.96 standard deviations of the mean (zero in this case) for a data series which is normally distributed).

A unique value is calculated for T for every league. For the 2004 National League, for example, T had a value of 78%, so that no element of any ballpark-specific base-out transition matrix could be more than 78% greater or less than 78% less than the league-wide base-out transition matrix. This restriction was binding in about 20% of all cases for the 2004 National League.*

*By construction, these 20% of cases account for approximately 5% of the total events within the league.

Step 6

Finally, after calculating values for BOp for every ballpark, all of the BOp matrices are summed, and re-sized, such that the size of the sum of the BOp matrices is equal to the size of BOL.

Let n equal the number of ballparks and let BOALL be the sum of the BOp. The final base-out transition matrix for ballpark p is then equal to the following:

BOp = BOp + (BOL - BOALL) / n

The final term, (BOL - BOALL) / n, is again restricted to be no greater (in absolute value) than the T figure constructed above.

In words, what I do here is to construct a normalized base-out transition matrix for a ballpark and a normalized base-out transition matrix for all games played by the same teams at all of the ballparks at which they played (in effect, a ballpark-specific league-transition matrix). I then adjust the league base-out transition matrix by the difference between the ballpark-specific transition matrix and this latter transition matrix (i.e., the ballpark-specific league-transition matrix).

This basically becomes my ballpark-specific base-out transition matrix, subject to two general restrictions: (1) that the ballpark-specific matrix can’t be too different from the league-wide base-out transition matrix, and (2) that the sum of the ballpark-specific base-out transition matrices should be approximately equal to the league-wide base-out transition matrix.

Let me clarify this explanation with an example. For simplicity, I will use U.S. Cellular Field in Chicago within the 2004 National League. In 2004, the Florida Marlins hosted the Montreal Expos in two games which were moved to U.S. Cellular Field in Chicago because of a hurricane in Florida. Hence, U.S. Cellular Field only hosted two games in the 2004 National League. The base-out transition matrix from these two games would be the BOpjk matrix described earlier.

Florida and Montreal also played each other 17 other times at three different ballparks (Miami, Montreal, and San Juan). Base-out transition matrices for all of these ballparks are here.

A single base-out transition matrix is constructed for all of these 17 games played by Florida and Montreal outside of Chicago. This would be the (BO’)pjk matrix described above.

The first of these, BOpjk, is the home transition matrix, BOp above. The home transition matrix is re-sized and added to (BO’)pjk, to produce a road transition matrix (BO’)p, above.

These two matrices are then both re-sized to be the same size as the base-out transition matrix for the 2004 National League as a whole. These three base-out transition matrices, BOL, BOp, and (BO’)p can be found here.

The league-wide transition matrix is then adjusted by the difference between these two matrices, subject to the restrictions described above. The difference between the adjusted home and road transition matrices for U.S. Cellular Field in the 2004 National League is presented here. The numbers shown in bold here were greater in absolute value than the T% restriction described above. These values were therefore constrained in constructing the base-out transition matrix for U.S. Cellular Field for the 2004 National League.

The final step, then, is to sum up all of the ballpark-specific base-out transition matrices for a particular league and tie them back to the leaguewide base-out transition matrix. The final base-out transition matrix for U.S. Cellular Field in the 2004 National League is shown here.

Ballpark factors based on base-out transition matrices are explored in several additional articles. Ballpark Run Factors are explored here. The stability of these factors across seasons is explored here. Event-specific component park factors are explored here. A look at how the same ballpark played in different leagues is here. Finally, some park factors for ballparks who only hosted a handful of games are looked at here.

Home     List of Articles