Persistence Equations

(Factor A)_Even = a + b*(Factor A)_Odd

Interpretation of Persistence Equation

The coefficient b in the Persistence Equation measures the persistence of Factor A between the two samples (even plays v. odd plays). The value of Factor A in the odd and the even period here are both samples of Factor A's true value. Sample statistics have a tendency to trend toward their long-run value as the sample size increases. Statisticians call this "regression to the mean".

The constant term, a, can be thought of as a measure of the extent to which Factor A regresses toward the mean. That is, one could re-write the Persistence Equation as follows:

(Factor A)_Even = b*(Factor A)_Odd + (1-b)*(Factor A)_Baseline

where (Factor A)_Baseline represents a baseline toward which Factor A regresses over time.

There are two relevant results in interpreting the extent to which Factor A persists. The one most commonly used by sabermetricians is the correlation coefficient, or r (or r²). The value, R², measures the percentage of variation in the left-hand side variable - (Factor A)_Even - that can be explained by the right-hand side variable(s) - i.e., (Factor A)_Odd. This provides some indication of the magnitude of the persistence of Factor A.

To assess the significance of the persistence, however, one must look at the significance of the persistence coefficient, b. The estimated value of b will have a standard error associated with it. If one divides b by this standard error, the resulting variable is called a t-statistic. The larger the t-statistic (in absolute value), the less likely that the true persistence coefficient is zero. As a (somewhat crude) rule of thumb, if the t-statistic is greater than 2, then we can be 95% certain that the true value of b is greater than zero (given that certain statistical assumptions about our equation are true).

Complications Estimating Persistence Equations

The basic Persistence Equation above:

(Factor A)_Even = a + b*(Factor A)_Odd

can be solved by Ordinary Least Squares (OLS), which is one of the most basic statistical regression procedures out there. There are, however, two additional complications associated with estimating Persistence Equations.

The first issue is that, in order to ensure that the estimated value b is not biased, the persistence equation should be fully specified. That is, if there are other variables that can be expected to affect (Factor A)_Even, these variables should be included on the right-hand side of the persistence equation along with (Factor A)_Odd. In general, this is not a big deal for most of the Persistence Equations that I estimate here, but it can be an issue in general regression analysis and is always worth keeping in mind.

The second issue is much more of an issue with the Persistence Equations that I estimate. The validity of OLS as an estimation technique is dependent on several assumptions about the distribution of the residual, or error, term in the persistence equation^*. One of these assumptions is that the variance of the error term is constant across all observations. That is, for example, OLS is only valid if the unexplained variation in player winning percentage is equal for all players. In this case, however, not only do we not want to assume this, but we actually know that it's wrong. Unexplained variation declines as the number of player games increases. Fortunately, there is a very easy way to adjust for this. Instead of OLS, I use Weighted Least Squares (WLS). This weights each observation by the number of player games over which the Factor has been compiled^**, squared^***. In this way, the results for players with more games played are weighted more heavily than players with fewer games.

^* To be technically correct, the persistence equation should be written as follows:

(Factor A)_Even = a + b*(Factor A)_Odd + e

where e is the "error" or "residual" term that measures unexplained variation in (Factor A)_Even. The appropriateness of OLS is then dependent on a set of assumptions regarding the distribution of e.

^** The number of games is defined as the harmonic mean of the games over which (Factor A)_Odd and (Factor A)_Even are compiled.

^***The decision to square the number of games in the weighting matrix was determined by empirical experimentation, which considered several alternative weighting schemes, based on the number of games (total games, the log of games, games squared, et al.).

Persistence Equations form the basis for dividing shared Player Game Points between batters and baserunners as well as between pitchers and fielders for several components. I also calculate and discuss specific Persistence Equations for Inter-Game Win Adjustments as a measure of "clutch". Article last updated: July 15, 2019

All articles are written so that they pull data directly from the most recent version of the Player won-lost database. Hence, any numbers cited within these articles should automatically incorporate the most recent update to Player won-lost records. In some cases, however, the accompanying text may have been written based on previous versions of Player won-lost records. I apologize if this results in non-sensical text in any cases.

Home List of Articles