Baseball Player Won-Loss Records
Home     List of Articles



Statistical Calculations: Variance, Standard Deviation, and Correlation

Basic Formulas

Three basic statistical measures which I use quite a bit are variance, standard deviation, and correlation.

Variance measures the average squared difference between a series and the mean of the series. It measures the spread of a series. If a series is normally distributed, approximately 65% of all observations will fall within one standard deviation of the mean of the series, 95% of all observations will fall within two standard deviations of the mean of the series, and 99% of all observations will fall within three standard deviations.

The conventional formula for the variance of a series, xi, with N observations and with mean m is the following:

Var(x) = SUMi=1N(xi - m)2/(N-1)

Standard deviation is equal to the square root of the variance.

Correlation is a measure of how closely related two series are. Correlation, often identified by the letter r, takes on a value between 1 and -1. A correlation of 1 indicates that the two series, call them x1 and x2, are perfect positive linear functions of each other - that is, one could find some constant values a and b, b>0, such that x1 = a + b*x2. A correlation of -1 indicates that the two series are perfect negative linear functions, so that x1 = a + b*x2 for some constant values of a, b, b<0. A correlation of zero indicates that there is no statistical relationship between the two series.

The conventional formula for the correlation of series x1, with mean m1, and x2, with mean m2, is the following:

r12 = {SUM(x1 - m1)*(x2 - m2)} / sqrt{SUM(x1 - m1)2*SUM(x2 - m2)2}

Weighted Variance and Standard Deviation

The basic formulas for variance, standard deviation, and correlation shown above assume that every observation of a series is equal. In many, perhaps most, cases, this is a perfectly reasonable assumption. In the case of my work, however, when the observations are winning percentages (or wins, or any other measure, really) by players, this is not a reasonable assumption. In such a case, one should, instead, weight each player's total by the total number of Player decisions accumulated by that player. This is actually quite easily done.

The formula for variance above in effect already takes a weighted average of (xi - m)2, where the weight assigned to each observation is equal to 1/(N-1). Instead of constant weights, however, the more correct weights in this case would be to weight each observation by gi/G, where gi is equal to the number of Player decisions associated with observation i and G is the total number of player decisions for all i. That is, weighted variance is equal to the following:

Var(x) = SUMi=1N gi*(xi - m)2/G

Weighted standard deviation is then simply equal to the square root of weighted variance.
Weighted Correlation

As with variance, the basic formula for correlation treats each observation of the two series, x1 and x2, as equal. The way to weight each observation differently is somewhat less obvious, however. This is because the weights in the unweighted correlation formula cancel out. In effect, the correlation formula above can be re-written as follows:

r12 = {SUM(1/(N-1))*(x1 - m1)*(x2 - m2)} / sqrt{SUM(1/(N-1))*(x1 - m1)2*SUM(1/(N-1))*(x2 - m2)2}

As with variance, this is, in effect, a weighted formula with equal weights of 1/(N-1) for each observation. Substituting in the same weights as for weighted variance, gi/G, produces the following formula for weighted correlation:

r12 = {SUM(gi/G)*(x1 - m1)*(x2 - m2)} / sqrt{SUM(gi/G)*(x1 - m1)2*SUM(gi/G)*(x2 - m2)2}

As a general rule, when I calculate variances, standard deviations, or correlations in any of my work, I will use weighted measures, using these formulas.

All articles are written so that they pull data directly from the most recent version of the Player won-lost database. Hence, any numbers cited within these articles should automatically incorporate the most recent update to Player won-lost records. In some cases, however, the accompanying text may have been written based on previous versions of Player won-lost records. I apologize if this results in non-sensical text in any cases.

Home     List of Articles