Baseball Player Won-Loss Records
Home     List of Articles

Most Similar Players
Determining the Most Similar Players

One of the pages which can be accessed from player pages is a page which allows one to identify the players most similar to a particular player. These pages can be accessed under the link labeled "Most Similar Players to" on player pages(e.g., Most Similar Players to Eddie Murray).

Comparison Options
The default for my similar-player pages is to show the 10 players with the most similar careers. I allow for several different options, however. As with all pages which rely upon positional averages and/or replacement levels, I allow one to choose the time period over which positional averages are calculated as well as the methodology for calculating positional averages for starting versus relief pitchers. These choices are described here. Beyond that, there are several options which one can use to customize one's similarity scores.

First, one can vary the number of players shown. The default number of players shown is ten.

Second, one can compare Player won-lost records over a specific age range. The default is to calculate similarity scores based on career values.

Third, one can choose to include pWins (which tie to team wins and hence incorporate context) or not. The default option is to include pWins (over positional average and replacement level) in the comparison (along with eWins).

Fourth, one can normalize season lengths (to 162 games) for all players and/or extrapolate missing player games before making comparisons. The default option is to normalize season lengths to 162 games and to extrapolate missing games.

Finally, one can assign unique weights for the six factors that are used for comparison: Batting, Baserunning, Pitching, Fielding, eWins, and pWins. The default weights are one for each of the six factors. Note that if the "context" option is set to "n" (No), the weight on pWins will be zero, regardless of what is entered here.
All of these options can be selected on the "Most Similar Players" page by filling in the appropriate boxes and clicking the "Go" button.

For example, the five players most similar to Tim Raines from ages 21 - 27 (Raines's prime, 1981 - 1987), excluding context (i.e., only considering eWins), with seasons normalized to 162 games, are here.

Identifying Most Similar Players
Measures Being Compared
There are four basic factors for which players can accumulate Player wins and losses: Batting, Baserunning, Pitching, and Fielding. There are two dimensions across which player similarity can be measured: quality and quantity.

To identify a player's "most similars", I look at up to seven breakdowns of Player won-lost records: batting, baserunning, starting pitching, relief pitching, fielding, and measures of overall player value in and out of context; and look at two measures: total wins and wins over some benchmark.

For batting and baserunning, the benchmark I look at is average non-pitcher winning percentage. It is necessary to exclude pitchers in order to accurately compare players in DH leagues with players in non-DH leagues.

For pitching, the benchmark I look at is wins over positional average. I calculate separate positional averages for starting vs. relief pitching. I therefore include two pitching measures (relative to benchmark): starting pitching wins over positional average and relief pitching wins over positional average. For total wins, however, I use a single measure of total pitching wins.

For fielding, the benchmark I use is replacement level. I chose this to allow for comparisons across fielding positions that controls for the relative difficulty of different positions. That is, an average (0.500) fielding shortstop is not comparable to an average (0.500) fielding left fielder; but a below-average fielding shortstop may be more comparable to an above-average fielding second or third baseman. For example, the players most similar to Derek Jeter are second basemen and his top-10 similars list includes at least one third baseman, one catcher, and one outfielder. (Derek Jeter had a fairly unique career as measured by Player won-lost records and has no players whose careers were truly "similar".)

For each of these four separate factors, I used expected (context-neutral) wins.
In addition to the four factors discussed above, I also look at total wins. For overall Player won-lost record, I do not look at total wins, but instead look at wins over both positional average as well as replacement level. For all comparisons, I look at expected (context-neutral) wins. If desired, the comparison can also include pWins over positional average (pWOPA) and replacement level (pWORL), which tie to team wins and reflect the context in which they were earned.

Identifying Most Similar Players
For each of the (up to thirteen) variables identified above, I begin by calculating the totals for the player of interest over the age range of interest (e.g., Starlin Castro at ages 20 - 22). I then normalize all of these figures by dividing by the standard deviation for these figures across all player-seasons for which I have calculated Player won-lost records. This puts everything on the same scale so that, for example, baserunning is weighted the same as batting, and wins, wins over average, and wins over replacement level are all given equal weight. These figures serve as the baseline numbers against which all other players are then compared.

To find the "most similar" players, then, the same figures are calculated for every other player for whom I have calculated Player won-lost records over the age range of interest. For a given player, then, for each measure, I calculate the difference between that player's value and the baseline value calculated above and square that difference. Squaring the difference has two effects: first, it treats a value slightly higher than the baseline value the same as a value slightly lower than the baseline value, and second, squaring the difference spreads out the scale, increasing the penalty for being very different - in this way, being a little bit different at everything will produce greater similarity than being identical at some things but very different at some other things.

For every player, then, a weighted sum of these squared differences is then calculated, using the selected weights as described above. Players are then sorted based on these sum-of-squared differences. The n players with the smallest sum-of-squared differences are then presented as the "Most Similar" players, where n is chosen as described above.

Example Sets of Most Similar Players
I'll conclude this article with a few example sets of most similar players.

Example 1: Top 10 Most Similar Players to Jim Rice, full career, all factors equally weighted, including pWins, seasons not normalized