On SABR-L, Ralph Caola found that the linear equation
Winning Percentage = ½ + u , (1)
is a good approximation to Bill James' "Pythagorean Win Formula." Here
This brief note shows that you can get a very similar result for any win-formula which only depends on u. Note that this includes James' Generalized Pythagorean formula,
X^{α} | |||
Winning Percentage | = | , (2) | |
X^{α} + Y^{α} |
where α is a number chosen to give the best fit to the data for a given set of baseball teams. For the 1981 MLB season James found that the best value for α was 1.83.
Formula (2) doesn't look like it can be written in terms of u, but
X = ½ (X+Y) (1 + u) , (3)
and
Y = ½ (X+Y) (1 - u) , (4)
so it's easy to see that (2) can be written as
(1 + u)^{α} | |||
Winning Percentage | = | . (5) | |
(1 + u)^{α} + (1 - u)^{α} |
The Generalized Pythagorean Win Formula is a function of u = (X-Y)/(X+Y).
So how well does (2) or (5) work? Well, look at Figure 1 below:
2003 MLB Wins, Loses, and
Runs
The x-axis of Figure 1 is the variable u, and the y-axis is the winning percentage of each team. The team in the lower left-hand corner is, of course, Detroit, the double dot at the very top of the right-hand side are the Yankees (left part of dot) and the Braves (right part of dot), and the dot furthest to the right is the Mariners. (See the raw data for this graph here.) The green curve, which you can hardly see, is the data for the standard Pythagorean rule, (2) with α = 2. The blue curve is the best fit to the data, which for 2003 is α = 2.00791. There is very little difference between the two curves, and a straight line, such as (1), would fit the data just as well, as Caola found.
OK, why does the linear form work? Well, looking at the figure, the obvious reason is that there is enough scatter to allow you to fit just about any curve. But there's more to it than that. Let's assume that there is some magic formula, f(u) which relates the ratio u = (X-Y)/(X+Y) to winning percentage. That is, winning percentage doesn't care about the total number of runs scored, or the run distribution, but only u. This function f(u) doesn't have to look much like (2) or (5), but it has to follow certain rules:
f(-1) = 0 . (6)
f(1) = 1 . (7)
f(0) = ½ . (8)
f'(u) > 0 , (-1 > u > 1) (9),
where f'(u) is the derivative of f with respect to u. This means that a curve drawn to fit the data in Figure 1 must always rise as it goes to the right.f(-u) = 1 - f(u) . (10)
Of course, the Generalized Pythagorean Win formula (2) or (5) obeys these rules.
Now if you look at the data for 2003, you'll see that u is always between -0.23 and 0.12. Since u could range from -1 to 1, the actual range is rather small. In calculus, they always told us that this means that we can expand f(u) in a Taylor series about u = 0:
f(u) = ½ + f'(0) u + ½ f''(0) u^{2} + 1/6 f'''(0) u^{3} + ... (11)
In fact, because of (8), all even derivatives of (9) vanish. We're left with
f(u) = ½ + f'(0) u + 1/6 f'''(0) u^{3} + O[u^{5}] , (12)
where O[u^{5}] means that the first term left out of (12) is some constant times u^{5}. Since |u| < ¼, this is going to be quite a small number for reasonable functions f(u).
Note that for the Generalized Pythagorean formula (2), we have
f(u) = ½ + ½ α u - 1/6 α (α^{2} - 1) u^{3} + O[u^{5}] . (13)
Formula (1) is just the first two terms of (13) with α = 2.
For any reasonable Major League season, meaning no team worse than Detroit '03, u is going to be between -¼ and ¼. This means that |u^{3}| will be less than 1/64, and |u^{5}| will be less than 1/1024. Unless f'''(0) is a rather large number, the cubic and higher order terms in (12) are going to be really small, and we can just use the linear term. If there were a quadratic term in (12) then it might be significant, but the reciprocity condition (10) eliminates. In most cases, then, a linear version of (12) is all that's necessary:
f(u) = ½ + f'(0) u . (14)
Caola's formula (1) is (14) with f'(0) = 1.
As an example, I refit the raw data for 2003 to both linear and cubic polynomials based on (12). The results are shown in Figure 2:
These fit at least as well as the Pythagorean formulas in Figure 1. The only thing the cubic term does is to make a better fit to the Tigers, but the linear estimate of Detroit's wins is no worse than many other teams.
While a general winning percentage formula must follow all the rules given by equations (6) through (10), if we restrict ourselves to Major League Baseball it's unlikely that we'll ever see a season where one team has a value of |u| > ¼. As a result, we can always use a linearized runs created formula, or, at worst, a cubic form. The reciprocity constraint (10) says that such an equation must have the form (12), i.e., there is no quadratic term. Since the cubic term is quite small, a linear equation will be adequate. If we don't want to do a fit to the data, equation (1) is a reasonable approximation.
Of course this works only for teams which are at approximately the same ability level (where Tigers ≅ Yankees). If we're talking 1927 Yankees against the Little Sisters of the Poor, all bets are off, especially if Sister Theresa hangs her curve ball.
Finally, this works only if the One True Winning Formula (TM) only depends on the relative number of runs scored, (X-Y)/(X+Y). If there is some dependence on the total number of runs scored, X+Y, or on how the runs are scored, e.g., in big innings or a run at a time, then we'll have to look at equations other than (14).
Plots and fits were made with gnuplot, and converted to PNG format using the convert utility in the ImageMagick package.
These opinions are mine. My family, friends, employers, co-workers, neighbors, and pets may share these opinions, but they probably don't.
Current URL:
http://www.rcjhawk.us/baseball/linwin/index.html