Dash Davidson, Peter Rosenbloom and Jared Cross*

In creating Steamer Projections we analyzed how best to use historical statistics to predict future ones. We broke batting and pitching performance into a series of components and used regression analysis to find the most effective way to combine previous years’ performances and the league average (regression to the mean) to predict future performance.

Now that the season is over and we have concrete data on how our system has performed, we need to find out in what areas our projections were accurate, and in what areas we need to improve. We intend to add park factors and age effects to our system for the 2010 season but are there other improvements we need to make?

The best way to figure this out is to compare and contrast the results of our system to those of other, “rival”, systems. One method of comparison was already done for us. For our analysis, we examined key rate statistics for both hitters and pitchers: OPS, ERA; and also key counting statistics for both of them: Wins, Ks, IPs, Hrs, RBIs, PAs. We also decided to analyze our data from a purely fantasy baseball oriented standpoint, choosing the most prudent fantasy-oriented stat, SGPs (Standings Gain Points) and seeing which of the eight systems offered the best projections for fantasy domination, and how and why.

**The forecasting systems:**

**Marcel** - The monkey. Uses three years worth of data, weighing recent years more heavily, adjusting for age and regressing to the mean. Marcel uses the same weights and age adjustments for each component. Marcel projected rate stats and used community forecasts for playing time.

**Steamer **- Our system. Steamer forecasts used the last 3 years worth of data for hitters and pitchers. Our 2009 projections did not utilize park factors or minor league statistics. Steamer is siimilar to marcel except a) although we always weigh recent seasons more heavily we have different weights for each component and regress some components more heavily than others, and, b) we did not take aging into account. Also, Steamer only projected rate stats (this includes things like “stolen base attempts per times on base” and RBI/AB), we used Pecota’s depth chart projected playing time.

**PECOTA **- Pecota not only creates a projectiong for a player X, it retrospectively creates projections for the most X-like players in history and adjusts X’s projection based on how all of the X-like players performed relative to their would be projections. Fancy. The projections we used were from’s PECOTA weighted mean spreadsheet not their fantasy baseball depth charts.

**CHONE** – CHONE uses 4 years of data for hitters and 3 years for pitchers. It utilizes batted ball data (the numbers of line drives, pop ups etc.), minor league statistics, batter’s weight and adjusts for league, park and age.

**ZiPS** – Like CHONE, ZiPS uses 3 years of data for pitchers and 4 years of data for most hitters (3 years for players under 24 and over 38). It also uses minor league statistics and park factors but has different aging curves for different player types. Uses GB/FB and handedness to project pitcher’s BABIP.

**Sporting News** – Sporting News publishes a widely used fantasy baseball guide each year. Although we don’t know this to be true, we suspect that their projections are created by an expert rather than a formula. This is the un-Marcel.

**Fantistics**- We analyzed Fantistics on the advice of Eric Mulkowsky who said that this system was particularly good in projecting playing time.

If we run a similar analysis in future years we would include OLIVER, CAIRO, Baseball Info Solutions (Bill James) and possibly Baseball HQ and other projection systems in the comparison.

**Missing data** – 475 hitters had 50 or more PA in 2009. 465 of these hitters had projections from each of the big 3 (chone, pecota and zips). 438 were projected by Marcel. We looked at these 438 hitters. Projection systems that projected fewer players (Steamer Projections and Sporting News were the main guilty parties) were given the Marcel projection for that player. This allowed for a comparison of all 438 hitters across systems. Sporting News and Steamer only projected about 270 players each. Systems could beat the monkey so long as the projetions they actually made were better than Marcel.

**Methods**

We’ll explain the details below but we should point out that we relied heavily on the methods outlined in the following forecast evaluation studies:

Nate Silver’s analysis of his 2007 hitter projections

Tom Tango’s comments on Nate Silver’s work and the ensuing thread.

An article from statspeak.net from March of 2009 that’s no longer available.

Let’s start with hitters and our measure of fantasy baseball goodness:

** Hitter SGPs**

System |
Avg SGP |
Stdev SGP |
R with actual |
RMSE* |
Uniqueness
(R with Avg) |

2008 | 4.19 |
3.48 | 0.653 | 2.67 | – |

Marcel | 4.64 | 2.31 | 0.697 | 2.52 | 0.961 |

Steamer | 4.90 | 2.53 | 0.696 | 2.53 | 0.961 |

PECOTA | 4.51 | 2.87 | 0.657 | 2.65 | 0.949 |

Chone | 5.33 | 2.32 | 0.687 | 2.56 | 0.934 |

ZiPS | 5.17 | 2.77 | 0.688 | 2.55 | 0.945 |

Sporting News | 5.23 | 3.11 | 0.707 | 2.49 | 0.963 |

Fantistics | 5.42 | 3.24 | 0.723 |
2.43 |
0.950 |

Avg Projection | 5.02 | 2.60 | 0.729 | 2.41 | – |

Actual (2009) | 4.25 | 3.52 | – | – | – |

We looked at root mean square error (RMSE) in addition to correlation (R) because correlation mostly tells us whether a system has players projected in the correct order. A system could have wildly over or under estimated the variance in true performance levels and still be well correlated with actual results. Such a system would have a high RMSE, however. Before finding the root mean square errors, all of the projections were “normalized” meaning that each system’s projections were multiplied by a the ratio of the average actual SGPs to the average predicted SGPs so that the normalized systems all projected an average of 4.25 SGPs. We did this so that systems weren’t overly punished for missing league offensive levels or being optimistic/pessimistc. (We do the same normalization later on when looking at the RMSE in OPS.)

Since Marcel is smarter than most monkeys we know, we added a considerably dumber monkey, “2008″, who expects every player to perform at exactly the same level he did the year before. This dumber monkey indeed struggled while Marcel finished in the middle of the pack.

Also worth noting, each of projection systems has a smaller standard deviation across their projections than the standard deviation of actual results from 2008 or 2009. This is as it should be. The projection systems are trying to forecast **true** talent whereas the variance in actual results is a combination of the variance in true talent and the variance in luck.

You’ve probably also noticed that Fantistics beat the pants off the other systems with Sporting News finishing 2nd. Perhaps it shouldn’t be surprising that projection systems aimed exclusively at winning fantasy baseball leagues did the best at projecting fantasy baseball value.

And, if you want the best linear equation of these systems for projecting actual 2009 SGPs:

Actual = 0.527*Fantistics + 0.409*Chone + 0.243*SportingNews – 0.761 (R^{2} = 0.547)

While Sporting News projected SGP’s better than Chone, Chone added more information to Fantistcs because Chone was the most unique system while Sporting News was the least unique system.

What systems were the most similar? Chone was the most similar to Zips (R = 0.935). Steamer was most similar to Marcel (R = 0.954). Pecota wasn’t that similar to any system but closest to Chone (0.902) and Zips (0.903). Surprisingly, Fantistics was most similar to Steamer (0.916) and Sporting News (0.913) and Sporting News was most similar to Marcel (0.948). Most dissimilar systems? Chone and Fantistics (0.856).

How robust was this result? Could Fantistics just have gotten lucky? One quck and dirty way to analyze this is to split the data into halves and see whether we would have come to the same conclusions looking at either half. We assigned a random number to each player and split our 438 players into two subset of 219 players based on their random number.

System |
R (subset 1) |
R (subset 2) |

2008 | 0.635 | 0.670 |

Marcel | 0.668 | 0.726 |

Steamer | 0.680 | 0.712 |

PECOTA | 0.639 | 0.675 |

Chone | 0.679 | 0.695 |

ZiPS | 0.672 | 0.705 |

Sporting News | 0.694 | 0.721 |

Fantistics | 0.708 |
0.738 |

Avg Projection |
0.711 |
0.748 |

We could certainly come to different conclusions about the relative quality of Marcel, Steamer and Sporting News based on which half we look at. Neither our 2008 Monkey nor Pecota looks good in either half and Fantistics wins both halves by a solid margin and for both halves we’d be best off taking the average of the projections.

In order to see why Fantistics was successful we need to look at how well each system did in projecting several other metrics.

**Hitter OPS**

System |
Avg OPS |
Stdev OPS |
R with actual |
RMSE* |
Uniqueness (R with avg) |

2008 | 0.763 | 0.183 | 0.401 | 1.84 | – |

Marcel | 0.775 | 0.066 | 0.590 | 1.62 | 0.965 |

Steamer | 0.780 | 0.066 | 0.567 | 1.66 | 0.955 |

PECOTA | 0.763 | 0.084 | 0.564 | 1.66 | 0.926 |

Chone | 0.770 |
0.075 | 0.638 |
1.55 |
0.954 |

ZiPS | 0.770 |
0.083 | 0.623 | 1.57 | 0.964 |

Sporting News | 0.781 | 0.074 | 0.568 | 1.65 | 0.945 |

Fantistics | 0.787 | 0.086 | 0.583 | 1.63 | 0.925 |

Avg Projection |
0.775 |
0.071 |
0.624 |
1.57 |
– |

Actual (2009) | 0.769 | 0.116 | – | – | - |

**Note, these numbers (with the exception of stdev) are all weighted by 2009 PA.**

While SGP’s may be the most meaningful metric for fantasy players, OPS is likely seen as more important by sabermeticians. And, here, Chone dominates with ZiPS coming in 2nd. Even more impressively, Chone actually beats the **average** projection. Fantistics and PECOTA get points for being the most unique systems. Steamer doesn’t do that well here, beating only PECOTA and there only by a nose. We have some serious work to do for the 2010 version. All systems badly beat our 2008 Monkey but only Chone and ZiPS beat Tango’s monkey.

Chone and Zips are similar to each other here (R = 0.956) which makes sense given their methodologies. The simple systems, Marcel and Steamers are also similar (0.970). Fantistics and Sporting News are both actually both most similar to Marcel (0.883 and 0.937, respectively).

And, if you want the best equation to project 2009 OPS:

ActualOPS = 0.716*Chone + 0.199*ZiPS + 0.081 (R^{2} = 0.414)

Although this equation doesn’t do much better than simply using Chone.

Looking at the same two subsets we used for SGPs we get:

System |
R (subset 1) |
R (subset 2) |

2008 | 0.455 | 0.354 |

Marcel | 0.607 | 0.571 |

Steamer | 0.569 | 0.565 |

PECOTA | 0.581 | 0.550 |

Chone | 0.668 |
0.604 |

ZiPS | 0.640 | 0.605 |

Sporting News | 0.589 | 0.543 |

Fantistics | 0.636 | 0.533 |

Avg Projection |
0.642 |
0.602 |

I think this suggests the the difference between Chone and Zips might not be significant given our sample size. Chone and Zips are the top 2 systems for both subsets and beat Pecota by a healthy margin for both pools of players. I would feel reasonably confident in saying that Chone and ZiPS are ahead of the pack right now but not at all confident in saying that Chone is better than ZiPS.

So, if Fantistics was only in the middle of the pack in projecting OPS, how did it dominate SGP’s?

**Hitter PA**

System |
Avg PA |
Stdev PA |
R w/ actual |

2008 | 367 | 215 | 0.642 |

«Marcel» | 413 |
130 | 0.657 |

«Steamer» | 419 | 133 | 0.666 |

PECOTA | 416 | 150 | 0.565 |

Chone | 486 | 95 | 0.580 |

ZiPS | 457 | 135 | 0.583 |

Sporting News | 442 | 158 | 0.694 |

Fantistics | 448 | 181 | 0.721 |

Avg Projection |
440 |
123 |
0.732 |

Actual (2009) | 384 | 203 | – |

Ok, so Chone and ZiPS aren’t really trying to project playing time and, despite their excellence in projecting hitter quality (as evidenced by OPS) don’t do well here. Pecota doesn’t try to project playing time in their weighted mean forecasts but does in their depth charts (used by Steamer). The community forecasts that Marcel used do reasonably well, but not as well as the fantasy basebal gurus in projecting playing time. Limiting this to the systems that try to project playing time (and using their proper names this time) we have:

System |
R with actual |

Community Forecasts | 0.657 |

Pecota Depth Charts for Fantasy | 0.666 |

Sporting News | 0.694 |

Fantistics | 0.721 |

Fantistics does really excel at projecting playing time. One advantage they present is that they update their projections throughout the offseason and these playing time projections are from immediately prior to the start of the season. Looking again at the two subsets:

System |
R Subset 1 |
R Subset 2 |

Community Forecasts | 0.629 | 0.683 |

Pecota Depth Charts | 0.662 | 0.671 |

Sporting News | 0.709 | 0.683 |

Fantistics | 0.717 |
0.727 |

Fantistics wins both subsets. They look like the authority on playing time.

In our quest to win our fantasy baseball leagues, we also need to project run production (R and RBI) and stolen bases. Let’s look at how each system did at projecting these, independent of playing time.

**(R + RBI)/PA**

System |
Avg |
Stdev |
R w/ actual |

2008 | 0.244 |
0.085 | 0.410 |

Marcel | 0.246 | 0.030 | 0.606 |

Steamer | 0.254 | 0.032 | 0.587 |

PECOTA | 0.245 | 0.040 | 0.603 |

Chone | 0.251 | 0.036 | 0.642 |

ZiPS | 0.250 | 0.041 | 0.634 |

Sporting News | 0.251 | 0.037 | 0.548 |

Fantistics | 0.253 | 0.043 | 0.553 |

Avg Projection |
0.250 |
0.034 |
0.645 |

Actual (2009) | 0.241 | 0.051 | – |

**Avg and R are weighted by 2009 PA**

This is a bit surprising. It looks like Fantistics was succesful in projecting SGPs, in spite of missing on Runs and RBIs. Chone and Zips show up on top here so perhaps successfully projecting R and RBI might hinge on successfully projecting hitter quality as much as anything else. Anyway, chalk up another win for Chone.

Looking at the two subsets:

System |
R (subset 1) |
R (subset 2) |

2008 | 0.444 | 0.381 |

Marcel | 0.621 | 0.592 |

Steamer | 0.590 | 0.584 |

PECOTA | 0.621 | 0.586 |

Chone | 0.649 |
0.636 |

ZiPS | 0.645 | 0.624 |

Sporting News | 0.564 | 0.537 |

Fantistics | 0.590 | 0.514 |

Avg Projection |
0.663 |
0.625 |

Chone and ZiPS finish one and two in both subsets. Fantisitcs looked to have a few fluke values that are drastically affecting their overall results in this category and created their wildy different results across subsets.

**SB/PA**

System |
Avg |
Stdev |
R w/ actual |

2008 | 0.0172 | 0.0216 | 0.774 |

Marcel | 0.0167 |
0.0130 | 0.811 |

Steamer | 0.0169 | 0.0132 | 0.809 |

PECOTA | 0.0167 |
0.0160 | 0.837 |

Chone | 0.0167 |
0.0155 | 0.848 |

ZiPS | 0.0172 | 0.0176 | 0.842 |

Sporting News | 0.0172 | 0.0165 | 0.825 |

Fantistics | 0.0168 | 0.0179 | 0.795 |

Avg Projection |
0.0170 |
0.0151 |
0.847 |

Actual (2009) | 0.0163 | 0.0180 | – |

**Avg and R are weighted by 2009 PA**

Another win for Chone but this one is close with Chone, Zips, Pecota and Sporting News all in the same neighborhood.

Here the standard deviations point to something that Steamer is doing wrong. Steamer, like Marcel, regresses to the league average, and gives everyone stolen bases. Do we need to use speed scores? Dash calculated 2008 speed scores (using Bill James’s 5 factor system) to see whether they would have added information to our SB projections but, given our small sample size, they didn’t make a statistically signficant improvement. We’ll be looking for ways to improve our SB projections for the upcoming season. Will age adjustments do the trick?

Looking at the two subsets:

System |
R (subset 1) |
R (subset 2) |

2008 | 0.802 | 0.748 |

Marcel | 0.819 | 0.802 |

Steamer | 0.829 | 0.789 |

PECOTA | 0.851 |
0.823 |

Chone | 0.831 | 0.859 |

ZiPS | 0.839 | 0.842 |

Sporting News | 0.846 | 0.803 |

Fantistics | 0.814 | 0.775 |

Avg Projection |
0.856 |
0.83 |

The two subsets look pretty different in this case leading us to believe that our one year sample might not be enough to confidently draw conclusions about which systems are the best at projecting stolen bases. We might be able to distinguish the top 4 from the rest.

**HR/AB**

We have put some effort into improving our HR predictions for the upcoming season and Greg Rybarczyk was even kind enough to send us hittracker data which we haven’t figured out how to utilize to our advantage yet. Anyway, we wanted to take a look at how we did projecting HR/AB.

System |
Avg |
Stdev |
R w/ actual |

2008 | 0.0301 | 0.0238 | 0.621 |

Marcel | 0.0313 | 0.0122 | 0.717 |

Steamer | 0.0319 |
0.0126 | 0.713 |

PECOTA | 0.0307 | 0.0149 | 0.733 |

Chone | 0.0310 | 0.0150 | 0.752 |

ZiPS | 0.0313 | 0.0159 | 0.749 |

Sporting News | 0.0319 |
0.0140 | 0.720 |

Fantistics | 0.0330 | 0.0170 | 0.730 |

Avg Projection |
0.0316 |
0.0139 |
0.755 |

Actual (2009) | 0.0320 | 0.0189 | – |

**Avg and R are weighted by 2009 PA**

Chone and Zips win again here with Chone edging out ZiPS.

It makes sense, perhaps, that Marcel would have a low standard deviation in HR rates since it regresses all components equally but why does Steamer have a low standard deviation in projected HR’s? Are we not being aggressive enough in our HR projections?

One way to analyze this is to graph actual HR/AB v. predicted HR/AB for each system and look at the slope of each line. The slope **should, **in theory**,** be 1.

System |
Slope |

Marcel | 1.023 |

Steamer | 0.988 |

PECOTA | 0.887 |

Chone | 0.909 |

ZiPS | 0.853 |

Sporting News | 0.894 |

Fantistics | 0.800 |

Actually, by the look of this data, Steamer is being just as aggressive as it should be. One way to interpret this data is that when Fantistics projects a player to have 50% more HR’s than a peer, he turns out to only have 0.800*50% = 40% more. Fantistics may be overly aggressive in projecting HR’s. Still, a weak performance by Steamer projecting individual HR/AB which suggests that we need to regress to a height/weight mean instead of a league mean in our projections.

Also, we have to take our victories where we can get them and Steamers edged out Sporting News (you’d need to see the next decimal place but trust me on this one) in projecting the **overall** HR rate for this group of players.

Looking at the two subsets:

System |
R (subset 1) |
R (subset 2) |

2008 | 0.641 | 0.600 |

Marcel | 0.743 | 0.687 |

Steamer | 0.734 | 0.689 |

PECOTA | 0.751 | 0.714 |

Chone | 0.778 | 0.724 |

ZiPS | 0.780 |
0.715 |

Sporting News | 0.760 | 0.677 |

Fantistics | 0.747 | 0.712 |

Avg Projection |
0.778 |
0.729 |

ZiPS wins one and Chone wins one but it’s crowded at the top and the evidence might not be compelling enough to say that ZiPS and Chone are better at projecting HR’s than Pecota or Fantistics.

**Conclusions:**

It’s hard to know exactly what to take from this. Chone and Zips seem to stand out in projecting hitter quality and they have somewhat similar methodologies which gives some hints about how to make good forecasts. Fantistics succeeds in projecting SGP’s best based largely on its success in projecting playing time which suggests, perhaps, that other systems haven’t put enough thought into how best to project playing time.

It’s worth noting, also, that for the fantasy player, not all playing time is created equal. If Jose Reyes and a #6 hitter are both projected for the same number of plate appearances and the same number of SGP’s, you’re probably better off taking Reyes. When he’s playing, he’s getting more plate appearances and when he’s not, he’s on the DL and you can play someone who, although they might project to zero SGP’s over replacement, is better than an empty slot.

*Up next: Pitchers*

## One thought on “Evaluation of 2009 Hitter Forecasts”