The Curse of... oh, let's say, Clay Bellinger:

Wednesday, October 17, 2007

CAIRO Projections v0.1

One of the areas of baseball analysis that I've taken an interest in are projections. Some of the people I consider the best analysts out there have spent a lot of time and effort devising various projection systems. With Dan Szymborski's ZiPS, Sean Smith's CHONE, Tango Tiger's Marcels, Nate Silver's PECOTA, and some others I'm surely forgetting there's no shortage of systems out there.
I have a stubborn side to me though. I have this burning need to understand how all these numbers work instead of just taking them as presented. So I've taken to trying to calculate a lot of stuff myself. That's why I decided to come up with my own projection system, which I've code-named CAIRO after my favorite bad baseball player. It will eventually stand for something, but I'm not quite sure what yet.

I don't expect CAIRO to be any better than any of the systems mentioned above, but at least now I don't have to wait for the others to come out. The other thing I want is to make this a reasonably open-source system. The heart of CAIRO is Tango Tiger's Marcel system, but with a few of my own tweaks. This is the methodology I am currently using.

1) Park adjust and league adjust the component stats for each season from 2003-2007. I am including 2006 and 2007 MLEs (major league equivalencies) but for now I've only projected people who appeared in the majors in 2007 except for a couple of Yankee farmhands.
2) Weigh each season using a 5/4/3/2/1 weight (most recent season weighed most heavily) for batters. For pitchers I use a 7/5/3/2/1 weight as I believe pitchers' most recent performance should be weighed a little more heavily since pitchers are more likely to change their true talent level both positively and negatively.
3) Add in a percentage of plate appearances or innings based on league average performance (regression towards the mean)
4) Adjust the appropriate components for the player's age (hits, xbh, HR, BB, K, SB)
5) Park-adjust the final stat line for the player's expected park/league. For pitchers I include an adjustment for the projected defense behind them.
6) This is an almost entirely objective system, but I have made/will make playing time adjustments for some players
6) Last, wait for everyone to tell me how wrong the projections are.

I have included defensive projections using Zone Rating as well. I used a weighted average from 2002-2007 with regression and aging factored in. I included a lot of players who don't have significant playing time at specific positions so small sample size caveats apply, even with the regression that I added in.

This is all very much a work in progress and I expect to continue tweaking it during the offseason, so if anyone sees any numbers that look wrong or has any questions about the methodology feel free to let me know.

Since this is supposed to be a Yankee site, here are the Yankees' projections.

NAME AGE TEAM POS G PA AB H 2B 3B HR RBI BB SO SB CS DP AVG OBP SLG
Alberto Gonzalez 25 NYA ss 75 259 237 60 11 1 4 28 18 34 0 0 0 .255 .311 .363
Alex Rodriguez 32 NYA 3b 157 686 580 174 27 1 42 124 89 127 19 4 17 .300 .402 .570
Andy Phillips 31 NYA 1b 63 198 180 48 9 1 4 24 13 34 1 1 4 .266 .310 .403
Angel Chavez 26 NYA ss 75 280 264 68 14 1 7 34 14 54 0 0 0 .259 .297 .395
Bobby Abreu 34 NYA rf 155 676 565 161 37 2 19 89 101 120 23 6 10 .284 .392 .462
Brett Gardner 24 NYA cf 83 331 297 78 11 3 2 31 31 64 0 0 0 .262 .332 .341
Bronson Sardinha 25 NYA rf 85 322 294 69 12 2 9 39 26 62 0 0 0 .234 .296 .378
Chris Basak 29 NYA 3b 69 239 215 56 11 1 6 30 21 42 0 0 0 .263 .329 .411
Derek Jeter 34 NYA ss 151 688 608 193 35 3 16 90 62 102 18 5 17 .318 .387 .463
Doug Mientkiewicz 34 NYA 1b 79 291 254 66 15 1 7 35 29 39 1 1 7 .262 .338 .409
Eric Duncan 23 NYA 1b 79 303 274 63 14 1 8 36 26 56 0 0 0 .231 .299 .379
Hideki Matsui 34 NYA lf 137 579 506 147 30 2 22 86 65 73 2 1 12 .291 .370 .490
Jason Giambi 37 NYA dh 107 422 339 87 15 0 22 66 70 82 1 0 6 .256 .398 .498
Johnny Damon 34 NYA cf 137 607 539 154 27 4 16 78 60 75 21 5 7 .286 .357 .438
Jorge Posada 36 NYA c 141 553 473 139 31 1 21 82 71 93 2 1 14 .294 .391 .499
Jose Molina 33 NYA c 60 195 179 45 10 0 4 22 10 39 1 0 4 .248 .281 .369
Kevin Reese 30 NYA lf 61 234 212 52 7 1 5 26 18 40 0 0 0 .245 .309 .361
Melky Cabrera 23 NYA cf 139 564 502 144 24 5 9 64 48 64 6 3 7 .287 .345 .407
Robinson Cano 25 NYA 2b 152 624 581 181 41 4 18 90 31 76 3 3 17 .312 .347 .490
Shelley Duncan 28 NYA dh 83 311 281 72 13 1 16 50 26 64 0 0 1 .256 .317 .479
Wil Nieves 30 NYA c 42 147 136 32 6 0 2 15 8 18 0 0 1 .233 .274 .340
Wilson Betemit 26 NYA 1b 109 308 272 71 14 1 12 44 30 70 1 1 4 .263 .332 .451


NAME AGE TEAM ERA G W L IP H ER HR BB SO
Andy Pettitte 36 NYA 4.21 36 13 9 201 199 94 19 57 149
Brian Bruney 26 NYA 5.20 32 2 2 33 31 19 4 22 30
Carl Pavano 32 NYA 4.49 10 3 3 50 53 25 6 11 31
Chase Wright 25 NYA 4.76 17 3 3 51 57 27 6 23 30
Chien-Ming Wang 28 NYA 3.93 35 13 9 202 210 88 12 56 95
Chris Britton 25 NYA 3.79 46 4 2 57 57 24 5 19 39
Colter Bean 31 NYA 4.44 13 2 1 24 23 12 2 16 22
Darrell Rasner 27 NYA 4.77 13 3 2 45 51 24 6 13 26
Edwar Ramirez 27 NYA 3.73 39 4 2 55 45 23 6 25 70
Ian Kennedy 23 NYA 4.19 38 11 9 181 179 84 21 72 143
Jeff Karstens 25 NYA 5.82 13 2 4 51 61 33 9 18 27
Jim Brower 35 NYA 5.88 22 1 2 26 28 17 4 11 19
Joba Chamberlain 22 NYA 3.68 55 10 7 149 136 61 16 48 150
Jose Veras 27 NYA 4.29 27 2 2 38 38 18 4 15 29
Kei Igawa 28 NYA 5.72 25 6 9 138 158 88 29 46 102
Kyle Farnsworth 32 NYA 4.23 58 4 2 57 53 27 7 24 58
Luis Vizcaino 33 NYA 4.46 71 4 4 73 69 36 8 33 62
Mariano Rivera 38 NYA 2.74 66 6 2 75 67 23 4 15 69
Matt DeSalvo 27 NYA 6.82 15 2 4 53 63 40 8 37 26
Mike Mussina 39 NYA 4.26 32 11 8 173 177 82 19 40 129
Philip Hughes 22 NYA 3.73 35 11 6 157 152 65 14 50 124
Roger Clemens 45 NYA 3.69 22 7 4 103 96 42 9 30 85
Ron Villone 38 NYA 4.60 45 3 4 59 56 30 7 29 47
Ross Ohlendorf 25 NYA 4.77 16 4 3 64 73 34 9 15 39
Sean Henn 27 NYA 5.90 19 2 2 32 36 21 5 19 21
Tyler Clippard 23 NYA 5.81 19 4 5 81 92 52 16 35 58


Player Age Team LG Pos G Innings PO A E DP PM CH ZR PM +/- RS RS/162
Doug Mientkiewicz 34 NYY AL 1B 86 669 626 53 4 63 119 137 .870 4 3 7
Jason Giambi 37 NYY AL 1B 60 459 402 32 5 38 72 90 .796 -4 -3 -9
Andy Phillips 31 NYY AL 1B 66 428 342 45 4 41 80 94 .846 1 1 2
Wilson Betemit 27 NYY AL 1B 36 255 88 63 3 20 69 84 .816 -1 -1 -5
Shelley Duncan 29 NYY AL 1B 31 218 51 62 3 18 61 73 .834 1 1 6
Robinson Cano 26 NYY AL 2B 140 1213 320 397 13 100 367 441 .833 4 3 4
Wilson Betemit 27 NYY AL 2B 11 84 53 13 0 7 18 22 .813 -1 -1 -9
Alex Rodriguez 33 NYY AL 3B 156 1337 115 273 16 31 299 394 .758 -3 -3 -3
Wilson Betemit 27 NYY AL 3B 41 288 23 63 4 9 64 81 .786 0 0 -2
Johnny Damon 35 NYY AL CF 117 980 284 16 3 3 280 318 .881 0 0 0
Melky Cabrera 24 NYY AL CF 62 505 153 23 3 4 153 173 .880 2 1 4
Hideki Matsui 34 NYY AL LF 105 892 204 6 4 1 199 238 .833 -7 -6 -10
Melky Cabrera 24 NYY AL LF 72 599 139 7 1 1 134 159 .840 -3 -2 -6
Johnny Damon 35 NYY AL LF 52 423 116 3 2 0 112 129 .869 0 0 0
Bobby Abreu 34 NYY AL RF 128 1089 240 7 4 1 237 275 .864 -2 -2 -2
Derek Jeter 34 NYY AL SS 152 1300 225 373 15 85 369 457 .806 -11 -8 -9
Alberto Gonzalez 25 NYY AL SS 35 225 49 18 2 3 56 66 .838 -1 0 0
Wilson Betemit 27 NYY AL SS 18 114 22 21 2 5 29 36 .793 -2 -1 -19


The full spreadsheet is available here.

Update: Version 1.3 is now available. I added more minor league data and changed some of my pitching algorithms. Link is here.

It's important to remember that any projection system is inherently limited. We're dealing with athletes playing games, and their true talent can change in ways that can be forecasted. In addition fluke seasons happen, both good and bad. I think that on a team level projections are a useful tool for understanding probabilities, but at the end of the day that's all they are. Probabilities, not predictions.
--Posted at 6:47 am by SG / 37 Comments | - (3252)

Comments

Page 1 of 1 pages:

I really think it would be OK to delete Pavano, since he will miss the entire season “rehabbing” from surgery.  Unless you’d rather place a small wager on that projection.  I’ll take the under on IP.

I don’t think Basak is with the Yankees any more.  I’m pretty sure he was picked up by Minnesota; I was at a Yankees/Red-Wings game at the end of the year, and they (the Wings) had a player named Chris Basak that the Scranton crowd cheered, so…

Do you have yet there relative linear-weights numbers?  That is, +/- relative to position?  For example, looking at Gonzalez and Gardner, at first glance their numbers look very low (.674 and .673 OPS, respectively).  However, I imagine when compared to league-average SS and CF, they aren’t that bad (less than -10 runs for sure), though I could be very wrong.  Add in their plus (or in Gardner’s case, plus-plus) speed, and reputations for good defense, and they could be very useful bench players next year.

I am assuming there is some reason none of the minor league guys are projected to steal a base? If Garnder has 331 at bats without recording a steal I will eat my hat

I really think it would be OK to delete Pavano, since he will miss the entire season �rehabbing� from surgery.

I kept Pavano in there strictly as a reminder to Brian Cashman.  I am positive he will not pitch in 2008.

I don�t think Basak is with the Yankees any more.

You’re right.  Moving players to their right teams will be an ongoing project, especially the minor leaguers.

Do you have yet there relative linear-weights numbers?

Yeah, I do have that but I didn’t include it in the spreadsheet I uploaded yet.  Gonzalez’s line would be equivalent to -13 runs above an average SS over 650 plate appearances.  That’s a touch better than replacement level if he’s at least average defensively.  Gardner’s line would -20 runs above an average CF over 650 plate appearances which is just about where replacement level starts.  He could be good enough defensively to take a chunk out of that, but there’s no way to know that.

I am assuming there is some reason none of the minor league guys are projected to steal a base?

Yeah.  Looks like I didn’t put my SB/CS numbers in the right place for my major league equivalencies.  It’ll be fixed in Cairo v0.2.

A-Rod: .300/.402/.570
Abreu: .284/.392/.462
Jeter: .318/.387/.463
Matsui:.291/.370/.490
Giambi: .256/.398/.498
Posada: .294/.391/.499
Cano: .312/.347/.490

if the Yankees met these predictions, they’d win 100+ games.

I have no faith in Giambi hitting his projection but it would depend on how much those guys can play too.  If you are penciling in Giambi for 150 games you’re probably over-estimating.

My early estimates with these projections allocating playing time to the bench has the offense scoring around 930 runs next year.  That’s predicated on keeping Posada and Rodriguez though.

Cool stuff, SG, and thanks.  I’m hoping (with a fan’s bias) that Melky will best his projection by figuring out a way to not take a month to get going or burn out in September.

“That’s predicated on keeping Posada and Rodriguez though.”

Man I hate sentences like that.

“if the Yankees met these predictions, they’d win 100+ games.”

These are not, overall, better numbers than this group of players had in 2007.  We KNOW, as best as you can know such things, the Yankees (with ARod and Posado) will score a boatload of runs.  We KNOW, as best as you can know such things, that Wang will be good and Pettitte at least decent.  Our final number of wins will depend on whether 2 of Hughes, Joba, IPK, and Moose turn in legit 150+ IP seasons of league average or above pitching, and whether the bullpen, whoever’s out there, holds together.

These are not, overall, better numbers than this group of players had in 2007.

i would think that Giambi alone makes this projection collectively better than the 2007 performance.

but i get your point.

Posada and A-Rod are going to regress some from last year.  and Giambi is simply not going to hit that projection.

SG already has Posada and A-Rod regressing from last year (as well he should):

2007 (from ESPN)
Posada—338/426/543
A-Rod—314/422/645

CAIRO
Posada—.294/.391/.499
A-Rod— .300/.402/.570

I think the Moose projection is worse than the Giambi projection, although I was amazed to see that Moose had a 4.11 FIP this year.  Maybe he really will bounce back a bit.

I should add to my previous point that I think this is what will make 2008 particularly exciting.

Wen was the last time the Yankees’ success depended on pitchers maturing rather than not declining too much?

SG:

One major difference I see between CAIRO and say, PECOTA, is that CAIRO does not factor comparable players into its projection.  Is this to save time, or due to a lack of data, or are you philosophically opposed to comparables in some way?  Given your explanation, I assume that time is the greatest factor.

Wen was the last time the Yankees’ success depended on pitchers maturing rather than not declining too much?

Back when we were riding the Scott Kamieniecki, Clay Parker, and Jeff Johnson train…

One major difference I see between CAIRO and say, PECOTA, is that CAIRO does not factor comparable players into its projection.  Is this to save time, or due to a lack of data, or are you philosophically opposed to comparables in some way?

Mainly time. I do think adding similar players can be somewhat beneficial, but I’m not sure of the best way to integrate it, or that the additional effort needed is justified by whatever marginal improvements it may add.  I’m open to adding it in the future of course.

Two questions about CAIRO (that show my ignorance about this stuff): 1) Where do the raw statistics come from and how granular are they?  2) I see that you age-adjust the component stats and regress the number of plate appearances towards the mean.  Do you also regress the component stats towards the mean?

1) Where do the raw statistics come from and how granular are they?

I pull the raw stats from Baseball Prospectus, example here.  They’re pretty granular although I am not making use of every single field for now.

2) 2) I see that you age-adjust the component stats and regress the number of plate appearances towards the mean.  Do you also regress the component stats towards the mean?

Yes, a percentage of league average performance in every component stat is added to every player’s line, not just plate appearances/innings.

I’ll take the over on Jeter and Cano (offensively and defensively) and the under on Giambi and Abreu.

How are the W/L records for the pitchers calculated?

most of the numbers seem to make sense, with the exception of some of the pitching stats. How does a guy who wins 19 games 2 years in a row project to go 13-9 wvwn though his ERA projects to be below 4? I realize that wins are only partially subject to the pitcher’s performance, but for the yankees anyone with an under 4 ERA should haveat least 15 wins. Does this sustem at all take into account team offense as far as wins go?

I’ll take the over on Jeter and Cano (offensively and defensively) and the under on Giambi and Abreu.

I’d take the over on Cano for both, Jeter’s right around where I’d think he should be in both areas.  I’d also take the under on Giambi, Abreu, and probably Posada.

How are the W/L records for the pitchers calculated?

Divide innings by 9 to get a # of decisions, then use the pythagorean formula and the team’s projected offense to calculate a winning percentage. (pitcher’s RA squared) / ( pitcher’s RA squared plus team projected runs per game squared) times # of decisions for wins.  Then decisions minus wins for losses.  I wouldn’t pay much attention to it.

I realize that wins are only partially subject to the pitcher’s performance, but for the yankees anyone with an under 4 ERA should haveat least 15 wins.

Looks like I didn’t get all the run support data in v0.1, I had every team averaging 5 runs a game.  If we use my 930 run estimate for the 2008 Yankees then Pettitte and Wang would both project to go 15-7.

I missed most of last night’s game but the video on No Maas suggesta that Pedroia might have been trying to slap the ball out of the 1b’s glove as Arod was so condemned for attempting, but Sawx fans contend he was just diving.  Anyone have an opinion?

It certainly appears as if Pedroia was trying to slap the ball out of Martinez’s glove, but then again much of Boston’s hatred of A-Rod is because a) he was a hair’s breadth away from being their SS four years ago and b) the fact that he went to the Yanks instead.  To me, Manny’s pose after his HR, with his team still down by four runs, was far more lame - but that’s Manny.

Hey i was wondering if anyone knew when humberto sanchez is suppose to be available to pitch again.  I want to say he had tommy john in april but not sure. Because if he could be ready by mid-season you would hope he could help out that bullpen since JOBA will be in the starting rotation

Last thing I read about Sanchez is his recovery was going ok and they think he’ll be in spring training although may not be ready when it starts.  It’d be great if he could be Joba in 2008, but I wouldn’t count on that yet.

It’d be great if he could be Joba in 2008, but I wouldn’t count on that yet.

True for a number of reasons. 1) Who knows how he responds when he really starts throwing - that’s when we see how he recovers 2) It may end his elbow problems, but there have also been concerns about his conditioning 3) He hasn’t yet had any experience or success in the majors. But his potential does have me drooling. Two years ago at the Futures game, he was the object of scouts affections, even more so than Hughes.

The way I think about projections is this:  They represent what you might expect the player to do during the upcoming season given his previous performances, his age, and his team.  What I find more interesting, and I guess what makes sports so captivating, is that some players will surprise you by either outperforming or underperforming any reasonable expectation.

I really wonder whether THAT factor, deviation from the projected performance, might itself become susceptible to prediction.  One case that comes to mind is Josh Beckett.  Seems like a lot of people were predicting a big year from him based on what appeared to be a determination on his part to alter his pitch selection (not rely so heavily on fastballs).  If a player is planning on changing his approach, or hooks up with a different set of coaches, or is in a contract year, etc., isn’t it conceivable that those kinds of subjective factors could play a useful role in their projections?

The conventional wisdom on Tommy John surgery is that velocity comes back in a year, but command takes longer.  See Dotel, Octavio.  I think the best case scenario for Sanchez is that he comes up at the end of August. It’s more likely that we won’t see him until ‘09.

The same goes for JB Cox and Mark Melancon, most likely, both of whom had the surgery as well.

i thought Cox had elbow, but not Tommy John, surgery?

I really wonder whether THAT factor, deviation from the projected performance, might itself become susceptible to prediction.

I’ve always been fascinated by the psychological elements of baseball, and I’d love to see both the psychological and statistical analysis come together for something like this.  Earlier in the season I tried to defend Torre’s decision to not play Giambi at 1B, stating that a good defensive first baseman’s value is not just tied to his zone rating, but also the confidence he inspires in his teammates.  Unfortunately there’s no easy way to quantify something like “Jeter knows Giambi is a bad first baseman, so he tries to be too precise with his throw, and ends up bouncing the ball in the dirt,” but factors like that HAVE to exist.

Changing the topic a little bit: If A-Rod decides to opt out and the Yankees don’t pursue him, maybe they can go after Ian Stewart, the stud 3B prospect the Rockies have. After all, he is expendable since the Rox have Garrett Atkins. Just a thought…

I’m having trouble deciding whether or not it would be cool staking out Steinbrenner’s house in Tampa for the results of these meetings. It’s probably not.

How freakin’ ridiculous is it that we still do not know if Torre is returning or not?

Unless they are spending this time trying to convince Torre to either A. Take a one year deal or B. Take a significant pay cut, it is just a total insult to Torre.

Unless they are spending this time trying to convince Torre to either A. Take a one year deal or B. Take a significant pay cut, it is just a total insult to Torre.

Or they could really be trying to determine if they want to keep him as a manager, or offer him a different position in the system.  Or they could have decided they want to keep him, and know what parameters (money and length) they want, but also want to make some other coaching changes (e.g. Guidry), and are trying to determine how to approach that, since Torre is very loyal to his coaches.

I know we all want closure on this either way, but let’s wait and see what happens before we decide that this is so insulting to Torre.  If it goes longer than Friday though it is probably getting a little silly.

This picture leaves little doubt as to what Pedroia was doing:
SlappyOHairless.gif

How freakin’ ridiculous is it that we still do not know if Torre is returning or not?

everyone keeps repeating this, and i still don’t understand it.

why do the fans seem to think the Yankees owe them this answer before some sort of arbitrary deadline?  it’s like someonem (Mike and the Maddog?) floated this idea that we should be outraged and everyone has run with it.

no one here knows anything about what is going on or how much the FO is in communication with Torre.  i would guess he is in the loop.

and i would also guess he is coming back.

during this 12 days of “total insult” to Torre, he has earned $230K.  just “waiting”.  what an insult. 

everyone keeps saying it is an insult to Torre, and yet no one has actually explained WHY.  why is it an insult?

it’s as if people think it would have been better for him to be fired the day after the season ended instead of waiting 2 weeks for a new contract.

to me, there are a lot of good reasons to let him go and a lot of good reasons for him to come back.  the FO is doing what they SHOULD be doing and weighing all of those reasons to make the best decision.  whether or not the fans are restless for an answer is completely irrelevant.

to me, there are a lot of good reasons to let him go and a lot of good reasons for him to come back.  the FO is doing what they SHOULD be doing and weighing all of those reasons to make the best decision.  whether or not the fans are restless for an answer is completely irrelevant.

Ditto.  Though as I said above, it really should be resolved by tomorrow.  If they are waiting on doing anything with anyone else until they figure out who the next manager is, they really should get that figured out by the end of this week.

Page 1 of 1 pages:
2 of 712 registered readers are currently logged in.
There are currently 48 visitors who are not logged in.
There was a record 234 simultaneous visitors on August 30, 2007 at 4:30:39 pm.

Logged in users: PagsRags, TVH


Yankees.com: Yankees have tough decisions to make
(134 Comments - 10/6/2008 8:19:41 am)

Why Did the 2008 Yankees Disappoint?(Relief Pitching Edition)
(128 Comments - 10/3/2008 5:50:16 pm)

Why Did the 2008 Yankees Disappoint?(Starting Pitching Edition)
(112 Comments - 10/3/2008 1:20:34 am)

Newsday: O’Brien: Yankees’ Cashman to sign three-year deal
(50 Comments - 10/1/2008 3:29:32 pm)

Why Did the 2008 Yankees Disappoint?(Offense Edition)
(73 Comments - 10/1/2008 8:01:57 am)

Why Did the 2008 Yankees Disappoint?(Defense Edition)
(50 Comments - 9/30/2008 7:39:48 pm)

Yay Moose!
(79 Comments - 9/29/2008 6:30:14 pm)

Yankees (88-72) @ RedSox (94-66), **Double Header Game Chatter**
(194 Comments - 9/28/2008 10:22:39 pm)

NorthJersey.com - Caldera: Yankees keep Sox from clinching AL East
(30 Comments - 9/28/2008 11:37:16 am)

Baseball America: Brackman Kicks Off HWB
(34 Comments - 9/27/2008 4:49:45 pm)



*ADVERTISEMENT*

*ADVERTISEMENT*

*ADVERTISEMENT*

*ADVERTISEMENT*