Statistics are a bit like guns. I'm really glad that we have experts in their use protecting and helping our society, but that doesn't mean I trust just anyone to wield them. They can do a lot of damage in the wrong hands. So when I saw this piece on LAG Confidential, I had to say something.
Sure, it's rivalry week. Banter pieces are part of the sports blogging tradition in such times. But it's one thing to poke fun at a rival, reminding them of triumphs through history, and it's quite another to veil oneself in the superiority of statistics and attempt to make a genuine argument. To my mind, that warrants a genuine refutation.
To be fair to him, Steffen does some good work, including this (much more thoughtful) piece previewing San Jose's season. This latest article, however, is a dud.
It does not take a statistical wizard to see the most gaping holes in Steffen's analysis. His sample size of 2 games is so small that not only would professional statisticians say we could glean very little from it, they would in fact caution against drawing any conclusions at all. His assumption that the 7.5 shots per game rate will hold up (and constitute the least-productive shooting team of all time) is just as absurd as the claim that the Quakes are on pace to be the greatest defensive team in league history (because of their 0.5 Goals Against Average) or that Chris Wondolowski is destined to break his own single-season goals record with 34 (since he's bagging one a game). Although you never know with that last one.
There are numerous other obvious problems, such as the fact that 34-15=19, not 23 (the shot differential he claims). Or that the score in the Portland game was 2-1, not 2-0. Or the fact that Occam's razor does not instruct us to value "luck" as a more reliable assumption than "style/skill of the defense," but rather to make fewer assumptions. It's also some random 13th century English monk's rule, it's not like it's a cornerstone of modern science.
But I wanted to get into two bigger issues in a bit more depth.
1) Confounding Variables
This is the biggest cardinal sin in statistics. And the one I can most directly deconstruct from Steffen's argument. It's logically invalid to infer that a hypothesized variable has a causal effect simply because the effect occurs. Put more simply: just because something happens doesn't mean that your preferred reason for it is the cause. Steffen's hypothesis is that the huge volume of shots conceded is a product of a terrible defense. But what other explanations for the data might there be?
For me, having reported on both games, and having covered Dominic Kinnear for over a year now, my hypothesis is that his extremely conservative tactical approach when defending a lead would lead to a much higher rate of shots conceded in those circumstances than when the game was tied or his team was losing. The data, which I hesitate to even use due to the tiny nature of the sample, bear this out:
|SJ Game State||Shots Conceded||Shots on Goal Conceded||Danger Zone Shots Conceded||Goals Conceded|
Even normalizing this for the amount of time San Jose played from each game state (75 minutes from tied positions, 105 minutes from winning positions) doesn't close the gap nearly enough to disregard the effect.
Teams that are winning tight games tend to defend their lead and sit deep. San Jose in particular, with two very good but very slow center backs, plays a deep line to start with. That would lead to an unusually high number of shots conceded, but it also generally lowers the quality of shots conceded, and the likelihood that they're converted. This not only explains the data, it also coheres with what we know about the team's coaching and personnel, and the general statistical effects of high versus low lines.
The major point, here, is that statsheads must be humble about how hard it is to accurately quantify the beautiful game, and must carefully match what they see in the data to what they see on the pitch.
2) Unwarranted Assumptions
Regardless of any of the above claims, the second statistical cardinal sin is assuming causal (or at least correlative) value for variables that don't deserve it. There are a few issues here, including the assumption that a higher volume of shots is good: the team with the most shots in MLS right now (Columbus Crew) is joint bottom, whereas the team with the fewest shots (San Jose) is joint top. I'm not suggesting that the opposite assumption is true, just that the assumption he makes would require a lot more evidence to back up, and doesn't look particularly solid through two weeks.
However, there's one assumption that he frequently references that I'd like to interrogate further. The idea that a reliance on crosses is not a viable strategy is an interesting argument, although it flies in the face of the analysis one of his own colleagues did about the 2015 Eastern Conference Champion Columbus Crew.
It also doesn't help his argument that in its first two games, San Jose was out-crossed by each of its opponents by a cumulative count of 46-24. Or that it strung together 471 passes at an 84% completion percentage against the Timbers, outdoing their supposedly slicker opponents in both categories. Of course, none of that made the article, because it would contradict the #narrative that San Jose is, in the words used on his American Soccer Analysis podcast, a "neanderthal" team.
I'm not suggesting San Jose will look like Barcelona in 2016. Far from it. But a lot has changed since the Bash Bros of 2013: the line-leading striker is 5'8", the wingers and central midfielders are much more technical, and the quality throughout has seen major improvements. Of the 6 midfielders and forwards in their starting XI against Portland, just two were even on the roster as of June 2015. As such, I would caution anyone from making bold claims about how this team plays before we see more of it.
If crosses are such a terrible thing, however, the great irony is that may in fact be the worst possible thing for Steffen's argument: it could be that the reason San Jose has conceded so many shots but so few goals is that it forced its opponents into pumping in 46 crosses that typically led to shots of low quality. Essentially, he'd be required to contradict one of his points in order to make the other.
Regardless of the validity of particular analytical methods, the one thing for certain is that it takes a great leap beyond what they reveal to reach some of the conclusions Steffen does: San Jose is terrible. Even "historically bad." They'd be the worst team in league history without Chris Wondolowski. Even with him, they'll "end up being one of the worst MLS teams of all time." While he insists that "the numbers really do back up my case," none of his data supports just about any of that, unless you buy that shot volume in a two game sample set is the best predictor of success. Which you shouldn't.
The reality is not the opposite; they're not the best team of all time on an inevitable march to legend. The reality is something a bit more moderate: they're a very good defensive team that is still finding itself offensively and whether they do so will define whether they're just outside the playoff bubble or comfortably within it.
So just remember: use statistics wisely, make sure your conclusions actually match your data, and
buy European retirees hold your team's beat reporters up to their end of the bargain on rivalry week.