Correlation and Causation

I had to laugh when I caught sight of this chart earlier this evening


That said, it’s still not as funny as “Figure 1” in George Yule’s classic work from 1926, “Why do we Sometimes get Nonsense-Correlations between Time-Series?–A Study in Sampling and the Nature of Time-Series”. Click this link – I won’t spoil it for you.

In my opinion, what’s even more dangerous is the abuse of R2. In research utilizing large data sets (most notoriously time-series), you’ll often see relationships between sets quantified in terms of this metric. Not a few times I’ve heard students say that they ran a regression and got a “good R2“. This begs the questions of what is meant by “good” and what R2 really signifies. My guidance was always to have them start by plotting the data, taking a lesson from Anscombe’s quartet. But that guidance isn’t very practical if one is evaluating thousands of time series, and it’s ultimately susceptible to the same error as the aforementioned two figures: attributing “causation” to “correlation”. The figures of Yule and are problematic because they look so convincing. If a correlation is what you’re looking for, you’ll be tempted to stop analyzing your data when you see that “cluster” of data points.

For this reason, I recommend working with a statistician when you’re formulating experimental designs. This requires a bit of patience on both sides, because the experimentalist needs to be able to accept input and the statistician may need to be educated about the nature of the experiment. There will probably be an initial delay in your work. But if you can establish a good relationship built on trust and respect (and hopefully, fun!), I guarantee it will pay scientific dividends!