After the back and forth with Arnold Kling yesterday in this post I thought an additional post might be a good idea.
Arnold made a point that the prediction error was understated in my simple graph. This is true, as I simply graphed the actual data along with the estimated value of the data from a linear model and did not graph the predicted (future) values vs. actual future values (we can do this since it is a contrived example for purposes of demonstration).
The first chart shows the real danger of mistaking an exponential process with a linear one. The dark blue series is the actual data, the red series is the estimated historical values of the model. The green values are the predicted values due to the linear model. As we can see, the linear model provides a very decent fit to the historical data. For those who know somethings about statsitics/regression analysis, the adjusted R2 is 0.92 which is a very good fit. But you can see that in terms of predictions the linear model does worse and worse with each additional increment in the explanatory variable. This highlights another pitfall that is easy to fall into with statistics, going for the high adjusted R2. While this does mean your model does explain what happened historically, as we can see it is in no way a gaurantee that the model will explain the future well at all.
The difficulty is that you may not be able to tell you have an exponential process simply by looking at the data. The first 30 or so observations look pretty linear, especially when you throw in a random error terms as I did. So what do you do? Look at the residuals. The vertical bars represent the difference between what is observed and what was predicted in the historical data and the model. Notice that the residuals are mostly positive at the ends and negative in the middle. This systematic movement in the residuals indicates you are missing something with your simple linear model. Further, that you may have some problems with predicting future values of the process you are analyzing. The solid line in the graph is a third order polynomial to show that the residuals do indeed tend to follow a systematic pattern.
Now I would be remiss not to mention a word of caution. This analysis of the residuals does not say that you have an exponential process with certainty. What it says is that you have missed something. It could be a relevant variable, such as if the process you are analyzing has a cyclical component. Or it is possible you have mis-specified your model (which is the case here in this example--but we only know this because the example is contrived). Ideally, you should learn about whatever process you are analyzing, or if you don't have time to do that talk to somebody who has. This way you might get an idea for how you'd expect the data to behave (i.e., is it exponential, cyclical, linear, etc.). And always graph your data. It is much, much easier to see relationships between data when you graph it. Skipping the graphing stage will increase the likelihood of you doing something stupid.
Posted by Steve at October 31, 2003 11:32 AM