Blog
What Is a Residual Value in Statistics?
If you’re working with data analysis using linear regression, especially the Ordinary Least Squares (OLS) method, it’s important to understand what a residual is. Why does this matter? Because several assumption tests in OLS regression rely heavily on residual values. That’s why you need a solid understanding of what residuals are and how to calculate them.
In this article, Kanda Data will walk you through the definition of residuals and how to compute them.
Understanding the Definition of a Residual
Let’s start with the basics. A residual is the difference between the actual observed value and the predicted value from a regression model. In simpler terms, it’s the gap between the actual Y and the predicted Y values.
Now, to make sense of that, we need to understand what we mean by actual Y and predicted Y. If you’re familiar with regression analysis, you’ve probably heard of the dependent variable, this is the variable that’s influenced by one or more independent variables.
The actual Y refers to the value of the dependent variable collected from your data — whether it’s cross-sectional or time-series data. So, the values of the dependent variable you gather through surveys or experiments are what we call actual Y values.
On the other hand, predicted Y values are generated after you run a regression analysis and obtain the intercept and regression coefficients. Why do you need to do this first? Because predicted Y values are calculated based on the estimated regression equation, which looks something like this:
Predicted Y = Intercept + (Coefficient1 × X1) + (Coefficient2 × X2) + … + (Coefficientn × Xn)
The number of coefficients depends on how many independent variables you include in your regression. For example, if your model has four predictors, you’ll end up with four estimated coefficients.
Once you’ve got your regression equation, you can start calculating the predicted Y for each observation in your dataset. From there, calculating the residual is straightforward:
Residual = Actual Y – Predicted Y
Residuals Can Be Positive or Negative
Once you’ve calculated the residuals, you’ll notice something interesting: residuals can be either positive or negative. So how do we interpret that?
A positive residual means the actual Y is higher than the predicted Y. In other words, the model underestimated the value. A negative residual means the actual Y is lower than the predicted Y, meaning the model overestimated the value.
These differences give you insight into how well (or poorly) your model is performing.
Using Residuals in Classical Assumption Tests
Residuals are not just for measuring prediction errors. They also play an essential role in classical assumption testing in regression analysis. One key assumption in OLS regression is that residuals must be normally distributed.
To check this, we perform normality tests on the residuals, such as the Shapiro-Wilk test or the Kolmogorov-Smirnov test. If the p-value from the test is greater than 0.05, you can conclude that the residuals are normally distributed. That means your regression model satisfies one of the core assumptions of OLS.
Once this condition is met, you can proceed to test other assumptions like homoscedasticity and multicollinearity.
Conclusion
That’s a wrap on what residuals are, how to calculate them, and why they matter in regression analysis. And guess what? Residuals are used in other types of regression tests too, but we’ll save that for the next article.
Thanks for reading! I hope this guide helped clarify the concept of residuals and gave you some new insights into your regression work. Stay tuned for more data tips and tutorials from Kanda Data!