Normality Test in Regression: Should We Test the Raw Data or the Residuals?

When we choose to analyze data using linear regression with the OLS method, there are several assumptions that must be met. These assumptions are essential to ensure that the estimation results are consistent and unbiased. This is what we refer to as the Best Linear Unbiased Estimator (BLUE).

One of the key assumptions to test is normality. However, many people often ask me: in a regression normality test, are we supposed to test the raw data collected from the field, or the residuals?

Don’t worry—this article from Kanda Data will discuss in depth the correct way to conduct a normality test in OLS regression analysis.

Understanding the Basics of the Normality Test in Regression

As mentioned earlier, the normality test is one of the assumption tests in linear regression using the OLS method. This test is conducted to check whether the data we use are normally distributed.

To answer this, we need to understand the theoretical foundation of OLS regression. In theory, it is assumed that the residuals in OLS regression are normally distributed—not the raw data.

Based on this concept, we can already answer the question in the article title: the normality test in regression is conducted on the residuals, not on the raw data.

Still, many are unaware that residuals are different from the raw data obtained during research. Therefore, to clarify, I emphasize once again: in the normality test for regression, we are testing the residuals, as the assumption of OLS regression is that the residuals are normally distributed.

To understand residuals more clearly, let’s explore what they are, so you can distinguish them from the raw data collected during research.

Understanding Residuals

A residual is the difference between the actual observed value and the predicted value. In the context of OLS linear regression, the observed value refers to the value of the dependent variable.

So, in simple terms, a residual is the difference between the actual observed value of the dependent variable and the predicted value of that same variable.

To get the observed value, we simply refer to the data collected directly from the field. To obtain the predicted value, we need to first estimate the regression equation. This involves processing the data to generate the intercept and regression coefficients for each independent variable.

Once we have those coefficients, we can calculate the predicted value of the dependent variable. Then, the residual can be calculated by subtracting the predicted value from the observed value.

Residuals can be either positive or negative: (a) A positive residual occurs when the observed value is greater than the predicted value; (b) A negative residual occurs when the observed value is less than the predicted value.

How to Perform a Normality Test on Residuals

There are several ways to test the normality of residuals. You can: (a) Use statistical analysis, or (b) Use visual diagrams (e.g., histograms, Q-Q plots).

However, the most commonly used method among researchers is statistical testing. Once you’ve calculated the residuals, you can use one of the following tests: (a) Kolmogorov-Smirnov test, or (b) Shapiro-Wilk test.

You can use either of these tests—they will generally produce similar results. If you’d like, you can even use both to cross-check your analysis.

All you need to do is input the residual values into statistical software and run one of the normality tests mentioned above. For interpretation, the rule of thumb is: (a) If the p-value > 0.05, the residuals are normally distributed; (b) If the p-value < 0.05, the residuals are not normally distributed.

Since one of the assumptions of OLS regression is that residuals must be normally distributed, we hope to see a p-value greater than 0.05. Also, data measured using interval or ratio scales are more likely to meet the normality assumption compared to data measured using ordinal scales.

Final Answer to the Question

Based on the explanation above, the question in the article title is now answered: In linear regression using the OLS method, the normality test is performed on the residuals, not on the raw data.

This distinction is important so that readers can differentiate between the normality test in regression and other types of statistical normality tests. Thanks to statistical software, we no longer have to manually calculate residuals—they can be generated automatically and then tested for normality.

That concludes this article. I hope it has been useful and provided new insights for all of you. If you found this article helpful, don’t hesitate to share it with others. Stay tuned for more updates from Kanda Data in the next article!

M	T	W	T	F	S	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

KANDA DATA

Blog