Blog
Assumption Tests in Linear Regression Using Survey Data
The most commonly used linear regression analysis by researchers is the Ordinary Least Squares (OLS) method. However, when applying linear regression with the OLS method, several assumptions must be met to ensure that the estimation results are consistent and unbiased.
If you are conducting a survey-based study involving respondents as research samples and you choose to analyze the data using multiple linear regression, it is important that you thoroughly understand the assumption tests that must be performed. In this article, Kanda Data will discuss the assumption tests required when applying multiple linear regression to survey data.
1. Normally Distributed Residuals
The first assumption you need to understand is that the residuals in OLS regression must be normally distributed. To test this assumption, you first need to calculate the residual values from the regression model you have developed.
So, what are residuals? Residuals are the difference between the actual Y value and the predicted Y value—in other words, the difference between the observed value and the predicted value.
Once you have the residuals, you need to perform a normality test to ensure that the residuals are normally distributed. You can test residual normality using statistical tests or graphical methods. However, statistical tests tend to be more accurate.
Commonly used normality tests include the Shapiro-Wilk test and the Kolmogorov-Smirnov test. If the test results show a p-value greater than 0.05, it can be concluded that the residuals are normally distributed. Conversely, if the p-value is less than 0.05, it indicates that the residuals are not normally distributed.
2. Constant Residual Variance
The next assumption in regression analysis using survey data is that the residual variance across the predictor variables should be constant. This condition is referred to as homoscedasticity.
If the residual variance is not constant, the model exhibits heteroscedasticity, which violates the assumption. Ideally, a good regression model should meet the homoscedasticity assumption. To test this assumption, one commonly used method is the Breusch-Pagan test.
3. No Strong Correlation Among Independent Variables
Another important assumption to ensure consistent and unbiased estimates is the absence of strong correlation among independent variables.
If the independent variables are highly correlated, the estimation results can become inaccurate. This issue is known as multicollinearity.
To prevent this problem, it is important to verify that the regression model is free from multicollinearity. You can assess this by checking the correlation among independent variables or by calculating the Variance Inflation Factor (VIF).
VIF is one of the most widely used indicators among researchers. If the VIF value is less than 10, it is generally assumed that there is no multicollinearity problem.
4. Linearity Test
Another key assumption is linearity. Since we are using linear regression analysis, the relationship between the independent and dependent variables should follow a linear pattern.
Therefore, scatter plots of the variables used should ideally form a linear shape. You need to test the linearity of the regression model. If the plotted data points follow a linear trend, then the linearity assumption is considered met.
In summary, when working with survey data—also known as cross-sectional data—it is essential to pay attention to the assumption tests required for multiple linear regression. These assumption tests should be aligned with the four main points discussed above.
Thank you for reading this article. We hope it is useful for those who need it. Stay tuned for more updates from Kanda Data in the next article!