Multiple linear regression equations must fulfill the required assumptions to obtain the best linear unbiased estimator. Several assumptions in linear regression using the ordinary least square (OLS) method must be met, one of which is non-multicollinearity.
What is the assumption of non-multicollinearity in linear regression? Non-multicollinearity assumes that there is no strong correlation between the independent variables. In order to obtain the best linear unbiased estimator, there is no strong correlation between independent variables.
If the independent variable in the multiple linear regression equation has a strong correlation, it can lead to a biased estimate. Biased estimates can cause the conclusions from research not to represent the population well.
Is it true that the multicollinearity test is only conducted in multiple linear regression equations?
As I wrote in the previous paragraph, the non-multicollinearity assumption test is aimed at ensuring that there is no strong correlation between the independent variables. Thus, the non-multicollinearity assumption test is performed on multiple linear regression equations with at least two independent variables.
Maybe you will ask, does simple linear regression need to be tested for non-multicollinearity assumptions? The answer is not necessary. Simple linear regression only consists of one dependent variable and one independent variable. Therefore, the simple linear regression does not test the assumption of non-multicollinearity.
How to detect multicollinearity
Detecting multicollinearity in multiple linear regression equations can be done by looking at the correlation value between the independent variables. The easiest way to detect multicollinearity can be seen from the variance inflation factor (VIF).
A large VIF value indicates a multicollinearity problem in multiple linear regression. On the other hand, a small VIF value indicates that the multiple linear regression equation is free from multicollinearity problems.
Several books and previous research results generally state that the VIF value is less than 10; it is assumed that there is no multicollinearity problem in multiple linear regression. Conversely, if the VIF value is more than 10, the linear regression equation is assumed to have a multicollinearity problem.
What if it turns out that my regression equation has a multicollinearity problem?
The researcher needs to take action if the multiple linear regression equation has a multicollinearity problem. As I wrote in the previous paragraph, a regression equation with a multicollinearity problem will impact biased estimation results.
Several ways can be used to overcome the problem of multicollinearity, namely: (1) Omitted variables that have a high VIF value; (2) In cross-sectional data, replace outlier data with new data from the field; (3) Add or subtract the number of observations; (4) Perform variable transformation; (5) Do other methods according to statistical rules.
Examples of how to solve multicollinearity in research data
To make it easier for you to understand better how to solve multicollinearity in multiple linear regression with the OLS method, Kanda Data has prepared a mini research example. The mini research in this example is intended to determine the effect of advertising costs, marketing personnel, store operating costs, and bonuses for marketing personnel on car sales.
Data has been collected from 15 car stores in an area. The specifications for multiple linear regression equations are arranged as follows:
Y = b0 + b1X1 + b2X2 + b3X3 + b4X4 + e
Description:
Y = car sales (units)
X1 = advertising cost (USD)
X2 = Marketing personnel (people)
X3 = Store operating cost (USD)
X4 = Bonus for marketing personnel (USD)
b0 = Intercept
b1, b2, b3, b4 = Coefficient of regression estimation
e = disturbance error
The data that has been collected is then inputted into SPSS. The Y variable is grouped in the dependent variable based on the equation specifications. While the variables X1-X4 are grouped in the independent variables.
Setting the “Variable View” page in SPSS according to the equation specifications can be seen in the image below:
The research data that has been inputted on the SPSS “Data View” page can be seen in the image below:
How to create VIF value in SPSS
After the data is inputted, you must display the VIF value to perform multicollinearity detection. Generating the VIF value is almost identical to the usual multiple linear regression analysis stages, but you must check the collinearity diagnostic.
You click Analyze -> Regression -> Linear. Next, move the car sales into the dependent box and move the advertising costs, marketing personnel, store operating costs, and bonuses for marketing personnel into the independent boxes as shown below:
Next, you click Statistics on the top right. After the “Linear Regression: Statistics” window appears, activate “Collinearity diagnostics” and click continue as shown below:
After clicking continue, then clicking OK, the SPSS output will appear. Next, pay attention to the SPSS output in the Coefficients table. To the right of the sig column, there are two additional columns, namely Tolerance and VIF, as shown in the image below:
Based on the output above, the VIF values for the independent variables are:
Advertising cost (X1) -> VIF = 17.374
Marketing Personnel (X2) -> VIF = 1279.992
Store operating costs (X3) -> VIF = 7.898
Bonus for marketing personnel (X4) -> VIF = 1381.480
Based on this value, we can conclude that the Advertising cost (X1), Marketing Personnel (X2), and Bonus for marketing personnel (X4) variables have multicollinearity problems because the VIF value is > 10.
How to solve the multicollinearity
The way to solve multicollinearity that I will do is to look at the independent variable with the largest VIF value. Based on the SPSS output, the first step I took was to review the data from the Marketing Personnel (X2) and Bonus for marketing personnel (X4) variables.
Why did I choose to elaborate on these two variables? The answer is that the two independent variables have very high VIF values ββ, and the values ββare identical.
Next, I checked the data from the variable marketing personnel and bonuses for marketing personnel. We can see that bonus data for marketing personnel in this study is given to employees with the same standard. Higher marketing personnel will get a higher bonus value for marketing personnel as well. More details can be seen again in the data below:
I excluded the Bonus for marketing personnel (X4) variable to solve the multicollinearity problem. So, the specification of variables in SPSS only consists of 3 independent variables, as shown in the figure below:
Then, in the same way, it was analyzed again using SPSS. SPSS output can be seen in the table below:
Based on the output above, the VIF values for the independent variables are:
Advertising cost (X1) -> VIF = 8.579
Marketing Personnel (X2) -> VIF = 3.081
Store operating costs (X3) -> VIF = 5.879
Based on the VIF value of the new multiple linear regression equation, it is known that the Advertising cost (X1), Marketing Personnel (X2), and store operating costs (X3) variables are free from multicollinearity problems with VIF values < 10. Advertising cost variables (X1) and Marketing Personnel (X2), whose previous VIF value is > 10, the latest output results are VIF value < 10.
Closing remarks
Based on what I have written in the previous paragraph, it can be concluded that several ways can be used to overcome the multicollinearity problem. My advice for the first stage is to use the method I have done. If it still doesn’t work, you can try another method.
Thus the article that I can write for all of you. Hopefully helpful and will provide added value for those who need this material. Wait for article updates next week!