In multiple linear regression analysis using cross-sectional data, there are several assumption tests that must be conducted to obtain the best linear unbiased estimator. It is crucial to understand which assumption tests are required for research utilizing cross-sectional data. This is important because the assumption tests for cross-sectional, time series, and panel data differ in some respects.
In this article, Kanda Data focuses specifically on cross-sectional data, which is widely used globally by researchers based on survey results across various fields of research. To ensure the validity and interpretability of the multiple linear regression estimation results, we need to have a solid understanding of at least the basic assumption tests required.
Before delving deeper into the assumption tests for linear regression, which typically uses the ordinary least squares (OLS) method, let’s first review the basic theory of multiple linear regression and the definition of cross-sectional data.
Definition of Multiple Linear Regression
Based on its basic theory, multiple linear regression can be defined as an inferential statistical technique used to model the relationship between a dependent variable (Y) and more than one independent variable (X1, X2, …, Xn). According to this definition, a model with at least two independent variables is referred to as multiple linear regression.
From this definition, we can then construct the general equation for multiple linear regression as follows:
π=π½0+π½1π1+π½2π2+β―+π½πππ+π
where:
Y is the dependent variable,
π1, π2,…,ππ are the independent variables (at least two independent variables),
π½0 is the constant,
π½1, π½2,…,π½π are the regression coefficients (equal in number to the number of independent variables),
π is the residual or error.
As previously mentioned, multiple linear regression generally uses the Ordinary Least Squares (OLS) method. OLS is a method used to estimate regression coefficients in linear regression by minimizing the sum of the squared residuals.
Residuals can be defined as the difference between the observed data values and the predicted values, which we can express as the difference between the actual Y and the predicted Y. To estimate effectively, consistently, and without bias, we must ensure that all the assumptions required in OLS regression are met.
Definition of Cross-Sectional Data
Before discussing the assumptions for multiple linear regression on cross-sectional data, it’s essential to first understand the definition of cross-sectional data. Cross-sectional data refers to data collected at a specific point in time from various subjects or units of observation, such as individuals, households, or companies.
From this definition, it is clear that cross-sectional data is collected at a single point in time. For example, a researcher might observe rice production in 100 farmer households in the XYZ region. The researcher collects data on rice production during the study period from each farmer household through structured interviews.
Thus, the rice production data collected from the 100 farmer households represents data gathered from 100 subjects or units of observation at a single point in time. The collected data is called cross-sectional data. Hopefully, this example helps clarify the definition of cross-sectional data.
Case Study Example: Assumption Tests for Multiple Linear Regression on Cross-Sectional Data
To better understand the implementation of assumption tests required for multiple linear regression using cross-sectional data, let’s discuss a case study. Suppose a researcher observes the influence of advertising costs and the number of marketing staff on product sales at 100 convenience stores in City ABC.
The researcher collects data from 100 convenience stores as subjects or units of observation, and the data is collected at a single point in time. Based on this case study, we can specify the multiple linear regression equation as follows:
π=π½0+π½1(Advertising Costs)+π½2(Marketing Staff)+π
Before conducting multiple linear regression analysis based on this equation, to obtain the best linear unbiased estimator, we must perform several assumption tests. Letβs discuss each assumption test required for this case study.
Residual Normality Assumption
The first assumption to test is the normality of residuals. Recall that residuals are the differences between actual Y values and predicted Y values. In the residual normality test, it is assumed that the distribution of residuals must follow a normal distribution. It is clear that the residuals should follow a normal distribution pattern.
There are several methods to test the normality of residuals. Two commonly used tests are the Kolmogorov-Smirnov test and the Shapiro-Wilk test. Additionally, we can support these tests by visualizing the data with a residual histogram or a P-P plot.
After performing the normality test using either Kolmogorov-Smirnov or Shapiro-Wilk tests, if the p-value is greater than 0.05, we accept the null hypothesis, indicating that the residuals are normally distributed. Conversely, if the p-value is less than 0.05, we reject the null hypothesis (accept the alternative hypothesis), indicating that the residuals are not normally distributed.
Homoscedasticity Assumption
The next assumption test is to ensure the homoscedasticity assumption is met. The homoscedasticity assumption states that the variance of the residuals must remain constant across all levels of the independent variables. If the residual variance is not constant, heteroscedasticity is present. The regression equation we create should not exhibit heteroscedasticity (the homoscedasticity assumption should be met).
To detect heteroscedasticity in the regression equation from the case study, we can use the Breusch-Pagan test or the White test. Additionally, we can create a scatter plot between the residuals and predicted values to ensure that heteroscedasticity is absent.
If the Breusch-Pagan test yields a p-value greater than 0.05, we accept the null hypothesis, indicating that the homoscedasticity assumption is met. If the p-value is less than 0.05, we reject the null hypothesis (accept the alternative hypothesis), indicating that heteroscedasticity is present (the homoscedasticity assumption is not met).
Non-Multicollinearity Assumption
The non-multicollinearity assumption test can only be conducted in multiple linear regression. Multicollinearity occurs when there is a very strong relationship between two or more independent variables in the regression model. For instance, in the case study, if there is a strong relationship between advertising costs and the number of marketing staff, the regression equation experiences multicollinearity.
Multicollinearity in the regression equation can result in inaccurate and biased estimation results. To detect multicollinearity, we generally look at the Variance Inflation Factor (VIF) or Tolerance values.
If the VIF values for advertising costs and marketing staff are each below 10, then the non-multicollinearity assumption is met. Thus, we can conclude that there is no strong correlation between advertising costs and the number of marketing staff.
Linearity Assumption
The linearity assumption is an additional test we can conduct to ensure the linearity assumption is met. This assumption states that the relationship between the independent and dependent variables must be linear. This assumption can be tested using a scatter plot between the dependent variable and the predicted values.
In the case study, if the scatter plot between predicted sales and advertising costs or marketing staff shows a linear pattern, the linearity assumption is met. If there is a curved or non-linear pattern, the assumption is not met.
Solutions for Violating Assumptions
If one of the assumptions is not met, several steps can be considered. One approach is to transform the variables, add or remove variables, or re-specify the equation.
For example, logarithmic transformation can be used to address heteroscedasticity or non-normal residuals. Adding or removing variables can help resolve multicollinearity.
It’s fascinating to discuss solutions if assumption tests fail to meet the requirements. However, since this article is already lengthy and the key points have been covered, we will discuss these solutions in an article to be published at a later time.
Conclusion
Assumption tests in multiple linear regression are essential to ensure that the resulting model is reliable and provides the best linear unbiased estimator. In cross-sectional data, it is important to test for residual normality, homoscedasticity, multicollinearity, and linearity to ensure that the regression analysis provides valid results.
This concludes the article from Kanda Data. We hope it is helpful and provides new insights for those in need. Stay tuned for updates on Kanda Data’s articles in the coming weeks. Thank you.