For researchers and students writing their theses, the use of the Ordinary Least Squares (OLS) linear regression method is certainly familiar. OLS linear regression is one of the most commonly used techniques to analyze the significant effect of independent variables on a dependent variable.
OLS linear regression is known for being highly quantitative; generally, linear regression is synonymous with numerical variables measured on an interval or ratio scale. However, in practice, we often encounter cases where we want to include an important categorical variable—such as business type or region—that is suspected to affect the dependent variable being measured.
The question then arises: can we include these categorical variables in an OLS linear regression model? The answer is yes. However, they require special treatment. This article by Kanda Data will discuss how to incorporate these categorical variables, their limitations, and how to ensure that the assumptions of OLS linear regression remain satisfied.
Understanding Categorical Variables
In statistics, you might recall the scales of data measurement. Data measurement scales can be divided into four types: nominal, ordinal, interval, and ratio scales. Interval and ratio scales are known as numerical variables, whereas nominal and ordinal scales are known as categorical variables.
In this article, the focus will be on nominal-scale categorical variables. A nominal categorical variable represents categories or groups without a specific order. Essentially, this variable only differentiates between groups without any inherent ranking. Examples include gender (male, female), employment status (permanent, contract, freelance), and region (rural, urban).
The defining characteristic is that the categories in these examples aim to distinguish groups without any order or ranking. With these categorical variables, we cannot directly obtain numerical values. This means we cannot simply insert category labels into a regression model, as those numbers could be misinterpreted as values with a specific order or distance.
Assumptions of OLS Linear Regression
Before incorporating categorical variables, it is crucial to first understand the basic assumptions of OLS regression. A number of assumptions must be met to produce the Best Linear Unbiased Estimator (BLUE).
If we refer back to a textbook like the Theory of Econometrics, there are about 14 OLS linear regression assumptions explained. However, I will outline a few minimum assumptions that need to be checked to ensure our regression model meets the required conditions.
Since the primary focus of this article is not to detail all of them, I will provide a summary of the key OLS assumptions we need to know. First, we must ensure that the variance of the residuals is constant (homoscedasticity). Second, the residuals must be normally distributed. Third, there must be no strong correlation among the independent variables (no multicollinearity), and the relationship between the independent and dependent variables must be linear. If you are utilizing time series data, you also need to include an autocorrelation test.
After understanding these required assumptions, the addition of categorical variables must not violate them. So, where do we place these categorical variables in an OLS linear regression analysis? We can include them as dummy variables. Let’s explore this further!
Categorical Variables as Dummy Variables
To include nominal-scale categorical variables, we can input them as dummy variables. These dummy variables will sit alongside other independent variables in our linear regression equation.
A dummy variable is a binary variable that represents the presence of a category. The scoring technique for dummy variables uses scores of 0 and 1. For example, suppose we want to determine whether an import policy has affected domestic production over the past 15 years. We can create a dummy variable where the period before the import policy is scored 0, and the period after the policy is scored 1.
The regression model, for instance, can be formulated as follows:
Y=β0+β1X1+ β2X2+ β3D+ε
Where:
𝛽0 : Intercept
𝛽1, 𝛽2: Coefficients of the independent variables X1, X2
𝛽3: Estimated coefficient of the dummy variable
One thing we need to understand is that the interpretation technique for dummy variables differs slightly from how we interpret other continuous independent variables.
Limitations of Dummy Variables in OLS Regression Analysis
Although we can include categorical variables in an OLS linear regression equation, there are limitations. If an OLS linear regression equation consists entirely of dummy variables, it has the potential to cause multicollinearity. Furthermore, if a variable has too many categories, using dummies can make the model overly complex and prone to overfitting.
Dummy variables are strictly suitable for nominal scales. For ordinal scales, a dummy approach can be used, but it does not capture the ordering information efficiently.
Conclusion
Categorical variables measured on a nominal scale can be included in OLS linear regression, but not directly. We need to convert these variables into dummy variables so they can be mathematically interpreted within the OLS linear regression model.
The use of dummy variables allows us to analyze differences between groups quantitatively. However, it is necessary to pay attention to several important limitations, such as the dummy variable trap, the selection of the reference category, and model complexity.
With proper understanding, these nominal-scale categorical variables can actually enrich regression analysis and provide better insights. That concludes this article from Kanda Data for now. Hopefully, it is useful and provides additional insight for all of us. Stay tuned for the next article update from Kanda Data in future educational posts.
