How to Perform Multiple Linear Regression Analysis on Time Series Data Using R Studio

Post Views: 196

Reading Time: 8 minutes

Multiple linear regression analysis on time series data, along with its assumption tests, can be performed using R Studio. In a previous article, I explained how to conduct multiple linear regression analysis and assumption tests for cross-sectional data.

To read the complete article, please refer to my earlier post titled: “How to Perform Multiple Linear Regression Analysis Using R Studio: A Complete Guide.” In this tutorial, Kanda Data will specifically outline how to conduct analysis and interpretation using time series data.

The fundamental difference lies in the assumption tests, which slightly differ between cross-sectional data and time series data. For time series data, multiple linear regression requires an autocorrelation test, whereas for cross-sectional data, the autocorrelation test is not needed.

I would like to emphasize that multiple linear regression analysis typically employs the Ordinary Least Squares (OLS) method. Therefore, we need to perform several assumption tests required for OLS linear regression. In this article, we will practice using a sample case study involving time series data.

In this guide, Kanda Data will thoroughly explain how to conduct multiple linear regression analysis on time series data using R Studio, including various diagnostic tests or OLS assumption tests such as normality test, homoscedasticity test, multicollinearity test, linearity test, and autocorrelation test, to ensure that the regression model is valid and scientifically justifiable.

Case Study: Multiple Linear Regression Analysis on Time Series Data

Before diving into the technical details of multiple linear regression analysis and its assumption tests, it is important to understand that multiple linear regression is used when there is more than one independent variable affecting a dependent variable.

Establishing the regression equation specification is the initial step we need to take when conducting multiple linear regression analysis on time series data. To facilitate understanding of this fundamental theory, let’s create a sample research case.

A researcher observed that, according to theory and previous research, inflation and unemployment rates are determinants of economic growth. The researcher wants to verify whether inflation and unemployment rates negatively affect economic growth.

Therefore, the researcher conducted observations over 30 quarterly periods for inflation, unemployment rates, and economic growth in country ABC. Based on the collected data, the researcher successfully gathered 30 observations for each of the variables studied.

Based on this case example, the first step we need to take is to construct a multiple linear regression equation. The multiple linear regression equation for the given case study can be formulated as follows:

𝑌=𝛽0+𝛽1𝑋1+𝛽2𝑋2+…+𝛽𝑛𝑋𝑛+𝜖

Where:

𝑌 is economic growth (%) as the dependent variable,

𝑋1 is the inflation rate (%) as the 1st independent variable,

𝑋2 is the unemployment rate (%) as the 2nd independent variable,

𝛽0 is the intercept (constant),

𝛽1 and 𝛽2 are the regression coefficients indicating the change in 𝑌 based on changes in 𝑋1 and X2,

𝜖 is the error or residual.

After constructing the specification of the multiple linear regression equation, the next step is to tabulate the collected data over the 30 observed time periods. The input data results from the study, in accordance with the specified multiple linear regression equation, can be seen in the table below:

Period	Inflation_Rate (X1)	Unemployment_ Rate (X2)	Economic_Growth (Y)
1	2.5	5.0	3.1
2	2.7	4.8	3.3
3	3.0	4.6	3.0
4	3.1	4.7	2.9
5	3.3	5.1	2.8
6	2.9	5.0	3.1
7	3.4	5.2	2.7
8	3.5	5.1	2.8
9	3.2	4.9	3.2
10	2.8	4.7	3.4
11	2.6	4.6	3.5
12	2.5	4.4	3.7
13	2.9	4.3	3.4
14	3.2	4.5	3.1
15	3.6	4.7	2.8
16	3.8	5.0	2.5
17	3.9	5.2	2.3
18	4.1	5.3	2.1
19	4.0	5.1	2.4
20	3.7	5.0	2.6
21	3.6	4.8	2.9
22	3.3	4.7	3.0
23	3.1	4.6	3.2
24	2.9	4.5	3.3
25	2.7	4.4	3.4
26	2.8	4.3	3.5
27	2.6	4.5	3.7
28	2.5	4.6	3.8
29	2.4	4.7	3.9
30	2.3	4.9	4.0

Multiple Linear Regression Analysis Command in R Studio and Interpretation of the Results

Once we have the data to be used for the analysis in this article, you can download and install the R application on your laptop. If R Studio has been successfully installed, the next step is to conduct a multiple linear regression analysis.

After opening R Studio, the next step is to input the data into R Studio for analysis. There are two ways to do this: importing the data directly from Excel or typing it directly into the command line in R Studio. In this article, I will demonstrate the second method.

Please copy all the data for each variable from Excel, then paste it and separate it using a comma (,). Next, enter the following command:

# Inputting data

data <- data.frame(

Inflation_Rate = c(2.5, 2.7, 3.0, 3.1, 3.3, 2.9, 3.4, 3.5, 3.2, 2.8, 2.6, 2.5, 2.9, 3.2, 3.6, 3.8, 3.9, 4.1, 4, 3.7, 3.6, 3.3, 3.1, 2.9, 2.7, 2.8, 2.6, 2.5, 2.4, 2.3),

Unemployment_Rate = c(5.0, 4.8, 4.6, 4.7, 5.1, 5.0, 5.2, 5.1, 4.9, 4.7, 4.6, 4.4, 4.3, 4.5, 4.7, 5, 5.2, 5.3, 5.1, 5, 4.8, 4.7, 4.6, 4.5, 4.4, 4.3, 4.5, 4.6, 4.7, 4.9),

Economic_Growth = c(3.1, 3.3, 3.0, 2.9, 2.8, 3.1, 2.7, 2.8, 3.2, 3.4, 3.5, 3.7, 3.4, 3.1, 2.8, 2.5, 2.3, 2.1, 2.4, 2.6, 2.9, 3.0, 3.2, 3.3, 3.4, 3.5, 3.7, 3.8, 3.9, 4.0))

The next step is to perform multiple linear regression analysis using R Studio. To conduct a multiple linear regression analysis, enter the command below:

# Performing multiple linear regression analysis

model <- lm(Economic_Growth ~ Inflation_Rate + Unemployment_Rate, data = data)

# Viewing the summary of the results

summary(model)

After pressing Enter or clicking ‘Run,’ the analysis output will appear as follows:

Call:

lm(formula = Economic_Growth ~ Inflation_Rate + Unemployment_Rate,

    data = data)

Residuals:

     Min       1Q   Median       3Q      Max

-0.39920 -0.05473 0.00262 0.08124 0.31354

Coefficients:

                  Estimate Std. Error t value Pr(>|t|)

(Intercept)        7.07429    0.50282 14.069 6.00e-14 ***

Inflation_Rate    -0.77174    0.06999 -11.026 1.68e-11 ***

Unemployment_Rate -0.32915    0.12634 -2.605   0.0148 *

—

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1523 on 27 degrees of freedom

Multiple R-squared: 0.9054, Adjusted R-squared: 0.8984 F-statistic: 129.1 on 2 and 27 DF, p-value: 1.503e-14

Based on the analysis results, there are at least three key values that need to be interpreted, which include the R-squared value, the F-statistic value, and the T-statistic values. The R-squared value of 0.9054 can be interpreted as indicating that 90.54% of the variation in the economic growth variable can be explained by variations in the inflation rate and unemployment rate variables, while the remaining 9.46% is explained by other variables not included in this regression equation.

Next, the F-statistic value of 129.1 with a p-value of 1.503e-14 (p-value < 0.05) indicates that, simultaneously, the inflation rate and unemployment rate significantly affect economic growth.

The T-statistic value for the inflation rate variable is -0.77174 with a p-value of 1.68e-11 (p-value < 0.05), which can be interpreted as the inflation rate having a significant partial effect on economic growth (assuming the unemployment rate variable is constant).

Similarly, the T-statistic value for the unemployment rate variable is -0.32915 with a p-value of 0.0148 (p-value < 0.05), indicating that the unemployment rate has a significant partial effect on economic growth (assuming the inflation rate variable is constant).

To ensure that the analysis results and interpretation yield the best linear unbiased estimator, it is necessary to conduct assumption tests, which will be discussed in the next section.

Command for Residual Normality Test in R Studio

The first assumption that needs to be tested is to ensure that the residuals from the multiple linear regression equation in the above case study follow a normal distribution. The residual normality test can be performed using the Shapiro-Wilk test or by inspecting the QQ plot.

In this article, we will use both methods. Type the following command to perform the residual normality test:

# Residual normality test using Shapiro-Wilk

shapiro.test(residuals(model))

# Plot QQ

qqnorm(residuals(model))

qqline(residuals(model))

The output from the Shapiro-Wilk test for the analysis we conducted is as follows:

Shapiro-Wilk normality test

data: residuals(model) W = 0.96792, p-value = 0.4839

The analysis results show that the W value is 0.96792 with a p-value of 0.4839 (p-value > 0.05), indicating that the residuals follow a normal distribution. Additionally, we can check the graph from the QQ plot as shown below:

The QQ plot shows that the points follow a straight line, which can be concluded that the residuals are normally distributed.

Heteroskedasticity Analysis Command in R Studio

The next assumption test we need to perform is to detect the presence of heteroskedasticity. In multiple linear regression for time series data using the OLS method, it is assumed that the residual variance is constant (homoscedasticity). Therefore, we need to ensure that there is no heteroskedasticity in our regression equation.

Heteroskedasticity detection can be tested using the Breusch-Pagan test in R Studio. Initially, we need to install the ‘lmtest’ package. Use the command below to perform the non-heteroskedasticity test on the regression equation.

# Using the lmtest package for Breusch-Pagan test

install.packages(“lmtest”)

library(lmtest)

bptest(model)

The output of the Breusch-Pagan test analysis can be seen as follows:

studentized Breusch-Pagan test

data: model BP = 11.09, df = 2, p-value = 0.003906

The analysis results show that the Breusch-Pagan value is 11.09 with a p-value of 0.003906 (p-value < 0.05), indicating that there is a heteroskedasticity problem, which means that the residual variance is not constant across the range of independent variable values. We need to take further action if our regression equation experiences heteroskedasticity.

Multicollinearity Analysis Command in R Studio

The next assumption test is to check for multicollinearity in our time series regression equation. In multiple linear regression equations, it is assumed that there is no strong correlation between independent variables. One way to check for multicollinearity is by using the Variance Inflation Factor (VIF).

To obtain the Variance Inflation Factor (VIF) values, we need to install the ‘car’ package in R Studio. Please enter the command below for the multicollinearity test in the regression analysis.

# Using the car package to calculate VIF

install.packages(“car”)

library(car)

vif(model)

The output VIF values based on the analysis results are as follows:

Inflation_Rate Unemployment_Rate

1.58248 1.58248

The analysis results show that the VIF values for the correlation between the inflation rate and unemployment rate are 1.58248, which is less than 10. Thus, it can be concluded that there is no multicollinearity between the independent variables in the multiple linear regression equation in the above case study using time series data.

Linearity Analysis Command in R Studio

The linearity test can be performed by plotting the residuals against the fitted values. Please enter the following command for the linearity test in R Studio.

# Plot residuals vs fitted values

plot(fitted(model), residuals(model))

abline(h = 0, col = “red”)

The resulting plot output is as follows:

The analysis results show that the points are randomly scattered around the horizontal line, indicating that the multiple linear regression model meets the linearity assumption.

Autocorrelation Analysis Command in R Studio

The final assumption test that needs to be performed in multiple linear regression analysis for time series data is the autocorrelation test. This test is not necessary for cross-section data. The purpose of the autocorrelation test is to examine whether there is a correlation between residuals at time period t and residuals at time period t-1. One commonly used test for this is the Durbin-Watson test.

Since we have already installed the lmtest package during the heteroskedasticity test, we only need to write the following command in R Studio:

# Load package lmtest

library(lmtest)

# Perform the Durbin-Watson autocorrelation test

dwtest(model)

After entering the command and pressing ‘Enter’, the output of the autocorrelation test will appear as follows:

Durbin-Watson test

data: model

DW = 0.58787, p-value = 5.165e-07

alternative hypothesis: true autocorrelation is greater than 0

For accurate interpretation, we need to refer to the DW table to find the values of dL and dU according to the number of observations we used. However, generally, a DW value of 0.58787 is very low, indicating a potential presence of positive autocorrelation. This means that the residuals of the previous period positively influence the residuals of the following period. The p-value of 5.165e-07 (p-value < 0.05) indicates that significant positive autocorrelation is present.

Therefore, autocorrelation is an undesirable condition in OLS linear regression. Researchers need to consider steps to address autocorrelation in the multiple linear regression equation.

Conclusion

Conducting multiple linear regression analysis in R Studio involves several steps, starting from model analysis to assumption testing. The validity of the results depends on assumption tests such as residual normality, heteroskedasticity, multicollinearity, linearity, and autocorrelation tests.

This is an article that Kanda Data can write and share with you. Hopefully, this article is useful and provides solutions for those conducting multiple linear regression analysis on time series data using R Studio. Happy learning!

Tags: autocorrelation, econometrics, Kanda data, Linear regression, multiple linear regression, OLS Regression, ordinary least square, regression, statistics, time series, time series data

Categories: Data Analysis in R

How to Perform Multiple Linear Regression Analysis on Time Series Data Using R Studio

Case Study: Multiple Linear Regression Analysis on Time Series Data

Multiple Linear Regression Analysis Command in R Studio and Interpretation of the Results

Command for Residual Normality Test in R Studio

Heteroskedasticity Analysis Command in R Studio

Multicollinearity Analysis Command in R Studio

Linearity Analysis Command in R Studio

Autocorrelation Analysis Command in R Studio

Conclusion

Related Posts:-

Multicollinearity Test in R Studio for Multiple Linear Regression Using Time Series Data

How to Analyze Heteroskedasticity for Time Series Data in Multiple Linear Regression and Its Interpretation

Tutorial on R Studio: Testing Residual Normality in Multiple Linear Regression for Time Series Data

Popular Post