How to Perform Multiple Linear Regression Analysis Using R Studio: A Complete Guide

Multiple linear regression analysis requires commands to be executed in R Studio. Given the importance of understanding how to analyze and interpret multiple linear regression using R Studio, Kanda Data will write an article discussing this topic.

As we all know, multiple linear regression is a statistical method used to analyze the relationship between a dependent variable and two or more independent variables. With multiple linear regression, we can analyze how two or more independent variables influence the dependent variable.

Additionally, we should understand that multiple linear regression typically uses the Ordinary Least Square (OLS) method. Therefore, it is necessary to perform several assumption tests required by OLS linear regression.

In this article, Kanda Data will discuss how to perform multiple linear regression analysis using R Studio, including some diagnostic tests or OLS assumption tests to ensure the regression model analyzed is scientifically valid and reliable.

Basic Theory of Multiple Linear Regression

Before we delve into the technical aspects of multiple linear regression analysis and assumption tests, it is important to first understand the basic theory. Multiple linear regression is used when more than one independent variable influences a dependent variable.

When using multiple linear regression analysis, the first step is to establish the regression equation specification. To make it easier to understand the implementation of this basic theory, we will create a case study example.

A researcher observes that, based on theory and practical experience, advertising expenditure and the number of marketing staff are determinants of product sales. Therefore, the researcher conducts observations on companies in city XYZ. From the total number of companies in XYZ, the researcher selects a sample of 15 companies.

Based on this case study, the first step is to establish the multiple linear regression equation. The regression equation based on this case study can be formulated as follows:

𝑌=𝛽0+𝛽1𝑋1+𝛽2𝑋2+…+𝛽𝑛𝑋𝑛+𝜖

Where:

𝑌 is the product sales (in thousands of units) as the dependent variable

𝑋1 is the advertising cost (in hundreds of US dollars) as the 1st independent variable,

𝑋2 is the number of marketing staff (employees) as the 2nd independent variable,

𝛽0 is the intercept (constant),

𝛽1 and 𝛽2 are the regression coefficients indicating changes in 𝑌 based on changes in 𝑋1 and 𝑋2,

𝜖 is the error or residua.

Once the regression equation is specified, the next step is to tabulate the data collected from the 15 sampled companies. The research data input according to the specified multiple linear regression equation is shown in the table below:

Sales (Y)Advertising_Cost (X1)Marketing_Staff (X2)
1803340
36036100
57042.9110
75045120
90045.6150
108048.6180
114054180
129054.6190
132054180
147052.5190
150058.5250
156058.8270
162054.9290
123057250
105045240
Multiple Linear Regression Commands in R Studio and Interpretation of Results

After we have the data to be used for the analysis practice in this article, you can download and install the R application on your respective laptops. Once you have confirmed that R Studio is installed correctly, the next step is to perform a multiple linear regression analysis.

After opening R Studio, the next step is to input the data for analysis into R Studio. There are two ways to do this: you can either import the data directly from Excel or input it manually using commands in R Studio.

In this article, I will demonstrate the second method of inputting data manually. For instructions on importing data from Excel, please refer to my previous article, which you can find on this website.

Please copy all the data from each variable in Excel, then paste it and separate it with commas (,). Next, write the following command:

# Input data

data <- data.frame(

  Advertising_Cost = c(33, 36, 42.9, 45, 45.6, 48.6, 54, 54.6, 54, 52.5, 58.5, 58.8, 54.9, 57, 45),

  Marketing_Staff = c(40, 100, 110, 120, 150, 180, 180, 190, 180, 190, 250, 270, 290, 250, 240),

  Sales = c(180, 360, 570, 750, 900, 1080, 1140, 1290, 1320, 1470, 1500, 1560, 1620, 1230, 1050))

The next step is to perform a multiple linear regression analysis using R Studio. Then, we can write the summary command to review the results of the analysis we have conducted in R Studio. To carry out the multiple linear regression analysis, please use the following command:

# Perform multiple linear regression

model <- lm(Sales ~ Advertising_Cost + Marketing_Staff, data = data)

# View summary of resultssummary(model)

Upon running the command, the output will display as follows:

Call:

lm(formula = Sales ~ Advertising_Cost + Marketing_Staff, data = data)

Residuals:

    Min      1Q  Median      3Q     Max

-260.13  -46.99   -1.20   41.60  276.11

Coefficients:

                   Estimate Std. Error t value Pr(>|t|)  

(Intercept)      -1068.7074   275.0397  -3.886  0.00217 **

Advertising_Cost    34.6276     8.1262   4.261  0.00111 **

Marketing_Staff      2.3403     0.9185   2.548  0.02556 *

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 127 on 12 degrees of freedom

Multiple R-squared:  0.9285,  Adjusted R-squared:  0.9166

F-statistic:  77.9 on 2 and 12 DF,  p-value: 1.337e-07

Based on the results of the analysis, at least three values need to be interpreted, including the R-squared value, the F-statistic, and the T-statistic. The analysis shows an R-squared value of 0.9285, which can be interpreted as 92.85% of the variation in the sales variable being explained by the variation in the advertising cost and marketing staff variables, with the remaining 7.15% being explained by other variables not included in this regression equation.

Next, the F-statistic of 77.9 with a p-value of 1.377e-07 (p-value < 0.05) indicates that, simultaneously, advertising cost and marketing staff significantly affect product sales.

The t-statistic for the Advertising Cost variable is 34.6276 with a p-value of 0.00111 (p-value < 0.05), indicating that advertising cost has a significant partial effect on product sales (assuming the marketing staff variable is constant).

To ensure that the analysis and interpretation result in the best linear unbiased estimator, we need to conduct the necessary assumption tests, which will be discussed in the next section.

Command for Residual Normality Analysis in R Studio and Interpretation of Results

The first assumption we need to test is whether the residuals of the multiple linear regression equation from the case study follow a normal distribution. The residual normality test can be performed using the Shapiro-Wilk test or by observing a QQ plot.

In this article, we will use both methods. Please type the following command to perform the residual normality test:

# Residual normality test using Shapiro-Wilk

shapiro.test(residuals(model))

# QQ Plot

qqnorm(residuals(model))qqline(residuals(model))

The output from the Shapiro-Wilk test is as follows:

Shapiro-Wilk normality test

data:  residuals(model) W = 0.94428, p-value = 0.4393

The analysis shows that the W value is 0.94428 with a p-value of 0.4393 (p-value > 0.05), indicating that the residuals are normally distributed. Next, we can check the QQ Plot, which shows that the points follow a straight line, confirming that the residuals are normally distributed.

The QQ Plot above shows that the points follow a straight line, indicating that the residuals are normally distributed.

Command for Non-Heteroskedasticity Analysis in R Studio and Interpretation of Results

The next assumption test is to detect the presence of heteroskedasticity. In multiple linear regression using the OLS method, it is assumed that the variance of the residuals is constant (homoscedasticity). Therefore, we need to ensure that there is no heteroskedasticity in our regression equation.

Heteroskedasticity detection can be tested using the Breusch-Pagan test in R Studio. First, we need to install the ‘lmtest’ package. Use the following command for the non-heteroskedasticity test in the regression equation:

# Using the lmtest package for the Breusch-Pagan test

install.packages(“lmtest”)

library(lmtest)bptest(model)

The output of the Breusch-Pagan test is as follows:

studentized Breusch-Pagan test

data:  model BP = 1.3924, df = 2, p-value = 0.4985

The analysis shows that the Breusch-Pagan value is 1.3924 with a p-value of 0.4985 (p-value > 0.05), indicating no heteroskedasticity, meaning that the residual variance is constant across the range of independent variables.

Command for Multicollinearity Analysis in R Studio and Interpretation of Results

The next assumption test is to check for multicollinearity in the regression equation from the case study. In multiple linear regression, it is assumed that there is no strong correlation between the independent variables. One way to check this is by using the Variance Inflation Factor (VIF).

To obtain the VIF values in R Studio, we need to install the ‘car’ package. Use the following command for the multicollinearity test in regression:

# Using the car package to calculate VIF

install.packages(“car”)

library(car)vif(model)

The VIF values based on the analysis output are:

Advertising_Cost  Marketing_Staff

         3.61358          3.61358

The analysis shows that the VIF for the correlation between advertising cost and marketing staff is 3.61358, which is less than 10, indicating no multicollinearity between the independent variables in the multiple linear regression equation in the case study.

Command for Linearity Analysis in R Studio and Interpretation of Results

The final assumption test we will perform in this article is the linearity test. Linearity can be tested by plotting the residuals against the predicted values. Please write the following command for the linearity test in R Studio:

# Plot residuals vs fitted values

plot(fitted(model), residuals(model)) abline(h = 0, col = “red”)

The output plot generated is as follows:

The resulting plot shows that the points are scattered randomly around the horizontal line, indicating that the multiple linear regression model meets the assumption of linearity.

Conclusion

Conducting multiple linear regression in R Studio involves several steps, from analyzing the model to testing assumptions. Valid results depend on assumption tests, such as residual normality, heteroskedasticity, multicollinearity, and linearity.

This article aims to provide new insights into how to perform multiple linear regression analysis using R Studio. That concludes this article by Kanda Data. We hope it is beneficial for you. Stay tuned for the next Kanda Data article update next week.