Blog
Reasons Why the R-Squared Value in Time Series Data Is Higher Than in Cross-Section Data
If you’re doing regression analysis, R-squared is one of the most important metrics you need to understand. R-squared shows how much of the variation in the dependent variable can be explained by the variation in the independent variables in a regression model.
If you often read research articles published in international or national journals, you might notice something interesting: most regression models that use time series data tend to have higher R-squared values compared to those using cross-section data.
Why does this happen? Does a high R-squared truly reflect a good regression model? And what if your R-squared is low, can the model still be accepted and continued?
In this article, Kanda Data will discuss this from several perspectives. By the end, I hope you’ll be able to draw your own conclusion about why R-squared values in time series data tend to be higher than those in cross-section data.
Time Series Data Has a More Consistent and Patterned Structure
Let me start by saying that time series data tends to have a more consistent and patterned structure. If we look again at the characteristics of time series data, we’ll notice that it is measured from a single subject/unit over several time periods.
For example, think of Indonesia’s consumption per capita data from the year 2000 to 2025. This means we have consumption data collected yearly from 2000 until 2025.
From this, we can see that time series data often contains long-term trends, it can go up or down throughout the observation period. It may also contain cycles or seasonal patterns. Time series data is also prone to autocorrelation, where the value at the current period can be influenced by the value from the previous period.
These characteristics make time series data more structured and form repeated patterns. These repeating patterns are one of the reasons why time series data tends to produce higher R-squared values compared to cross-section data.
For example, consider monthly rice consumption, which generally increases along with population growth. If we put this variable into a regression model along with income and rice production, such consistent patterns will likely produce a higher R-squared.
Now let’s compare this to cross-section data. In cross-section data, the units observed are many different entities measured at a single point in time. For instance, when conducting a survey for a thesis, you might collect household income data from 200 respondents, that’s cross-section data.
Of course, household incomes vary widely depending on each respondent’s socioeconomic conditions. This large variation makes it harder for the regression model to capture consistent patterns. As a result, R-squared values tend to be lower.
So from this first explanation, we can already understand the differences between the characteristics of time series and cross-section data. Now let’s look at it from other perspectives.
The Influence of Autocorrelation and Non-Stationarity
In time series data, autocorrelation and non-stationarity can significantly affect the size of the R-squared. This is why when using time series data, we need to run autocorrelation tests and stationarity tests.
Autocorrelation occurs when there’s a relationship between the errors/residuals at the current time (t) and those from previous periods (t-1, t-2, etc.). In time series data, the probability of autocorrelation is quite high because time-based observations are generally dependent.
Then what does non-stationary data look like? Non-stationary data usually has a certain pattern, like GDP data that consistently increases every year.
Non-stationary data can lead to spurious regression, where R-squared is high but the effect of variable X on Y is not significant. In fact, a variable might appear highly significant even though it has no real relationship at all. This is what we consider bias.
GDP increases every year, household consumption also increases every year. A regression model will likely produce R-squared values above 0.90 simply because both variables share a trend, even though the relationship must be tested using proper time series methods such as stationarity tests, ECM/VECM, ARDL, and others.
On the other hand, in cross-section data, there is no time-related dependency. The error term from one respondent does not affect other respondents. This explains why, even when the model is correct, the variables make sense theoretically, and the method used is appropriate, cross-section data still tends to yield lower R-squared values compared to time series data.
The Amount of Variation Explained by the Model Is Higher in Time Series Data
Conceptually, if we go back to the formula, R-squared essentially compares the variance explained by the model with the total variance in the data. In time series data, the change from one year to the next is relatively small compared to the long time span, and the data often has identifiable patterns.
If the model can capture these patterns (trend, cycle, or seasonality) then a large portion of the variance can be explained well. For variables like GDP, stock prices, or rainfall, it is common to see high R-squared values simply because the data naturally follows time-related patterns.
On the other hand, a low R-squared in cross-section data does not mean the model is bad. It can actually provide valuable insight, for example, that the data is highly heterogeneous, the independent variables cannot explain all differences between respondents, or human behavior is very diverse. This is why regression models that involve human behavior often have very low R-squared values.
Conclusion
In general, I can summarize that there are three main reasons why the R-squared value in time series data tends to be higher than in cross-section data. You can see the main points in each subheading of this article.
What I want to emphasize at the end of this article is that R-squared is not the only indicator of model quality. A good model still needs to pass various tests: stationarity and autocorrelation tests (for time series), heteroskedasticity tests, residual normality, multicollinearity checks, and linearity tests. Choosing the right model specification is also crucial.
By understanding the characteristics of both types of data, I hope you can be wiser in evaluating model quality and avoid common misinterpretations in regression analysis. That’s all for now, stay tuned for more articles from Kanda Data. Have a great day!