Introduction
Linear Regression is one of the most common methods in statistics. It is used to predict and make inferences about a continuous outcome. In this tutorial we will learn how to do the following:
1. Conduct simple linear regression in r.
2. Interpret regression output.
3. Generate confidence intervals for regression parameters.
4. Compute predicted values.
5. Plot the predicted values.
Throughout this process, I will demonstrate 1) how to access the most important output in a quick and dirty manner and 2) how to format it in an appropriately.
Data Overview
During this tutorial, I will be illustrating how to conduct simple linear regression with simulated undergraduate educational data called student
. Rows correspond to students while columns correspond to overall grades in Intro to Psychology, Abnormal Psychology, and Statistics for Psychologists. We will be trying to predict a students grade in statistics with their grade in Intro to Psychology.
id | Psych101 | Abnormal | Statistics |
---|---|---|---|
1 | 78.20424 | 72.90162 | 65.39501 |
2 | 75.22972 | 64.10788 | 70.95324 |
3 | 77.24524 | 64.95519 | 56.25551 |
4 | 73.74729 | 68.36619 | 50.70919 |
5 | 65.71232 | 88.81546 | 65.56294 |
6 | 68.73337 | 58.99921 | 32.08966 |
Using lm()
lm()
is the function used to conduct linear regression in R. For a detailed help file you can type ?lm()
in your console in R studio. For now, we will concern ourselves with two primary arguments: formula =
and data =
. The formula argument takes an R formula. In R, formulas are composed of two parts. The first is your dependent variable, in our case Statistics
. The other portion is the independent variable(s), in our case Psych101
. These two parts are separated by a tilde ~
to give the formula. Taken together, the formula defines the formula for the regression line. The data argument takes the data frame that stores our observed data.
Let’s give it a try with the student
data and predict the undergrads performance in statistics from their performance in introduction to psychology.
lm(formula = Statistics~Psych101, data = student)
##
## Call:
## lm(formula = Statistics ~ Psych101, data = student)
##
## Coefficients:
## (Intercept) Psych101
## 26.4104 0.5053
Wait a minute…
You may be wondering “Where the heck did the intercept come from? I didn’t include it in your formula”. Intercepts are so common place that R makes the assumption that you want to include an intercept in the model. You can make the intercept explicit by including 1 on the right-hand side of the tilde (i.e., Statistics ~ 1 + Psych101
). If you are absolutely certain that the intercept is 0, you can tell R to suppress the intercept entirely by including a 0 in the formula (i.e., Statistics ~ 0 + Psych101
). This is rarely done, because we would be positing that regression line goes through directly through the origin. That is a strong assertion! We will leave the formula as is for now.
You may also notice that lm didn’t produce very much information! It just reproduced our call (the code we just ran) and the regression coefficients. While that is nice, we want more! To get more information, we can save the output of lm()
to an object. The object created by lm()
can be indexed using square brackets or the $
operator, just like any other object in R. This allows us to access a ton of useful information.
If we call summary()
on the object, a bunch of other output is produced!
I will make the intercept explicit in my formula below and also save the output to an object (simple_reg
) so we can begin to explore its contents.
# Fitting a logistic regression model
simple_reg<-lm(formula = Statistics~1+Psych101, data = student)
# Summarizing the simple regression model
summary(simple_reg)
##
## Call:
## lm(formula = Statistics ~ 1 + Psych101, data = student)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.703 -11.963 1.259 11.701 34.524
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.4104 8.6319 3.060 0.0026 **
## Psych101 0.5053 0.1206 4.188 4.66e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.31 on 158 degrees of freedom
## Multiple R-squared: 0.09992, Adjusted R-squared: 0.09422
## F-statistic: 17.54 on 1 and 158 DF, p-value: 4.662e-05
TLDR: lm()
assumes that you want to include an intercept in your regression equation. summary()
produces more information about the regression when used on an object created by lm()
.
What is all this output!?
Before, when we called lm()
the output only contained the regression coefficients and call. Now, with summary()
, there is a TON of information that we have to decipher. Let’s print out the output again, and at least tie some brief descriptions to what we see.
##
## Call:
## lm(formula = Statistics ~ 1 + Psych101, data = student)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.703 -11.963 1.259 11.701 34.524
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.4104 8.6319 3.060 0.0026 **
## Psych101 0.5053 0.1206 4.188 4.66e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.31 on 158 degrees of freedom
## Multiple R-squared: 0.09992, Adjusted R-squared: 0.09422
## F-statistic: 17.54 on 1 and 158 DF, p-value: 4.662e-05
Output | Column | Description |
---|---|---|
Residuals | Quartiles of the residuals. Should be approximately symmetric and centered on zero. | |
Coefficients | A table of statistics related to the regression coefficients. | |
Estimate | Yes | The estimate of the regression coefficient. |
Std. Error | Yes | The standard error of the regression coefficient. |
t value | Yes | The test statistic testing the null \(H_0: b_i =0\). |
PR(>|t|) | Yes | The two-tailed probability associated with the observed test statistic give the null. |
Residual Standard Error | The expected deviation of the observed score from the regression line. | |
Multiple R-squared | The proportion of variance in the outcome accounted for by all predictors in the model. | |
F-statistic | An F-test testing the null that \(H_0: \frac{MSR}{MSE} = 1\). | |
P-Value | The probability of the observed F given the null. |
While many of these statistics are insightful, let's keep things simple. For now, let’s stick to interpreting the regression coefficients. Below is an example of how to interpret them.
We would expect a student who scored a 0 in Psych 101 to score approximately 26.41 in statistics. However, for every 1 point increase in Psych 101, we would expect students to score 0.51 points higher in statistics.
Is the Intercept Meaningful?
So, as the data is now, the intercept value is not very informative. While some students struggled, no student got anywhere near a score of 0. What can we do to make the intercept more meaningful? Well, we could consider shifting the predictor so that 0 is equal to some meaningful value (i.e., the mean). This is done by subtracting a value from the independent variable (\(X - \bar x\)). This is what people have done when they say that they have “centered” their independent variables. I’ll go a head and illustrate how this is done in dplyr.
student<-student%>% # create an object "student" from another object called "student"
mutate(Psych101_c = Psych101 - mean(Psych101, na.rm = TRUE)) # Create centered predictor Psych101_c
id | Psych101 | Abnormal | Statistics | Psych101_c |
---|---|---|---|---|
1 | 78.20424 | 72.90162 | 65.39501 | 7.366426 |
2 | 75.22972 | 64.10788 | 70.95324 | 4.391905 |
3 | 77.24524 | 64.95519 | 56.25551 | 6.407432 |
4 | 73.74729 | 68.36619 | 50.70919 | 2.909481 |
5 | 65.71232 | 88.81546 | 65.56294 | -5.125489 |
6 | 68.73337 | 58.99921 | 32.08966 | -2.104443 |
# Fitting a linear regression model
simple_reg<-lm(formula = Statistics~1+Psych101_c, data = student)
# Summarizing the simple regression model
summary(simple_reg)
##
## Call:
## lm(formula = Statistics ~ 1 + Psych101_c, data = student)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.703 -11.963 1.259 11.701 34.524
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.2037 1.2107 51.377 < 2e-16 ***
## Psych101_c 0.5053 0.1206 4.188 4.66e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.31 on 158 degrees of freedom
## Multiple R-squared: 0.09992, Adjusted R-squared: 0.09422
## F-statistic: 17.54 on 1 and 158 DF, p-value: 4.662e-05
How does centering change our interpretation?
We would expect a student who achieved an average score in Psych 101 to score approximately 62.2 in statistics.
However, for every 1 point increase in Psych 101, we would expect students to score 0.51 higher in statistics.
Making Your Reader’s Job Easy
So far you have made a predictive model, made inferences about the strength of association between independent variables and dependent variables, and also ensured all of your parameters are meaningful. Great work! Part of conducting these analyses is also communicating them to readers, stakeholders, and colleagues. It is your job as the analyst to make this as easy as possible.
One way to ease your readers effort is to report standardized regression coefficients (\(\beta\)) as well as unstandardized coefficients (\(b\)). Reporting standardized coefficients allows your readers to more easily compare the strength of association between variables on different scales.
Method 1: Standardizing Variables Before Modeling
You can calculate standardized coefficients in a couple of different ways. I will demonstrate two. The first one is to simply standardize your predictor and outcome (i.e., convert to z-score) prior to including them in the linear model. This can be done in a simple data management step using dplyr
.
student<-student%>% # create an object "student" from another object called "student"
mutate(Psych101_s = Psych101_c/sd(Psych101_c, na.rm = TRUE), # Create a new standardized IV/DV Psych101_s/Statisics_s
Statistics_s = (Statistics - mean(Statistics, na.rm = TRUE))/sd(Statistics, na.rm = TRUE))
s_simple_reg<-lm(Statistics_s ~ Psych101_s, data = student)
summary(s_simple_reg)
##
## Call:
## lm(formula = Statistics_s ~ Psych101_s, data = student)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.34303 -0.74346 0.07824 0.72713 2.14546
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.791e-16 7.524e-02 0.000 1
## Psych101_s 3.161e-01 7.548e-02 4.188 4.66e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9517 on 158 degrees of freedom
## Multiple R-squared: 0.09992, Adjusted R-squared: 0.09422
## F-statistic: 17.54 on 1 and 158 DF, p-value: 4.662e-05
Method 2: Using a Conversion Factor after Modeling Another way to approach standardization is to simply use the model we fit previously and convert the regression coefficients. This can be done by doing the following.
\[\beta = b\times\frac{s_x}{s_y}\] This rescales the regression coefficients so that they are in standard deviation units. To extract only the regression coefficients from an object created by lm()
we can use coef()
.
# Calculating standard deviations for IV and DV
sd_x<-sd(student$Psych101, na.rm = TRUE)
sd_y<-sd(student$Statistics, na.rm = TRUE)
# Extracting the coefficients
my_coef<-coef(simple_reg)
# Pulling out the slope
my_slope<-my_coef[2]
# Extracting the slope and res
my_slope*(sd_x/sd_y)
## Psych101_c
## 0.3161004
TLDR: You can create standardized regression coefficients by either standardizing all variables before modeling or multiplying the coefficients by a conversion factor.
A Fun Proof Conceptual Proof for the Standardized Regression Weight Formula
The standardized regression weight is calculated using the formula below. \[\beta = b\times\frac{s_x}{s_y}\]
Slope can be conceptualized as rise over run (i.e.\(\frac{\Delta y}{\Delta x}\))
\[=\frac{\Delta y}{\Delta x}\times \frac{s_x}{s_y}\]
One can simply rearrange the formula so that it becomes a complex fraction.
\[=\frac{\frac{\Delta y}{s_y}}{\frac{\Delta x}{s_x}}\]
Dividing \(\Delta y\) and \(\Delta x\) by the standard deviations of y and x convert these values to z scores. Thus, standardized slope is equal to the change in standardized y over the change in standardized x!
\[=\frac{{\Delta z_y}}{{\Delta z_x}}\]
TLDR: Standardized slopes are equal to the standard deviation change in y for every 1 standard deviation change in x.
How Does Simple Regression Relate to Correlation?
The correlation between two variables is equal to the standardized regression slope in the case of simple regression. Why?
We know that: \[b = \frac{cov(x,y)}{var(x)}\] and
\[\beta = b\times\frac{s_x}{s_y}\] Plugging our formula for b into our formual for beta we get \[\beta = \frac{cov(x,y)}{var(x)}\times\frac{s_x}{s_y}\]
We can rewrite this into a complex fraction: \[\beta = \frac{\frac{cov(x,y)}{s_y}}{ \frac{var(x)}{s_x}}\]
Since the variance of a variable is equal to it’s standard deviation squared, dividing a variables variance by its standard deviation equals its standard deviation. \[\beta = \frac{\frac{cov(x,y)}{s_y}}{ s_x}\]
We can rewrite the above equation for clarity, which gives us the formula for correlation! \[\beta = \frac{cov(x,y)}{s_ys_x} = cor(x,y)\]
It is important to note that when you have multiple independent variables, correlation and regression coefficients are different things!
TLDR: For simple regression, standardized regression coefficients equals the bivariate correlation.
What Proportion of Variance is Explained by an Independent Variable?
The standardized regression coefficients allow us to being to talk about the proportion of variance explained by certain predictors! This is super useful when we are trying to assess overall model fit and variable importance. Let's decompose the variance of y based on our model.
Model: \[y = b\times x + e\]
Covariance Algebra \[var(y) = b^2\times var(x)+var(e)\]
To find the proportion of variance explained in an outcome variable by x, we can simply take the portion of variance in y attributable to x (i.e., \(b^2\times var(x)\)) and divide it by the total variance!
\[\frac{b^2\times var(x)}{var(y)}\]
What do we know about the variance of y and the variance of x when we use standardized regression coefficients? Well, x and y have both been rescaled so that they are on a standard normal distribution. This means that \(var(y)=var(x)=1\). Plugging those values into the above equation gives us:
\[\frac{b^2\times 1}{1}=b^2\]
Lets see how much variance Psych 101 performance explains in statistics performance
# Extracting model coefficients from standardized model
my_coefs<-coef(s_simple_reg)
# Pulling out the slope
slope<-my_coefs[2]
# Squaring the slope to calculate proportion of variance explained in y
slope^2
## Psych101_s
## 0.09991946
The output from summary()
provides a statistic called multiple R-squared, which is the proportion of variance explained by the model as a whole. Since there is only one predictor in the model, the on multiple R-squared to equal the squared standardized regression coefficient
# Printing summary of the unstandardized model
summary(simple_reg)
##
## Call:
## lm(formula = Statistics ~ 1 + Psych101_c, data = student)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.703 -11.963 1.259 11.701 34.524
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.2037 1.2107 51.377 < 2e-16 ***
## Psych101_c 0.5053 0.1206 4.188 4.66e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.31 on 158 degrees of freedom
## Multiple R-squared: 0.09992, Adjusted R-squared: 0.09422
## F-statistic: 17.54 on 1 and 158 DF, p-value: 4.662e-05
TLDR: The proportion of variance explained in an outcome by a given variable is equal to its squared standardized regression coefficient!
Uncertainty in the Parameter Estimates
Just like the central limit theorem tells us that \(\bar y\) is an imperfect estimator of \(\mu\), a regression model’s slopes and intercept are imperfect estimators of their population parameters. Conveniently, regression parameters share some of the distributional properties with \(\bar y\) & \(\mu\). They are unbiased meaning that \(E(\hat \beta) = \beta\). Furthermore, when certain assumptions are met (more on that later), we assume that
\[\hat \beta \sim N(\beta, \sigma_\beta)\]
\(\sigma_\beta\) is unknown, however so we approximate it with \(s_\beta = \sqrt\frac{MSE}{\sum (x_i-\bar x)^2}\). I will spare you from a lengthy description of why this is the way it is! In the near future I hope to write a post on robust standard errors and I will dive more into these issues then.
Note that \(s_\beta\) is simply the standard error of the regression coefficient provided in the summary()
output!
##
## Call:
## lm(formula = Statistics ~ 1 + Psych101_c, data = student)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.703 -11.963 1.259 11.701 34.524
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.2037 1.2107 51.377 < 2e-16 ***
## Psych101_c 0.5053 0.1206 4.188 4.66e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.31 on 158 degrees of freedom
## Multiple R-squared: 0.09992, Adjusted R-squared: 0.09422
## F-statistic: 17.54 on 1 and 158 DF, p-value: 4.662e-05
TLDR: We are uncertain about regression parameters just like we are in sample means. While the expected value of the regression parameters is the population parameter, each estimated regression weight also has a standard error.
Using these standard errors, we can calculate t-statistics just like a normal t-test!
\[t(df) = \frac{\hat \beta - H_0}{s_\beta}\]
Where \(df = n-k\) and k equals the number of estimated coefficients (including the intercept).
summary(simple_reg)
##
## Call:
## lm(formula = Statistics ~ 1 + Psych101_c, data = student)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.703 -11.963 1.259 11.701 34.524
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.2037 1.2107 51.377 < 2e-16 ***
## Psych101_c 0.5053 0.1206 4.188 4.66e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.31 on 158 degrees of freedom
## Multiple R-squared: 0.09992, Adjusted R-squared: 0.09422
## F-statistic: 17.54 on 1 and 158 DF, p-value: 4.662e-05
The test statistics suggest that it would be rather surprising to observe these regression coefficients if the intercept and slope were truly 0. The probabilities of observing an intercept and a slope as large as we did, given that in the population they were 0, were approximately 2e-16 and 4.66e-05 respectively.
Generating Confidence Intervals
Just like hypothesis testing is the same for regression coefficients and sample means, generating confidence intervals shouldn’t be anything new either! We can do it using the formula below.
\[95\% CI = \hat \beta\pm t_{.975, df}\times s_\beta\]
This information isn’t generated automatically by summary()
. Instead we can use the confint()
function to calculate confidence intervals for regression parameters. Furthermore, we can use the level =
argument to define how stringent we want the interval to be! The default is .95.
# 95 % Confidence Interval
confint(simple_reg, level = .95)
## 2.5 % 97.5 %
## (Intercept) 59.8124236 64.5950394
## Psych101_c 0.2669927 0.7435791
# 99% Confidence Interval
confint(simple_reg, level = .99)
## 0.5 % 99.5 %
## (Intercept) 59.0469846 65.3604784
## Psych101_c 0.1907169 0.8198549
Making Predictions
In research, we are often concerned about making inferences how independent variables relate to an outcome. However, in practice, many people want to predict an important outcome. For example, in an educational setting, a teacher may want to identify how well certain students will do in a class based on past performance and allocate time appropriately to help every student achieve their goals! In an organizational setting, a practitioner may be interested in identifying high quality leaders based on some test.
The predict()
function allows researchers to make predictions for the original sample used to train the model (training sample) and new samples using a regression model. To make predictions for the training sample, simply include an object generated by lm()
within the predict
wrapper.
# Making predictions for the training set and storing them in train_pred
train_pred<-predict(simple_reg)
#Printing the first few predictions
glimpse(train_pred)
## Named num [1:160] 65.9 64.4 65.4 63.7 59.6 ...
## - attr(*, "names")= chr [1:160] "1" "2" "3" "4" ...
You can also make predictions for a new sample. new_student
contains information from students that have yet to take statistics. I can use the regression model to make predictions about how well they will preform by passing the object created by lm
in predict
and including the newdata =
argument to make predictions from the for the new data frame.
Variables used to train the model must be named the same thing in the new data frame
# The model used a centered version of the predictor
# Creating this variable in the new data
new_student<-new_student%>%
mutate(Psych101_c = Psych101-mean(Psych101, na.rm = TRUE))
# Storing my prediction in an object titled new_pred
new_pred<-predict(simple_reg, newdata = new_student)
# taking a peek at the first few predictions
glimpse(new_pred)
## Named num [1:100] 57.5 69.3 67.5 60.5 52.7 ...
## - attr(*, "names")= chr [1:100] "1" "2" "3" "4" ...
Confidence Intervals for Predictions
A natural way to think about making predictions using the linear model is that we creating a linear combination of our observed variables. When generating confidence intervals for the prediction, it is more useful to thinking about it in the opposite way: that we are creating a linear combination of slopes!
This change in perspective is because, in regression, we assume that the independent variables are observed without error (i.e., have no variance), but the parameter estimates can be noisy and have variability (i.e., standard error)!
Let’s set up the covariance algebra problem! We are trying to make a prediction for the first student in new_student
.
id | Statistics | Psych101_c |
---|---|---|
161 | 63.77125 | -9.247285 |
The formula for this prediction is equal to:
\[\hat y = b_0 + -9.25\times b_1\] Through covariance algebra, we can see that the variance in y is
\[ var(\hat y) = var(b_0)+ -9.25^2 \times var(b_1)+ -18.49\times cov(b_0, b_1) \]
You should know that regression coefficients have variances (squared standard error). Solving for those are easy! But coefficient covariances?! Madness!!! I won’t go into too much detail about these but they are just a variation on the standard error formula. Again, I will save the details for a later post, but here is the formula.
\(s_\beta = \frac{MSE}{\sum (x_i-\bar x) (y_i-\bar y)}\)
This all may seem like a lot of work, but it can be calculated in the general case quite easily with some matrix algebra! For simplicity's sake we will assume that these values are given in the covariance matrix below (note. this was generated by vcov()
).
## (Intercept) Psych101_c
## (Intercept) 1.465873e+00 -9.096594e-17
## Psych101_c -9.096594e-17 1.455622e-02
Below, I plug in the relevant values from the vcov()
matrix and solve for the variance of the prediction.
## [1] 2.71
Thus, the variance of our prediction is equal to 2.71 and the standard error of our prediction is equal to \[\sqrt{var(\hat y)} = \sqrt {2.71} = 1.65 \]
Using this standard error we could calculate the confidence interval using the same method as every other CI.
\[95\%\ CI = \hat y \pm t(df)\times s_{pred}\] \[57.53 \pm 1.98\times 1.65\]
## Upper
## [,1]
## [1,] 60.78262
## Lower
## [,1]
## [1,] 54.2798
Skip this formula if you aren’t interested, but for those who are interested the general formula using matrix algebra for the variance of the prediction is:
\[var(\hat y) = x_{new}^\top \Sigma_bx_{new}\]
TLDR: Covariance algebra is important and can be used to derive all sorts of things! Our predictions are uncertain because the regression coefficients are uncertain! It is nice to quantify the uncertainty in our prediction using confidence intervals.
Regression Confidence Intervals in Practice
The great thing is, you will hopefully never have to do the math outlined above, unless you are writing your own software or really interested in the inner workings of statistical methods.
predict()
can generate multiple types of confidence intervals. and we will use it to do all the heavy lifting I did by hand above.
To generate confidence intervals, we simply add the interval
argument to the predict()
function. There are two different types of intervals that can be generated. I outline each conceptually below one more time to highlight the differences.
Interval = “confidence” This type of interval generates the confidence interval for the best prediction given the data. This is the interval I described above. It acknowledges the uncertainty in our regression parameters but does is does not address the magnitude of error in our predictions. You can think of this as the confidence interval for the regression line.
new_pred_conf<-predict(simple_reg, interval = "confidence", newdata = new_student)
print_head(new_pred_conf)
fit | lwr | upr |
---|---|---|
57.53121 | 54.27943 | 60.78299 |
69.32022 | 65.19929 | 73.44114 |
67.46659 | 64.02007 | 70.91312 |
60.53627 | 58.01899 | 63.05356 |
52.68478 | 47.59845 | 57.77111 |
65.75152 | 62.83300 | 68.67003 |
Interval = “prediction” This type of interval generates the confidence interval for the future observations of a student’s score. It explicitly acknowledges how wrong our model was in the training data. You can think of this as the confidence interval for individual data points.
new_pred_pred<-predict(simple_reg, interval = "prediction", newdata = new_student)
print_head(new_pred_pred)
fit | lwr | upr |
---|---|---|
57.53121 | 27.10900 | 87.95341 |
69.32022 | 38.79287 | 99.84756 |
67.46659 | 37.02296 | 97.91023 |
60.53627 | 30.18379 | 90.88876 |
52.68478 | 22.01220 | 83.35736 |
65.75152 | 35.36313 | 96.13991 |
Visualizing these different confidence intervals is helpful to illustrate how they relate to the data The red lines depict interval = "confidence"
while the blue lines depict interval = "prediction"
. The black line is the regression line. The red confidence intervals are associated with the regression while the blue confidence intervals are associated with the new observations.
Creating Pretty Plots
My hope is that you have a reasonable foundation on how to conduct simple linear regression. Now we will focus on reporting regression analyses. Plotting is pretty easy to do too. I will show you how to do it by hand with sjPlot
as well. For publication purposes, what you will typically see is a single line illustrating the relationship between one IV and DV.
ggplot2: Manual method
Step 1: Extract the coefficients from the model
# Pulling the coefficients from the model
coefficients<-coef(simple_reg)
# Printing the coefficients
coefficients
## (Intercept) Psych101_c
## 62.2037315 0.5052859
int<-coefficients[1]
slp<-coefficients[2]
Step 2: Initialize the plot
p<-ggplot(data = student, aes(x = Psych101_c, y = Statistics ))
p
Step 3: Add geom_abline
p<-p+
geom_abline(intercept =int, slope = slp)
p
Step 4: Add points
p<-p+
geom_point()
p
Step 5: Define Limits
Limits should be set to adequately visualize the relationship between the variables of interest.
p<-p+
lims(x = c(-70, 30),
y = c(0, 100))
p
sjPlot: Automated Plotting sjPlot
has a function that works with most models in R called plot_model()
. There are a ton of different options, but I will only talk about one. plot_model()
takes a lm
object (or a different type of model) and plots it, however the default plot depends on a variety of things! If you are only interested in looking at the predicted values, you need to set the type =
argument to `“pred”.
plot_model(simple_reg,
type = "pred")
## $Psych101_c
This object returns a list of ggplot
objects that can be edited and added to like any other plot. I can add geom_point
to the plot_model()
output.
# Creating plot
p<-plot_model(simple_reg,
type = "pred")
# Pulling out part of list
p<-p$Psych101_c
# Adding layer
p+
geom_point(data = student, aes(x = Psych101_c, y = Statistics))
Getting Creative With Plotting
Sometimes, we may be interested in making some really nice visualizations for a grant or a publication. These may include labels, the predicted values, and color. Lets take a look at how to make beautiful regression plots that move beyond the bare bones methods above.
Step 1: Prepare the Data Frame
# Create a plot_data data frame from student
plot_data<-student
# Add Predictions
plot_data$pred<-predict(simple_reg)
# Add Residuals (function is new but self-explanatory)
plot_data$resid<-resid(simple_reg)
# Is the observation under predicted value (if so TRUE)
plot_data<-plot_data%>%
mutate(negative = if_else(resid<=0, TRUE, FALSE))
print_head(plot_data)
id | Psych101 | Abnormal | Statistics | Psych101_c | Psych101_s | Statistics_s | pred | resid | negative |
---|---|---|---|---|---|---|---|---|---|
1 | 78.20424 | 72.90162 | 65.39501 | 7.366426 | 0.7317649 | 0.1983203 | 65.92588 | -0.5308739 | TRUE |
2 | 75.22972 | 64.10788 | 70.95324 | 4.391905 | 0.4362824 | 0.5437335 | 64.42290 | 6.5303377 | FALSE |
3 | 77.24524 | 64.95519 | 56.25551 | 6.407432 | 0.6365004 | -0.3696489 | 65.44132 | -9.1858028 | TRUE |
4 | 73.74729 | 68.36619 | 50.70919 | 2.909481 | 0.2890215 | -0.7143223 | 63.67385 | -12.9646611 | TRUE |
5 | 65.71232 | 88.81546 | 65.56294 | -5.125489 | -0.5091551 | 0.2087560 | 59.61389 | 5.9490424 | FALSE |
6 | 68.73337 | 58.99921 | 32.08966 | -2.104443 | -0.2090508 | -1.8714236 | 61.14039 | -29.0507307 | TRUE |
Step 1: Extract the coefficients from the model
# Pulling the coefficients from the model
coefficients<-coef(simple_reg)
# Printing the coefficients
coefficients
## (Intercept) Psych101_c
## 62.2037315 0.5052859
int<-coefficients[1]
slp<-coefficients[2]
Step 2: Initialize the plot
p<-ggplot(data = plot_data, aes(x = Psych101_c, y = Statistics ))
p
Step 3: Add geom_abline
p<-p+
geom_abline(intercept = int, slope = slp)
p
Step 4: Add points with color
p<-p+
geom_point(aes(color = negative))
p
Step 5: Define Limits.
Limits should be set to adequately visualize the relationship between the variables of interest.
p<-p+
lims(x = c(-70, 30),
y = c(0, 100))
p
Step 6: Add Lines
p<-p+
geom_segment(aes(xend = Psych101_c, yend = pred), lty = 3)
p
Step 7: Create Label Data
label_data<-sample_n(plot_data, size = 20)%>%
mutate(resid = round(resid, 2))
Step 8: Add Labels
p<-p+
geom_text(data = label_data,
aes(label = resid),
nudge_y = .5)
p
Step 9: Format
p+
theme(legend.position = "none", # Removing legend
panel.grid = element_blank())+
labs(x = "Psych 101", # Adding labels
title = "Plot of Statistics Final Grades Regressed on Psych 101",
subtitle = "Residual values reported as labels")
Creating Pretty Tables
The apaTables
package contains some extremely useful functions for generating APA style tables in word documents. We can simply pass the regression object, define where we want the document to be saved and it will generate an APA table!
apa.reg.table(simple_reg, filename = "my_regression.doc")
There are two notable alternatives that come in handy when working with HTML and LaTeX documents. I will talk about one from the sjPlot
package but the other, stargazer, is semi-redundant so I will just provide a link to the documentation below. The tab_model()
function in sjPlot
produces an HTML that can be copied and pasted. It is not perfectly APA, but it is really close. Plus, all of the options allow you to get even closer! There is even an option to standardize the regression coefficients making much of our code above unnecessary! If you’re interested check out the documentation at the developers website.
tab_model(simple_reg)
Statistics | |||
---|---|---|---|
Predictors | Estimates | CI | p |
(Intercept) | 62.20 | 59.81 – 64.60 | <0.001 |
Psych 101 c | 0.51 | 0.27 – 0.74 | <0.001 |
Observations | 160 | ||
R^{2} / R^{2} adjusted | 0.100 / 0.094 |
Text by James Rigby. Photographs by Image by Nikolay Georgiev from Pixabay