Linear Regression


Regression is a technique used to predict a dependent variable given one or more independent variables. Linear regression as the name implies is specifically used when there is a linear relationship between the dependent and independent variable. The relationship between the variables can be described using an equation that is referred to as a model or the line of best fit (LOBF). The general form is:
Y = \beta_0 + \beta_1X
This is just the equation of a line, y = mx + b, but the notation used in statistics is different. Similarly to the equation for a line, B_0 is the y-intercept and B_1 is the slope. Y is referred to as the dependent variable or response and X is the independent variable or predictor.

Topics Covered

Common Questions

Correlation

The Model

Interpretation

Hypothesis Test and Confidence Interval

Residual

Common Questions

When covering the chapter of linear regression, the following kinds of questions are most commonly asked:

  • Finding and interpreting the correlation, r
  • Write the theoretical and estimated model
  • Read software output (JMP, Excel)
  • Interpret the slope (b_1) and y-intercept (b_0)
  • Interpret the coefficient of determination, R^2
  • Conduct a hypothesis test for B_1
  • Create a confidence interval for B_1
  • Calculate a residual

This is not a definitive list. Ask your professor or instructor for other types of questions that will be on the exam.
top

Correlation

Correlation measures the linear relationship between two variables. The sample correlation is represented as r, while the population correlation is represented as \rho (Greek letter rho). It is important to keep in mind that r can only be between 1 and -1.

When r > 0, the variables will have the following characteristics:

  • The variables are positively correlated
  • As one increases the other variable increases
  • The variables will be upward sloping

When r < 0, the following characteristics apply:

  • The variables are negatively correlated
  • When one increases the other variable decreases
  • The variables will be downward sloping

When r = 1, the variables have a perfect positive linear relationship. This means that when the variables are plotted on a scatter plot, they will form a straight upward sloping line. When r = -1, the variables have a perfect negative linear relationship. When the variables are plotted on a scatter plot, they will form a straight downward sloping line. When r = 0, there is no linear relationship between the variables. When they are plotted, it will look like either a horizontal or vertical line.

Here are some examples of two variables plotted using a scatter plot and their following correlation:

top

The Model

To cover the following topics, we will be using an example from the video game PlayerUnknown’s Battlegrounds (PUBG). It is a game similar to Fortnite, where 100 players are dropped in a map. The goal of the game is to eliminate all other players using a variety of weapons and be the last one standing. In this example, we will be trying to predict how much total damage a player dealt to other players given how many kills they got in a match. The first step is to write theĀ Theoretical Model:

Damage Dealt = \beta_0 + \beta_1(kills) + \varepsilon or Y = \beta_0 + \beta_1(X) + \varepsilon

Some important things to note when writing the theoretical model:

  • Capital letters are used, such as Y, X,\beta_0, \beta_1
  • \beta_0: y-intercept
  • \beta_1: population slope
  • \varepsilon: random error, only gets included with the theoretical model
  • Y = Damage Dealt: the dependent or response variable
  • X = kills: the independent or predictor variable

Now we can use either Excel or JMP to run the regression analysis and be able to write theĀ Estimated Model. The following output is from Excel, but JMP output should have the same information.

To write the estimated model we have to look at the third table, under the Coefficients column. The first row is dedicated to information about the y-intercept and the second row is dedicated to information about the slope. So our estimated model will look like this:

\hat{Damage Dealt} = 39.590 + 99.604(kills) or \hat{y} = 39.590 + 99.604(x)

Keep the following in mind when writing the estimated model:

  • Add a hat (\hat{y}) to the dependent variable
  • Use lower case letters, such as y, b_0, b_1
  • The random error term (\varepsilon) is not included

top

Interpretation

We are usually asked to interpret the slope, y-intercept, and coefficient of determination (R^2) and in most cases in can be done in a formulaic way. The following interpretations are generalized and will have to be put in the context of the problem.

  • Y-intercept (b_0): When X is equal to 0, Y is equal to b_0
  • Slope (b_1): When X is increased by 1 unit, Y changes by the slope
  • R^2: the proportion of variability in the response that can be explained by the model.
    • This value is found under the first table.

Example:

  • Y-intercept: When a player has 0 kills, the predicted total damage dealt is 39.590.
  • Slope: When a players kill count increases by 1, their predicted total damage dealt increases by 99.604.
  • R^2: 80.95% of the variability in damage dealt can be explained by the model.

top

Hypothesis Test and Confidence Interval

One way to determine if there is a linear relationship between two variables is by evaluating their correlation. Another way is by evaluating the slope from the model. If the population slope (\beta_1) is zero, it means that there is no relationship between the variables and the predictor variable is not significant. To evaluate the slope, we can either perform a hypothesis test or a confidence interval. In either case, we should end up with the same conclusion.

Hypothesis Test

Hypothesis test for the slope will all have the same setup:

H_0: \beta_1 = 0

H_a: \beta_1 \neq 0

t = \frac{b_1}{SE_{b_1}}

The test statistic and the corresponding P-Value can be found in the third table of our output.

Note: Sometimes they will not give the test statistic and you will have to calculate it by hand.

EX:

Our test statistics is equal to:

t = \frac{99.604}{1.529}=65.1

Since the corresponding P-Value is approximately 0, we reject the null hypothesis and conclude that \beta_1 differs from 0.

Confidence Interval

The general formula for the CI is:

b_1\pm (\text{critical value}) \times (SE_{b_1})

Sometimes the CI will be given to you in the summary table. If you are asked to calculate it by hand, most of the information should be given to you in the summary output. The only value that changes will be the critical value, which will depend on the confidence level. After the CI is calculated, we check if 0 is in or out of our CI. If 0 is outside the CI, we conclude that \beta_1 significantly differs from 0. If 0 is inside the CI, we conclude that \beta_1 does not significantly differs from 0.

EX:

To create the CI, we pull the values straight from the third table in the summary output. For a 95% CI we will approximate the critical value by using 2.

99.604 \pm 2\times 1.529 = (96.546,102.662)

We notice that 0 is not in our CI, so we can conclude that \beta_1 significantly differs from 0.
top

Residual

A residual tells us how much our prediction was off by. We can calculate a residual for every value. If a value has a residual of 0, it means that our model predicted the response accurately. The formula is as follows:

residual = observed - predicted

EX:

Suppose we are asked to calculate the residual for when a player has 2 kills. In our data set, we observe that a player with 2 kills dealt a total damage of 200. The predicted value is calculated using our line of best fit.

Observed = 200

Predicted = 39.590 + 99.604(2) = 238.798

Residual = 200 - 238.798 = -38.798
top