Simple Linear Regression.

1_xxxqZtZExBJoxmYKIY-waw

What is Regression?

In a scatter plot with a line of best fit running through it, we can assess how well the X variable explains the changes in the Y variable using Regression e.g., What percentage of Sales is explained by changes in Price?

When we find a high ‘R Squared’, in percentage terms, the changes in Y are largely explained by the changes in X. If the R Squared is 95%, we say that, 95% of the variation in Sales is explained by the change in Price and the rest, the 5%, is due to error. The error referred to here is the distances between the dots in the scatter and the line of best fit.

There are many other components to a Regression table output including, confidence intervals, P Values and t ratios.

For a better understanding how to interpret Regression Models, you are strongly encouraged to revise this topic on your own.

What is Correlation?

Correlation is a numerical way of interpreting the relationship between two variables. A Regression analysis uses the ‘least squares method’ to fit a line through a scatter plot and is measured by R Squared.

The Correlation coefficient is the square route of R Squared, taking on the sign (+ or -) of the slope of the data. When R is high and positive, we say that there is a positive correlation e.g., when X goes up, Y goes up.

Correlation or r, measures the tightness of the scatter dots to the line of best fit and its sign tells us whether Y goes up with changes in X or Y goes down with changes in X. An r of 0 (zero) means that there is no relationship between X and Y and when r is 1, that means that there is a perfect relationship between X and Y where, when X goes up, Y goes up.

Recap:

A line of best fit is a line drawn through a scatter plot so that each point on that line minimizes the total distance to any of the scatter data points. This is traditionally called a ‘Least Squares Line’ and it follows the formula, y = mX + C. Imagine a line running through a scatter plot. Each point on that line will have an X and a Y value. The least squares method says that we would take each line y value and subtract it from the scatter dots Y value. Our intention is to sum all these values however, some of the values in this subtraction will be negative as some scatter dots will be below the line so, we square each value before we sum all the values. The Line of Best Fit is one where that sum of the squares is the lowest number it can be i.e., all the dots are either on or very close to the line. This summation is called ‘Sum of Squared Errors’. When the SSE is at its lowest, our line is a good way of predicting Y values, given values of X.

Published by:

MiguelAngelMudoy

Welcome to all of you :-) This is my own Personal Blog Site as a Life-Long Learning Professional in the ever-changing broad field of Data Science. I can only hope that you will find it at least helpful… Happy Learning!

Categories Data ScienceLeave a comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.