Linear Regression Example

This problem uses the following (x,y) data points: (3,6), (10,7), (15,13), (19,14), and (25,22). These points make up the data for our problem. They are "observed" values. These observed points are graphed in the following chart.

ChartObject Chart 1

Even though these points are not on a straight line, they do look like they could be close to a straight line that could be drawn among them. We are assuming that there is some linear relation between the x and the y values. A strict linear relation, one governed by a constrain such as y=0.5x+4.5, would mean that the points would have to line up, they would have to be on the line y=0.5x+4.5. We know that these points are not on any line, so there can not be a strict linear relation between the x and the y values. However, there could be an underlying linear relation that allows for some natural variability. If these points represnt measurements from the real world, then we can expect some real world variability to throw the values off from as strict linear relation.

The question is, which straight line would best represent the underlying relationship between the x and y values. There are an infinite number of choices. Let us take a look at two choices, y=(1/2)x +4.5 and y=x-3. Clearly these are different lines. The figure below plots these lines along with a re-plot of the observed data points, this time connected by solid line segments.

ChartObject Chart 1

Note that each line hits two of the original points, but misses others. Since the points are not on a single line, any line that we draw is bound to miss some of the points. Which of our two lines is a better fit? The line for y=(1/2)x +4.5 looks like it is closer to more of the points, but it just does not give a feel for the rising values, it looks too flat. On the other hand, the line for y=x-3 gives a good feel for the data points, but it is way off at the lowest value. How do we choose between these two proposed lines?

For either proposed line, we can find the "expected" value of y for any given value of x. In particular, we can plug in our observed x values, and calculate the expected y value. The chart below does just that.
  Column B Column C Column D Column E Column F Column G Column H Column I
  observed observed expected (observed - expected) square of (observed - expected) expected (observed - expected) square of (observed - expected)
  x y y=.5x +4.5     y=x-3    
row 1 3 6 6 0 0 0 6 36
row 2 10 7 9.5 -2.5 6.25 7 0 0
row 3 15 13 12 1 1 12 1 1
row 4 19 14 14 0 0 16 -2 4
row 5 25 22 17 5 25 22 0 0
                 
Totals 72 62 58.5 3.5 32.25 57 5 41
Columns B and C give the observed x and y values in rows 1 through 5. Column D gives the expected y value, for each observed x value, based on the equation y=(1/2)x +4.5. Column G gives the expected y value, for each observed x value, based on the equation and y=x-3. Once we have both the observed and the expected y values we can compute the difference between them. This is done in column E for y=(1/2)x +4.5 and in column H for y=x-3. Then, we can find the square of each of these differences. By squaring the differences we end up with positive values, and large differences produce much larger squared values. The squared differences are in columns F and I. Finally, we obtain the sum of the squared differences, shown in the TOTALS row. Thus we find that the sum of the squared differences for y=(1/2)x +4.5 is 32.25 and the sum of the squared differences for y=x-3 is 41. We judge which line is a better fit by taking the line with the smaller sum of the squared differences. In this case, we would say that y=(1/2)x +4.5 is a better fit than is y=x-3.

But, is y=(1/2)x +4.5 the line with the best fit? We just made up y=(1/2)x +4.5 as an example. There should be lines with a better fit. How do we find the line with the best fit, the line that will produce the least sum of the squared differences? For the simple case of two variables, it turns out that the line of best fit will have the form y=mx+b and the m and the b will satisfy the following system of linear equations:

N b + (Sx) m = Sy
(Sx) b + (Sx2) m = Sxy
where In the following table, Columns B and C give the observed data. Column D gives the square of each x value. Column E gives the product of each pair of x, y values. And, the TOTALS row gives the totals for the columns.
  Column B Column C Column D Column E Column F Column G Column H
  observed observed          
  x y x^2 x*y y=0.728107*x+1.915254 (observed - expected) square of (observed - expected)
row 1 3 6 9 18 4.099575 1.900425 3.61161518
row 2 10 7 100 70 9.196324 -2.196324 4.82383911
row 3 15 13 225 195 12.836859 0.163141 0.02661499
row 4 19 14 361 266 15.749287 -1.749287 3.06000501
row 5 25 22 625 550 20.117929 1.882071 3.54219125
               
Totals 72 62 1320 1099 61.999974 2.6E-05 15.0642655
From this we see that
N=5,     Sx=72,     Sy=62,     Sx2=1320,     and     Sxy=1099
Thus, the equations given above can be re-written as
5 b + 72 m = 62
72 b + 1320 m = 1099
The solution to this system of linear equations gives b=1.915254 and m=0.728107. Therefore, the equation y = 0.728107 x  +  1.915254 should give the best fit to the data. Column F of the chart above gives the expected y values for each of the observed x values, according to that equation. Column G gives the differences between the observed and the expected values. Column H gives the squares of the differences. The TOTALS row extends to give the sum of the squared differences as 15.0642655, a value less than half of our previous best value, 32.25. The graph of that equation is given below.

ChartObject Chart 1

What can we say as a result of all of this? We can say that, assuming there is an underlying linear relationship between the x and y values that we observed, the equation of the line that best fits the data is y = 0.728107 x  +  1.915254, because that equation gives the smallest sum of the squared differences between observed and expected values.

©Roger M. Palay
Saline, MI 48176
August, 2010