1 00:00:05,440 --> 00:00:10,840 So, we saw in the previous video that the house prices 2 00:00:10,840 --> 00:00:14,510 were distributed over the area in an interesting way, 3 00:00:14,510 --> 00:00:17,400 certainly not the kind of linear way. 4 00:00:17,400 --> 00:00:19,580 And we wouldn't necessarily expect linear regression 5 00:00:19,580 --> 00:00:22,030 to do very well at predicting house price, 6 00:00:22,030 --> 00:00:24,410 just given latitude and longitude. 7 00:00:24,410 --> 00:00:26,610 We can kind of develop an intuition more 8 00:00:26,610 --> 00:00:32,030 by plotting the relationship between latitude and house 9 00:00:32,030 --> 00:00:38,950 prices-- which doesn't look very linear-- or the longitude 10 00:00:38,950 --> 00:00:45,310 and the house prices, which also looks pretty nonlinear. 11 00:00:45,310 --> 00:00:48,290 So, we'll try fitting it in a linear regression anyway. 12 00:00:48,290 --> 00:00:52,550 So, let's call it latlonlm. 13 00:00:52,550 --> 00:00:56,210 And we'll use the LM command, linear model, 14 00:00:56,210 --> 00:01:02,010 to predict house prices based on latitude and longitude using 15 00:01:02,010 --> 00:01:05,319 the Boston data set. 16 00:01:05,319 --> 00:01:08,250 If we take a look at our linear regression, 17 00:01:08,250 --> 00:01:12,430 we see that r squared is around 0.1, which is not great. 18 00:01:12,430 --> 00:01:16,450 The latitude is not significant, which 19 00:01:16,450 --> 00:01:18,510 means the north-south differences aren't 20 00:01:18,510 --> 00:01:21,250 going to be really used at all. 21 00:01:21,250 --> 00:01:23,990 Longitude is significant, and it's negative. 22 00:01:23,990 --> 00:01:26,940 Which we can interpret as, as we go towards the oceans-- 23 00:01:26,940 --> 00:01:31,690 we go towards the east-- house prices decrease linearly. 24 00:01:31,690 --> 00:01:34,330 So this all seems kind of unlikely, 25 00:01:34,330 --> 00:01:37,450 but let's work with it. 26 00:01:37,450 --> 00:01:40,380 So let's see how this linear regression 27 00:01:40,380 --> 00:01:43,289 model looks on a plot. 28 00:01:43,289 --> 00:01:48,259 So let's plot the census tracts again. 29 00:01:48,259 --> 00:01:49,120 OK. 30 00:01:49,120 --> 00:01:52,620 Now, remember before, we had-- from the previous video-- 31 00:01:52,620 --> 00:01:55,420 we plotted the above-median house prices. 32 00:01:55,420 --> 00:01:58,350 So we're going to do that one more time. 33 00:01:58,350 --> 00:02:14,130 Median was 21.2. 34 00:02:14,130 --> 00:02:18,390 We had-- the color was red. 35 00:02:18,390 --> 00:02:20,240 And we used solid dots. 36 00:02:23,829 --> 00:02:24,329 Ha. 37 00:02:24,329 --> 00:02:24,890 Oops. 38 00:02:24,890 --> 00:02:26,020 See what I did there? 39 00:02:26,020 --> 00:02:28,390 I used the plot command, instead of the points command, 40 00:02:28,390 --> 00:02:30,390 and it plotted just the new points. 41 00:02:30,390 --> 00:02:34,820 I meant to plot the original points 42 00:02:34,820 --> 00:02:39,750 and use the points command to plot it 43 00:02:39,750 --> 00:02:42,090 on top of the existing plot. 44 00:02:42,090 --> 00:02:42,590 OK. 45 00:02:42,590 --> 00:02:43,700 So that's more like it. 46 00:02:43,700 --> 00:02:49,860 So now we have the median values with the above median value 47 00:02:49,860 --> 00:02:51,760 census tracts. 48 00:02:51,760 --> 00:02:54,700 So, OK, we want to see, now, the question 49 00:02:54,700 --> 00:02:57,310 we're going to ask, and then plot, 50 00:02:57,310 --> 00:03:02,510 is what does a linear regression model think is above median. 51 00:03:02,510 --> 00:03:05,170 So we could just do this pretty easily. 52 00:03:05,170 --> 00:03:13,660 We have latlonlm$fitted.values and this is what the linear 53 00:03:13,660 --> 00:03:17,760 regression model predicts for each of the 506 census tracts. 54 00:03:17,760 --> 00:03:20,970 So we'll plot these on top. 55 00:03:20,970 --> 00:03:24,260 Boston$LON-- take all the census tracts, 56 00:03:24,260 --> 00:03:34,380 such that the latlonlm's fitted values are above the median. 57 00:03:34,380 --> 00:03:35,630 Take the latitudes, too. 58 00:03:43,490 --> 00:03:48,270 And I'm going to make them blue, but let's pause for a moment 59 00:03:48,270 --> 00:03:48,770 and think. 60 00:03:48,770 --> 00:03:53,910 If we use the dots again, we'll cover up the red dots 61 00:03:53,910 --> 00:03:56,810 and cover up some of the black dots. 62 00:03:56,810 --> 00:03:59,060 What we won't be able to see is where 63 00:03:59,060 --> 00:04:01,220 the red dots and the blue dots match up. 64 00:04:01,220 --> 00:04:02,720 You know, we're interested in seeing 65 00:04:02,720 --> 00:04:06,370 how the linear regression matches up with the truth. 66 00:04:06,370 --> 00:04:08,040 So it'd be ideal if we could plot 67 00:04:08,040 --> 00:04:11,840 the linear regression blue dots on top of the red dots, 68 00:04:11,840 --> 00:04:15,470 in some way that we can still see the red dots. 69 00:04:15,470 --> 00:04:17,300 It turns out that you can actually 70 00:04:17,300 --> 00:04:19,980 pass in characters to this PCH option. 71 00:04:19,980 --> 00:04:22,990 So since we're talking about money, 72 00:04:22,990 --> 00:04:27,620 let's plot dollar signs instead of points. 73 00:04:27,620 --> 00:04:28,970 And there you have it. 74 00:04:28,970 --> 00:04:34,270 So, the linear regression model has plotted a dollar sign 75 00:04:34,270 --> 00:04:35,730 for every time it thinks the census 76 00:04:35,730 --> 00:04:37,520 tract is above median value. 77 00:04:37,520 --> 00:04:38,990 And you can see that, indeed, it's 78 00:04:38,990 --> 00:04:41,390 almost as-- you can see the sharp line 79 00:04:41,390 --> 00:04:44,240 that the linear regression defines. 80 00:04:44,240 --> 00:04:46,270 And how it's pretty much vertical, 81 00:04:46,270 --> 00:04:49,530 because remember before, the latitude variable 82 00:04:49,530 --> 00:04:53,510 was not very significant in the regression. 83 00:04:53,510 --> 00:04:57,800 So that's interesting and pretty wrong. 84 00:04:57,800 --> 00:04:59,260 One thing that really stands out is 85 00:04:59,260 --> 00:05:04,240 how it says Boston is mostly above median. 86 00:05:04,240 --> 00:05:06,200 Even knowing-- we saw it right from the start-- 87 00:05:06,200 --> 00:05:09,190 there's a big non-red spot, right 88 00:05:09,190 --> 00:05:12,540 in the middle of Boston, where the house 89 00:05:12,540 --> 00:05:13,750 prices were below the median. 90 00:05:13,750 --> 00:05:17,150 So the linear regression model isn't really doing a good job. 91 00:05:17,150 --> 00:05:18,740 And it's completely ignored everything 92 00:05:18,740 --> 00:05:21,600 to the right side of the picture.