1 00:00:04,520 --> 00:00:08,650 Before we jump into R, let's understand the data. 2 00:00:08,650 --> 00:00:10,900 Each entry of this data set corresponds 3 00:00:10,900 --> 00:00:14,560 to a census tract, a statistical division of the area that 4 00:00:14,560 --> 00:00:18,250 is used by researchers to break down towns and cities. 5 00:00:18,250 --> 00:00:21,250 As a result, there will usually be multiple census tracts 6 00:00:21,250 --> 00:00:23,020 per town. 7 00:00:23,020 --> 00:00:25,620 LON and LAT are the longitude and latitude 8 00:00:25,620 --> 00:00:28,740 of the center of the census tract. 9 00:00:28,740 --> 00:00:33,240 MEDV is the median value of owner-occupied homes, measured 10 00:00:33,240 --> 00:00:36,280 in thousands of dollars. 11 00:00:36,280 --> 00:00:39,650 CRIM is the per capita crime rate. 12 00:00:39,650 --> 00:00:42,220 ZN is related to how much of the land 13 00:00:42,220 --> 00:00:45,560 is zoned for large residential properties. 14 00:00:45,560 --> 00:00:50,270 INDUS is the proportion of the area used for industry. 15 00:00:50,270 --> 00:00:53,920 CHAS is 1 if a census tract is next to the Charles 16 00:00:53,920 --> 00:00:56,870 River, which I drew before. 17 00:00:56,870 --> 00:00:59,860 NOX is the concentration of nitrous oxides 18 00:00:59,860 --> 00:01:03,290 in the air, a measure of air pollution. 19 00:01:03,290 --> 00:01:07,830 RM is the average number of rooms per dwelling. 20 00:01:07,830 --> 00:01:10,530 AGE is the proportion of owner-occupied units 21 00:01:10,530 --> 00:01:13,550 built before 1940. 22 00:01:13,550 --> 00:01:15,950 DIS is a measure of how far the tract is 23 00:01:15,950 --> 00:01:18,690 from centers of employment in Boston. 24 00:01:18,690 --> 00:01:22,510 RAD is a measure of closeness to important highways. 25 00:01:22,510 --> 00:01:26,560 TAX is the property tax per $10,000 of value. 26 00:01:26,560 --> 00:01:31,070 And PTRATIO is the pupil to teacher ratio by town. 27 00:01:31,070 --> 00:01:34,039 Let's switch over to R now. 28 00:01:34,039 --> 00:01:38,509 So let's begin to analyze our data set with R. First of all, 29 00:01:38,509 --> 00:01:41,880 we'll load the data set into the Boston variable. 30 00:01:46,770 --> 00:01:49,300 If we look at the structure of the Boston data set, 31 00:01:49,300 --> 00:01:52,880 we can see all the variables we talked about before. 32 00:01:52,880 --> 00:01:55,580 There are 506 observations corresponding 33 00:01:55,580 --> 00:02:01,050 to 506 census tracts in the Greater Boston area. 34 00:02:01,050 --> 00:02:03,640 We are interested in building a model initially 35 00:02:03,640 --> 00:02:07,280 of how prices vary by location across a region. 36 00:02:07,280 --> 00:02:10,460 So let's first see how the points are laid out. 37 00:02:10,460 --> 00:02:16,190 Using the plot commands, we can plot the latitude and longitude 38 00:02:16,190 --> 00:02:19,100 of each of our census tracts. 39 00:02:21,850 --> 00:02:24,060 This picture might be a little bit meaningless to you 40 00:02:24,060 --> 00:02:28,180 if you're not familiar with the Massachusetts-Boston area, 41 00:02:28,180 --> 00:02:31,050 but I can tell you that the dense central core of points 42 00:02:31,050 --> 00:02:33,550 corresponds to Boston city, Cambridge 43 00:02:33,550 --> 00:02:38,700 city, and other close urban cities. 44 00:02:38,700 --> 00:02:41,180 Still, let's try and relate it back to that picture 45 00:02:41,180 --> 00:02:44,030 we saw in the first video, where I showed you the river 46 00:02:44,030 --> 00:02:44,980 and where MIT was. 47 00:02:44,980 --> 00:02:49,040 So we want to show all the points that lie along 48 00:02:49,040 --> 00:02:50,860 the Charles River in a different color. 49 00:02:50,860 --> 00:02:53,070 We have a variable, CHAS, that tells us 50 00:02:53,070 --> 00:02:55,860 if a point is on the Charles River or not. 51 00:02:55,860 --> 00:02:58,760 So to put points on an already-existing plot, 52 00:02:58,760 --> 00:03:00,920 we can use the points command, which 53 00:03:00,920 --> 00:03:04,560 looks very similar to the plot command, 54 00:03:04,560 --> 00:03:07,560 except it operates in a plot that already exists. 55 00:03:07,560 --> 00:03:12,520 So let's plot just the points where the Charles River 56 00:03:12,520 --> 00:03:14,880 variable is set to one. 57 00:03:27,880 --> 00:03:30,090 Up to now it looks pretty much like the plot command, 58 00:03:30,090 --> 00:03:32,240 but here's where it's about to get interesting. 59 00:03:32,240 --> 00:03:35,240 We can pass a color, such as blue, 60 00:03:35,240 --> 00:03:37,050 to plot all these points in blue. 61 00:03:37,050 --> 00:03:39,329 And this would plot blue hollow circles 62 00:03:39,329 --> 00:03:41,030 on top of the black hollow circles. 63 00:03:41,030 --> 00:03:42,870 Which would look all right, but I 64 00:03:42,870 --> 00:03:45,880 think I'd much prefer to have solid blue dots. 65 00:03:45,880 --> 00:03:47,940 To control how the points are plotted, 66 00:03:47,940 --> 00:03:52,210 we use a PCH option, which you can read about more in the help 67 00:03:52,210 --> 00:03:54,760 documentation for the points command. 68 00:03:54,760 --> 00:03:57,520 But I'm going to use PCH 19, which 69 00:03:57,520 --> 00:04:01,620 is a solid version of the dots we already have on our plot. 70 00:04:01,620 --> 00:04:03,290 So by running this command, you see 71 00:04:03,290 --> 00:04:06,470 we have some blue dots in our plot now. 72 00:04:06,470 --> 00:04:10,910 These are the census tracts that lie along the Charles River. 73 00:04:10,910 --> 00:04:12,750 But maybe it's still a little bit confusing, 74 00:04:12,750 --> 00:04:15,820 and you'd like to know where MIT is in this picture. 75 00:04:15,820 --> 00:04:17,470 So we can do that too. 76 00:04:17,470 --> 00:04:22,660 I looked up which census tract MIT is in, 77 00:04:22,660 --> 00:04:27,520 and it's census tract 3531. 78 00:04:27,520 --> 00:04:29,000 So let's plot that. 79 00:04:29,000 --> 00:04:36,780 We add another point, the longitude of MIT, 80 00:04:36,780 --> 00:04:42,520 which is in tract 3531, and the latitude of MIT, 81 00:04:42,520 --> 00:04:51,040 which is in census tract 3531. 82 00:04:51,040 --> 00:04:52,990 I'm going to plot this one in red, 83 00:04:52,990 --> 00:04:57,280 so we can tell it apart from the other Charles River dots. 84 00:04:57,280 --> 00:05:01,540 And again, I'm going to use a solid dot to do it. 85 00:05:01,540 --> 00:05:03,090 Can you see it on the little picture? 86 00:05:03,090 --> 00:05:05,420 This little red dot, right in the middle. 87 00:05:05,420 --> 00:05:06,880 That's exactly what we were looking 88 00:05:06,880 --> 00:05:11,980 at from the picture in Video One 89 00:05:11,980 --> 00:05:13,410 What other things can we do? 90 00:05:13,410 --> 00:05:17,220 Well, this data set was originally constructed 91 00:05:17,220 --> 00:05:19,000 to investigate questions about how 92 00:05:19,000 --> 00:05:21,230 air pollution affects prices. 93 00:05:21,230 --> 00:05:24,360 So the air pollution variable is this NOX variable. 94 00:05:24,360 --> 00:05:28,020 Let's have a look at a distribution of NOX. 95 00:05:31,190 --> 00:05:32,840 boston$NOX. 96 00:05:32,840 --> 00:05:37,260 So we see that the minimum value is 0.385, 97 00:05:37,260 --> 00:05:41,280 the maximum value is 0.87 and the median 98 00:05:41,280 --> 00:05:46,350 and the mean are about 0.53, 0.55. 99 00:05:46,350 --> 00:05:49,790 So let's just use the value of 0.55, 100 00:05:49,790 --> 00:05:51,950 it's kind of in the middle. 101 00:05:51,950 --> 00:05:53,810 And we'll look at just the census 102 00:05:53,810 --> 00:05:56,970 tracts that have above-average pollution. 103 00:05:56,970 --> 00:06:00,200 So we'll use the points command again 104 00:06:00,200 --> 00:06:01,550 to plot just those points. 105 00:06:04,100 --> 00:06:11,110 So, points, the latitude--no the longitude first. 106 00:06:11,110 --> 00:06:15,710 So we want the census tracts with NOX levels 107 00:06:15,710 --> 00:06:19,540 greater than or equal to 0.55. 108 00:06:19,540 --> 00:06:24,580 We want the latitude of those same census tracks. 109 00:06:24,580 --> 00:06:29,210 Again, only if the NOX is greater than 0.55. 110 00:06:29,210 --> 00:06:32,600 And I guess a suitable color for nasty pollution 111 00:06:32,600 --> 00:06:35,140 would be a bright green. 112 00:06:35,140 --> 00:06:40,280 And again, we'll use the solid dots. 113 00:06:40,280 --> 00:06:44,530 So you can see it is pretty much the same as the other commands. 114 00:06:44,530 --> 00:06:46,490 Wow okay. 115 00:06:46,490 --> 00:06:49,360 So those are all the points have got above-average pollution. 116 00:06:49,360 --> 00:06:51,200 Looks like my office is right in the middle. 117 00:06:53,620 --> 00:06:55,080 Now it kind of makes sense, though, 118 00:06:55,080 --> 00:06:58,750 because that's the dense urban core of Boston. 119 00:06:58,750 --> 00:07:00,870 If you think of anywhere where pollution would be, 120 00:07:00,870 --> 00:07:03,830 you'd think it'd be where the most cars and the most people 121 00:07:03,830 --> 00:07:04,330 are. 122 00:07:06,920 --> 00:07:10,090 So that's kind of interesting. 123 00:07:10,090 --> 00:07:12,240 Now, before we do anything more, we 124 00:07:12,240 --> 00:07:14,450 should probably look at how prices vary over the area 125 00:07:14,450 --> 00:07:16,430 as well. 126 00:07:16,430 --> 00:07:18,540 So let's make a new plot. 127 00:07:18,540 --> 00:07:20,340 This one's got a few too many things on it. 128 00:07:20,340 --> 00:07:24,240 So we'll just plot again the longitude 129 00:07:24,240 --> 00:07:26,910 and the latitude for all census tracts. 130 00:07:26,910 --> 00:07:30,370 That kind of resets our plot. 131 00:07:30,370 --> 00:07:34,050 If we look at the distribution of the housing prices (Boston 132 00:07:34,050 --> 00:07:40,460 MEDV), we see that the minimum price 133 00:07:40,460 --> 00:07:43,500 (remember units are thousands of dollars, 134 00:07:43,500 --> 00:07:45,740 so the median value of owner-occupied homes 135 00:07:45,740 --> 00:07:51,350 is in thousands of dollars) is around five, 136 00:07:51,350 --> 00:07:53,730 the maximum is around 50. 137 00:07:53,730 --> 00:07:59,750 So let's plot again only the above-average price points. 138 00:07:59,750 --> 00:08:01,880 So we'll go: points(boston$LON[boston$MEDV>=21.2]. 139 00:08:01,880 --> 00:08:24,720 We can also plot the latitude: boston$LATboston$LAT[boston$MEDV>=21.2]. 140 00:08:24,720 --> 00:08:26,610 We'll reuse that red color we used for MIT. 141 00:08:29,310 --> 00:08:32,120 And one more time, with we'll do the solid dots. 142 00:08:34,760 --> 00:08:38,650 So what we see now are all the census tracts 143 00:08:38,650 --> 00:08:43,110 with above-average housing prices. 144 00:08:43,110 --> 00:08:46,140 As you can see, it's definitely not simple. 145 00:08:46,140 --> 00:08:50,510 There's census tracts of above-average and below-average 146 00:08:50,510 --> 00:08:52,820 mixed in between each other. 147 00:08:52,820 --> 00:08:54,640 But there are some patterns. 148 00:08:54,640 --> 00:08:59,130 For example, look at that dense black bit in the middle. 149 00:08:59,130 --> 00:09:01,710 That corresponds to most of the city of Boston, 150 00:09:01,710 --> 00:09:04,670 especially the southern parts of the city. 151 00:09:04,670 --> 00:09:06,810 Also, on the Cambridge side of the river, 152 00:09:06,810 --> 00:09:09,580 there's a big chunk there of dots that are black, 153 00:09:09,580 --> 00:09:13,930 that are not red, that are also presumably below average. 154 00:09:13,930 --> 00:09:16,770 So there's definitely some structure to it, 155 00:09:16,770 --> 00:09:18,970 but it's certainly not simple in relation 156 00:09:18,970 --> 00:09:21,420 to latitude and longitude at least. 157 00:09:21,420 --> 00:09:24,450 We will explore this more in the next video.