1 00:00:04,730 --> 00:00:08,680 To download the data that we'll be working with in this video, 2 00:00:08,680 --> 00:00:12,900 click on the hyperlink given in the text above this video. 3 00:00:12,900 --> 00:00:14,070 Don't use Internet Explorer. 4 00:00:14,070 --> 00:00:20,050 Chrome, Safari, or Firefox should all work fine. 5 00:00:20,050 --> 00:00:21,700 After you click on the hyperlink, 6 00:00:21,700 --> 00:00:25,020 it will take you to a page that looks like this. 7 00:00:25,020 --> 00:00:28,280 Go ahead and copy all the text on this page 8 00:00:28,280 --> 00:00:37,090 by first selecting all of it and then hitting Control C on a PC 9 00:00:37,090 --> 00:00:39,680 or Command C on a Mac. 10 00:00:39,680 --> 00:00:42,740 Then go to a simple text editor, like 11 00:00:42,740 --> 00:00:46,600 Notepad on a PC or Text Edit on a Mac, 12 00:00:46,600 --> 00:00:50,480 and paste what you just copied into the text editor 13 00:00:50,480 --> 00:00:55,060 with Control V on a PC or Command V on a Mac. 14 00:00:55,060 --> 00:01:03,820 Then go ahead and save this file as the name movielens.txt, 15 00:01:03,820 --> 00:01:05,220 for text. 16 00:01:05,220 --> 00:01:10,620 Save this somewhere that you can easily navigate to in R. Now, 17 00:01:10,620 --> 00:01:13,570 let's switch to R and load our data. 18 00:01:13,570 --> 00:01:17,370 First, in your R console, navigate to the directory 19 00:01:17,370 --> 00:01:19,170 where you just saved that file. 20 00:01:28,660 --> 00:01:31,110 And click OK. 21 00:01:31,110 --> 00:01:34,750 Now, to load our data, we'll be using a slightly different 22 00:01:34,750 --> 00:01:36,520 command this time. 23 00:01:36,520 --> 00:01:38,550 Our data is not a CSV file. 24 00:01:38,550 --> 00:01:40,840 It's a text file, where the entries 25 00:01:40,840 --> 00:01:43,560 are separated by a vertical bar. 26 00:01:43,560 --> 00:01:46,750 So we'll call it data set movies, 27 00:01:46,750 --> 00:01:50,509 and then we'll use the read.table function, where 28 00:01:50,509 --> 00:01:55,620 the first argument is the name of our data set in quotes. 29 00:01:55,620 --> 00:02:00,000 The second argument is header=FALSE. 30 00:02:00,000 --> 00:02:02,170 This is because our data doesn't have 31 00:02:02,170 --> 00:02:05,440 a header or a variable name row. 32 00:02:05,440 --> 00:02:13,150 And then the next argument is sep="|" , 33 00:02:13,150 --> 00:02:17,870 which can be found above the Enter key on your keyboard. 34 00:02:17,870 --> 00:02:20,920 We need one more argument, which is quote="(backslash)" ". 35 00:02:28,790 --> 00:02:31,650 Close the parentheses, and hit Enter. 36 00:02:31,650 --> 00:02:34,400 That last argument just make sure that our text 37 00:02:34,400 --> 00:02:37,270 was read in properly. 38 00:02:37,270 --> 00:02:39,340 Let's take a look at the structure of our data 39 00:02:39,340 --> 00:02:40,900 using the str function. 40 00:02:46,400 --> 00:02:53,210 We have 1,682 observations of 24 different variables. 41 00:02:53,210 --> 00:02:57,530 Since our variables didn't have names, header equaled false, 42 00:02:57,530 --> 00:03:03,090 R just labeled them with V1, V2, V3, et cetera. 43 00:03:03,090 --> 00:03:05,640 But from the Movie Lens documentation, 44 00:03:05,640 --> 00:03:07,810 we know what these variables are. 45 00:03:07,810 --> 00:03:11,830 So we'll go ahead and add in the column names ourselves. 46 00:03:11,830 --> 00:03:18,270 To do this, start by typing colnames, for column names, 47 00:03:18,270 --> 00:03:20,320 and then in parentheses, the name of our data 48 00:03:20,320 --> 00:03:24,010 set, movies, and then equals, and we'll 49 00:03:24,010 --> 00:03:26,410 use the c function, where we're going 50 00:03:26,410 --> 00:03:28,520 to list all of the variable names, 51 00:03:28,520 --> 00:03:32,500 each of them in double quotes and separated by commas. 52 00:03:32,500 --> 00:03:38,840 So first, we have "ID", the ID of the movie, then "Title", 53 00:03:38,840 --> 00:03:50,590 "ReleaseDate", "VideoReleaseDate", "IMDB", 54 00:03:50,590 --> 00:03:54,090 "Unknown"-- this is the unknown genre-- 55 00:03:54,090 --> 00:04:02,030 and then our 18 other genres-- "Action", "Adventure", 56 00:04:02,030 --> 00:04:14,620 "Animation", "Children's, "Comedy", "Crime", 57 00:04:14,620 --> 00:04:27,620 "Documentary", "Drama", "Fantasy", "FilmNoir", 58 00:04:27,620 --> 00:04:45,730 "Horror", "Musical", "Mystery", "Romance", "SciFi", "Thriller", 59 00:04:45,730 --> 00:04:47,630 "War", and "Western". 60 00:04:50,320 --> 00:04:50,820 Go 61 00:04:50,820 --> 00:04:54,690 ahead and close the parentheses, and hit Enter. 62 00:04:54,690 --> 00:04:56,450 Let's see what our data looks like now 63 00:04:56,450 --> 00:05:00,780 using the str function again. 64 00:05:00,780 --> 00:05:03,980 We can see that we have the same number of observations 65 00:05:03,980 --> 00:05:06,780 and the same number of variables, but each of them 66 00:05:06,780 --> 00:05:10,660 now has the name that we just gave. 67 00:05:10,660 --> 00:05:14,900 We won't be using the ID, release date, video release 68 00:05:14,900 --> 00:05:17,600 data, or IMDB variables. 69 00:05:17,600 --> 00:05:19,850 So let's go ahead and remove them. 70 00:05:19,850 --> 00:05:24,780 To do this, we type the name of our data set-- movies$-- 71 00:05:24,780 --> 00:05:27,630 the name of the variable we want to remove, 72 00:05:27,630 --> 00:05:31,460 and then just say =NULL, in capital letters. 73 00:05:31,460 --> 00:05:35,400 This would just remove the variable from our data set. 74 00:05:35,400 --> 00:05:43,260 Let's repeat this with release date, video release 75 00:05:43,260 --> 00:05:49,080 date, and IMDB. 76 00:05:55,280 --> 00:05:58,520 And there are a few duplicate entries in our data set, 77 00:05:58,520 --> 00:06:02,100 so we'll go ahead and remove them with the unique function. 78 00:06:02,100 --> 00:06:03,480 So just type the name of our data 79 00:06:03,480 --> 00:06:04,680 set, movies = unique(movies). 80 00:06:10,530 --> 00:06:12,660 Let's take a look at our data one more time. 81 00:06:16,300 --> 00:06:21,910 Now, we have 1,664 observations, a few less than before, 82 00:06:21,910 --> 00:06:26,600 and 20 variables-- the title of the movie, the unknown genre 83 00:06:26,600 --> 00:06:32,000 label, and then the 18 other genre labels. 84 00:06:32,000 --> 00:06:34,880 In this video, we've seen one example 85 00:06:34,880 --> 00:06:37,830 of how to prepare data taken from the internet 86 00:06:37,830 --> 00:06:39,480 to work with it in R. 87 00:06:39,480 --> 00:06:42,130 In the next video, we'll use this data 88 00:06:42,130 --> 00:06:46,060 set to cluster our movies using hierarchical clustering.