1 00:00:09,630 --> 00:00:11,710 There are a lot of visualizations in the world 2 00:00:11,710 --> 00:00:13,640 and we don't have time for them all. 3 00:00:13,640 --> 00:00:16,620 So let's focus on one particularly abused plot 4 00:00:16,620 --> 00:00:18,780 type, the pie chart. 5 00:00:18,780 --> 00:00:21,160 We have a specimen right here. 6 00:00:21,160 --> 00:00:23,980 This is a pie chart of phone application crashes, 7 00:00:23,980 --> 00:00:25,790 showing what percentage of all crashes 8 00:00:25,790 --> 00:00:28,710 took place in each mobile operating system. 9 00:00:28,710 --> 00:00:30,260 This data set contains information 10 00:00:30,260 --> 00:00:34,570 for all versions of Apple's iOS, which is used in the iPhone, 11 00:00:34,570 --> 00:00:37,830 as well as the various versions of Google's Android. 12 00:00:37,830 --> 00:00:39,630 There are many things wrong with this plot, 13 00:00:39,630 --> 00:00:42,100 but let's break down exactly what. 14 00:00:42,100 --> 00:00:44,230 Putting aside, for a moment, that there are far too 15 00:00:44,230 --> 00:00:46,950 many labels, check out the ordering of the labels 16 00:00:46,950 --> 00:00:49,670 corresponding to iOS. 17 00:00:49,670 --> 00:00:52,260 Two sensible ways of ordering iOS data 18 00:00:52,260 --> 00:00:53,750 might be by decreasing percentage 19 00:00:53,750 --> 00:00:55,880 or by version number. 20 00:00:55,880 --> 00:01:01,050 Instead, we start at the top with iOS 3.13, with 0%, 21 00:01:01,050 --> 00:01:06,090 and then jump to iOS 4.2.10, with 12.64%, 22 00:01:06,090 --> 00:01:13,050 before going back down to iOS 3.2, with 0.00% again. 23 00:01:13,050 --> 00:01:15,510 Which brings us to the number of labels. 24 00:01:15,510 --> 00:01:17,350 Many of the segments are so narrow that they 25 00:01:17,350 --> 00:01:20,200 can't be seen, although technically, all data is 26 00:01:20,200 --> 00:01:23,100 retained, because every segment is labeled. 27 00:01:23,100 --> 00:01:25,080 If we look at iOS, we see that there 28 00:01:25,080 --> 00:01:29,090 are only three major versions, 3, 4, and 5, suggesting 29 00:01:29,090 --> 00:01:31,970 we can compress down the iOS segments to just three 30 00:01:31,970 --> 00:01:35,740 segments, while retaining most of the information. 31 00:01:35,740 --> 00:01:38,160 At the least, the versions that differ in the third number 32 00:01:38,160 --> 00:01:43,240 should be combined, and all data points of 0% should be removed. 33 00:01:43,240 --> 00:01:45,460 The more fundamental concern of this visualization 34 00:01:45,460 --> 00:01:46,880 is that it might really be showing 35 00:01:46,880 --> 00:01:48,289 the percentage of the phone market 36 00:01:48,289 --> 00:01:51,460 using each operating system, and says nothing about whether one 37 00:01:51,460 --> 00:01:53,729 operating system crashes more than the other, which 38 00:01:53,729 --> 00:01:56,190 is the focus of this visualization. 39 00:01:56,190 --> 00:01:59,420 Our next pie chart has its own share of problems. 40 00:01:59,420 --> 00:02:01,740 This is a plot of how many shark attacks have 41 00:02:01,740 --> 00:02:04,890 been attributed to each type of shark. 42 00:02:04,890 --> 00:02:07,050 Firstly, the pie chart is, for some reason, 43 00:02:07,050 --> 00:02:09,780 plotted on a hemisphere, a graphical effect that 44 00:02:09,780 --> 00:02:12,710 adds nothing, but has the effect of vertically compressing 45 00:02:12,710 --> 00:02:14,660 the pie chart. 46 00:02:14,660 --> 00:02:17,630 Next, there is the issue of label orientation. 47 00:02:17,630 --> 00:02:21,370 While the caption, "Shark species (total/deaths)", 48 00:02:21,370 --> 00:02:23,840 and the label, "White shark", are horizontal, 49 00:02:23,840 --> 00:02:26,570 the rest are vertical and hard to read. 50 00:02:26,570 --> 00:02:29,329 They are in order, however, which does help. 51 00:02:29,329 --> 00:02:32,190 Although the "Others" segment is unfortunately large, 52 00:02:32,190 --> 00:02:35,240 which is unclear if it is due to there being a lot of attacks 53 00:02:35,240 --> 00:02:37,570 by many species, or if the species is not 54 00:02:37,570 --> 00:02:40,579 known for many attacks. 55 00:02:40,579 --> 00:02:43,090 Finally, at a glance, it is hard to distinguish 56 00:02:43,090 --> 00:02:46,030 the magnitude of differences between the orange, green, 57 00:02:46,030 --> 00:02:49,250 blue, and brown segments in the top part of the pie chart, 58 00:02:49,250 --> 00:02:52,770 and we must resort to the labels to distinguish between them. 59 00:02:52,770 --> 00:02:56,380 There is no meaning in the colors, they are arbitrary. 60 00:02:56,380 --> 00:02:58,790 Finally, we'll look at a pie chart I made, 61 00:02:58,790 --> 00:03:02,440 of the origins of the international students at MIT. 62 00:03:02,440 --> 00:03:04,370 I made this chart with the default settings 63 00:03:04,370 --> 00:03:06,430 in Google Sheets. 64 00:03:06,430 --> 00:03:08,090 First of all, not all of the segments 65 00:03:08,090 --> 00:03:09,940 are labeled, so that data is lost, 66 00:03:09,940 --> 00:03:12,680 for the Middle East, Africa, Oceania, 67 00:03:12,680 --> 00:03:14,940 and the unknown regions. 68 00:03:14,940 --> 00:03:18,460 Second, again, we have colors that are arbitrary and almost 69 00:03:18,460 --> 00:03:20,090 close enough to be confusing. 70 00:03:20,090 --> 00:03:23,940 The difference between Asia and Africa's colors is subtle. 71 00:03:23,940 --> 00:03:26,000 And of course, the 3D-effect on the pie 72 00:03:26,000 --> 00:03:29,620 chart adds nothing, but does play a subtle trick on the eye. 73 00:03:29,620 --> 00:03:32,210 Due to the 3D-effect, the blue and red segments 74 00:03:32,210 --> 00:03:34,990 are actually larger looking, which at a glance, 75 00:03:34,990 --> 00:03:38,730 may lead the viewer to overestimate their size. 76 00:03:38,730 --> 00:03:41,150 What we are going to do now is, switch over to R 77 00:03:41,150 --> 00:03:44,690 and plot this data more appropriately, using ggplot. 78 00:03:44,690 --> 00:03:46,190 And then we'll return to the slides, 79 00:03:46,190 --> 00:03:49,670 to discuss some more possibilities for this data.