1
00:00:00,120 --> 00:00:02,500
The following content is
provided under a Creative

2
00:00:02,500 --> 00:00:03,910
Commons license.

3
00:00:03,910 --> 00:00:06,950
Your support will help MIT
OpenCourseWare continue to

4
00:00:06,950 --> 00:00:10,600
offer high quality educational
resources for free.

5
00:00:10,600 --> 00:00:13,500
To make a donation or view
additional materials from

6
00:00:13,500 --> 00:00:17,780
hundreds of MIT courses, visit
MIT OpenCourseWare at

7
00:00:17,780 --> 00:00:19,030
ocw.mit.edu.

8
00:00:29,180 --> 00:00:31,450
PROFESSOR: As we come
close to testing, we

9
00:00:31,450 --> 00:00:32,530
have shrinkage here.

10
00:00:32,530 --> 00:00:34,130
People probably left home.

11
00:00:34,130 --> 00:00:37,700
Hopefully, everybody who left
home finished their report.

12
00:00:37,700 --> 00:00:40,280
So you guys have all
looked into how to

13
00:00:40,280 --> 00:00:41,400
do the final project.

14
00:00:41,400 --> 00:00:43,900
And have all the ideas how
to go and optimize.

15
00:00:47,110 --> 00:00:50,235
How many people have downloaded,
compiled, and ran,

16
00:00:50,235 --> 00:00:51,490
and you know what's going on?

17
00:00:51,490 --> 00:00:52,010
OK.

18
00:00:52,010 --> 00:00:52,450
Good.

19
00:00:52,450 --> 00:00:52,660
Good.

20
00:00:52,660 --> 00:00:53,910
Good.

21
00:00:55,930 --> 00:00:57,180
Exactly.

22
00:01:00,710 --> 00:01:01,950
It's happening right now.

23
00:01:01,950 --> 00:01:02,800
OK.

24
00:01:02,800 --> 00:01:03,810
Good.

25
00:01:03,810 --> 00:01:08,161
So I will repeat this what
I said last time in here.

26
00:01:08,161 --> 00:01:10,790
We're going to have a design
review with your masters.

27
00:01:10,790 --> 00:01:14,840
So just look for us to send
you the information.

28
00:01:14,840 --> 00:01:16,670
That means when you
come back from

29
00:01:16,670 --> 00:01:18,520
Thanksgiving, schedule it early.

30
00:01:18,520 --> 00:01:23,790
So they can help if you have any
changes in design process.

31
00:01:23,790 --> 00:01:27,820
And then we have a competition
on December 9 in class here,

32
00:01:27,820 --> 00:01:28,840
trying to figure
out who has the

33
00:01:28,840 --> 00:01:33,650
fastest ray tracer created.

34
00:01:33,650 --> 00:01:37,660
And in fact, this year there
is Akamai prize for the

35
00:01:37,660 --> 00:01:41,400
winning team, including they
have a kind of celebration and

36
00:01:41,400 --> 00:01:42,790
demonstration in their
headquarters.

37
00:01:42,790 --> 00:01:46,590
You get to go get a tour with
their knock and stuff like.

38
00:01:46,590 --> 00:01:52,440
Plus, every winning member is
going to get a iPod Nano.

39
00:01:52,440 --> 00:01:55,440
So there's a lot more motivation
now to get the

40
00:01:55,440 --> 00:02:00,280
fastest running ray tracer OK.

41
00:02:00,280 --> 00:02:04,770
So with that, let's switch
gears a little bit.

42
00:02:04,770 --> 00:02:10,039
So today, I'm going to talk
about distributed systems.

43
00:02:10,039 --> 00:02:13,580
Until now what we looked at was,
OK, given a box how to

44
00:02:13,580 --> 00:02:18,555
get something running as fast
as possible inside that box.

45
00:02:18,555 --> 00:02:23,170
And today we're going to look
at going outside the box.

46
00:02:23,170 --> 00:02:27,750
Basically, we want to scale up
to clusters of machines.

47
00:02:27,750 --> 00:02:30,900
That means the room can
have 10, 15 machines.

48
00:02:30,900 --> 00:02:35,300
In fact, for your class,
you guys are using--

49
00:02:35,300 --> 00:02:37,440
how many machines do we have?

50
00:02:37,440 --> 00:02:38,170
16 machines.

51
00:02:38,170 --> 00:02:39,510
So you are doing
independently.

52
00:02:39,510 --> 00:02:42,280
But you can use as one gigantic
machine if you can

53
00:02:42,280 --> 00:02:43,280
and run something.

54
00:02:43,280 --> 00:02:44,540
And data center scale.

55
00:02:44,540 --> 00:02:47,370
This is kind of people like
Google, and Amazon, has these

56
00:02:47,370 --> 00:02:48,250
kinds of things.

57
00:02:48,250 --> 00:02:51,100
And finally, Planet Scale.

58
00:02:51,100 --> 00:02:53,790
If you want to run something
even bigger, larger.

59
00:02:53,790 --> 00:02:55,420
What you have to deal with, and
what kind of issues you

60
00:02:55,420 --> 00:02:55,860
have to deal with.

61
00:02:55,860 --> 00:02:57,600
It's time to reboot
my machine.

62
00:02:57,600 --> 00:03:01,890
And I have to be pressing this
button probably four or five

63
00:03:01,890 --> 00:03:05,190
times during the day.

64
00:03:05,190 --> 00:03:07,560
So Cluster Scale.

65
00:03:07,560 --> 00:03:11,340
So you want to run a program
on multiple machines.

66
00:03:11,340 --> 00:03:12,910
And OK, Let me put it there.

67
00:03:12,910 --> 00:03:14,780
Why the heck do you
want to do that?

68
00:03:14,780 --> 00:03:16,930
What's the advantages of--

69
00:03:16,930 --> 00:03:20,720
instead of running on one nice
machine, running on a cluster

70
00:03:20,720 --> 00:03:21,180
of machines?

71
00:03:21,180 --> 00:03:22,430
What do you get?

72
00:03:27,982 --> 00:03:28,914
AUDIENCE: It's cheaper.

73
00:03:28,914 --> 00:03:29,850
PROFESSOR: It's cheaper.

74
00:03:29,850 --> 00:03:30,530
That's a good one.

75
00:03:30,530 --> 00:03:33,910
It's cheaper to get a bunch of
small machines than to buy a

76
00:03:33,910 --> 00:03:36,290
humongo mainframe
type machine.

77
00:03:36,290 --> 00:03:37,900
Yes, that's a very
good answer.

78
00:03:37,900 --> 00:03:39,150
What else?

79
00:03:41,798 --> 00:03:43,734
AUDIENCE: It's very slow.

80
00:03:43,734 --> 00:03:44,870
It's slower.

81
00:03:44,870 --> 00:03:48,290
PROFESSOR: So you would run
something because it's slower?

82
00:03:48,290 --> 00:03:50,170
AUDIENCE: But it
is a trade-off.

83
00:03:50,170 --> 00:03:54,635
PROFESSOR: Yes, so there's some
trade off between speed.

84
00:03:54,635 --> 00:03:57,570
But it might not be that much.

85
00:03:57,570 --> 00:03:59,820
Even when you get a gigantic
machine, there are

86
00:03:59,820 --> 00:04:00,630
bottlenecks in it.

87
00:04:00,630 --> 00:04:03,270
In a cluster kind of thing, you
can avoid the bottlenecks.

88
00:04:03,270 --> 00:04:05,080
But hopefully, you're trying
to do it to get some

89
00:04:05,080 --> 00:04:09,540
performance in scaling
to large number of

90
00:04:09,540 --> 00:04:11,680
users and what not.

91
00:04:11,680 --> 00:04:16,470
So basically, what you
want to get is--

92
00:04:16,470 --> 00:04:19,600
so get more parallelism.

93
00:04:19,600 --> 00:04:21,529
Because now we have more
machines, more calls.

94
00:04:21,529 --> 00:04:23,120
And hopefully get higher
throughput.

95
00:04:23,120 --> 00:04:24,810
Definitely, because
you are doing it.

96
00:04:24,810 --> 00:04:27,590
Hopefully, it's a little bit of
lower latency too, because

97
00:04:27,590 --> 00:04:32,580
if you have one gigantic system,
if everything has to

98
00:04:32,580 --> 00:04:37,150
go through bottlenecks, it might
be slower than basically

99
00:04:37,150 --> 00:04:38,390
having different system.

100
00:04:38,390 --> 00:04:46,510
So assume, just an example, if
you are something like Verizon

101
00:04:46,510 --> 00:04:49,480
or Netflix trying to
serve your videos.

102
00:04:49,480 --> 00:04:52,230
It makes much more sense to have
a bunch of clusters of

103
00:04:52,230 --> 00:04:55,360
machines each doing a lot of
independent work than trying

104
00:04:55,360 --> 00:04:58,360
to send all the videos
to one machine.

105
00:04:58,360 --> 00:05:01,930
Another interesting fact
is robustness.

106
00:05:01,930 --> 00:05:04,720
So until now you guys didn't
care about robustness, because

107
00:05:04,720 --> 00:05:07,280
something went wrong, the
entire thing collapsed.

108
00:05:07,280 --> 00:05:09,420
There's no half baked machine.

109
00:05:09,420 --> 00:05:11,440
The machine crashed, your
program crashed.

110
00:05:11,440 --> 00:05:11,870
Everything died.

111
00:05:11,870 --> 00:05:14,530
So you just have this
fatalistic attitude.

112
00:05:14,530 --> 00:05:15,405
OK.

113
00:05:15,405 --> 00:05:16,880
It crashed.

114
00:05:16,880 --> 00:05:17,520
Everything is dead.

115
00:05:17,520 --> 00:05:19,000
So why bother?

116
00:05:19,000 --> 00:05:23,847
But in these clusters, if you
have a lot of machines, if one

117
00:05:23,847 --> 00:05:25,550
machine dies, there's many
others to pick up.

118
00:05:25,550 --> 00:05:29,650
So you can have a system that
probably has availability much

119
00:05:29,650 --> 00:05:32,150
higher than what you can get
on a single machine.

120
00:05:32,150 --> 00:05:34,360
And finally, cost savings.

121
00:05:34,360 --> 00:05:36,400
Because it's cheaper
to do this, have a

122
00:05:36,400 --> 00:05:37,700
bunch of small machines.

123
00:05:37,700 --> 00:05:41,340
And businesses like Google
has really taken

124
00:05:41,340 --> 00:05:42,590
advantage of that.

125
00:05:45,790 --> 00:05:48,520
So there are issues we have to
deal with in order to program

126
00:05:48,520 --> 00:05:50,162
this damn thing.

127
00:05:50,162 --> 00:05:52,240
And if you want to get
performance, you have to

128
00:05:52,240 --> 00:05:53,780
program in a way to get
good performance.

129
00:05:53,780 --> 00:05:56,820
You don't run much slower
and load less

130
00:05:56,820 --> 00:05:57,920
performance than one box.

131
00:05:57,920 --> 00:06:00,650
You'll get performance and also
performance scalability.

132
00:06:00,650 --> 00:06:03,240
That means if you get 10
machines, you want to get some

133
00:06:03,240 --> 00:06:05,370
performance as if you have 20.

134
00:06:05,370 --> 00:06:07,370
Hopefully, you want to get a lot
more performance than 20.

135
00:06:07,370 --> 00:06:09,630
So how do we keep things
scaling in there?

136
00:06:09,630 --> 00:06:12,510
And also the thing's
robustness.

137
00:06:12,510 --> 00:06:16,970
So the idea there is if you
have one machine, you're

138
00:06:16,970 --> 00:06:17,210
fatalistic.

139
00:06:17,210 --> 00:06:18,730
If the machine goes,
everything goes.

140
00:06:18,730 --> 00:06:20,080
You don't care.

141
00:06:20,080 --> 00:06:22,870
But if you have a lot more
machines, you want to make

142
00:06:22,870 --> 00:06:26,100
sure that application runs even
if the machine's fails.

143
00:06:26,100 --> 00:06:28,170
Worse, if you have a lot of
machines, there's a lot more

144
00:06:28,170 --> 00:06:29,460
chance of failure.

145
00:06:29,460 --> 00:06:32,130
So if one goes down, everything
crashes still.

146
00:06:32,130 --> 00:06:34,470
Then your application will be a
lot less robust even than a

147
00:06:34,470 --> 00:06:35,990
single machine, because
there too many

148
00:06:35,990 --> 00:06:37,845
moving parts to go wrong.

149
00:06:37,845 --> 00:06:39,890
So you want to actually deal
with this robustness.

150
00:06:39,890 --> 00:06:44,780
So that adds an entire new
dimension in there.

151
00:06:44,780 --> 00:06:47,050
We are not going to go too much
deeper into robustness.

152
00:06:47,050 --> 00:06:50,340
But that is one big thing that
you have to really worry about

153
00:06:50,340 --> 00:06:53,570
when you go to distributed
systems.

154
00:06:53,570 --> 00:06:53,830
OK.

155
00:06:53,830 --> 00:06:55,700
What's a distributed system?

156
00:06:55,700 --> 00:07:00,400
So this is what we have
been working so far?

157
00:07:00,400 --> 00:07:02,080
Can we see if we can
reduce the lights?

158
00:07:06,980 --> 00:07:08,480
I guess up there you can't--

159
00:07:08,480 --> 00:07:10,780
OK.

160
00:07:10,780 --> 00:07:12,030
We don't go fully
dark, we'll see.

161
00:07:16,450 --> 00:07:19,680
Oh, that's you guys.

162
00:07:19,680 --> 00:07:22,420
Don't go to sleep even
though light is--

163
00:07:22,420 --> 00:07:23,270
there.

164
00:07:23,270 --> 00:07:25,510
So this should be over there
and I don't have any way to

165
00:07:25,510 --> 00:07:26,700
darken this side.

166
00:07:26,700 --> 00:07:29,580
So these are the machines we
have been thinking about.

167
00:07:29,580 --> 00:07:31,020
We have a memory system.

168
00:07:31,020 --> 00:07:34,170
And more than just having
a shared memory,

169
00:07:34,170 --> 00:07:35,760
we have cache coherence.

170
00:07:35,760 --> 00:07:39,230
So that means if two people want
to communicate to write

171
00:07:39,230 --> 00:07:42,250
to this single memory location,
and the lot of that

172
00:07:42,250 --> 00:07:46,000
data appears lower on all
the different cores.

173
00:07:46,000 --> 00:07:48,770
So we can use that information
to basically communicate to

174
00:07:48,770 --> 00:07:49,320
the processor.

175
00:07:49,320 --> 00:07:51,350
That's really nice.

176
00:07:51,350 --> 00:07:57,040
So a distributed memory machine
has no shared memory.

177
00:07:57,040 --> 00:07:58,050
So each memory is--

178
00:07:58,050 --> 00:07:59,995
Now, how are you going
to communicate?

179
00:08:03,130 --> 00:08:04,490
Message?

180
00:08:04,490 --> 00:08:06,000
Yeah, this is not software.

181
00:08:06,000 --> 00:08:08,100
You actually need something
additional.

182
00:08:08,100 --> 00:08:11,060
Something like a network, or
Internet, something behind

183
00:08:11,060 --> 00:08:13,360
sitting out that actually
let you communicate

184
00:08:13,360 --> 00:08:15,830
between each other.

185
00:08:15,830 --> 00:08:19,690
So if you just really look at
the kind of cost, this is a

186
00:08:19,690 --> 00:08:21,740
back of the envelope
type calculation.

187
00:08:21,740 --> 00:08:23,710
Register is probably
one cycle.

188
00:08:23,710 --> 00:08:26,090
Cache is about 10 cycles.

189
00:08:26,090 --> 00:08:28,940
If you go to DRAM, you can
get about 1,000 cycles.

190
00:08:28,940 --> 00:08:32,520
Remote memory, going somewhere
across, is, again, another

191
00:08:32,520 --> 00:08:34,480
order of magnitude from that.

192
00:08:34,480 --> 00:08:36,090
So of course, you keep adding.

193
00:08:36,090 --> 00:08:37,789
And that's probably the
reason that sometimes

194
00:08:37,789 --> 00:08:38,370
things can be slow.

195
00:08:38,370 --> 00:08:41,370
Because now, we have another
layer that's even slower.

196
00:08:41,370 --> 00:08:46,610
So we have to think about it,
worry about it when you're

197
00:08:46,610 --> 00:08:48,880
writing code for these
types of machines.

198
00:08:48,880 --> 00:08:53,190
So in shared memory machines,
we learn in

199
00:08:53,190 --> 00:08:54,750
languages like Cilk.

200
00:08:54,750 --> 00:08:57,300
It's very nice to communicate
because we

201
00:08:57,300 --> 00:08:58,930
synchronize via locks.

202
00:08:58,930 --> 00:09:01,780
And all communication
via memory.

203
00:09:01,780 --> 00:09:03,560
Because when you write
something, if you look at that

204
00:09:03,560 --> 00:09:05,990
memory location, everybody
else will see it.

205
00:09:05,990 --> 00:09:08,275
And if you put the right
synchronization, hopefully you

206
00:09:08,275 --> 00:09:10,600
will get the value you want.

207
00:09:10,600 --> 00:09:13,620
In distributed memory
machines, there's

208
00:09:13,620 --> 00:09:14,820
nothing like that.

209
00:09:14,820 --> 00:09:16,700
So what we see is we explicitly

210
00:09:16,700 --> 00:09:18,510
sends some data across.

211
00:09:18,510 --> 00:09:20,240
So you have what we
call messages.

212
00:09:20,240 --> 00:09:23,350
And that means if you want to
send something to-- if another

213
00:09:23,350 --> 00:09:25,286
person needs to look at
something, we have to send it

214
00:09:25,286 --> 00:09:26,260
to that person.

215
00:09:26,260 --> 00:09:27,755
So you have to originate
yourself.

216
00:09:27,755 --> 00:09:28,470
Saying, I'm sending.

217
00:09:28,470 --> 00:09:30,100
That other person has
to receive it.

218
00:09:30,100 --> 00:09:32,950
And they have to put it
wherever you want.

219
00:09:32,950 --> 00:09:36,620
So everybody's address
space is separate.

220
00:09:36,620 --> 00:09:38,820
And if you want to synchronize,
you would also do

221
00:09:38,820 --> 00:09:39,540
it through the message.

222
00:09:39,540 --> 00:09:41,940
So you send a message, then the
other person wait for the

223
00:09:41,940 --> 00:09:43,730
message to come.

224
00:09:43,730 --> 00:09:50,550
And so this shows you what
normally happens in messages.

225
00:09:50,550 --> 00:09:53,250
In the shared memory, there's
nothing called message size.

226
00:09:53,250 --> 00:09:54,560
You write a cache line.

227
00:09:54,560 --> 00:09:56,070
The cache line moves.

228
00:09:56,070 --> 00:09:58,500
And you can't keep changing
the cache line size.

229
00:09:58,500 --> 00:10:01,220
Hopefully, prefetcher will be
good and do something nice.

230
00:10:01,220 --> 00:10:02,700
But you don't have
that much choice.

231
00:10:02,700 --> 00:10:05,430
In messages, you can compose any
size of message you want.

232
00:10:05,430 --> 00:10:10,930
So what this graph shows is the
minimum cost and average

233
00:10:10,930 --> 00:10:13,720
cost of different
size messages.

234
00:10:13,720 --> 00:10:16,030
So there's a couple of things
to get out of this graph.

235
00:10:16,030 --> 00:10:20,590
One is that if the message is
even 0 length, or very small,

236
00:10:20,590 --> 00:10:21,670
you still have overhead.

237
00:10:21,670 --> 00:10:23,240
You're going to send
the darn message.

238
00:10:23,240 --> 00:10:27,000
So if even you send nothing,
it cost you some amount.

239
00:10:27,000 --> 00:10:29,670
And the second thing is as the
message gets bigger and

240
00:10:29,670 --> 00:10:31,820
bigger, the cost keeps
increasing, because now you're

241
00:10:31,820 --> 00:10:33,640
sending more and more data.

242
00:10:33,640 --> 00:10:36,670
So if you really amortize the
overhead cost, you are to send

243
00:10:36,670 --> 00:10:38,320
large messages in there.

244
00:10:38,320 --> 00:10:41,600
Another thing this chart shows
is that as messages become

245
00:10:41,600 --> 00:10:45,870
bigger, the kind of the
distribution of overhead is

246
00:10:45,870 --> 00:10:47,440
all over the map.

247
00:10:47,440 --> 00:10:50,840
Because now we are sending large
things, a lot of other

248
00:10:50,840 --> 00:10:52,150
craziness happens
to these things.

249
00:10:52,150 --> 00:10:53,910
So sometimes it can go
fast, sometimes it

250
00:10:53,910 --> 00:10:54,730
can be pretty slow.

251
00:10:54,730 --> 00:10:57,560
I will get why it might be
sometimes this kind of

252
00:10:57,560 --> 00:10:59,610
distribution shortly.

253
00:10:59,610 --> 00:11:02,790
So the main point is that, that
you don't send smaller

254
00:11:02,790 --> 00:11:06,880
messages if you can, because
the overhead is too high.

255
00:11:06,880 --> 00:11:08,060
So why is this?

256
00:11:08,060 --> 00:11:11,310
Why is sending messages
complicated?

257
00:11:11,310 --> 00:11:13,940
Till now, there's
nobody sitting

258
00:11:13,940 --> 00:11:16,290
between you and hardware.

259
00:11:16,290 --> 00:11:19,550
Once you send the program run,
you own the entire hardware,

260
00:11:19,550 --> 00:11:24,060
and after figuring out all the
weirdness that's on x86

261
00:11:24,060 --> 00:11:25,670
there's nothing in
between you.

262
00:11:25,670 --> 00:11:27,920
You probably won't look at the
compile code if you look at

263
00:11:27,920 --> 00:11:29,940
what assembly is generated
you have full view

264
00:11:29,940 --> 00:11:31,770
what's going on in here.

265
00:11:31,770 --> 00:11:34,870
Unfortunately, message passing,
a lot of other things

266
00:11:34,870 --> 00:11:35,480
come into play.

267
00:11:35,480 --> 00:11:37,320
So if you want to send a message
and the applications

268
00:11:37,320 --> 00:11:39,700
says, aha, I'm sending
a message.

269
00:11:39,700 --> 00:11:41,960
And normally, it will
do a system call

270
00:11:41,960 --> 00:11:43,180
to operating system.

271
00:11:43,180 --> 00:11:45,460
And normally, this message
will get copied into the

272
00:11:45,460 --> 00:11:46,860
operating system.

273
00:11:46,860 --> 00:11:48,210
It's copying here.

274
00:11:48,210 --> 00:11:50,680
This operating system called the
operating system wakes up.

275
00:11:50,680 --> 00:11:53,500
This might be when this
scheduled, there's a lot of

276
00:11:53,500 --> 00:11:54,350
things going on.

277
00:11:54,350 --> 00:11:56,390
And then the operating system
has to send to the network

278
00:11:56,390 --> 00:11:57,620
interface card.

279
00:11:57,620 --> 00:12:00,230
And the network will say, OK,
I can't send long messages.

280
00:12:00,230 --> 00:12:03,930
I'm going to break into a
bunch of small messages.

281
00:12:03,930 --> 00:12:05,600
And put some hardware here.

282
00:12:05,600 --> 00:12:07,720
And it will end up in
the other side.

283
00:12:07,720 --> 00:12:11,600
In a bunch of fragmented small
pieces that the network

284
00:12:11,600 --> 00:12:14,440
interface unit has to reassemble
into one message

285
00:12:14,440 --> 00:12:15,570
and deliver up.

286
00:12:15,570 --> 00:12:16,840
And this will probably--

287
00:12:16,840 --> 00:12:19,660
it will copy back into
the application.

288
00:12:19,660 --> 00:12:22,120
So what that means is that a
lot of other things getting

289
00:12:22,120 --> 00:12:24,920
involved, each optimize
separately, doing a lot of

290
00:12:24,920 --> 00:12:26,090
different things.

291
00:12:26,090 --> 00:12:29,610
And so that is why you have this
big unpredictable mess

292
00:12:29,610 --> 00:12:32,300
happening in message passing.

293
00:12:32,300 --> 00:12:37,050
And so there you not only have
to worry about your code.

294
00:12:37,050 --> 00:12:38,730
You have to worry about what the
operating system is doing.

295
00:12:38,730 --> 00:12:40,590
You have to worry about what
the network is doing.

296
00:12:40,590 --> 00:12:43,060
You have to worry about your
network card's doing.

297
00:12:43,060 --> 00:12:44,930
So there's a lot of moving
parts in this.

298
00:12:44,930 --> 00:12:47,180
If you want to get really,
really good performance,

299
00:12:47,180 --> 00:12:51,090
people have to worry about
all these things in here.

300
00:12:51,090 --> 00:12:54,650
So let's look at how
a message works.

301
00:12:54,650 --> 00:12:56,150
So I hope you can see
these diagrams.

302
00:12:56,150 --> 00:12:58,060
Can you see these?

303
00:12:58,060 --> 00:12:58,530
Barely?

304
00:12:58,530 --> 00:12:59,400
So let me say--

305
00:12:59,400 --> 00:13:01,675
So I have a sending process
and a receiving process.

306
00:13:01,675 --> 00:13:03,110
Oh, don't reboot please.

307
00:13:08,160 --> 00:13:11,320
And so what happens if-- this
is a message-- if we are

308
00:13:11,320 --> 00:13:14,280
sending without any buffering of
a message, that means I am

309
00:13:14,280 --> 00:13:16,250
not copying it anywhere,
so assume I

310
00:13:16,250 --> 00:13:17,170
want to send a message.

311
00:13:17,170 --> 00:13:19,650
I said I have a message
to send.

312
00:13:19,650 --> 00:13:23,260
And then what happens in this
model is, OK, until the

313
00:13:23,260 --> 00:13:26,440
receiver is ready,
you have to wait.

314
00:13:26,440 --> 00:13:29,860
Because there is no place
to send the message.

315
00:13:29,860 --> 00:13:33,620
So finally, when the other side
says I want to receive

316
00:13:33,620 --> 00:13:36,440
something, it will tell this
thing it's OK to send.

317
00:13:36,440 --> 00:13:37,770
And it will copy the data.

318
00:13:37,770 --> 00:13:42,410
And then after copying the data,
both parts can continue.

319
00:13:42,410 --> 00:13:48,510
So this is what happens if
sender wants to send early.

320
00:13:48,510 --> 00:13:50,830
If you're very lucky, the minute
you try to send, the

321
00:13:50,830 --> 00:13:52,250
receiver says I want it.

322
00:13:52,250 --> 00:13:54,170
And we have very little delay.

323
00:13:54,170 --> 00:13:55,660
And everything gets copied.

324
00:13:55,660 --> 00:13:57,410
And that's in your lucky case.

325
00:13:57,410 --> 00:13:59,880
In other cases, the receiver
wants some data.

326
00:14:02,760 --> 00:14:05,580
But the sender is not ready, so
your receiver has to wait

327
00:14:05,580 --> 00:14:07,010
until the sender wants
to send it.

328
00:14:07,010 --> 00:14:07,770
And when this message [? is ?]

329
00:14:07,770 --> 00:14:10,220
[? sent ?], you copy
the data in here.

330
00:14:10,220 --> 00:14:13,430
So this is a very naive
simple way.

331
00:14:13,430 --> 00:14:16,570
What can we eliminate
out of this?

332
00:14:16,570 --> 00:14:20,132
How can we make it a
little bit faster?

333
00:14:20,132 --> 00:14:22,480
AUDIENCE: Buffer.

334
00:14:22,480 --> 00:14:22,840
PROFESSOR: Yeah.

335
00:14:22,840 --> 00:14:24,410
If you buffer, what
will eliminate?

336
00:14:24,410 --> 00:14:26,090
What will go away?

337
00:14:26,090 --> 00:14:28,580
Out of-- we have this overhead,
this overhead, and

338
00:14:28,580 --> 00:14:30,390
this overhead.

339
00:14:30,390 --> 00:14:31,640
Which overheads can
get eliminated?

340
00:14:35,390 --> 00:14:38,110
Wait for send can go ahead.

341
00:14:38,110 --> 00:14:39,260
So what happens is--

342
00:14:39,260 --> 00:14:45,750
So here actually what they're
showing is buffering also with

343
00:14:45,750 --> 00:14:46,710
some hardware support.

344
00:14:46,710 --> 00:14:49,730
That means I am trying
to send something.

345
00:14:49,730 --> 00:14:52,570
And the minute I copied
it out there, I can

346
00:14:52,570 --> 00:14:53,540
keep working in there.

347
00:14:53,540 --> 00:14:54,870
And somewhere in the background
where you send the

348
00:14:54,870 --> 00:14:56,295
data, it will arrive here.

349
00:14:56,295 --> 00:14:59,740
And if it wants it,
the data is there.

350
00:14:59,740 --> 00:15:02,210
Of course, if the receiver comes
early and asks for data,

351
00:15:02,210 --> 00:15:02,980
you can't do that.

352
00:15:02,980 --> 00:15:07,100
Still you have to wait, because
the data is not there.

353
00:15:07,100 --> 00:15:09,600
However, if there's no hardware
support, both has to

354
00:15:09,600 --> 00:15:11,790
probably wait a little bit,
because you have to get the

355
00:15:11,790 --> 00:15:12,540
data copied.

356
00:15:12,540 --> 00:15:14,380
So if you have a lot of hardware
support, you don't

357
00:15:14,380 --> 00:15:15,690
see this copy time.

358
00:15:15,690 --> 00:15:18,960
But if there's no hardware
support, you see some copy

359
00:15:18,960 --> 00:15:21,580
time going in here.

360
00:15:21,580 --> 00:15:25,150
So what's the advantage
of this versus--

361
00:15:25,150 --> 00:15:27,070
OK, tell me one advantage
of this

362
00:15:27,070 --> 00:15:30,730
method versus this method.

363
00:15:30,730 --> 00:15:33,600
So of course, this one there
is a lot of wait time and

364
00:15:33,600 --> 00:15:34,180
stuff like that.

365
00:15:34,180 --> 00:15:35,120
We know that.

366
00:15:35,120 --> 00:15:38,370
But is there any advantage of
doing this one, this waiting

367
00:15:38,370 --> 00:15:40,990
until sending and sending it
there versus this kind of a

368
00:15:40,990 --> 00:15:42,944
nice sending it in
the background.

369
00:15:42,944 --> 00:15:43,912
AUDIENCE: They're
synchronized.

370
00:15:43,912 --> 00:15:44,396
PROFESSOR: Hmm?

371
00:15:44,396 --> 00:15:46,820
AUDIENCE: Sychronized.

372
00:15:46,820 --> 00:15:48,520
PROFESSOR: Synchronized
is one advantage.

373
00:15:48,520 --> 00:15:51,780
What else might happen?

374
00:15:51,780 --> 00:15:53,470
So what else are you going
to do to get this

375
00:15:53,470 --> 00:15:54,850
kind of thing working?

376
00:16:00,960 --> 00:16:03,200
So in order for this to make
progress, what do you have to

377
00:16:03,200 --> 00:16:05,970
do to data?

378
00:16:05,970 --> 00:16:07,020
It has to copy.

379
00:16:07,020 --> 00:16:08,910
So it has to get multiple
copies.

380
00:16:08,910 --> 00:16:10,330
So from application space.

381
00:16:10,330 --> 00:16:12,520
It has to get copied to
operating system space.

382
00:16:12,520 --> 00:16:14,980
It has to get copied into
the networking stack.

383
00:16:14,980 --> 00:16:16,920
So data keep getting copying,
and copying,

384
00:16:16,920 --> 00:16:18,080
and copying in there.

385
00:16:18,080 --> 00:16:19,830
And in here you basically
don't copy.

386
00:16:19,830 --> 00:16:21,070
You just say, OK, wait.

387
00:16:21,070 --> 00:16:22,980
I'll keep the data and when
you're ready, I will send it

388
00:16:22,980 --> 00:16:24,020
directly in here.

389
00:16:24,020 --> 00:16:26,400
And you can directly probably
even send it to the network.

390
00:16:26,400 --> 00:16:27,550
And send it.

391
00:16:27,550 --> 00:16:32,670
So if you're sending a lot of
data, copy my old value.

392
00:16:32,670 --> 00:16:35,580
So this might even be better
if you're sending a huge

393
00:16:35,580 --> 00:16:36,760
amount of data.

394
00:16:36,760 --> 00:16:40,130
So that's one advantage of
having system like that.

395
00:16:40,130 --> 00:16:42,080
And of course, hardware--

396
00:16:42,080 --> 00:16:44,805
if there's no hardware support,
basically still you

397
00:16:44,805 --> 00:16:46,220
have to do some copying
in here.

398
00:16:52,950 --> 00:16:55,264
So this is--

399
00:16:55,264 --> 00:16:57,580
what am I showing here?

400
00:16:57,580 --> 00:17:01,160
So what we are showing in
here is non-blocking.

401
00:17:01,160 --> 00:17:06,119
So one way to look at that is
when you're sending, when you

402
00:17:06,119 --> 00:17:11,079
request for send, what you can
say is, OK, I continue but I

403
00:17:11,079 --> 00:17:12,099
haven't copied the data.

404
00:17:12,099 --> 00:17:14,089
I have my data in here,
but I'm doing that.

405
00:17:14,089 --> 00:17:16,524
But what I must tell you, OK,
look, this data still hasn't

406
00:17:16,524 --> 00:17:18,150
moved out of my space yet.

407
00:17:18,150 --> 00:17:18,849
So I have to worry.

408
00:17:18,849 --> 00:17:19,880
I can't rewrite the data.

409
00:17:19,880 --> 00:17:23,030
And at some point, when you say
I want the data, it will

410
00:17:23,030 --> 00:17:25,730
go there and bring
the data for you.

411
00:17:25,730 --> 00:17:27,440
And catch you like that.

412
00:17:27,440 --> 00:17:30,440
So between this time since I
don't want to make too many

413
00:17:30,440 --> 00:17:32,840
copies, I have to make sure that
I don't touch that data.

414
00:17:32,840 --> 00:17:34,060
Or I have to copy it.

415
00:17:34,060 --> 00:17:36,900
So that's my request in here.

416
00:17:36,900 --> 00:17:43,590
And of course, if you have no
hardware support, you have to

417
00:17:43,590 --> 00:17:48,210
put some time into actually
doing the copying.

418
00:17:48,210 --> 00:17:52,600
So this is nice.

419
00:17:52,600 --> 00:17:55,370
But we want to have a little
bit of high level

420
00:17:55,370 --> 00:17:56,730
support to do this.

421
00:17:56,730 --> 00:18:01,840
So this is not as nice as things
like Cilk, because you

422
00:18:01,840 --> 00:18:03,180
don't have to worry about a
lot of other interesting

423
00:18:03,180 --> 00:18:04,360
things going on.

424
00:18:04,360 --> 00:18:08,130
So what people have developed
is called MPI language,

425
00:18:08,130 --> 00:18:10,040
Message Passing Interface
language.

426
00:18:10,040 --> 00:18:12,580
It is kind of a bit foggy.

427
00:18:12,580 --> 00:18:15,470
But that's the best people
have these days.

428
00:18:15,470 --> 00:18:19,880
A machine independent way of
when have the distributed

429
00:18:19,880 --> 00:18:22,760
systems to communicate
with each other.

430
00:18:22,760 --> 00:18:23,360
So--

431
00:18:23,360 --> 00:18:24,812
[PHONE RINGING]

432
00:18:24,812 --> 00:18:25,780
Whoops.

433
00:18:25,780 --> 00:18:26,466
That's not good.

434
00:18:26,466 --> 00:18:27,716
My phone.

435
00:18:31,140 --> 00:18:32,900
Sorry about that.

436
00:18:32,900 --> 00:18:35,540
So what happens is each machine
has its own processor,

437
00:18:35,540 --> 00:18:37,220
it's own memory.

438
00:18:37,220 --> 00:18:39,280
So there's no shared memory
on a thing like that.

439
00:18:39,280 --> 00:18:41,180
Its own thread of
control is run.

440
00:18:41,180 --> 00:18:45,280
And each process communicates
via messages.

441
00:18:45,280 --> 00:18:49,260
And there is send
as is needed.

442
00:18:49,260 --> 00:18:53,430
And that means but you can't
send like pointers, because

443
00:18:53,430 --> 00:18:54,800
there's no notion of pointers.

444
00:18:54,800 --> 00:18:56,360
You actually have a data
structure that's

445
00:18:56,360 --> 00:18:59,140
self-contained center
of the site.

446
00:18:59,140 --> 00:19:00,630
So here's a small program.

447
00:19:00,630 --> 00:19:02,650
I'm going to walk
through that.

448
00:19:02,650 --> 00:19:03,750
So I have main.

449
00:19:03,750 --> 00:19:05,620
And I'm setting a bunch
of these variables.

450
00:19:05,620 --> 00:19:08,570
For now, those are not
that important.

451
00:19:08,570 --> 00:19:10,820
But for completeness,
I have that.

452
00:19:10,820 --> 00:19:12,920
And then, of course, if use
something like MPI, there's a

453
00:19:12,920 --> 00:19:14,490
bunch of setup things
that you have.

454
00:19:14,490 --> 00:19:18,000
And so basically like cut and
paste with what people

455
00:19:18,000 --> 00:19:19,690
normally do as we set up.

456
00:19:19,690 --> 00:19:23,770
And then I have this
piece of code.

457
00:19:27,490 --> 00:19:32,210
This piece of code, what it
does is, this same program

458
00:19:32,210 --> 00:19:37,000
runs on multiple different
machines.

459
00:19:37,000 --> 00:19:38,860
So everyone has the
same program.

460
00:19:38,860 --> 00:19:41,160
But then at some point,
I want to know in my

461
00:19:41,160 --> 00:19:42,320
machine what to do.

462
00:19:42,320 --> 00:19:45,760
So what I do is I
check who am I?

463
00:19:45,760 --> 00:19:46,610
Am I machine zero?

464
00:19:46,610 --> 00:19:48,830
If I'm machine zero, do this.

465
00:19:48,830 --> 00:19:50,940
If I'm machine one, do this.

466
00:19:50,940 --> 00:19:52,830
So by doing that, I can
cite a piece of code

467
00:19:52,830 --> 00:19:54,090
that everybody runs.

468
00:19:54,090 --> 00:19:56,140
And everybody figures
out who they are.

469
00:19:56,140 --> 00:19:59,310
And if they are the given
thing, what to do.

470
00:19:59,310 --> 00:20:03,920
So here what it says is, OK, if
I'm machine zero, my source

471
00:20:03,920 --> 00:20:05,420
and destination is
machine one.

472
00:20:05,420 --> 00:20:06,930
If I'm machine one,
my source and

473
00:20:06,930 --> 00:20:08,070
destination is machine zero.

474
00:20:08,070 --> 00:20:10,880
So I'm trying to communicate
between each other.

475
00:20:10,880 --> 00:20:16,110
So if you look at what happens
is, first, I am sending

476
00:20:16,110 --> 00:20:18,870
basically to this machine.

477
00:20:18,870 --> 00:20:22,670
So I'm sending something into
this machine, so the syntax--

478
00:20:22,670 --> 00:20:24,150
I'm not going to go
through that.

479
00:20:24,150 --> 00:20:25,240
You don't have to know that.

480
00:20:25,240 --> 00:20:26,760
But what you need to
know is that I'm

481
00:20:26,760 --> 00:20:27,780
trying to send something.

482
00:20:27,780 --> 00:20:29,686
I tell explicitly who to send.

483
00:20:29,686 --> 00:20:32,550
And there has to be matching
receiving that data.

484
00:20:32,550 --> 00:20:33,670
Otherwise, sends go somewhere.

485
00:20:33,670 --> 00:20:36,400
And just it goes bad.

486
00:20:36,400 --> 00:20:37,990
Send here, you can send it.

487
00:20:37,990 --> 00:20:39,520
It can probably go bad.

488
00:20:39,520 --> 00:20:41,780
But receive you have to have
somebody who sends for that.

489
00:20:41,780 --> 00:20:43,680
So the receive basically
has to have matching.

490
00:20:43,680 --> 00:20:45,910
And then you send it
that direction.

491
00:20:45,910 --> 00:20:50,740
And then what I do is
I receive in here.

492
00:20:50,740 --> 00:20:53,640
And this gets sent
to me in here.

493
00:20:53,640 --> 00:20:56,320
So question I did send
receive here.

494
00:20:56,320 --> 00:20:59,450
What would happen if I did
also send receive here?

495
00:20:59,450 --> 00:21:03,100
If I reorganized these two,
what would happen?

496
00:21:03,100 --> 00:21:06,240
If I used the same piece of
code, that two pieces of code.

497
00:21:06,240 --> 00:21:08,250
Then I don't even have
do a bit to make this

498
00:21:08,250 --> 00:21:08,910
two separate code.

499
00:21:08,910 --> 00:21:11,910
I can basically factor
this out down here.

500
00:21:11,910 --> 00:21:14,070
I do a send, receive;
send, receive here.

501
00:21:14,070 --> 00:21:16,930
And then send the just IDs.

502
00:21:19,950 --> 00:21:21,200
What happen?

503
00:21:29,194 --> 00:21:30,444
AUDIENCE: It works
without a buffer.

504
00:21:33,110 --> 00:21:34,210
PROFESSOR: These things
are what you

505
00:21:34,210 --> 00:21:35,620
called blocking sends.

506
00:21:35,620 --> 00:21:38,040
If you have blocking sends, it
means that until the receiver

507
00:21:38,040 --> 00:21:40,060
receives it might be
blocked if you are

508
00:21:40,060 --> 00:21:41,710
doing a blocking send.

509
00:21:41,710 --> 00:21:42,200
OK.

510
00:21:42,200 --> 00:21:43,460
That means if two guys
are trying to

511
00:21:43,460 --> 00:21:45,520
send, nobody is receiving.

512
00:21:45,520 --> 00:21:46,570
You have what?

513
00:21:46,570 --> 00:21:47,362
AUDIENCE: Deadlock.

514
00:21:47,362 --> 00:21:48,420
PROFESSOR: You have deadlock.

515
00:21:48,420 --> 00:21:49,960
So that's why I actually
had to do this.

516
00:21:49,960 --> 00:21:51,630
This is called blocking send.

517
00:21:51,630 --> 00:21:55,220
So instead of blocking sends--

518
00:21:55,220 --> 00:21:56,800
So of course, those are
finalized things

519
00:21:56,800 --> 00:21:58,590
and do that up there.

520
00:21:58,590 --> 00:21:59,960
I can do this one.

521
00:21:59,960 --> 00:22:03,040
What this says is--

522
00:22:03,040 --> 00:22:04,530
This is actually a little
more complicated.

523
00:22:04,530 --> 00:22:06,990
What I'm doing is I have a
bunch of buffers here.

524
00:22:06,990 --> 00:22:10,690
I have how many processors?

525
00:22:10,690 --> 00:22:12,500
I have bunch of buffers
in here.

526
00:22:12,500 --> 00:22:17,450
I have, I guess, my ID number
of processors--

527
00:22:17,450 --> 00:22:19,560
no, numtask number
of processors.

528
00:22:19,560 --> 00:22:21,870
What I am sending, say I'm
sending a circular buffer.

529
00:22:21,870 --> 00:22:24,991
I'm sending around
to everybody.

530
00:22:24,991 --> 00:22:26,350
Both directions.

531
00:22:26,350 --> 00:22:28,060
So I am sending the
previous and next.

532
00:22:28,060 --> 00:22:30,520
So assume something is
sitting in numtasks.

533
00:22:30,520 --> 00:22:31,960
I am sending back and forth.

534
00:22:31,960 --> 00:22:35,300
So here what I am doing is
basically non-blocking sends

535
00:22:35,300 --> 00:22:35,980
and receives.

536
00:22:35,980 --> 00:22:38,700
So first time issuing
a receive.

537
00:22:38,700 --> 00:22:41,110
So even if I receive a receive,
it says I have intent

538
00:22:41,110 --> 00:22:41,660
to receive.

539
00:22:41,660 --> 00:22:42,950
But I am not receiving
something.

540
00:22:42,950 --> 00:22:44,170
I am not waiting.

541
00:22:44,170 --> 00:22:45,500
So I can continue.

542
00:22:45,500 --> 00:22:46,940
And then I am doing the same.

543
00:22:46,940 --> 00:22:48,900
So otherwise, if I just do just
receive and send, if you

544
00:22:48,900 --> 00:22:50,660
do blocking is going
to be deadlocked.

545
00:22:50,660 --> 00:22:51,970
But here I do that.

546
00:22:51,970 --> 00:22:55,030
And then in this wait for all.

547
00:22:55,030 --> 00:22:57,590
What it says is, OK, now
I issued a receive.

548
00:22:57,590 --> 00:23:00,010
Now wait until that
receive is done.

549
00:23:00,010 --> 00:23:03,200
So before I use the data, I have
to wait for it in there.

550
00:23:03,200 --> 00:23:09,650
And also, when I do the send,
I am wait for all in here.

551
00:23:09,650 --> 00:23:15,190
So why do you think it might
be advantageous to do a

552
00:23:15,190 --> 00:23:20,020
non-blocking receives and
non-blocking sends?

553
00:23:20,020 --> 00:23:24,450
So sends, it makes perfect
sense, because once I have

554
00:23:24,450 --> 00:23:26,480
sent, I won't do anything
because I don't have to wait

555
00:23:26,480 --> 00:23:26,940
for anything.

556
00:23:26,940 --> 00:23:28,120
I am done.

557
00:23:28,120 --> 00:23:31,400
So blocking sends is
not that useful.

558
00:23:31,400 --> 00:23:34,950
But receives, why do you want
to do non-blocking receives,

559
00:23:34,950 --> 00:23:36,060
then a blocking receive?

560
00:23:36,060 --> 00:23:37,940
Because then you
won't receive.

561
00:23:37,940 --> 00:23:39,310
You have to wait till the data
comes to do anything.

562
00:23:39,310 --> 00:23:41,940
Because that's what non-blocking
receives means.

563
00:23:41,940 --> 00:23:45,930
I have to receive early and then
wait for the receives to

564
00:23:45,930 --> 00:23:48,270
happen at this point.

565
00:23:48,270 --> 00:23:50,840
What might be an advantage of
doing a non-blocking receive?

566
00:23:53,650 --> 00:23:54,900
Anybody can think
of an advantage?

567
00:23:59,800 --> 00:24:00,535
It's harder.

568
00:24:00,535 --> 00:24:04,300
Because now you have to remove
the receives from the

569
00:24:04,300 --> 00:24:06,410
synchronization point instead
of writing one receive.

570
00:24:09,612 --> 00:24:09,972
AUDIENCE: Because when sends
come from other machines, we

571
00:24:09,972 --> 00:24:11,222
can receive it.

572
00:24:17,330 --> 00:24:19,150
PROFESSOR: That might be one
interesting thing because you

573
00:24:19,150 --> 00:24:20,780
are expecting multiple
receives.

574
00:24:20,780 --> 00:24:22,360
You don't know what's
coming first.

575
00:24:22,360 --> 00:24:26,330
If you do non-blocking receive,
then you can be--

576
00:24:26,330 --> 00:24:29,710
opt out the first guy then
basically work on.

577
00:24:29,710 --> 00:24:30,190
That's a very good point.

578
00:24:30,190 --> 00:24:30,390
OK.

579
00:24:30,390 --> 00:24:31,640
What else?

580
00:24:39,860 --> 00:24:41,445
What other advantages
you might have?

581
00:24:41,445 --> 00:24:42,695
Having a non-blocking receive?

582
00:24:52,640 --> 00:24:54,909
AUDIENCE: If the receive fails,
we can just have them

583
00:24:54,909 --> 00:24:57,570
resend it again.

584
00:24:57,570 --> 00:24:58,940
PROFESSOR: Receive fails?

585
00:24:58,940 --> 00:24:59,200
OK.

586
00:24:59,200 --> 00:24:59,350
Receive fails.

587
00:24:59,350 --> 00:25:01,090
See, that's complicated.

588
00:25:01,090 --> 00:25:03,900
But another interesting
thing might be space.

589
00:25:03,900 --> 00:25:06,940
Because when I see the
non-blocking receive, I know

590
00:25:06,940 --> 00:25:09,150
where the data has to be.

591
00:25:09,150 --> 00:25:12,060
So I already allocated
a buffer for that.

592
00:25:12,060 --> 00:25:14,900
So, if the data comes now, I
can directly copy into my

593
00:25:14,900 --> 00:25:16,900
local buffer if there's already
a received issued, if

594
00:25:16,900 --> 00:25:18,790
there's already a
space allocated.

595
00:25:18,790 --> 00:25:21,935
Normally, other way around,
until you see the issue the

596
00:25:21,935 --> 00:25:23,830
received, you don't know where
the data has to be, so it has

597
00:25:23,830 --> 00:25:25,590
to get copied at that point.

598
00:25:25,590 --> 00:25:28,900
So here, you can keep the
buffer, and hopefully, if you

599
00:25:28,900 --> 00:25:32,680
are lucky, the same hasn't
happened yet, so you should

600
00:25:32,680 --> 00:25:34,870
set up the buffer, and then,
when the data comes, say, aha,

601
00:25:34,870 --> 00:25:36,140
here is the matching receive.

602
00:25:36,140 --> 00:25:38,580
Directly put it there
by passing it and

603
00:25:38,580 --> 00:25:39,280
copying in the middle.

604
00:25:39,280 --> 00:25:42,110
So that's the advantage here.

605
00:25:42,110 --> 00:25:45,310
So, I am here.

606
00:25:45,310 --> 00:25:47,400
I did a wait for the
receives here.

607
00:25:47,400 --> 00:25:48,950
Did the work that
uses the data.

608
00:25:48,950 --> 00:25:52,760
And wait for sends
here, afterwards.

609
00:25:52,760 --> 00:25:53,760
OK.

610
00:25:53,760 --> 00:25:57,080
Could I have moved wait for
sends before the work?

611
00:26:02,370 --> 00:26:04,790
What happens if I have
moved wait for

612
00:26:04,790 --> 00:26:06,096
sends before the work?

613
00:26:06,096 --> 00:26:08,130
Is it incorrect?

614
00:26:08,130 --> 00:26:09,460
How many people think
this is incorrect?

615
00:26:12,450 --> 00:26:15,610
Is it incorrect to move
this wait for sends?

616
00:26:15,610 --> 00:26:18,480
All the sends before the work,
to move this item about?

617
00:26:18,480 --> 00:26:21,310
Because work is where all the
work happens, I assume, that

618
00:26:21,310 --> 00:26:23,060
uses this data.

619
00:26:23,060 --> 00:26:25,751
So wait for sends about
what happens.

620
00:26:25,751 --> 00:26:27,001
AUDIENCE: Well, if that's
incorrect, then you lose...

621
00:26:28,980 --> 00:26:30,790
PROFESSOR: Yeah, you lose--
you're waiting for something

622
00:26:30,790 --> 00:26:32,060
that you don't have to wait.

623
00:26:32,060 --> 00:26:35,110
Of course, you can move these
down, because that means you

624
00:26:35,110 --> 00:26:37,940
might start using, and try to
use data that's not there.

625
00:26:37,940 --> 00:26:39,910
So, this has to be here.

626
00:26:39,910 --> 00:26:42,360
And this, basically,
for performance

627
00:26:42,360 --> 00:26:44,205
purposes, has to be after.

628
00:26:48,900 --> 00:26:50,740
So, of course you have to
worry about a lot of

629
00:26:50,740 --> 00:26:51,580
correctness issues.

630
00:26:51,580 --> 00:26:53,880
One is deadlocks.

631
00:26:53,880 --> 00:26:55,370
So, there are two types
of deadlocks.

632
00:26:55,370 --> 00:26:57,480
That's blocking sends
and receives,

633
00:26:57,480 --> 00:26:58,650
what we talked about.

634
00:26:58,650 --> 00:27:01,180
But there's also other types
of deadlocks that happen

635
00:27:01,180 --> 00:27:04,060
because of resources.

636
00:27:04,060 --> 00:27:07,590
So, let me get to that
in the next slide.

637
00:27:07,590 --> 00:27:09,940
And the other interesting
thing that can

638
00:27:09,940 --> 00:27:11,450
happen is stale data.

639
00:27:11,450 --> 00:27:14,860
In your shared memory machine,
need to update the data.

640
00:27:14,860 --> 00:27:17,140
You know everybody's
going to see that.

641
00:27:17,140 --> 00:27:19,700
Because the hardware
takes care of that.

642
00:27:19,700 --> 00:27:23,270
But, in a message passing
machine, it's up to you get

643
00:27:23,270 --> 00:27:25,410
the latest data when
it's needed.

644
00:27:25,410 --> 00:27:27,540
So, if you don't have the data,
you think, aha, I have

645
00:27:27,540 --> 00:27:29,760
the data, but it might not be
the right value because you

646
00:27:29,760 --> 00:27:31,200
haven't gotten something new.

647
00:27:31,200 --> 00:27:34,150
So, it's up to you to
basically send the

648
00:27:34,150 --> 00:27:36,710
data out and that.

649
00:27:36,710 --> 00:27:41,610
And, also, robustness is a big
issue because the fact that

650
00:27:41,610 --> 00:27:44,080
you have multiple machines means
you can make it robust,

651
00:27:44,080 --> 00:27:46,950
but the other flip side is up
to you to make it robust.

652
00:27:46,950 --> 00:27:49,880
So that means you have to figure
out if a machine fails,

653
00:27:49,880 --> 00:27:51,110
how to respond to that.

654
00:27:51,110 --> 00:27:54,480
So, if you're waiting for a
machine there for it fails.

655
00:27:54,480 --> 00:27:55,530
OK?

656
00:27:55,530 --> 00:27:57,700
There are a lot of issues, it
time out, and then you have to

657
00:27:57,700 --> 00:27:59,390
go and deal with that.

658
00:27:59,390 --> 00:28:01,580
So, that can make the
programming a lot more

659
00:28:01,580 --> 00:28:02,720
complicated.

660
00:28:02,720 --> 00:28:07,950
And if you just don't do that,
your overall program would be

661
00:28:07,950 --> 00:28:10,670
a lot less robust than a single
machine because there

662
00:28:10,670 --> 00:28:12,380
could be a lot more failures
in the large system.

663
00:28:17,080 --> 00:28:21,550
So, here's a kind of deadlock
that can happen.

664
00:28:21,550 --> 00:28:28,700
What I am doing is processor
zero is sending and processor

665
00:28:28,700 --> 00:28:31,460
one is sending data
to each other.

666
00:28:31,460 --> 00:28:35,400
It doesn't have a read write
deadlock because I am for

667
00:28:35,400 --> 00:28:37,830
sending, sending and then I
am receiving, receiving.

668
00:28:37,830 --> 00:28:40,680
The sends here, and the receives
from here, the sends

669
00:28:40,680 --> 00:28:42,470
here, and then receive.

670
00:28:42,470 --> 00:28:44,690
So normally it looks like I'm
sending two things, It should

671
00:28:44,690 --> 00:28:46,700
go, and I am receiving that.

672
00:28:46,700 --> 00:28:51,550
But, assuming that I am sending
huge amount of data.

673
00:28:51,550 --> 00:28:51,660
OK.

674
00:28:51,660 --> 00:28:55,930
So I start sending,
and there's not

675
00:28:55,930 --> 00:28:57,710
enough room for received.

676
00:28:57,710 --> 00:29:00,630
Just say, OK, I don't have any
room to receive, I have to

677
00:29:00,630 --> 00:29:03,410
wait until the data, at least a
data start getting consumed

678
00:29:03,410 --> 00:29:04,942
to start receiving.

679
00:29:04,942 --> 00:29:05,350
OK.

680
00:29:05,350 --> 00:29:08,410
If you keep sending multiple of
send outs, you might do a

681
00:29:08,410 --> 00:29:10,270
multiple of send outs in more
than one, multiple of sends

682
00:29:10,270 --> 00:29:13,310
here, I might get deadlocked,
I might get blocked because

683
00:29:13,310 --> 00:29:15,550
they can have receive, receive,
and he's also trying

684
00:29:15,550 --> 00:29:17,470
to send multiple things, I might
get blocked in here.

685
00:29:17,470 --> 00:29:19,920
So I might get into this
criss-cross situation.

686
00:29:19,920 --> 00:29:22,200
If you were trying to send
something, but other guy can't

687
00:29:22,200 --> 00:29:24,000
proceed, up to get received.

688
00:29:24,000 --> 00:29:27,570
So even though, if program
look like there's a nice

689
00:29:27,570 --> 00:29:29,680
matching send and receive,
there's no cycles.

690
00:29:29,680 --> 00:29:33,280
There's a cycle created by the
resource usage in here.

691
00:29:33,280 --> 00:29:37,870
So, if doing lot of sends before
a lot receives, and

692
00:29:37,870 --> 00:29:39,670
vice versa, you have
to be careful.

693
00:29:39,670 --> 00:29:42,750
If you do too much of that,
it's nice to block things,

694
00:29:42,750 --> 00:29:43,610
move things up.

695
00:29:43,610 --> 00:29:46,120
But if you have too many things,
then you might end up

696
00:29:46,120 --> 00:29:47,230
in deadlock situation.

697
00:29:47,230 --> 00:29:49,402
Even though, traditionally,
it might not happen.

698
00:29:52,000 --> 00:29:55,620
So, you have host of other
performance issues

699
00:29:55,620 --> 00:29:57,030
that you'll deal with.

700
00:29:57,030 --> 00:30:01,140
So let me address couple
of them and what

701
00:30:01,140 --> 00:30:02,780
it might shows up.

702
00:30:02,780 --> 00:30:06,620
So one big thing is
occupancy cost.

703
00:30:06,620 --> 00:30:14,320
Because, when you do shared
memory, minute you basically

704
00:30:14,320 --> 00:30:15,240
showing instructions.

705
00:30:15,240 --> 00:30:18,300
Instruction goes executes, you
are done with and that

706
00:30:18,300 --> 00:30:19,550
operation is finished.

707
00:30:22,820 --> 00:30:26,250
When you are doing a message
passing, each

708
00:30:26,250 --> 00:30:27,070
message is very expensive.

709
00:30:27,070 --> 00:30:28,510
It has to do a lot of things.

710
00:30:28,510 --> 00:30:31,260
You have to do a context switch,
do a buffer copy, and

711
00:30:31,260 --> 00:30:34,580
a protocol stack processing,
and then might have to do

712
00:30:34,580 --> 00:30:36,220
another context switch
for that.

713
00:30:36,220 --> 00:30:39,180
And there's a lot of these
copying and stuff happening,

714
00:30:39,180 --> 00:30:43,880
and the network controller might
interrupt the private

715
00:30:43,880 --> 00:30:46,640
system, because either there's
data coming, copy the data in

716
00:30:46,640 --> 00:30:49,410
there, and you send as
ignored application.

717
00:30:49,410 --> 00:30:51,300
So there's this huge amount
of things happening

718
00:30:51,300 --> 00:30:52,790
just for one message.

719
00:30:52,790 --> 00:30:57,120
If you are sending like one
lousy byte, or one even like

720
00:30:57,120 --> 00:31:00,690
kilobyte., it just doing a lot
of millions of instructions,

721
00:31:00,690 --> 00:31:03,180
just on behalf of small
amount of data.

722
00:31:03,180 --> 00:31:04,740
So that's a large
amount of cost

723
00:31:04,740 --> 00:31:05,990
associated with that overhead.

724
00:31:08,710 --> 00:31:10,920
So, setup already
is very high.

725
00:31:10,920 --> 00:31:14,370
And, so what you want to do is
you want to amortize the cost

726
00:31:14,370 --> 00:31:16,320
by sending large messages.

727
00:31:16,320 --> 00:31:18,190
So what you were to say, so
look I'm not sending this

728
00:31:18,190 --> 00:31:20,390
small thing, I'm going to
accumulate a lot of things,

729
00:31:20,390 --> 00:31:22,820
I'm going to send everything
as bulk if you can.

730
00:31:22,820 --> 00:31:24,940
And then you can basically
amortize these

731
00:31:24,940 --> 00:31:26,190
costs of doing things.

732
00:31:28,700 --> 00:31:32,450
Other thing is communications
is excruciatingly slow.

733
00:31:32,450 --> 00:31:37,220
So even the memory system, it's
about probably a couple

734
00:31:37,220 --> 00:31:43,140
of hundred plus compared
to CPU communicating.

735
00:31:43,140 --> 00:31:47,190
In the cluster interconnect, you
can do tens of thousands

736
00:31:47,190 --> 00:31:53,880
of instructions in the CPU, by
the time it get communicated.

737
00:31:53,880 --> 00:31:57,820
In a grid, or if you are doing
through the internet, and then

738
00:31:57,820 --> 00:31:58,910
it sits in the seconds now.

739
00:31:58,910 --> 00:31:59,790
You can actually feel it.

740
00:31:59,790 --> 00:32:01,895
And then your processor can run
millions of instructions

741
00:32:01,895 --> 00:32:03,220
in that time.

742
00:32:03,220 --> 00:32:06,440
And so if you are waiting for
something, you had to wait for

743
00:32:06,440 --> 00:32:07,460
a very long time.

744
00:32:07,460 --> 00:32:09,220
They might not have enough
things to stuff in the middle,

745
00:32:09,220 --> 00:32:12,420
to kind of amortize the cost.

746
00:32:12,420 --> 00:32:13,560
And then you have to
worry about that.

747
00:32:13,560 --> 00:32:17,180
So what that means is if you
start waiting for something to

748
00:32:17,180 --> 00:32:19,420
happen, you are waiting
for a long time.

749
00:32:19,420 --> 00:32:22,160
So you have to figure out
putting things in there.

750
00:32:22,160 --> 00:32:25,775
Not waiting this, non-blocking
things is very important

751
00:32:25,775 --> 00:32:28,310
because of that.

752
00:32:28,310 --> 00:32:37,300
And so normally what you want
to do is you want to have

753
00:32:37,300 --> 00:32:40,260
always split operations, that
means you want to kind of

754
00:32:40,260 --> 00:32:43,680
initiate something at some
point, very early on, and then

755
00:32:43,680 --> 00:32:45,870
use it later, especially if
you're looking for something

756
00:32:45,870 --> 00:32:46,770
like a receive.

757
00:32:46,770 --> 00:32:50,200
So if, I want to, especially
if I want to get some--

758
00:32:50,200 --> 00:32:53,180
normally in the shared memory,
just kind of doing a simple

759
00:32:53,180 --> 00:32:55,690
thing, asking something and
get replies, very simple.

760
00:32:55,690 --> 00:32:58,370
Here, because if you just do
that, there's a huge waiting

761
00:32:58,370 --> 00:33:01,230
bit rate, so you want to kind
of do speed operations.

762
00:33:01,230 --> 00:33:03,390
So, this code can be
very complicated,

763
00:33:03,390 --> 00:33:04,830
because you tried to--

764
00:33:04,830 --> 00:33:07,240
if you want to ask something,
you ask very early, you do a

765
00:33:07,240 --> 00:33:10,585
lot of other things before
the reply comes.

766
00:33:10,585 --> 00:33:12,490
AUDIENCE: [INAUDIBLE].

767
00:33:12,490 --> 00:33:13,740
PROFESSOR: Oops.

768
00:33:16,750 --> 00:33:17,220
OK.

769
00:33:17,220 --> 00:33:20,330
Let's see how many times
I had to press that.

770
00:33:20,330 --> 00:33:21,580
Before I realize I
have to reboot.

771
00:33:24,810 --> 00:33:28,260
So, if you want to
rendezvous with--

772
00:33:28,260 --> 00:33:30,480
normally, what have that means
that two points has to kind of

773
00:33:30,480 --> 00:33:31,890
synchronize at the same point.

774
00:33:31,890 --> 00:33:34,520
So what you want to do is you
can do a three-way sender send

775
00:33:34,520 --> 00:33:38,055
a request, receiver acks with,
OK, it's OK to send, and the

776
00:33:38,055 --> 00:33:39,130
senders delivers the data.

777
00:33:39,130 --> 00:33:41,140
So that means I have to do a
three-way communication.

778
00:33:41,140 --> 00:33:43,620
Or, this alternative with
two-way, you sender doesn't

779
00:33:43,620 --> 00:33:46,560
send anything, receiver
basically send a request, and

780
00:33:46,560 --> 00:33:47,270
then you send the data.

781
00:33:47,270 --> 00:33:50,930
So this could be faster because
there's less things to

782
00:33:50,930 --> 00:33:53,780
do in here.

783
00:33:53,780 --> 00:33:57,480
There's another method called
RMA, or it's another name you

784
00:33:57,480 --> 00:33:59,180
might see, it's called
active messages.

785
00:33:59,180 --> 00:34:02,170
Where you don't ask
the receiver.

786
00:34:02,170 --> 00:34:04,995
When you send, you have some
pre-assigned place you can go

787
00:34:04,995 --> 00:34:05,750
and dump the data.

788
00:34:05,750 --> 00:34:07,120
So you don't wait for somebody
to ask for data.

789
00:34:07,120 --> 00:34:09,585
You said, OK, if I want send,
I'll immediately send, and put

790
00:34:09,585 --> 00:34:10,260
it somewhere.

791
00:34:10,260 --> 00:34:14,840
And the data that I send
the place in here.

792
00:34:14,840 --> 00:34:19,900
So, the first slide I showed
you, all at that time, you saw

793
00:34:19,900 --> 00:34:23,330
all these big difference in
the time, either this can

794
00:34:23,330 --> 00:34:26,210
happen, the message can go very
fast, or sometime it will

795
00:34:26,210 --> 00:34:27,080
be really slow.

796
00:34:27,080 --> 00:34:30,889
There's a big variation
in here.

797
00:34:30,889 --> 00:34:35,920
This happen basically because of
the network communications.

798
00:34:35,920 --> 00:34:38,454
How many of you know a
little bit about TCP?

799
00:34:41,179 --> 00:34:42,429
OK.

800
00:34:46,860 --> 00:34:49,840
Let me just explain about
five minutes of TCP

801
00:34:49,840 --> 00:34:50,889
before I move on.

802
00:34:50,889 --> 00:34:55,600
So TCP is one of the main
protocols that we use to

803
00:34:55,600 --> 00:34:56,639
communicate over Internet.

804
00:34:56,639 --> 00:34:59,730
And two things, you want
to actually send data.

805
00:34:59,730 --> 00:35:03,100
But also, you want to
be a good citizen.

806
00:35:03,100 --> 00:35:07,700
You have actually work in a
way that it doesn't really

807
00:35:07,700 --> 00:35:10,320
take over the entire shared
bandwidth you have.

808
00:35:10,320 --> 00:35:13,120
So what TCP does,
it has a window.

809
00:35:13,120 --> 00:35:16,320
So what it says is, OK, I can
send certain amount of data

810
00:35:16,320 --> 00:35:20,780
that's the size of the window,
but I can't move beyond that

811
00:35:20,780 --> 00:35:22,640
until I get some
acknowledgement.

812
00:35:22,640 --> 00:35:24,460
So I send the window
amount of data.

813
00:35:24,460 --> 00:35:26,650
And, on the other side, when
it's received that data, I

814
00:35:26,650 --> 00:35:28,600
said, I have seen this
much of the window.

815
00:35:28,600 --> 00:35:32,060
And once it's seen this much of
the window this send that

816
00:35:32,060 --> 00:35:33,260
acknowledgement back.

817
00:35:33,260 --> 00:35:34,450
And when that acknowledgement
comes, you

818
00:35:34,450 --> 00:35:35,670
say, aha, that's good.

819
00:35:35,670 --> 00:35:36,700
That means it has seen that.

820
00:35:36,700 --> 00:35:37,480
Then I can send more.

821
00:35:37,480 --> 00:35:38,630
I keep sending more.

822
00:35:38,630 --> 00:35:41,070
And then, the TCP has this very
interesting property.

823
00:35:41,070 --> 00:35:43,410
If you are doing a really good
communication, things are

824
00:35:43,410 --> 00:35:46,490
going very nicely, it starts
increasing the window size.

825
00:35:46,490 --> 00:35:49,300
It says, oh, OK, that means that
windows size keeps going.

826
00:35:49,300 --> 00:35:51,000
I can do bigger and bigger
and bigger window size.

827
00:35:51,000 --> 00:35:52,810
You can keep increasing
the window size.

828
00:35:52,810 --> 00:35:57,130
And then, what happens at some
point, the system get

829
00:35:57,130 --> 00:35:59,200
overloaded, because if you
everyone is start--

830
00:35:59,200 --> 00:36:02,080
increase their window size, too
many packets start coming

831
00:36:02,080 --> 00:36:02,720
into the network.

832
00:36:02,720 --> 00:36:04,710
At some point, the network in
the middle, doesn't have

833
00:36:04,710 --> 00:36:05,530
enough room.

834
00:36:05,530 --> 00:36:06,780
It drops something.

835
00:36:06,780 --> 00:36:09,720
So, TCP, nice thing about the
network is that, even if it

836
00:36:09,720 --> 00:36:11,520
doesn't have a guarantee that it
will guarantee it can just

837
00:36:11,520 --> 00:36:12,380
drop something.

838
00:36:12,380 --> 00:36:15,260
And when it drops, what happens
is the other guy is

839
00:36:15,260 --> 00:36:15,960
waiting for acknowledgement.

840
00:36:15,960 --> 00:36:18,670
So waiting for data to come,
it never shows up.

841
00:36:18,670 --> 00:36:21,840
So then it has this thing, an
ack, saying I never got it.

842
00:36:21,840 --> 00:36:25,960
And the problem with that is, so
you have a nice bandwidth,

843
00:36:25,960 --> 00:36:28,010
and you get increasing the
bandwidth, you get faster and

844
00:36:28,010 --> 00:36:30,730
faster and faster, and suddenly
data get missed.

845
00:36:30,730 --> 00:36:33,610
And suddenly you have this
big timeout delay.

846
00:36:33,610 --> 00:36:35,060
And then everybody freezes.

847
00:36:35,060 --> 00:36:37,980
And then get the ack, and you
restart with a smaller window

848
00:36:37,980 --> 00:36:39,660
and slowly pick up, something
like that.

849
00:36:39,660 --> 00:36:42,500
So because of that, there a lot
of times what happens, is

850
00:36:42,500 --> 00:36:46,300
this packet get dropped,
retransmit happens, so you

851
00:36:46,300 --> 00:36:47,820
have this kind of sawtooth
pattern.

852
00:36:47,820 --> 00:36:51,400
Things get faster and faster,
things go down for nothing,

853
00:36:51,400 --> 00:36:56,430
for a little while, again
start again, in here.

854
00:36:56,430 --> 00:37:01,420
So the other way of
communicating is called UDP.

855
00:37:01,420 --> 00:37:04,030
UDP says, OK, if you don't
have any kind of

856
00:37:04,030 --> 00:37:06,450
acknowledgement, or something
like that, I'll just send.

857
00:37:06,450 --> 00:37:08,870
And you, on the other hand,
figure this out whether you

858
00:37:08,870 --> 00:37:10,050
got something or not.

859
00:37:10,050 --> 00:37:11,820
And send information back.

860
00:37:11,820 --> 00:37:15,190
So the network doesn't
participate in any kind of a

861
00:37:15,190 --> 00:37:16,750
balancing act of
communication.

862
00:37:16,750 --> 00:37:17,830
It's end-to-end.

863
00:37:17,830 --> 00:37:20,220
So of course, you can be a
really bad citizen, and say,

864
00:37:20,220 --> 00:37:20,880
OK, I don't care.

865
00:37:20,880 --> 00:37:23,210
I just keep sending huge
amount of data and then

866
00:37:23,210 --> 00:37:24,370
somebody would go well.

867
00:37:24,370 --> 00:37:31,300
But, what people have found is
for things like video, you can

868
00:37:31,300 --> 00:37:33,650
send UPD and kind of
manipulate yourself

869
00:37:33,650 --> 00:37:35,560
end-to-end, can get
much better than

870
00:37:35,560 --> 00:37:38,110
trying to do this TCP.

871
00:37:38,110 --> 00:37:40,900
So the kind of thing is,
even though there's not

872
00:37:40,900 --> 00:37:45,930
acknowledgment, or there's no
real attempt to make sure all

873
00:37:45,930 --> 00:37:49,420
the data goes, UDP sometimes
can get better bandwidth,

874
00:37:49,420 --> 00:37:52,540
because it doesn't drop
packets in here.

875
00:37:52,540 --> 00:37:55,140
So, there's a lot of great
stuff, I mean you guys can

876
00:37:55,140 --> 00:38:01,230
take the networking class and
learn all about the protocols

877
00:38:01,230 --> 00:38:01,740
and stuff like that.

878
00:38:01,740 --> 00:38:07,240
There's really, really cool
stuff in here, so some of you

879
00:38:07,240 --> 00:38:10,890
might actually, next couple of
semesters, learn all about how

880
00:38:10,890 --> 00:38:11,790
these things work.

881
00:38:11,790 --> 00:38:16,010
So I'm just giving you, lot of
performance wise, these are

882
00:38:16,010 --> 00:38:18,390
the issues that you are to worry
about when you are doing

883
00:38:18,390 --> 00:38:21,040
network level things.

884
00:38:21,040 --> 00:38:21,810
OK.

885
00:38:21,810 --> 00:38:27,170
So that's kind of talks about
a little bit about a small

886
00:38:27,170 --> 00:38:31,456
scale, and then if you want to
go to next bigger scale--

887
00:38:31,456 --> 00:38:33,160
why you want to go?

888
00:38:33,160 --> 00:38:35,030
There can be lot more uses.

889
00:38:35,030 --> 00:38:38,790
If you are on something like
Facebook, or Amazon, you have

890
00:38:38,790 --> 00:38:41,530
a lot more users to deal with.

891
00:38:41,530 --> 00:38:49,970
If you are, what's a good
one with a lot of data?

892
00:38:52,560 --> 00:38:53,510
Google Earth, or something
like that.

893
00:38:53,510 --> 00:38:55,040
You have a lot of data.

894
00:38:55,040 --> 00:38:57,050
And you have to deal with all
the data, and that's a good

895
00:38:57,050 --> 00:39:00,040
way to do the scale up.

896
00:39:00,040 --> 00:39:02,370
Or you might have huge amount of
processing you want to do,

897
00:39:02,370 --> 00:39:05,970
for example, the one place a lot
of data and processing is

898
00:39:05,970 --> 00:39:15,230
things like these new basically
telescopes that's

899
00:39:15,230 --> 00:39:17,630
coming about, that has arrays
of hundreds of different

900
00:39:17,630 --> 00:39:19,880
things, so you have huge amount
of data coming from the

901
00:39:19,880 --> 00:39:20,530
telescopes.

902
00:39:20,530 --> 00:39:23,130
And then you could do a huge
amount of processing and that.

903
00:39:23,130 --> 00:39:27,140
So that basically, has broad
data and processing, and in

904
00:39:27,140 --> 00:39:30,450
things like webs, social
networks, and stuff like that,

905
00:39:30,450 --> 00:39:32,070
gives me a lot of data.

906
00:39:32,070 --> 00:39:34,860
So here are some examples of
some things like the airline

907
00:39:34,860 --> 00:39:36,050
reservation system.

908
00:39:36,050 --> 00:39:38,200
It's something, all the airlines
have to assign

909
00:39:38,200 --> 00:39:41,770
millions of planes, flights,
millions of seats

910
00:39:41,770 --> 00:39:42,790
that you deal with.

911
00:39:42,790 --> 00:39:45,850
Things like a stock trading
system that all the trades has

912
00:39:45,850 --> 00:39:50,150
to has to come there, and the
prices has to get calculated,

913
00:39:50,150 --> 00:39:52,030
and then trades has
to get validated.

914
00:39:52,030 --> 00:39:55,500
And, very big analysis, so you
form some kind of global

915
00:39:55,500 --> 00:39:57,250
understanding of what's
going on.

916
00:39:57,250 --> 00:40:00,290
And I'm going to talk about
these two, three things too.

917
00:40:00,290 --> 00:40:02,970
Scene completion and web
search, which probably

918
00:40:02,970 --> 00:40:04,260
everybody knows.

919
00:40:04,260 --> 00:40:09,220
So, yes, this kind of
data, now, kind of a

920
00:40:09,220 --> 00:40:10,880
web analysis example.

921
00:40:10,880 --> 00:40:16,000
So what these guys were trying
to do was, every weekly, troll

922
00:40:16,000 --> 00:40:21,250
151 million web pages, and
get about a terabyte of

923
00:40:21,250 --> 00:40:24,960
information, and analyze
page statistics.

924
00:40:24,960 --> 00:40:25,840
So that's what they
are trying to do.

925
00:40:25,840 --> 00:40:28,430
Some come up with some
idea about OK, what

926
00:40:28,430 --> 00:40:29,310
is the world happening?

927
00:40:29,310 --> 00:40:31,540
How did the pages change
the last week?

928
00:40:31,540 --> 00:40:34,620
And then try to get a
global view of that.

929
00:40:34,620 --> 00:40:37,990
At this point, you have both
huge amount of data and pretty

930
00:40:37,990 --> 00:40:40,050
large amount of computation
power, that you had to build a

931
00:40:40,050 --> 00:40:41,100
system to do that.

932
00:40:41,100 --> 00:40:43,120
This is where you need
a larger system.

933
00:40:43,120 --> 00:40:47,670
Here's another interesting
system that people built.

934
00:40:47,670 --> 00:40:51,660
So if you have image here, and
the image has this nice

935
00:40:51,660 --> 00:40:53,960
background, there's
unfortunate house

936
00:40:53,960 --> 00:40:55,190
sitting in the front.

937
00:40:55,190 --> 00:40:58,710
So what this says is, OK,
eliminate the house, search a

938
00:40:58,710 --> 00:41:04,180
very large database to find
similar images and plop

939
00:41:04,180 --> 00:41:05,430
something in there.

940
00:41:07,480 --> 00:41:08,650
OK.

941
00:41:08,650 --> 00:41:09,780
So OK.

942
00:41:09,780 --> 00:41:15,170
You can get your face with
some nice actual eyes, or

943
00:41:15,170 --> 00:41:15,730
something like that.

944
00:41:15,730 --> 00:41:18,420
Just eliminate all the bad
parts, and then and then get

945
00:41:18,420 --> 00:41:20,810
good parts and put
them in there.

946
00:41:20,810 --> 00:41:25,780
And so this one, basically, what
they'll do was, that's

947
00:41:25,780 --> 00:41:30,780
about 396 gigabytes of
images out there.

948
00:41:30,780 --> 00:41:33,640
And so we had to classify
images to get the scene

949
00:41:33,640 --> 00:41:37,500
detector, do color similarity,
and do context matching.

950
00:41:37,500 --> 00:41:41,470
So computation, what they're
doing is about 50 minutes

951
00:41:41,470 --> 00:41:45,300
doing scene matching, 20 minutes
of local matching

952
00:41:45,300 --> 00:41:49,340
trying to find right matching,
and four minutes composing

953
00:41:49,340 --> 00:41:52,020
there, and then you can
parallelize that and reduce

954
00:41:52,020 --> 00:41:54,160
this time to about
five minutes.

955
00:41:54,160 --> 00:41:55,890
So here's something that's
huge amount of data.

956
00:41:55,890 --> 00:41:58,470
You'll look a lot of things, you
do a lot of processing to

957
00:41:58,470 --> 00:42:03,640
figure out we get the right
thing and these actually keep

958
00:42:03,640 --> 00:42:06,880
increasing these images as we
keep asking for more, more

959
00:42:06,880 --> 00:42:09,270
flexibility, and
more accuracy.

960
00:42:09,270 --> 00:42:11,040
Things can get higher
and higher.

961
00:42:11,040 --> 00:42:14,740
So really cool application that
really require large data

962
00:42:14,740 --> 00:42:15,970
and large processing.

963
00:42:15,970 --> 00:42:18,820
So, of course, the kind of
clinical application is

964
00:42:18,820 --> 00:42:19,940
probably Google.

965
00:42:19,940 --> 00:42:22,840
So in this research, you'll
get some nice results.

966
00:42:22,840 --> 00:42:25,460
So what people say, is this
what two thousand process

967
00:42:25,460 --> 00:42:28,710
involved getting this
query for you.

968
00:42:28,710 --> 00:42:33,070
It takes 200 plus terabytes of
data, but this is already old

969
00:42:33,070 --> 00:42:35,730
now, this could be
even higher now.

970
00:42:35,730 --> 00:42:38,680
And this takes ten to the ten
total clock cycles for

971
00:42:38,680 --> 00:42:40,730
everything that needs
to happen, for you

972
00:42:40,730 --> 00:42:43,460
to get to your query.

973
00:42:43,460 --> 00:42:45,430
And you only get one
sent for the query.

974
00:42:45,430 --> 00:42:47,870
So that, not only are doing it
fast, you are doing a lot of

975
00:42:47,870 --> 00:42:49,820
processing, you are doing
it very cheap.

976
00:42:49,820 --> 00:42:53,910
And I think one of the biggest
things that Google did is

977
00:42:53,910 --> 00:42:57,190
figure how to get that
done fast and cheap.

978
00:42:57,190 --> 00:43:02,010
And that's why they
so successful.

979
00:43:02,010 --> 00:43:03,670
Oops, sorry, I didn't say
this. so it's one second

980
00:43:03,670 --> 00:43:06,740
response time, and the cheapest
$0.05 average the

981
00:43:06,740 --> 00:43:08,750
cost, basically.

982
00:43:08,750 --> 00:43:11,740
If you compute a time that's
going to cost more than $0.05,

983
00:43:11,740 --> 00:43:12,850
is not worth it.

984
00:43:12,850 --> 00:43:15,090
And you had to do it that.

985
00:43:15,090 --> 00:43:19,000
So, this is Google, spend a lot
of time how to figure out,

986
00:43:19,000 --> 00:43:20,250
how to do this is cheaply.

987
00:43:23,110 --> 00:43:27,010
So, this is already validated,
but Google is very secretive

988
00:43:27,010 --> 00:43:30,720
of what they do, so this is the
closest I can figure out.

989
00:43:30,720 --> 00:43:35,590
They have three million plus
processers in clusters of 2000

990
00:43:35,590 --> 00:43:37,660
plus process, each,
in each cluster.

991
00:43:37,660 --> 00:43:39,610
And what they already did
was they went for

992
00:43:39,610 --> 00:43:41,070
the cheapest thing.

993
00:43:41,070 --> 00:43:43,940
They build entire system
out of the cheapest

994
00:43:43,940 --> 00:43:45,095
parts we can get.

995
00:43:45,095 --> 00:43:48,040
x86 processors, the cheapest
disks, fairly cheap

996
00:43:48,040 --> 00:43:51,670
communication, and
gain reliability,

997
00:43:51,670 --> 00:43:53,340
redundancy though software.

998
00:43:53,340 --> 00:43:56,940
So each part, I mean supposing
in Google, this data center,

999
00:43:56,940 --> 00:43:59,220
there's somebody who's
constantly growing and

1000
00:43:59,220 --> 00:44:00,980
changing machines and
changing disks.

1001
00:44:00,980 --> 00:44:03,700
Because there's so
much failure.

1002
00:44:03,700 --> 00:44:06,800
But that means we have to have
the software system to keep

1003
00:44:06,800 --> 00:44:08,880
the things running in there.

1004
00:44:08,880 --> 00:44:11,520
And what they have is a
partitioned workload, all

1005
00:44:11,520 --> 00:44:13,580
those things are nicely
partitioned and distributed

1006
00:44:13,580 --> 00:44:18,110
through Google as this nice file
system and stuff do that.

1007
00:44:18,110 --> 00:44:20,160
And then you have to do
crawling, index generation,

1008
00:44:20,160 --> 00:44:22,850
index search, document
retrieval, ad placement, all

1009
00:44:22,850 --> 00:44:24,350
those things happen in there.

1010
00:44:24,350 --> 00:44:27,210
Of course, other things like
Microsoft and Yahoo, and all

1011
00:44:27,210 --> 00:44:29,690
those other people have
systems like that.

1012
00:44:29,690 --> 00:44:34,030
So this is kind of what, when
you go in here to scale up,

1013
00:44:34,030 --> 00:44:35,940
there's no other way, you have
to actually build this huge

1014
00:44:35,940 --> 00:44:37,490
system to do that.

1015
00:44:37,490 --> 00:44:40,575
So one thing Google does, going
a little bit technical,

1016
00:44:40,575 --> 00:44:43,040
is this have a system
called MapReduce.

1017
00:44:43,040 --> 00:44:45,150
How many of you have seen,
heard of MapReduce?

1018
00:44:45,150 --> 00:44:45,610
OK.

1019
00:44:45,610 --> 00:44:47,630
So there's all this people
who know MapReduce.

1020
00:44:47,630 --> 00:44:49,870
Probably more than I do.

1021
00:44:49,870 --> 00:44:53,980
So the idea there is you have a
bunch of data, a huge amount

1022
00:44:53,980 --> 00:44:56,660
of data in here.

1023
00:44:56,660 --> 00:44:59,220
And, normally, what you have
to do is find some

1024
00:44:59,220 --> 00:45:01,640
similarities in lot of
data, and do some

1025
00:45:01,640 --> 00:45:02,780
processing for that.

1026
00:45:02,780 --> 00:45:08,750
And this is programming model
set up nicely help doing that.

1027
00:45:08,750 --> 00:45:11,800
So, that this borrows lot of
functional programming.

1028
00:45:11,800 --> 00:45:15,710
What that means is I'm not
changing data, I'm always

1029
00:45:15,710 --> 00:45:18,730
taking some data values and
creating something new.

1030
00:45:18,730 --> 00:45:21,460
I'm never changing something
existing, that's basically

1031
00:45:21,460 --> 00:45:23,220
meaning of a functional
program.

1032
00:45:23,220 --> 00:45:24,900
So MapReduce has
two components.

1033
00:45:24,900 --> 00:45:26,240
First the map.

1034
00:45:26,240 --> 00:45:33,000
That means given some input
value and a key in there, what

1035
00:45:33,000 --> 00:45:36,690
you develop generate is some
intermediate results and

1036
00:45:36,690 --> 00:45:39,670
output key.

1037
00:45:39,670 --> 00:45:43,180
You get bunch of values coming
through, and everybody process

1038
00:45:43,180 --> 00:45:43,930
each one as separate.

1039
00:45:43,930 --> 00:45:44,370
And say, OK.

1040
00:45:44,370 --> 00:45:46,370
So here is the output
key, and here's some

1041
00:45:46,370 --> 00:45:47,550
intermediate value.

1042
00:45:47,550 --> 00:45:51,700
And then what you do is things
with the same output key gets

1043
00:45:51,700 --> 00:45:54,010
sorted into one list.

1044
00:45:54,010 --> 00:45:56,140
And then it's going reduce it.

1045
00:45:56,140 --> 00:45:59,480
And the reducer takes the output
key in this list, and

1046
00:45:59,480 --> 00:46:02,420
say, OK, look I'm going to
process the entire list down

1047
00:46:02,420 --> 00:46:06,110
to one element or
small data item.

1048
00:46:06,110 --> 00:46:06,700
OK?

1049
00:46:06,700 --> 00:46:09,140
So let's go through
a little bit more,

1050
00:46:09,140 --> 00:46:10,550
digging deep into that.

1051
00:46:10,550 --> 00:46:18,200
And so you map, basically get a
huge amount of records from

1052
00:46:18,200 --> 00:46:25,360
the data source, and it fits
into this map function, and it

1053
00:46:25,360 --> 00:46:28,250
produce intermediate results.

1054
00:46:28,250 --> 00:46:31,250
And the reduced function,
basically, combines the data,

1055
00:46:31,250 --> 00:46:35,060
and all the folding--

1056
00:46:35,060 --> 00:46:36,030
let me give you an example.

1057
00:46:36,030 --> 00:46:37,190
I think that will
show you better.

1058
00:46:37,190 --> 00:46:38,730
So here is kind of
architecture.

1059
00:46:38,730 --> 00:46:40,590
So you have a huge amount
of data resources.

1060
00:46:40,590 --> 00:46:43,380
You have many, many
sources in here.

1061
00:46:43,380 --> 00:46:46,010
And each of the data comes in
to that, and the map will

1062
00:46:46,010 --> 00:46:48,430
basically distributed by keys
and values, so there could be

1063
00:46:48,430 --> 00:46:49,960
millions and values.

1064
00:46:49,960 --> 00:46:53,480
And then, what you have to do
is, wait until all the data,

1065
00:46:53,480 --> 00:46:54,770
has done that.

1066
00:46:54,770 --> 00:46:57,560
And then cleared for
the number of keys

1067
00:46:57,560 --> 00:46:59,620
here, number of reducers.

1068
00:46:59,620 --> 00:47:01,360
So hopefully you wont
have a lot of keys.

1069
00:47:01,360 --> 00:47:03,170
If you have more than two
keys, you don't get that

1070
00:47:03,170 --> 00:47:05,940
parallelism because then you
would be too huge lists.

1071
00:47:05,940 --> 00:47:09,780
And then, again, what happens
is these keys get paired to

1072
00:47:09,780 --> 00:47:13,660
reducers to come the final
value in here.

1073
00:47:13,660 --> 00:47:15,100
So what's the parallelism
here?

1074
00:47:15,100 --> 00:47:17,780
What makes the parallelism
go high?

1075
00:47:17,780 --> 00:47:20,090
Or, not have enough
parallelism?

1076
00:47:23,290 --> 00:47:25,810
Yeah, I mean, first of all,
you need to have enough,

1077
00:47:25,810 --> 00:47:28,670
hopefully, multiple data stores
so you get a lot of

1078
00:47:28,670 --> 00:47:29,710
parallelism coming in here.

1079
00:47:29,710 --> 00:47:33,660
Map is easily parallelizable,
because each choosing in here.

1080
00:47:33,660 --> 00:47:35,820
Reducer is the problem
I think one.

1081
00:47:35,820 --> 00:47:38,280
Because if you have too many
keys, too little keys, you are

1082
00:47:38,280 --> 00:47:39,820
in trouble.

1083
00:47:39,820 --> 00:47:42,000
The other interesting thing
in here is there's a big

1084
00:47:42,000 --> 00:47:43,990
shuffling between here.

1085
00:47:43,990 --> 00:47:46,470
So that means data has to
go all over the place.

1086
00:47:46,470 --> 00:47:49,590
So it's not something that, you
got data and you process

1087
00:47:49,590 --> 00:47:52,040
to the end you got data you
process to end, every data has

1088
00:47:52,040 --> 00:47:54,210
to kind of cross back, and so
that's a huge amount of

1089
00:47:54,210 --> 00:47:56,360
communication in here that could
be bottle-necked too.

1090
00:47:56,360 --> 00:48:00,200
That can be bottle-necked, keys
can be bottle-necked.

1091
00:48:00,200 --> 00:48:03,160
So map function runs parallel,
creating different things.

1092
00:48:03,160 --> 00:48:06,040
Reduced functions also run
parallel for each key.

1093
00:48:06,040 --> 00:48:08,470
And all values are basically
processed independently

1094
00:48:08,470 --> 00:48:10,860
because of that.

1095
00:48:10,860 --> 00:48:13,640
Also, the bottle-neck is reduce
phase can't start until

1096
00:48:13,640 --> 00:48:15,360
all the map is done, and also
all the data gets shuffled

1097
00:48:15,360 --> 00:48:16,830
around with that.

1098
00:48:16,830 --> 00:48:18,050
So here's an interesting
example.

1099
00:48:18,050 --> 00:48:20,630
What I am trying to do is I am
trying to count the number of

1100
00:48:20,630 --> 00:48:25,450
words in assume huge amount
of web pages.

1101
00:48:25,450 --> 00:48:30,800
So what I can do is in the map,
I get each page in here--

1102
00:48:30,800 --> 00:48:35,660
I thread through the page
emitting each word as my key

1103
00:48:35,660 --> 00:48:37,230
and the count as one.

1104
00:48:37,230 --> 00:48:40,020
Because I only get one thing.

1105
00:48:40,020 --> 00:48:42,930
And then my reducer
is basically--

1106
00:48:42,930 --> 00:48:44,770
my key is each word.

1107
00:48:44,770 --> 00:48:48,050
So if I have a million words, I
can have a million reducers.

1108
00:48:48,050 --> 00:48:50,580
And the reducer basically takes
all those things-- it's

1109
00:48:50,580 --> 00:48:52,250
not that fun because
everything is

1110
00:48:52,250 --> 00:48:53,290
all at number one.

1111
00:48:53,290 --> 00:48:54,550
Because we count at one.

1112
00:48:54,550 --> 00:48:58,410
And then basically keep adding
up how many things for each

1113
00:48:58,410 --> 00:49:01,200
word came about and put
up the results.

1114
00:49:01,200 --> 00:49:04,310
So you can say, OK, look, for
the entire corpus of data, I

1115
00:49:04,310 --> 00:49:08,270
had this many words count, this
many word occurrences,

1116
00:49:08,270 --> 00:49:09,990
this many all for each word.

1117
00:49:09,990 --> 00:49:12,160
You get a word count.

1118
00:49:12,160 --> 00:49:16,960
So basically trying to create
a histogram here and

1119
00:49:16,960 --> 00:49:20,500
MapReducer provides a very nice
interface to do that.

1120
00:49:20,500 --> 00:49:24,560
And it's very nice, high level,
and it provides this

1121
00:49:24,560 --> 00:49:32,200
nice infrastructure to run this
in parallel in here and

1122
00:49:32,200 --> 00:49:37,230
do all the communication
necessary, figure out how many

1123
00:49:37,230 --> 00:49:40,040
reducers to run, look at
machines to run them, produce

1124
00:49:40,040 --> 00:49:41,780
the result, and give
you the result.

1125
00:49:41,780 --> 00:49:43,372
So this is a nice infrastructure
Google has

1126
00:49:43,372 --> 00:49:44,622
built in there.

1127
00:49:48,470 --> 00:49:51,390
So in this level, when
you go to this--

1128
00:49:51,390 --> 00:49:53,130
this is the data center level.

1129
00:49:53,130 --> 00:49:55,790
What do you have
to do to scale?

1130
00:49:55,790 --> 00:49:58,030
You need to distribute data.

1131
00:49:58,030 --> 00:50:01,290
And you need to parallelize
because if all the data is in

1132
00:50:01,290 --> 00:50:02,730
one machine, it doesn't help.

1133
00:50:02,730 --> 00:50:04,580
And you need to have parallelism
to scale

1134
00:50:04,580 --> 00:50:06,150
everything.

1135
00:50:06,150 --> 00:50:10,040
Another interesting thing you
can do is approximate.

1136
00:50:10,040 --> 00:50:15,090
So what that means is normally
when you calculate, when

1137
00:50:15,090 --> 00:50:19,260
everybody has exactly the same
data all the time-- because

1138
00:50:19,260 --> 00:50:21,520
when you write the memory,
everybody sees that memory--

1139
00:50:21,520 --> 00:50:24,140
you have the perfect knowledge
of the word.

1140
00:50:24,140 --> 00:50:26,810
And in a distributed system,
getting perfect knowledge is

1141
00:50:26,810 --> 00:50:27,340
very expensive.

1142
00:50:27,340 --> 00:50:29,425
That means every time something
changes, you have to

1143
00:50:29,425 --> 00:50:31,130
send everybody that data.

1144
00:50:31,130 --> 00:50:35,020
And one way that people really
make these systems run fast,

1145
00:50:35,020 --> 00:50:37,145
you say, wait a minute, if
somebody doesn't have the

1146
00:50:37,145 --> 00:50:39,930
perfect knowledge, if there's
a little bit of discrepancy

1147
00:50:39,930 --> 00:50:43,860
between something, I am OK.

1148
00:50:43,860 --> 00:50:45,772
Assume you have a new--

1149
00:50:45,772 --> 00:50:48,970
you changed your--

1150
00:50:48,970 --> 00:50:49,720
we'll say--

1151
00:50:49,720 --> 00:50:53,320
web page and added a couple
of new words in there.

1152
00:50:53,320 --> 00:50:55,970
Next second, the search
doesn't see it.

1153
00:50:55,970 --> 00:50:57,630
Nobody's going to complain.

1154
00:50:57,630 --> 00:51:00,980
And then you can deliver that.

1155
00:51:00,980 --> 00:51:02,690
If you do a search, you'll
find something.

1156
00:51:02,690 --> 00:51:05,030
But somebody else doesn't do it
because that data haven't

1157
00:51:05,030 --> 00:51:07,350
propagated that to both
of the things.

1158
00:51:07,350 --> 00:51:09,270
Nobody's going to complain and
say, wait a minute, I found

1159
00:51:09,270 --> 00:51:11,210
it, but he didn't.

1160
00:51:11,210 --> 00:51:13,140
It can have a little
bit of a lag.

1161
00:51:13,140 --> 00:51:15,900
And that can be really,
really useful in

1162
00:51:15,900 --> 00:51:16,880
these kind of systems.

1163
00:51:16,880 --> 00:51:19,492
Because every time something
happened, you don't have to

1164
00:51:19,492 --> 00:51:20,830
keep updating.

1165
00:51:20,830 --> 00:51:23,960
But tell me a system that you
can't actually do that.

1166
00:51:26,550 --> 00:51:31,972
Play it a little bit
fast and easy.

1167
00:51:31,972 --> 00:51:33,210
AUDIENCE: Stock trading.

1168
00:51:33,210 --> 00:51:33,560
PROFESSOR: Stock trading.

1169
00:51:33,560 --> 00:51:36,940
Yeah, that's something
basically, if you say, yeah,

1170
00:51:36,940 --> 00:51:38,910
you might get it too,
you might get it--

1171
00:51:38,910 --> 00:51:40,680
and that doesn't work.

1172
00:51:40,680 --> 00:51:44,130
Basically, stock trading has
this very particular thing

1173
00:51:44,130 --> 00:51:50,195
because when it does submit,
we'll say, a sale order,

1174
00:51:50,195 --> 00:51:53,410
within a certain amount of time,
it has to get matched up

1175
00:51:53,410 --> 00:51:56,860
and has to be announced
to both people.

1176
00:51:56,860 --> 00:51:59,800
And also, there are a lot of
other constraints like if the

1177
00:51:59,800 --> 00:52:01,600
machine goes down.

1178
00:52:01,600 --> 00:52:04,760
Either trade has to be everybody
saw the trade or

1179
00:52:04,760 --> 00:52:06,280
nobody saw it.

1180
00:52:06,280 --> 00:52:08,050
You can't say-- somebody
says, I sold it.

1181
00:52:08,050 --> 00:52:11,520
And another guy says,
no, I didn't buy it.

1182
00:52:11,520 --> 00:52:13,460
And when you have millions of
billions of dollars back and

1183
00:52:13,460 --> 00:52:15,400
forth, that doesn't
really work.

1184
00:52:15,400 --> 00:52:20,250
So for that, there's this thing
called transactions.

1185
00:52:20,250 --> 00:52:23,050
So transactions is an
interesting way-- a lot of

1186
00:52:23,050 --> 00:52:24,450
databases have this
transaction.

1187
00:52:24,450 --> 00:52:26,050
Transactions say, look,
I am doing this

1188
00:52:26,050 --> 00:52:27,300
very complicated thing.

1189
00:52:31,500 --> 00:52:34,940
And I cannot have this
intermediate state going on.

1190
00:52:34,940 --> 00:52:38,720
So what transactions say is,
first, tell me everything I

1191
00:52:38,720 --> 00:52:40,280
want to do in the transaction.

1192
00:52:40,280 --> 00:52:42,400
So it might be I want to sell
a stock, I want to buy a

1193
00:52:42,400 --> 00:52:43,660
stock, whatever.

1194
00:52:43,660 --> 00:52:48,250
And then at some point when you
commit the transaction,

1195
00:52:48,250 --> 00:52:51,250
you either say, OK, everything
worked, good, the entire thing

1196
00:52:51,250 --> 00:52:52,030
gets committed.

1197
00:52:52,030 --> 00:52:52,960
And you are done.

1198
00:52:52,960 --> 00:52:57,170
Or it can explicitly
reject it.

1199
00:52:57,170 --> 00:52:59,210
It can come back and say,
look, I can't do this

1200
00:52:59,210 --> 00:52:59,580
transaction.

1201
00:52:59,580 --> 00:53:01,230
Now sorry, you can restart it.

1202
00:53:01,230 --> 00:53:02,960
So you can accept and reject.

1203
00:53:02,960 --> 00:53:06,690
But then the nice thing about
that is then every one

1204
00:53:06,690 --> 00:53:08,020
single-- it's like atomicity.

1205
00:53:08,020 --> 00:53:12,210
Every one action doesn't have
to happen immediately or

1206
00:53:12,210 --> 00:53:12,960
happen as a group.

1207
00:53:12,960 --> 00:53:14,590
You can say, OK, I'm doing
a bunch of action in the

1208
00:53:14,590 --> 00:53:15,150
transaction.

1209
00:53:15,150 --> 00:53:19,400
And then finally, I can come
in and if it works, great.

1210
00:53:19,400 --> 00:53:21,950
So what might be a reason you
might not be able to commit a

1211
00:53:21,950 --> 00:53:23,245
transaction if you
do a transaction?

1212
00:53:30,695 --> 00:53:32,270
Anybody else want to answer?

1213
00:53:32,270 --> 00:53:34,140
When you say, I want a
transaction, I want to commit

1214
00:53:34,140 --> 00:53:35,710
something, what might say--

1215
00:53:38,250 --> 00:53:39,450
so here's an interesting
thing.

1216
00:53:39,450 --> 00:53:42,220
In the stock trading
type world--

1217
00:53:42,220 --> 00:53:43,990
so assume I want to
sell something.

1218
00:53:43,990 --> 00:53:45,590
Let's look at the airline
reservation.

1219
00:53:45,590 --> 00:53:48,690
So assume I have an airline
seat in here.

1220
00:53:48,690 --> 00:53:54,280
And if two people want to try to
resell that seat, I can do

1221
00:53:54,280 --> 00:53:57,380
all this processing in parallel
for everybody until I

1222
00:53:57,380 --> 00:53:59,750
come to the commit point.

1223
00:53:59,750 --> 00:54:02,230
That means I can look at
everybody after you enter the

1224
00:54:02,230 --> 00:54:06,580
data, do the price, all those
things separately

1225
00:54:06,580 --> 00:54:08,460
for the same seat.

1226
00:54:08,460 --> 00:54:10,860
But then when you come to the
commit point, you say, can I

1227
00:54:10,860 --> 00:54:11,750
commit the transaction?

1228
00:54:11,750 --> 00:54:14,310
At that point, only at that
point, they have to figure out

1229
00:54:14,310 --> 00:54:16,590
whether there's a conflict
in here.

1230
00:54:16,590 --> 00:54:17,770
And at some point, if
there's a conflict,

1231
00:54:17,770 --> 00:54:19,310
it says, oops, can't.

1232
00:54:19,310 --> 00:54:21,110
One transaction has
to get aborted.

1233
00:54:21,110 --> 00:54:23,675
The nice thing about that is
most of the time people are

1234
00:54:23,675 --> 00:54:25,000
not going to fight for
the same seat.

1235
00:54:25,000 --> 00:54:27,380
And then things can proceed
in parallel.

1236
00:54:27,380 --> 00:54:28,310
You don't have to wait.

1237
00:54:28,310 --> 00:54:31,050
Otherwise, if you do that,
there might only one seat

1238
00:54:31,050 --> 00:54:32,330
assignment at a time
you can do.

1239
00:54:32,330 --> 00:54:34,120
And that's really not
going to scale.

1240
00:54:34,120 --> 00:54:35,750
So everybody tried to
get their seat.

1241
00:54:35,750 --> 00:54:37,590
They go to the end, and they
say, can I proceed?

1242
00:54:37,590 --> 00:54:40,090
And at that point, you check
whether there's a conflict.

1243
00:54:40,090 --> 00:54:42,130
And if there's a conflict,
one guy backs out.

1244
00:54:42,130 --> 00:54:43,435
So that's the transaction.

1245
00:54:43,435 --> 00:54:45,610
Oops, I'm going to get
rebooted I guess.

1246
00:54:49,400 --> 00:54:56,050
So when you go to planet scale,
you can get even into

1247
00:54:56,050 --> 00:54:57,540
more issues, things like--

1248
00:54:57,540 --> 00:54:59,550
what could be a planet scale
thing out there?

1249
00:55:05,190 --> 00:55:06,870
What's an interesting planet
scale thing that

1250
00:55:06,870 --> 00:55:08,120
you can think of?

1251
00:55:10,690 --> 00:55:14,240
Single computation that has to
happen in the planet scale.

1252
00:55:21,300 --> 00:55:23,920
Something like Internet
naming system.

1253
00:55:23,920 --> 00:55:25,820
It has to work everywhere
in the entire planet.

1254
00:55:25,820 --> 00:55:27,640
Or something like Internet
routing.

1255
00:55:27,640 --> 00:55:30,050
There has to be an algorithm
that has to work.

1256
00:55:30,050 --> 00:55:33,260
The entire world has to
cooperate and then make sure

1257
00:55:33,260 --> 00:55:35,730
that all the traffic actually
goes to the right place.

1258
00:55:35,730 --> 00:55:37,880
So there's a lot more
issues, interesting

1259
00:55:37,880 --> 00:55:40,610
things show up in here.

1260
00:55:40,610 --> 00:55:44,850
So things like Seti@Home
type stuff--

1261
00:55:44,850 --> 00:55:48,290
these are a little bit dated
these days, that happens--

1262
00:55:48,290 --> 00:55:50,160
distributed all across
the place.

1263
00:55:50,160 --> 00:55:55,520
So if you do planet scale, it
has to be truly distributed.

1264
00:55:55,520 --> 00:55:57,780
There cannot be any global
operations, no single

1265
00:55:57,780 --> 00:55:58,950
bottleneck.

1266
00:55:58,950 --> 00:56:00,795
And you have to have
distributed

1267
00:56:00,795 --> 00:56:02,510
view with stale data.

1268
00:56:02,510 --> 00:56:04,580
You cannot say, look,
everybody has to

1269
00:56:04,580 --> 00:56:05,850
have the same data.

1270
00:56:05,850 --> 00:56:07,870
You have to have everything
distributed.

1271
00:56:07,870 --> 00:56:10,880
And it has to add up to load
distributions because things

1272
00:56:10,880 --> 00:56:13,800
can keep changing in there.

1273
00:56:13,800 --> 00:56:17,760
So what I'm going to do next is
trying to give you a little

1274
00:56:17,760 --> 00:56:22,530
bit of a case study that shows
you some interesting

1275
00:56:22,530 --> 00:56:24,580
properties that show
up when you start

1276
00:56:24,580 --> 00:56:26,230
building at that scale.

1277
00:56:26,230 --> 00:56:30,590
And this has some planet scale
type properties, some cluster

1278
00:56:30,590 --> 00:56:31,820
properties, whatever.

1279
00:56:31,820 --> 00:56:33,750
And I will probably first
describe this interesting

1280
00:56:33,750 --> 00:56:37,810
problem and then show what kind
of solutions that came

1281
00:56:37,810 --> 00:56:41,740
through, so to give you
a perspective for

1282
00:56:41,740 --> 00:56:43,630
a problem in here.

1283
00:56:43,630 --> 00:56:45,860
Any questions up to this far
for distributed systems?

1284
00:56:48,490 --> 00:56:51,020
It's hard to do distributed
systems in one lecture.

1285
00:56:51,020 --> 00:56:52,860
There are almost closest for
distributed systems.

1286
00:56:52,860 --> 00:56:58,890
But this will give you a
feel for some of it.

1287
00:56:58,890 --> 00:57:03,180
So the case study here
is from VMware.

1288
00:57:03,180 --> 00:57:07,500
It's called deduplication
at global space.

1289
00:57:07,500 --> 00:57:11,010
And the problem shows up when
you're trying to move virtual

1290
00:57:11,010 --> 00:57:13,330
machines across the world.

1291
00:57:13,330 --> 00:57:14,740
You have this virtual machine.

1292
00:57:14,740 --> 00:57:18,310
So what virtualization did was
it took a piece of hardware

1293
00:57:18,310 --> 00:57:20,690
and it converted
it into a file.

1294
00:57:20,690 --> 00:57:23,460
So each machine is now a file.

1295
00:57:23,460 --> 00:57:25,290
When you have a file, like
hardware, there are a lot of

1296
00:57:25,290 --> 00:57:27,390
cool things you can do then.

1297
00:57:27,390 --> 00:57:28,880
You can replicate those files.

1298
00:57:28,880 --> 00:57:31,640
So suddenly, instead of one
machine, you've got tens of

1299
00:57:31,640 --> 00:57:32,930
hundreds of machines.

1300
00:57:32,930 --> 00:57:34,600
You can move those
things in here.

1301
00:57:34,600 --> 00:57:38,340
And of course, you can start
another machines all over.

1302
00:57:38,340 --> 00:57:40,970
So once you are able to move
these things, the issue

1303
00:57:40,970 --> 00:57:43,150
becomes how to move those things
around and what's the

1304
00:57:43,150 --> 00:57:44,400
cost of moving something.

1305
00:57:46,830 --> 00:57:48,360
And also, you can
store it, store

1306
00:57:48,360 --> 00:57:50,070
those things in a database.

1307
00:57:54,800 --> 00:57:56,260
The interesting thing that's
happening these

1308
00:57:56,260 --> 00:57:58,180
days is cloud computing.

1309
00:57:58,180 --> 00:58:01,485
Cloud means there's all these
providers all over the place,

1310
00:58:01,485 --> 00:58:02,970
saying, I have processing
power, I can

1311
00:58:02,970 --> 00:58:03,700
give you some of them.

1312
00:58:03,700 --> 00:58:06,550
Amazon does something easy too,
but Verizon is trying to

1313
00:58:06,550 --> 00:58:09,680
do, everybody is trying
to do that.

1314
00:58:09,680 --> 00:58:13,940
So if you want to have the best
market, what you want to

1315
00:58:13,940 --> 00:58:16,780
do is have the elasticity to
move from cloud to cloud for

1316
00:58:16,780 --> 00:58:18,570
many reasons.

1317
00:58:18,570 --> 00:58:20,600
So sometimes the cloud
might be too small.

1318
00:58:20,600 --> 00:58:22,510
You want to get to a bigger
cloud in there.

1319
00:58:22,510 --> 00:58:25,120
Or you want to be
near the users.

1320
00:58:25,120 --> 00:58:28,350
So in the daytime in the US, you
want probably to move the

1321
00:58:28,350 --> 00:58:29,410
machines through US.

1322
00:58:29,410 --> 00:58:31,930
At night, there might be users
in China, so you want to move

1323
00:58:31,930 --> 00:58:32,940
your compute nearer China.

1324
00:58:32,940 --> 00:58:35,100
Because it will be closer to the
people who are using it.

1325
00:58:35,100 --> 00:58:36,960
So something like that, you
can move around there.

1326
00:58:36,960 --> 00:58:38,170
Or you want to find the
cheaper provider.

1327
00:58:38,170 --> 00:58:40,280
If somebody comes and says,
look, I can give you compute

1328
00:58:40,280 --> 00:58:45,210
power $0.10 cheaper than what
you are getting, OK, I want to

1329
00:58:45,210 --> 00:58:46,880
move to that guy.

1330
00:58:46,880 --> 00:58:49,890
And also, to amortize the risk
of catastrophic failure.

1331
00:58:49,890 --> 00:58:52,710
If there's a hurricane
approaching somewhere, I might

1332
00:58:52,710 --> 00:58:54,860
want to move to a data center
that might be out of the way.

1333
00:58:54,860 --> 00:58:56,370
And I want to do that.

1334
00:58:56,370 --> 00:59:00,956
And the interesting thing there
is a lot of things.

1335
00:59:00,956 --> 00:59:06,860
But when you say, application in
the cloud, it's a machine,

1336
00:59:06,860 --> 00:59:09,440
a machine basically in
a virtual machine.

1337
00:59:09,440 --> 00:59:11,160
At the same time, a virtual
machine has to get moved

1338
00:59:11,160 --> 00:59:13,110
around, not your small
application.

1339
00:59:13,110 --> 00:59:15,010
The entire thing has
to move around.

1340
00:59:15,010 --> 00:59:17,290
And virtual machines
are hefty.

1341
00:59:17,290 --> 00:59:18,630
Because it has an operating
system,

1342
00:59:18,630 --> 00:59:21,490
it has all the software.

1343
00:59:21,490 --> 00:59:22,520
There's so many things now.

1344
00:59:22,520 --> 00:59:24,160
Then your data and your state.

1345
00:59:24,160 --> 00:59:26,540
There are a lot of things
in the machine in here.

1346
00:59:26,540 --> 00:59:28,360
And all those things have to get
moved around, so that can

1347
00:59:28,360 --> 00:59:29,900
be expensive.

1348
00:59:29,900 --> 00:59:33,100
So yes, interesting experiment
in here.

1349
00:59:33,100 --> 00:59:36,530
So the idea here is to try to
move something from Boston to

1350
00:59:36,530 --> 00:59:38,870
Palo Alto on a 2 megabytes
network.

1351
00:59:38,870 --> 00:59:42,320
And there are a bunch
of different virtual

1352
00:59:42,320 --> 00:59:44,130
machines in here, VMs.

1353
00:59:44,130 --> 00:59:46,860
And it takes--

1354
00:59:46,860 --> 00:59:47,270
whatever--

1355
00:59:47,270 --> 00:59:52,670
3,000 minutes to move these
machines from--

1356
00:59:52,670 --> 00:59:55,050
this is 500 minutes.

1357
00:59:55,050 --> 00:59:56,300
That means what?

1358
01:00:01,050 --> 01:00:02,000
Some hours to move them.

1359
01:00:02,000 --> 01:00:04,820
Some hours to move these
machines around.

1360
01:00:04,820 --> 01:00:08,800
And then what you say, look, the
machines are heavy, big.

1361
01:00:08,800 --> 01:00:11,010
Why can't you first compress
the machine?

1362
01:00:11,010 --> 01:00:13,450
So you can use something like
normal compression.

1363
01:00:13,450 --> 01:00:17,250
So blue is basically compress,
move the machine, and

1364
01:00:17,250 --> 01:00:18,810
decompress.

1365
01:00:18,810 --> 01:00:19,950
So here is something
interesting.

1366
01:00:19,950 --> 01:00:24,380
This is a very fast
compression.

1367
01:00:24,380 --> 01:00:26,380
You did really well, you
are moving there.

1368
01:00:26,380 --> 01:00:28,290
If you want a better
compression, you say, I'm

1369
01:00:28,290 --> 01:00:31,760
going to do a full, best
compression I can do, it's

1370
01:00:31,760 --> 01:00:32,550
actually slower.

1371
01:00:32,550 --> 01:00:35,360
Because the trouble is the
compression time is so high,

1372
01:00:35,360 --> 01:00:38,570
the reduction is not
usable in here.

1373
01:00:38,570 --> 01:00:40,340
So you try to compress, you
spend most of the time

1374
01:00:40,340 --> 01:00:41,220
compressing.

1375
01:00:41,220 --> 01:00:43,390
So actually, this is even
slower that just sending

1376
01:00:43,390 --> 01:00:44,370
without compression.

1377
01:00:44,370 --> 01:00:46,310
So compression is important,
compression is useful.

1378
01:00:46,310 --> 01:00:48,750
So can you do better than a
normal compression in here?

1379
01:00:48,750 --> 01:00:51,070
How can you do better?

1380
01:00:51,070 --> 01:00:54,146
So some key observations
in here.

1381
01:00:54,146 --> 01:00:57,410
So a large part of these
files are executables.

1382
01:00:57,410 --> 01:00:59,660
You have your Linux kernel,
whatever, to all those

1383
01:00:59,660 --> 01:01:01,230
executables hitting in there.

1384
01:01:01,230 --> 01:01:07,460
And basically, that's
monoculturing the world.

1385
01:01:07,460 --> 01:01:11,000
There are no million different
executables.

1386
01:01:11,000 --> 01:01:13,030
You are a Linux kernel, there's
only a certain amount

1387
01:01:13,030 --> 01:01:14,830
of versions.

1388
01:01:14,830 --> 01:01:17,360
Microsoft XP, there are certain
types of versions.

1389
01:01:17,360 --> 01:01:20,335
So even though there are
millions of machines, inside

1390
01:01:20,335 --> 01:01:22,270
the millions of machines,
there aren't millions of

1391
01:01:22,270 --> 01:01:23,020
different applications.

1392
01:01:23,020 --> 01:01:25,350
There's only hundreds of
different applications.

1393
01:01:25,350 --> 01:01:26,420
And so your motion moves.

1394
01:01:26,420 --> 01:01:28,290
If you think about it, you are
moving the same thing again

1395
01:01:28,290 --> 01:01:29,170
and again and again.

1396
01:01:29,170 --> 01:01:31,310
Can you take advantage
of that?

1397
01:01:31,310 --> 01:01:35,720
And there's even substantial
redundancy in each of these.

1398
01:01:35,720 --> 01:01:37,190
So this is very interesting.

1399
01:01:37,190 --> 01:01:40,430
If you have a Windows machine,
each DLL has three copies.

1400
01:01:40,430 --> 01:01:43,990
So you have the copy, and then
the Installer has a copy.

1401
01:01:43,990 --> 01:01:46,650
And then there's another copy
in the next version to

1402
01:01:46,650 --> 01:01:49,050
basically back out,
so undo copy.

1403
01:01:49,050 --> 01:01:50,320
So each thing is kept
three copies.

1404
01:01:50,320 --> 01:01:52,520
So every big thing,
they're seeing

1405
01:01:52,520 --> 01:01:53,410
multiple copies in there.

1406
01:01:53,410 --> 01:01:54,610
So that part is there also.

1407
01:01:54,610 --> 01:01:58,420
Even within a single disk, there
is redundancy in here.

1408
01:01:58,420 --> 01:02:03,250
And another interesting thing
is many of the disks have a

1409
01:02:03,250 --> 01:02:04,630
large amount of zero pages.

1410
01:02:04,630 --> 01:02:07,230
So if you send something
uncompressed, you send a huge

1411
01:02:07,230 --> 01:02:08,030
amount of zeros.

1412
01:02:08,030 --> 01:02:11,120
So you are waiting for zeros
to get in there.

1413
01:02:11,120 --> 01:02:13,875
Even easy compression can get to
those zeros, but this is a

1414
01:02:13,875 --> 01:02:15,700
large chunk of data in here.

1415
01:02:15,700 --> 01:02:19,560
And so the interesting thing
is if you take one virtual

1416
01:02:19,560 --> 01:02:22,680
machine, this is the number
of non-zero blocks.

1417
01:02:22,680 --> 01:02:24,290
And this is the number
of unique blocks.

1418
01:02:24,290 --> 01:02:27,150
So unique blocks are smaller
than non-zero blocks.

1419
01:02:27,150 --> 01:02:30,510
But if you keep adding more
and more virtual machines,

1420
01:02:30,510 --> 01:02:33,520
then of course, the number of
total blocks keeps going up.

1421
01:02:33,520 --> 01:02:37,570
But the unique blocks doesn't
keep increasing.

1422
01:02:37,570 --> 01:02:42,720
That means the second Linux box
you add, there's not much

1423
01:02:42,720 --> 01:02:43,600
new in there.

1424
01:02:43,600 --> 01:02:46,310
So if you look at that, what
happens is the first guy has

1425
01:02:46,310 --> 01:02:48,810
about 80% things are unique.

1426
01:02:48,810 --> 01:02:52,140
When you keep adding things,
it's about only 30% is unique

1427
01:02:52,140 --> 01:02:52,570
after you add.

1428
01:02:52,570 --> 01:02:53,770
Because it's the same program.

1429
01:02:53,770 --> 01:02:55,700
Only the data is different
as you keep adding.

1430
01:02:55,700 --> 01:02:57,630
So can you really take
advantage of that?

1431
01:02:57,630 --> 01:03:01,350
So that is where deduplication
comes in.

1432
01:03:01,350 --> 01:03:04,420
So deduplication says, I have
this data, I have a lot of

1433
01:03:04,420 --> 01:03:05,120
redundant data.

1434
01:03:05,120 --> 01:03:07,380
So A B, A B, A B is redundant.

1435
01:03:07,380 --> 01:03:09,800
So what you want to do is break
it up to some kind of

1436
01:03:09,800 --> 01:03:11,830
blocks in here.

1437
01:03:11,830 --> 01:03:15,350
And then one easy way to
do is calculate a hash.

1438
01:03:15,350 --> 01:03:16,740
Because you don't want
to compare blocks.

1439
01:03:16,740 --> 01:03:17,640
That's too much.

1440
01:03:17,640 --> 01:03:20,280
N squared comparison of blocks
is a lot of comparison.

1441
01:03:20,280 --> 01:03:22,180
You can have some kind of
hash calculated for

1442
01:03:22,180 --> 01:03:23,260
each of these blocks.

1443
01:03:23,260 --> 01:03:25,430
And then you can compare
the hashes.

1444
01:03:25,430 --> 01:03:26,980
And if the hashes are
the same, they

1445
01:03:26,980 --> 01:03:28,190
are the same blocks.

1446
01:03:28,190 --> 01:03:30,430
And then what you can do is
you can eliminate most of

1447
01:03:30,430 --> 01:03:35,790
these blocks in there and then
keep hashes for each block--

1448
01:03:35,790 --> 01:03:36,960
only the hash.

1449
01:03:36,960 --> 01:03:40,900
And then what you can do is you
can only keep the unique

1450
01:03:40,900 --> 01:03:41,440
blocks in here.

1451
01:03:41,440 --> 01:03:44,600
So even though you have nine in
here, only five different

1452
01:03:44,600 --> 01:03:45,910
unique blocks are there.

1453
01:03:45,910 --> 01:03:47,710
So that's a nice way
of deduplicating.

1454
01:03:47,710 --> 01:03:51,275
So you actually have what you
call recipe, a common block

1455
01:03:51,275 --> 01:03:53,310
store in here.

1456
01:03:53,310 --> 01:03:56,130
So one way to do that is you
can have a recipe on common

1457
01:03:56,130 --> 01:03:59,040
block store for each of
the systems in here.

1458
01:03:59,040 --> 01:04:01,120
That's the tradition
of deduplication.

1459
01:04:01,120 --> 01:04:04,930
Or what you can do is have
everybody keep a recipe and

1460
01:04:04,930 --> 01:04:07,150
only have one common
block store.

1461
01:04:07,150 --> 01:04:09,690
Just keep one, single common
block store, and everybody

1462
01:04:09,690 --> 01:04:11,910
have a recipe, or probably
cache of a recipe.

1463
01:04:11,910 --> 01:04:16,140
So by doing that, you can even
reduce a huge amount of the

1464
01:04:16,140 --> 01:04:19,480
things happening in here.

1465
01:04:19,480 --> 01:04:21,430
So the interesting thing is if
you are keeping one common

1466
01:04:21,430 --> 01:04:23,570
block store, who can keep,
who can manage?

1467
01:04:23,570 --> 01:04:26,180
That's the interesting
question in here.

1468
01:04:26,180 --> 01:04:31,140
So can you keep instead of
common block store for each

1469
01:04:31,140 --> 01:04:34,320
processor, each computer, can
you keep the common block

1470
01:04:34,320 --> 01:04:36,720
store for the entire world?

1471
01:04:36,720 --> 01:04:39,080
So if you find most of the
common blocks in the world,

1472
01:04:39,080 --> 01:04:40,130
keep one store.

1473
01:04:40,130 --> 01:04:44,020
And the nice thing about
that is then I can go

1474
01:04:44,020 --> 01:04:44,480
anywhere in the world.

1475
01:04:44,480 --> 01:04:46,880
I can ask for the common blocks
for the common things.

1476
01:04:46,880 --> 01:04:48,130
I can populate it myself.

1477
01:04:52,900 --> 01:04:56,070
So here's this interesting
system called the Bonsai.

1478
01:04:56,070 --> 01:04:57,930
What they did was--

1479
01:04:57,930 --> 01:05:00,250
so if you have a block in here,
you calculate a hash

1480
01:05:00,250 --> 01:05:02,970
function in here,
get a hash key.

1481
01:05:02,970 --> 01:05:04,670
So what that means,
the hash key can

1482
01:05:04,670 --> 01:05:06,450
uniquely access this block.

1483
01:05:06,450 --> 01:05:09,170
And then what you want to do
is ask that you want to

1484
01:05:09,170 --> 01:05:11,930
compress this block.

1485
01:05:11,930 --> 01:05:14,700
This additional step, I'll
explain later why it's needed.

1486
01:05:14,700 --> 01:05:17,320
So the other thing you can do
is you can get a second hash

1487
01:05:17,320 --> 01:05:24,160
key and use that as a private
key to encrypt this block.

1488
01:05:24,160 --> 01:05:25,230
Because you calculate
two hash keys.

1489
01:05:25,230 --> 01:05:26,500
One is the hash key
to identify.

1490
01:05:26,500 --> 01:05:28,810
The other one is a private key
to encrypt this block.

1491
01:05:28,810 --> 01:05:31,440
And then what you can look at
is this global store to see

1492
01:05:31,440 --> 01:05:33,660
whether this hash key exists.

1493
01:05:33,660 --> 01:05:35,540
If the hash key exists,
then say, I got the

1494
01:05:35,540 --> 01:05:36,870
page, here is the page.

1495
01:05:36,870 --> 01:05:37,980
That's the encrypted page.

1496
01:05:37,980 --> 01:05:41,040
And each page will
have a unique ID.

1497
01:05:41,040 --> 01:05:42,490
And so here's my unique ID.

1498
01:05:42,490 --> 01:05:46,230
If you find the page in here,
what you can do is you can

1499
01:05:46,230 --> 01:05:49,230
only store UID and this
private key and

1500
01:05:49,230 --> 01:05:50,710
get rid of my page.

1501
01:05:50,710 --> 01:05:55,040
So storing UID and private key
is sufficient to get my page

1502
01:05:55,040 --> 01:05:56,070
and unencrypt it.

1503
01:05:56,070 --> 01:05:59,540
Why do you think I have
to compress here?

1504
01:05:59,540 --> 01:06:03,170
Why do I have to basically
do encrypt here?

1505
01:06:03,170 --> 01:06:04,550
What's the interesting thing
about encryption?

1506
01:06:08,270 --> 01:06:10,150
Why encrypt?

1507
01:06:10,150 --> 01:06:11,400
Assume this is global.

1508
01:06:17,260 --> 01:06:23,520
Because if you don't encrypt, it
might be a common page, but

1509
01:06:23,520 --> 01:06:26,490
it might not be something you
want everybody to know.

1510
01:06:26,490 --> 01:06:34,640
So assume a large company
like Google.

1511
01:06:34,640 --> 01:06:37,620
President of Google, Larry Page,
sends everybody emails,

1512
01:06:37,620 --> 01:06:40,420
saying, this is very private,
but here is something that's

1513
01:06:40,420 --> 01:06:41,760
happening in the company.

1514
01:06:41,760 --> 01:06:43,355
And it will get into everybody's
mailbox.

1515
01:06:43,355 --> 01:06:44,560
And suddenly, it becomes--

1516
01:06:44,560 --> 01:06:46,050
aha-- a common page.

1517
01:06:46,050 --> 01:06:49,500
And it gets sucked in
the world because

1518
01:06:49,500 --> 01:06:50,050
of the common page.

1519
01:06:50,050 --> 01:06:52,560
And now everybody can see that,
and that's not good.

1520
01:06:52,560 --> 01:06:54,840
But now if you have
this private key--

1521
01:06:54,840 --> 01:06:57,620
if you don't have the private
key, I can't decrypt that.

1522
01:06:57,620 --> 01:06:59,030
So what happens is--

1523
01:06:59,030 --> 01:07:01,560
let me go through what
you can do in here.

1524
01:07:01,560 --> 01:07:05,760
And then what can happen is if
you had these two, UID and

1525
01:07:05,760 --> 01:07:11,460
private key, you can go to the
global storage and say, here's

1526
01:07:11,460 --> 01:07:12,430
the UID, give me the page.

1527
01:07:12,430 --> 01:07:13,200
It has a page.

1528
01:07:13,200 --> 01:07:14,690
It better have the page
for that UID.

1529
01:07:14,690 --> 01:07:16,030
Get the page out of that.

1530
01:07:16,030 --> 01:07:19,310
And then now I can use my
private key to decrypt it.

1531
01:07:19,310 --> 01:07:21,335
And then of course, I
can decompress it in

1532
01:07:21,335 --> 01:07:22,460
the original page.

1533
01:07:22,460 --> 01:07:27,550
So by doing that, I can have
this global system that keeps

1534
01:07:27,550 --> 01:07:31,010
common pages in there
and store them.

1535
01:07:31,010 --> 01:07:33,665
I can basically, really
eliminate the storage.

1536
01:07:33,665 --> 01:07:37,830
Because the nice thing is this
could be, we'll say, 2K pages.

1537
01:07:37,830 --> 01:07:41,800
And this is 64 bit, and this
is probably 256 bit.

1538
01:07:41,800 --> 01:07:44,652
So there's a huge compression
of data in there.

1539
01:07:50,320 --> 01:07:52,110
So here are the kinds of
decisions you are to make.

1540
01:07:52,110 --> 01:07:54,030
For example, hash key.

1541
01:07:54,030 --> 01:07:55,760
So each page is represented
by a hash key.

1542
01:07:58,840 --> 01:08:04,940
But you can have two hash keys,
two pages mapping to the

1543
01:08:04,940 --> 01:08:05,960
same hash key.

1544
01:08:05,960 --> 01:08:08,120
So my god, you're not unique.

1545
01:08:08,120 --> 01:08:09,370
So why is it OK?

1546
01:08:12,692 --> 01:08:14,090
AUDIENCE: Low probability.

1547
01:08:14,090 --> 01:08:15,440
PROFESSOR: Very low
probability.

1548
01:08:15,440 --> 01:08:17,670
And in fact, what they looked
at was they calculated the

1549
01:08:17,670 --> 01:08:19,450
disk failure.

1550
01:08:19,450 --> 01:08:23,210
There's a higher failure of disk
failing and losing data

1551
01:08:23,210 --> 01:08:25,080
than hash collision.

1552
01:08:25,080 --> 01:08:26,800
So you can say, look, you're
keeping that in the hard

1553
01:08:26,800 --> 01:08:29,540
drive, so there could be more
chance of disk failure than

1554
01:08:29,540 --> 01:08:30,140
hash collision.

1555
01:08:30,140 --> 01:08:32,830
So therefore, hash key is--

1556
01:08:32,830 --> 01:08:34,870
you can do that, low enough
probability, you can

1557
01:08:34,870 --> 01:08:36,386
get away with that.

1558
01:08:36,386 --> 01:08:40,840
Actually, I want to skip the
rest of this in there.

1559
01:08:40,840 --> 01:08:45,870
So here is the comparison how
much compression you can give.

1560
01:08:45,870 --> 01:08:48,620
Some of them, you almost
can compress 100%.

1561
01:08:48,620 --> 01:08:52,220
Because if you have a newly
installed Linux box with very

1562
01:08:52,220 --> 01:08:54,899
little data, everything
is in the global.

1563
01:08:54,899 --> 01:08:58,649
So by doing that, basically,
here is the communication.

1564
01:08:58,649 --> 01:09:00,290
The compression is cheap.

1565
01:09:00,290 --> 01:09:02,779
You don't have to do too much
because you just do this hash

1566
01:09:02,779 --> 01:09:04,220
comparison to do that.

1567
01:09:04,220 --> 01:09:09,100
And then now the communication
time is even reduced because

1568
01:09:09,100 --> 01:09:10,490
you communicate a lot less.

1569
01:09:10,490 --> 01:09:11,460
And then you expand.

1570
01:09:11,460 --> 01:09:16,899
So all these three are actually
now much faster to

1571
01:09:16,899 --> 01:09:19,140
send a machine across.

1572
01:09:19,140 --> 01:09:22,850
So here is a total size
of all the VMs.

1573
01:09:22,850 --> 01:09:27,500
The interesting thing is if you
compress within, if you

1574
01:09:27,500 --> 01:09:30,630
just do compression, most
of this is zero blocks.

1575
01:09:30,630 --> 01:09:32,500
You can eliminate the zeros
and do a simple

1576
01:09:32,500 --> 01:09:33,720
compression in here.

1577
01:09:33,720 --> 01:09:35,939
And if you do local
deduplication, if you

1578
01:09:35,939 --> 01:09:37,500
eliminate that, and
if you do global,

1579
01:09:37,500 --> 01:09:38,380
you'll get to this point.

1580
01:09:38,380 --> 01:09:44,060
So you get to about basically
30% of all your data.

1581
01:09:44,060 --> 01:09:46,526
And yeah, there's more data
in here, but just--

1582
01:09:50,109 --> 01:09:57,000
So that's an interesting global
level system that

1583
01:09:57,000 --> 01:09:58,670
people are building today.

1584
01:09:58,670 --> 01:10:02,590
And I think if these kind of
systems appear in many

1585
01:10:02,590 --> 01:10:05,360
different places.

1586
01:10:05,360 --> 01:10:07,265
These days, a lot of people are
building things like cell

1587
01:10:07,265 --> 01:10:09,760
phone games and things like
that, has this huge back

1588
01:10:09,760 --> 01:10:11,370
stores, back computation.

1589
01:10:11,370 --> 01:10:14,790
All those things have this large
scalability in there.

1590
01:10:14,790 --> 01:10:19,630
And so the nice thing about
performance engineering is if

1591
01:10:19,630 --> 01:10:21,700
your application doesn't require
a huge amount of

1592
01:10:21,700 --> 01:10:25,010
computation or very fast
processing, if you're going to

1593
01:10:25,010 --> 01:10:27,880
have millions and millions of
users, or if you're expecting

1594
01:10:27,880 --> 01:10:30,880
millions of users, then building
these systems, and

1595
01:10:30,880 --> 01:10:32,290
understanding, and building

1596
01:10:32,290 --> 01:10:35,460
scalability is really important.

1597
01:10:35,460 --> 01:10:38,360
And a lot of the things we
learned in this class is

1598
01:10:38,360 --> 01:10:40,760
directly applicable there.

1599
01:10:40,760 --> 01:10:42,760
So I have...

1600
01:10:42,760 --> 01:10:48,290
Any other questions before you
guys can go finish your

1601
01:10:48,290 --> 01:10:50,850
project report and go have a
nice Thanksgiving dinner?

1602
01:10:54,560 --> 01:10:58,830
So everybody is thinking that
they will be able to get a

1603
01:10:58,830 --> 01:11:02,660
good handle on what they are
doing for the final project?

1604
01:11:02,660 --> 01:11:04,870
Oh come on, there are so many
cool things you can do with

1605
01:11:04,870 --> 01:11:06,120
this project.

1606
01:11:09,214 --> 01:11:10,464
AUDIENCE: We're working on it.

1607
01:11:13,040 --> 01:11:14,410
PROFESSOR: And there are--

1608
01:11:14,410 --> 01:11:14,740
whatever--

1609
01:11:14,740 --> 01:11:18,410
three or four iPod
Nanos waiting.

1610
01:11:18,410 --> 01:11:20,910
So it makes a lot of
sense to actually

1611
01:11:20,910 --> 01:11:21,970
really focus this one.

1612
01:11:21,970 --> 01:11:22,890
This is a fun project.

1613
01:11:22,890 --> 01:11:24,220
This is, in fact,
a fun project.

1614
01:11:24,220 --> 01:11:28,410
Because talk to these guys
what they did last year.

1615
01:11:28,410 --> 01:11:31,220
People did a lot of interesting
things.

1616
01:11:31,220 --> 01:11:33,030
Because we gave you freedom to
actually even change the

1617
01:11:33,030 --> 01:11:34,610
algorithms.

1618
01:11:34,610 --> 01:11:36,930
So you can actually look
at the algorithm.

1619
01:11:36,930 --> 01:11:39,720
If you know a little bit of
physics and graphics type

1620
01:11:39,720 --> 01:11:42,570
stuff, look at them and say,
look, can I even reduce

1621
01:11:42,570 --> 01:11:44,980
computation of how the
computation is being done?

1622
01:11:44,980 --> 01:11:47,590
So people got a lot of wins
by doing things like that.

1623
01:11:47,590 --> 01:11:50,346
And of course, parallelization
matters.

1624
01:11:50,346 --> 01:11:53,330
And there are a lot of
optimization possibilities in

1625
01:11:53,330 --> 01:11:55,340
this piece of code.

1626
01:11:55,340 --> 01:11:57,200
So take a look at that.

1627
01:11:57,200 --> 01:11:58,260
Just have a plan.

1628
01:11:58,260 --> 01:12:01,250
Just don't go blindly into
it, just have a plan.

1629
01:12:01,250 --> 01:12:03,440
Run it, profile it,
get some feedback.

1630
01:12:03,440 --> 01:12:06,450
You need to get this for
your presentations.

1631
01:12:06,450 --> 01:12:10,330
So run, profile, get some
feedback, have a good plan.

1632
01:12:10,330 --> 01:12:12,660
Go attack it.

1633
01:12:12,660 --> 01:12:14,840
So see you in a week.