1
00:00:00,120 --> 00:00:02,500
The following content is
provided under a Creative

2
00:00:02,500 --> 00:00:03,910
Commons license.

3
00:00:03,910 --> 00:00:06,950
Your support will help MIT
OpenCourseWare continue to

4
00:00:06,950 --> 00:00:10,600
offer high-quality educational
resources for free.

5
00:00:10,600 --> 00:00:13,500
To make a donation or view
additional materials from

6
00:00:13,500 --> 00:00:18,130
hundreds of MIT courses, visit
MIT OpenCourseWare at

7
00:00:18,130 --> 00:00:19,380
ocw.mit.edu.

8
00:00:28,664 --> 00:00:32,320
PROFESSOR: So a couple of things
I want to say about the

9
00:00:32,320 --> 00:00:33,570
final project.

10
00:00:36,800 --> 00:00:40,700
You guys should start
thinking about it.

11
00:00:40,700 --> 00:00:44,920
So of course, you guys should
think about the teams first,

12
00:00:44,920 --> 00:00:47,350
and submit team information.

13
00:00:47,350 --> 00:00:49,940
Say who you're going
to team up with.

14
00:00:49,940 --> 00:00:51,040
By when?

15
00:00:51,040 --> 00:00:53,270
Team information,
Josh, by when?

16
00:00:53,270 --> 00:00:54,990
Team information
has to be in--

17
00:00:54,990 --> 00:00:56,481
AUDIENCE: By tomorrow.

18
00:00:56,481 --> 00:00:57,380
PROFESSOR: By tomorrow.

19
00:00:57,380 --> 00:00:57,790
OK, good.

20
00:00:57,790 --> 00:00:58,930
You should know your teams.

21
00:00:58,930 --> 00:01:02,200
Get them together, use the same
mechanism that we used

22
00:01:02,200 --> 00:01:04,420
before to get the team
information before.

23
00:01:04,420 --> 00:01:07,210
So we are going to add one small
thing this year I think

24
00:01:07,210 --> 00:01:09,720
that would be useful.

25
00:01:09,720 --> 00:01:12,430
Once you have submitted your
design documents, we are going

26
00:01:12,430 --> 00:01:15,370
to get a design review
done by Masters.

27
00:01:15,370 --> 00:01:19,130
So what that means is next week
we are going to schedule

28
00:01:19,130 --> 00:01:21,830
a design review with
your Masters.

29
00:01:21,830 --> 00:01:30,000
And you should send mail to your
Masters, hopefully in the

30
00:01:30,000 --> 00:01:32,610
middle of next week, to schedule
a design review.

31
00:01:32,610 --> 00:01:36,150
The design review will happen
the week after Thanksgiving.

32
00:01:36,150 --> 00:01:39,350
So you'll submit your design
doc next week, and then the

33
00:01:39,350 --> 00:01:40,500
week after Thanksgiving, you'll

34
00:01:40,500 --> 00:01:41,930
schedule your design review.

35
00:01:41,930 --> 00:01:43,530
The earlier the better, because
you can hopefully get

36
00:01:43,530 --> 00:01:45,990
some really good feedback
before you go into

37
00:01:45,990 --> 00:01:47,080
implementation.

38
00:01:47,080 --> 00:01:49,860
So have an idea what
you're doing.

39
00:01:49,860 --> 00:01:52,170
Of course you're write it up
for your design document.

40
00:01:52,170 --> 00:01:54,180
And then go to your Masters
and say, here's what I'm

41
00:01:54,180 --> 00:01:55,080
planning to do.

42
00:01:55,080 --> 00:01:57,050
Get some good feedback, and
hopefully it will help your

43
00:01:57,050 --> 00:01:58,100
life to do that.

44
00:01:58,100 --> 00:02:03,890
And then beforehand, performance
only mattered for

45
00:02:03,890 --> 00:02:04,470
your grade.

46
00:02:04,470 --> 00:02:06,310
We did this absolute grading.

47
00:02:06,310 --> 00:02:08,310
This year, we are going to
have actually an in-class

48
00:02:08,310 --> 00:02:12,660
competition on the final day
of class, to figure out who

49
00:02:12,660 --> 00:02:17,380
has the fastest ray tracer
in the class.

50
00:02:17,380 --> 00:02:20,440
And for that we will actually
give [? you ?]

51
00:02:20,440 --> 00:02:23,260
a little but of a different
[UNINTELLIGIBLE]

52
00:02:23,260 --> 00:02:25,250
than what I have given you.

53
00:02:25,250 --> 00:02:28,770
And so you don't go too much
into really hand coding to

54
00:02:28,770 --> 00:02:30,560
that teammate because
that might not work.

55
00:02:30,560 --> 00:02:33,570
And so here's something
hot off the press.

56
00:02:33,570 --> 00:02:38,120
So for the winning team, there's
going to be an Akamai

57
00:02:38,120 --> 00:02:39,910
prize for the winning team.

58
00:02:39,910 --> 00:02:42,750
And this prize includes a
celebration/demonstration at

59
00:02:42,750 --> 00:02:44,820
Akamai headquarters.

60
00:02:44,820 --> 00:02:46,680
You're going to go visit
there [UNINTELLIGIBLE]

61
00:02:46,680 --> 00:02:49,890
and perhaps show off to their
engineers the cool

62
00:02:49,890 --> 00:02:51,220
ray tracer you did.

63
00:02:51,220 --> 00:02:57,040
And also every team member of
the winning team is going to

64
00:02:57,040 --> 00:02:59,150
get the iPod Nano.

65
00:02:59,150 --> 00:03:01,150
Sorry, guys, last year
it didn't happen.

66
00:03:01,150 --> 00:03:02,480
First time.

67
00:03:02,480 --> 00:03:05,550
So there's a lot at stake.

68
00:03:05,550 --> 00:03:09,370
So make sure your program
is going to run

69
00:03:09,370 --> 00:03:11,120
as fast as you can.

70
00:03:11,120 --> 00:03:19,440
OK, with that, let's
get your slides on.

71
00:03:19,440 --> 00:03:22,660
So I'd like to introduce
Bradley Kuszmaul.

72
00:03:22,660 --> 00:03:27,430
Bradley has been at MIT, in
and out of MIT, for a long

73
00:03:27,430 --> 00:03:32,620
time doing lots of cool stuff
with high-performance--

74
00:03:32,620 --> 00:03:34,670
yeah, make the screen bigger--

75
00:03:34,670 --> 00:03:37,660
high-performance computing.

76
00:03:37,660 --> 00:03:39,980
He has done some really
interesting data structure

77
00:03:39,980 --> 00:03:43,740
work, performance optimization
work, and stuff like that.

78
00:03:43,740 --> 00:03:46,600
And today he's going to talk
about an interesting data

79
00:03:46,600 --> 00:03:50,970
structure that goes all the way
from theory into getting

80
00:03:50,970 --> 00:03:53,130
really, really high-performance.

81
00:03:53,130 --> 00:03:53,580
Thank you, Bradley.

82
00:03:53,580 --> 00:03:55,745
OK, you have the mic.

83
00:03:55,745 --> 00:04:01,290
BRADLEY KUSZMAUL: So I'm going
to talk about a data structure

84
00:04:01,290 --> 00:04:04,330
called fractal trees, which in
the academic world are called

85
00:04:04,330 --> 00:04:05,860
streaming B-trees.

86
00:04:05,860 --> 00:04:08,310
But the marketing people didn't
think very much of

87
00:04:08,310 --> 00:04:11,530
that, and so a lot of these
slides are borrowed from a

88
00:04:11,530 --> 00:04:13,270
company that I've started.

89
00:04:13,270 --> 00:04:15,710
So rather than redo that, I'm
just going to stick to the

90
00:04:15,710 --> 00:04:20,029
terminology "fractal tree." I'm
research faculty at MIT,

91
00:04:20,029 --> 00:04:21,790
and I'm a founder at Tokutek.

92
00:04:21,790 --> 00:04:24,980
And so that's sort
of who I am.

93
00:04:24,980 --> 00:04:26,620
I'll do a little bit
more introduction.

94
00:04:26,620 --> 00:04:29,060
So I have been around
at MIT a long time.

95
00:04:29,060 --> 00:04:31,150
I have four MIT degrees.

96
00:04:31,150 --> 00:04:34,030
And I was one of the architects
of the Connection

97
00:04:34,030 --> 00:04:35,190
Machine CM-5.

98
00:04:35,190 --> 00:04:38,920
And Charles was also one of the

99
00:04:38,920 --> 00:04:40,380
architects of that machine.

100
00:04:40,380 --> 00:04:43,380
So at the time, that was the
fastest machine in the world,

101
00:04:43,380 --> 00:04:45,300
at least for some
applications.

102
00:04:45,300 --> 00:04:51,230
And after getting my degrees and
being an architect, I went

103
00:04:51,230 --> 00:04:53,150
and was a professor
at Yale, and then

104
00:04:53,150 --> 00:04:54,990
later I was at Akamai.

105
00:04:54,990 --> 00:04:59,080
So I don't know what an Akamai
prize is beyond an iPod, but

106
00:04:59,080 --> 00:05:01,210
maybe it's like all your content
delivered free for a

107
00:05:01,210 --> 00:05:03,990
month or something.

108
00:05:03,990 --> 00:05:07,010
And I'm now research faculty
in the SuperTech Group,

109
00:05:07,010 --> 00:05:08,770
working with Charles.

110
00:05:08,770 --> 00:05:12,010
And I'm a founder of Tokutek,
which is commercializing some

111
00:05:12,010 --> 00:05:13,130
work we did.

112
00:05:13,130 --> 00:05:17,490
A couple years ago, I started
collaborating with Michael

113
00:05:17,490 --> 00:05:20,230
Bender and Martin Farach-Colton
on data

114
00:05:20,230 --> 00:05:26,260
structures for that are suited
for storing data on disk.

115
00:05:26,260 --> 00:05:29,530
And we ended up a bit later
starting a company to

116
00:05:29,530 --> 00:05:30,690
commercialize the research.

117
00:05:30,690 --> 00:05:34,330
And basically, I'll tell you
sort of what the background

118
00:05:34,330 --> 00:05:35,900
is, and actually go into some

119
00:05:35,900 --> 00:05:38,390
technical on the data structure.

120
00:05:38,390 --> 00:05:42,470
So I don't know exactly what
you've spent most of your time

121
00:05:42,470 --> 00:05:44,815
on, but a lot of
high-performance work,

122
00:05:44,815 --> 00:05:49,190
especially in academia, focuses
on the CPUs and using

123
00:05:49,190 --> 00:05:51,110
the CPUs efficiently,
maybe getting

124
00:05:51,110 --> 00:05:52,780
lots of FLOPS or something.

125
00:05:52,780 --> 00:05:55,730
The Cilk work that Charles and
I did, for example, is

126
00:05:55,730 --> 00:06:00,000
squarely in the category of how
do you get more FLOPS, or

127
00:06:00,000 --> 00:06:03,770
more computrons out of
a particular machine.

128
00:06:03,770 --> 00:06:07,940
But it turns out often I/O is
a big bottleneck, and so you

129
00:06:07,940 --> 00:06:09,850
see systems that look a
little bit like this.

130
00:06:09,850 --> 00:06:12,720
You have a whole bunch of
sensors somewhere, and the

131
00:06:12,720 --> 00:06:20,900
sensors might be something like
a bunch of telescopes in

132
00:06:20,900 --> 00:06:22,400
an astronomy system.

133
00:06:22,400 --> 00:06:24,950
They're sending millions of
data items per second, and

134
00:06:24,950 --> 00:06:26,100
they have to be stored.

135
00:06:26,100 --> 00:06:29,010
And disk is where you have to
store large amounts of data,

136
00:06:29,010 --> 00:06:33,830
because disk is orders of
magnitude per byte than other

137
00:06:33,830 --> 00:06:35,010
storage systems.

138
00:06:35,010 --> 00:06:42,560
And then you want to do queries
on that data, and you

139
00:06:42,560 --> 00:06:44,160
want to look at the data
that's recent.

140
00:06:44,160 --> 00:06:46,440
So it's not good enough just to
look at yesterday's data.

141
00:06:46,440 --> 00:06:48,500
You want to know what's
going on right now.

142
00:06:48,500 --> 00:06:50,970
If your sensor array is a bunch
of telescopes, and a

143
00:06:50,970 --> 00:06:53,980
supernova starts happening, you
want to be able to find

144
00:06:53,980 --> 00:06:56,460
out quickly what's going on, so
that you can broadcast the

145
00:06:56,460 --> 00:06:58,280
message to everybody in the
world so they can all point

146
00:06:58,280 --> 00:07:02,690
their telescopes at the
supernova while it's fresh.

147
00:07:02,690 --> 00:07:05,170
So that's the picture.

148
00:07:05,170 --> 00:07:08,730
Another example of a sensor
system is the internet, where

149
00:07:08,730 --> 00:07:12,220
you have thousands or millions
of people clicking away on

150
00:07:12,220 --> 00:07:13,850
Facebook, for example.

151
00:07:13,850 --> 00:07:17,380
You could view that collection
of mice as, abstractly, a

152
00:07:17,380 --> 00:07:19,580
bunch of sensors.

153
00:07:19,580 --> 00:07:21,240
And so you see it in science.

154
00:07:21,240 --> 00:07:22,260
You see it in internet.

155
00:07:22,260 --> 00:07:24,340
There's lots of applications.

156
00:07:24,340 --> 00:07:27,220
For example, another one would
be that you're looking for

157
00:07:27,220 --> 00:07:32,100
attacks on your internet
infrastructure in a large

158
00:07:32,100 --> 00:07:34,370
corporation, or something.

159
00:07:34,370 --> 00:07:43,320
So trying to reduce this big
sensor system to what is the

160
00:07:43,320 --> 00:07:46,640
fundamental problem, basically
we need to index the data.

161
00:07:46,640 --> 00:07:49,550
So the data indexing
problem is this.

162
00:07:49,550 --> 00:07:54,280
Data is arriving in one order,
and you want to ask about it

163
00:07:54,280 --> 00:07:55,440
in another order.

164
00:07:55,440 --> 00:07:59,970
So typically data is arriving
in the order by time.

165
00:07:59,970 --> 00:08:03,260
When an observation is made,
the event is logged.

166
00:08:03,260 --> 00:08:05,950
When the next observation is
made, the event is logged.

167
00:08:05,950 --> 00:08:07,820
And then you want
to do a query.

168
00:08:07,820 --> 00:08:11,040
Tell me everything that's
happening in that particular

169
00:08:11,040 --> 00:08:14,470
area of the sky over
the past month.

170
00:08:14,470 --> 00:08:18,060
So there's a big transposition
that has to be done for these

171
00:08:18,060 --> 00:08:21,210
queries, abstractly, is the
data's coming in in one order,

172
00:08:21,210 --> 00:08:23,490
and you want to sort
it and get the data

173
00:08:23,490 --> 00:08:24,740
out in another order.

174
00:08:28,050 --> 00:08:30,700
So one solution to this problem
that a lot of people

175
00:08:30,700 --> 00:08:32,220
use is simply to
sort the data.

176
00:08:32,220 --> 00:08:33,020
The data comes in.

177
00:08:33,020 --> 00:08:34,270
Sort it.

178
00:08:36,860 --> 00:08:39,380
Then you can query it in the
order that it makes sense.

179
00:08:39,380 --> 00:08:43,150
This is basically a
simple-minded explanation of

180
00:08:43,150 --> 00:08:45,090
what a data warehouse is.

181
00:08:45,090 --> 00:08:47,310
A data warehouse is all
this data comes in--

182
00:08:47,310 --> 00:08:49,610
Walmart runs one of
these in Arkansas.

183
00:08:49,610 --> 00:08:53,320
All these events which are
people scanning cans of soup

184
00:08:53,320 --> 00:08:56,560
on bar codes all over the
country out in Walmart stores,

185
00:08:56,560 --> 00:09:01,060
all that data arrives in
Arkansas, in one location.

186
00:09:01,060 --> 00:09:03,870
They sort the data overnight,
and then the next morning they

187
00:09:03,870 --> 00:09:08,170
can answer questions like what's
the most popular food

188
00:09:08,170 --> 00:09:12,070
in the week before a
hurricane strikes?

189
00:09:12,070 --> 00:09:14,690
Because this is the kind of
request that Walmart might

190
00:09:14,690 --> 00:09:17,230
care about, because they get a
forecast that a hurricane's

191
00:09:17,230 --> 00:09:19,110
coming, and it turns out they
need to ship beer and

192
00:09:19,110 --> 00:09:23,920
blueberry Pop-Tarts to the local
stores, which are the

193
00:09:23,920 --> 00:09:28,780
things that basically you can
eat even if power has failed.

194
00:09:28,780 --> 00:09:30,550
The problem with sorting
is that you

195
00:09:30,550 --> 00:09:31,880
have to wait overnight.

196
00:09:31,880 --> 00:09:35,100
And for Walmart, that might
actually be good enough.

197
00:09:35,100 --> 00:09:36,890
But if you're the
astronomer, that

198
00:09:36,890 --> 00:09:39,430
application is not so great.

199
00:09:39,430 --> 00:09:42,940
So this problem is called
the indexing problem.

200
00:09:42,940 --> 00:09:45,430
We have to maintain indexes.

201
00:09:45,430 --> 00:09:50,790
And traditionally, the classical
solution is to use a

202
00:09:50,790 --> 00:09:52,040
data structure called
a B-tree.

203
00:09:52,040 --> 00:09:55,348
So do you all know
what a B-tree is?

204
00:09:55,348 --> 00:09:57,120
[INAUDIBLE] data structures
to algorithms?

205
00:09:57,120 --> 00:10:00,000
A B-tree is like a search tree,
except it's got some

206
00:10:00,000 --> 00:10:03,030
fan-out, and I'll talk
about it in a second.

207
00:10:03,030 --> 00:10:06,230
They show up in virtually all
storage systems today, and

208
00:10:06,230 --> 00:10:10,160
they were invented about 40
years ago, and they show up in

209
00:10:10,160 --> 00:10:13,230
databases such as MyISAM
or Oracle.

210
00:10:13,230 --> 00:10:16,010
They show up in file
systems like XFS.

211
00:10:16,010 --> 00:10:19,390
You can think of what Unix file
systems like EXT do as

212
00:10:19,390 --> 00:10:20,820
being a variation of a B-tree.

213
00:10:20,820 --> 00:10:22,630
Basically they're everywhere.

214
00:10:22,630 --> 00:10:26,380
Mike might drew this picture
of a B-tree.

215
00:10:26,380 --> 00:10:31,090
And I said, I don't get it.

216
00:10:31,090 --> 00:10:34,540
He said, well, there's a
tree, and there's bees.

217
00:10:34,540 --> 00:10:36,710
And I said, but those
are wasps.

218
00:10:36,710 --> 00:10:37,960
So anyway--

219
00:10:45,670 --> 00:10:47,180
So a B-tree looks like this.

220
00:10:47,180 --> 00:10:50,660
It's a search tree, so that
means everything is organized.

221
00:10:50,660 --> 00:10:52,990
It's got left children and right
children, and there's

222
00:10:52,990 --> 00:10:54,360
actually many children.

223
00:10:54,360 --> 00:10:56,340
And like any other search tree,
all the things to the

224
00:10:56,340 --> 00:10:58,820
left are before all the
things to the right.

225
00:10:58,820 --> 00:11:01,470
That's sort of the property of
trees that lets you do more

226
00:11:01,470 --> 00:11:02,470
than just a hash table.

227
00:11:02,470 --> 00:11:05,850
A hash table lets you
do get and put.

228
00:11:05,850 --> 00:11:08,930
But a tree lets you do next.

229
00:11:08,930 --> 00:11:11,870
And that's the key observation,
why you need

230
00:11:11,870 --> 00:11:13,480
something like a tree instead
of a hash table.

231
00:11:13,480 --> 00:11:17,510
A lot of database queries-- if
you go and click on Facebook

232
00:11:17,510 --> 00:11:22,000
on somebody's page, there's
all these things that have

233
00:11:22,000 --> 00:11:25,770
been posted on somebody's
wall.

234
00:11:25,770 --> 00:11:28,480
And what they've done when they
organized that data is

235
00:11:28,480 --> 00:11:31,750
that they've organized it so
that each of those items is a

236
00:11:31,750 --> 00:11:34,290
row in the database, and they're
next to each other so

237
00:11:34,290 --> 00:11:37,380
that you fetch the first one,
which is like the home page of

238
00:11:37,380 --> 00:11:40,460
the person, and then the next
and next and next gives each

239
00:11:40,460 --> 00:11:42,860
of the messages that they
want to display.

240
00:11:42,860 --> 00:11:46,440
And by making those things
adjacent to each other, it

241
00:11:46,440 --> 00:11:49,630
means that they don't incur
a disk I/O every time.

242
00:11:49,630 --> 00:11:51,890
if it were just a hash table,
you'd be having to look all

243
00:11:51,890 --> 00:11:54,400
over the place to find
those things.

244
00:11:54,400 --> 00:11:58,010
So B-trees are really fast, if
you have to do insertions

245
00:11:58,010 --> 00:11:58,740
sequentially.

246
00:11:58,740 --> 00:12:02,100
And the reason is you have a
data structure that's too big

247
00:12:02,100 --> 00:12:05,350
to fit in main memory.

248
00:12:05,350 --> 00:12:07,620
If the data structure fits in
main memory, this is just the

249
00:12:07,620 --> 00:12:09,690
wrong data structure, right?

250
00:12:09,690 --> 00:12:13,320
If it fits in main memory, what
should you use to solve

251
00:12:13,320 --> 00:12:14,570
this problem?

252
00:12:16,760 --> 00:12:19,450
Any ideas?

253
00:12:19,450 --> 00:12:23,410
What data structure is like a
B-tree except it doesn't have

254
00:12:23,410 --> 00:12:24,660
lots of fan-out?

255
00:12:28,241 --> 00:12:30,784
Does anybody know this
stuff in this class?

256
00:12:33,530 --> 00:12:36,780
Do you people know data
structures at all?

257
00:12:36,780 --> 00:12:38,820
Maybe I'm in the wrong place.

258
00:12:38,820 --> 00:12:41,520
Because it's OK.

259
00:12:41,520 --> 00:12:43,550
Just a binary tree would be the
data structure if you were

260
00:12:43,550 --> 00:12:46,070
doing this in memory, right?

261
00:12:46,070 --> 00:12:47,520
A binary tree would be fine.

262
00:12:47,520 --> 00:12:50,365
Or maybe you would try to
minimize the number of cache

263
00:12:50,365 --> 00:12:52,810
misses or something.

264
00:12:52,810 --> 00:12:56,640
So for sequential inserts, if
you're inserting at the end,

265
00:12:56,640 --> 00:12:59,380
basically all the stuff down the
right spine of the tree is

266
00:12:59,380 --> 00:13:00,750
in main memory, and
an insertion

267
00:13:00,750 --> 00:13:01,870
just inserts and inserts.

268
00:13:01,870 --> 00:13:03,890
You have no disk I/Os,
and basically it

269
00:13:03,890 --> 00:13:06,310
runs extremely fast.

270
00:13:06,310 --> 00:13:08,210
The disk I/O is sequential.

271
00:13:08,210 --> 00:13:12,110
You get basically performance
that's limited by the disk

272
00:13:12,110 --> 00:13:15,270
bandwidth, which is the rate
at which the disk can write

273
00:13:15,270 --> 00:13:17,980
consecutive blocks.

274
00:13:17,980 --> 00:13:21,650
But B-trees are really slow if
you're doing insertions that

275
00:13:21,650 --> 00:13:23,260
look random.

276
00:13:23,260 --> 00:13:26,260
The database world calls
those high-entropy.

277
00:13:26,260 --> 00:13:29,320
And so basically the idea is I
pick some leaf at random, and

278
00:13:29,320 --> 00:13:32,900
then I have to bring it into
main memory, put the new

279
00:13:32,900 --> 00:13:35,610
record in there, and then
eventually write it back out.

280
00:13:35,610 --> 00:13:39,960
And because the data structure
is spread all over disk, then

281
00:13:39,960 --> 00:13:42,530
each of those random blocks that
I choose, when I bring it

282
00:13:42,530 --> 00:13:45,920
in, that's a random disk I/O,
which is very expensive.

283
00:13:45,920 --> 00:13:49,390
So here, for this workload,
unlike the previous workload,

284
00:13:49,390 --> 00:13:53,850
the performance of the system is
limited by how fast can you

285
00:13:53,850 --> 00:13:57,260
move the disk head around,
rather than how fast can you

286
00:13:57,260 --> 00:13:59,280
write having placed
the disk head.

287
00:13:59,280 --> 00:14:03,920
And you perhaps you can only, on
a disk drive, do something

288
00:14:03,920 --> 00:14:08,500
like 100 disk head movements
per second.

289
00:14:08,500 --> 00:14:13,010
And if you're writing small
records that are like 100

290
00:14:13,010 --> 00:14:16,180
bytes or something, you might
find yourself using a

291
00:14:16,180 --> 00:14:20,500
thousandth of a percent of the
disk I/O, of the disk's

292
00:14:20,500 --> 00:14:21,970
bandwidth performance.

293
00:14:21,970 --> 00:14:23,860
And so people hate that.

294
00:14:23,860 --> 00:14:25,690
They hate buying something and
only being able to use a

295
00:14:25,690 --> 00:14:28,670
thousandth of a percent
of its capacity.

296
00:14:28,670 --> 00:14:29,920
Right?

297
00:14:34,440 --> 00:14:35,540
New B-trees.

298
00:14:35,540 --> 00:14:37,080
Something's wrong
with that title.

299
00:14:37,080 --> 00:14:41,190
So B-trees are really fast in
doing range queries, because

300
00:14:41,190 --> 00:14:43,440
basically once you've are
brought a block in and you

301
00:14:43,440 --> 00:14:46,470
want to do the next item,
chances are the next item's

302
00:14:46,470 --> 00:14:47,700
also on the same page.

303
00:14:47,700 --> 00:14:50,990
So once in while you go over a
page boundary, but mostly you

304
00:14:50,990 --> 00:14:53,410
just reading stuff very fast.

305
00:14:53,410 --> 00:14:55,050
Oh, I know what this is about.

306
00:14:55,050 --> 00:14:57,960
When a B-tree's new and it's
been constructed sequentially,

307
00:14:57,960 --> 00:14:59,350
it's also very fast.

308
00:14:59,350 --> 00:15:02,760
When it gets old, what happens
is the blocks themselves get

309
00:15:02,760 --> 00:15:03,760
moved around on disk.

310
00:15:03,760 --> 00:15:05,280
They're not next
to each other.

311
00:15:05,280 --> 00:15:08,600
And this is a problem that
people have spent a lot of

312
00:15:08,600 --> 00:15:12,370
time trying to solve, is that
as B-trees get older, their

313
00:15:12,370 --> 00:15:14,350
performance degrades.

314
00:15:14,350 --> 00:15:16,390
This aging problem--

315
00:15:16,390 --> 00:15:20,790
I saw one report that suggested
that something like

316
00:15:20,790 --> 00:15:27,430
2% of all the money spent by
corporations on IT is spent

317
00:15:27,430 --> 00:15:30,730
dumping and reloading their
B-trees to try to make this

318
00:15:30,730 --> 00:15:32,820
problem go away.

319
00:15:32,820 --> 00:15:37,175
So that's a lot of money
or pain or something.

320
00:15:40,890 --> 00:15:46,880
Well, B-trees are optimal
for doing lookups.

321
00:15:46,880 --> 00:15:49,370
If you just want to look
something up, there's an old

322
00:15:49,370 --> 00:15:52,660
argument that says, gee,
if going to have a tree

323
00:15:52,660 --> 00:15:56,430
structure, which is what you
need in order to do next

324
00:15:56,430 --> 00:15:59,220
operations, then you're going to
have some path through the

325
00:15:59,220 --> 00:16:04,370
B-tree which is a certain depth,
and you do it optimally

326
00:16:04,370 --> 00:16:07,430
by having the fan-out
be the block size.

327
00:16:07,430 --> 00:16:08,680
And everything works.

328
00:16:11,520 --> 00:16:15,420
But that argument of optimality
is not actually

329
00:16:15,420 --> 00:16:17,900
true for insertion workloads.

330
00:16:17,900 --> 00:16:21,990
And this is where the data
structures work that I've done

331
00:16:21,990 --> 00:16:25,440
with Mike and Martin sort of
gets to be an advantage.

332
00:16:25,440 --> 00:16:28,220
To see that B-trees aren't
optimal for insertions--

333
00:16:28,220 --> 00:16:32,340
here's a data structure that's
really good at insertions.

334
00:16:32,340 --> 00:16:33,220
What is the data structure?

335
00:16:33,220 --> 00:16:36,260
I'm just going to append
to the end of a file.

336
00:16:36,260 --> 00:16:37,920
Right?

337
00:16:37,920 --> 00:16:39,390
So it's great.

338
00:16:39,390 --> 00:16:41,820
Basically, it doesn't matter
what the keys are.

339
00:16:41,820 --> 00:16:44,010
I can insert data
into this data

340
00:16:44,010 --> 00:16:46,040
structure at disk bandwidth.

341
00:16:49,010 --> 00:16:51,468
What's the disadvantage of
this data structure?

342
00:16:51,468 --> 00:16:53,908
AUDIENCE: Lookups?

343
00:16:53,908 --> 00:16:54,890
BRADLEY KUSZMAUL: Lookups.

344
00:16:54,890 --> 00:16:57,530
So what is the disadvantage?

345
00:16:57,530 --> 00:16:58,790
Lookups aren't so good.

346
00:16:58,790 --> 00:17:01,150
What is the cost of
doing a lookup?

347
00:17:01,150 --> 00:17:02,536
AUDIENCE: Order N?

348
00:17:02,536 --> 00:17:03,940
BRADLEY KUSZMAUL: Order n.

349
00:17:03,940 --> 00:17:04,740
Yeah.

350
00:17:04,740 --> 00:17:07,730
You have to look
at everything.

351
00:17:07,730 --> 00:17:11,560
It requires a scan of
the entire table.

352
00:17:11,560 --> 00:17:13,849
And we'll get into what the
cost model is in a second.

353
00:17:13,849 --> 00:17:16,069
But basically, you have
to look at everything.

354
00:17:16,069 --> 00:17:19,250
So it's order n.

355
00:17:19,250 --> 00:17:23,400
It turns out the number of
blocks you have to read in,

356
00:17:23,400 --> 00:17:25,319
which is the thing
you care about--

357
00:17:25,319 --> 00:17:27,859
it's order n over b.

358
00:17:27,859 --> 00:17:28,900
So we'll get into a performance

359
00:17:28,900 --> 00:17:30,480
model in just a second.

360
00:17:30,480 --> 00:17:31,386
So here we are.

361
00:17:31,386 --> 00:17:34,850
We have two data structures--
a B-tree, which is not so

362
00:17:34,850 --> 00:17:37,140
great at insertions.

363
00:17:37,140 --> 00:17:40,210
It's quite good at point queries
and quite good at

364
00:17:40,210 --> 00:17:42,530
ranged queries, especially
when it's young.

365
00:17:42,530 --> 00:17:45,130
And we have this other data
structure, which is the append

366
00:17:45,130 --> 00:17:48,700
data structure, which is
wonderful for insertions and

367
00:17:48,700 --> 00:17:50,880
really bad for queries.

368
00:17:50,880 --> 00:17:55,010
So can you do something that's
like the best of

369
00:17:55,010 --> 00:17:56,670
all possible worlds?

370
00:17:56,670 --> 00:17:58,610
You can imagine a data structure
that's the worst of

371
00:17:58,610 --> 00:18:02,570
all possible, but it turns out
that there are data structures

372
00:18:02,570 --> 00:18:05,020
that do well for this.

373
00:18:05,020 --> 00:18:09,480
And I'll show you how one
works in a minute

374
00:18:09,480 --> 00:18:12,520
So to explain how it works and
to do the analysis, we need to

375
00:18:12,520 --> 00:18:13,830
have a cost model.

376
00:18:13,830 --> 00:18:16,060
And we got into this just a
minute ago, with what is the

377
00:18:16,060 --> 00:18:17,870
cost model for a table scan?

378
00:18:17,870 --> 00:18:19,530
Is it order N?

379
00:18:19,530 --> 00:18:23,080
Well, if you're only counting
the number of CPU cycles that

380
00:18:23,080 --> 00:18:25,120
you're using up, it's order N,
because you have to look at

381
00:18:25,120 --> 00:18:26,630
every item.

382
00:18:26,630 --> 00:18:28,910
But if what you really care
about is the number of disk

383
00:18:28,910 --> 00:18:31,440
I/Os, then you just count
up the number of blocks.

384
00:18:31,440 --> 00:18:35,820
And so in that model, the cost
is order N over B. And that's

385
00:18:35,820 --> 00:18:39,230
the model that we're going to
use to do this analysis.

386
00:18:39,230 --> 00:18:43,030
So in this model, we aren't
going to care about CPU cost.

387
00:18:43,030 --> 00:18:45,470
We are going to care about disk
I/O. And that's a pretty

388
00:18:45,470 --> 00:18:48,510
good place to design in if
you're an engineer, because

389
00:18:48,510 --> 00:18:52,595
right now the number of CPU
cycles that you get for a

390
00:18:52,595 --> 00:18:54,620
dollar is going up.

391
00:18:54,620 --> 00:18:55,480
It's been going up.

392
00:18:55,480 --> 00:18:57,330
It's continuing to go up.

393
00:18:57,330 --> 00:19:00,140
You have to write parallel
programs today to get that,

394
00:19:00,140 --> 00:19:06,650
but you get a lot of cycles
in a $100 package.

395
00:19:06,650 --> 00:19:10,000
But the number of disk I/Os
that you're getting is

396
00:19:10,000 --> 00:19:10,885
essentially unchanged.

397
00:19:10,885 --> 00:19:14,520
It's maybe improved by a factor
of two in 40 years.

398
00:19:14,520 --> 00:19:18,370
So that's the one to optimize
for, is the

399
00:19:18,370 --> 00:19:19,960
one that's not changing.

400
00:19:19,960 --> 00:19:22,890
And use all those CPU cycles,
if you can, to do something.

401
00:19:22,890 --> 00:19:26,880
So the model here is that we're
going to have a memory

402
00:19:26,880 --> 00:19:27,950
and a disk.

403
00:19:27,950 --> 00:19:33,340
And you could use this, and
there's some block size, B,

404
00:19:33,340 --> 00:19:36,100
which we may or may not
know what it is.

405
00:19:36,100 --> 00:19:39,340
And it's actually quite tricky
on real disk systems to figure

406
00:19:39,340 --> 00:19:42,510
out what the right
block size is.

407
00:19:42,510 --> 00:19:48,100
It's not 500 bytes, because
that's not going to be a good

408
00:19:48,100 --> 00:19:49,090
block size.

409
00:19:49,090 --> 00:19:51,550
It might be more like
a megabyte.

410
00:19:51,550 --> 00:19:53,880
And when we move stuff back and
forth, we're going to move

411
00:19:53,880 --> 00:19:54,940
a block at a time.

412
00:19:54,940 --> 00:19:57,270
We're going to bring in a whole
block from disk, and

413
00:19:57,270 --> 00:19:59,440
when we have to write a
block out, we write

414
00:19:59,440 --> 00:20:00,710
the whole block out.

415
00:20:00,710 --> 00:20:01,970
So we're just going
to count that up.

416
00:20:01,970 --> 00:20:03,000
There's two parameters.

417
00:20:03,000 --> 00:20:08,790
There's the block size, B, and
the memory size, M. So if the

418
00:20:08,790 --> 00:20:11,690
memory is as big as the entire
disk, then the problem goes

419
00:20:11,690 --> 00:20:14,700
away, and if the memory's way
too small-- like you can only

420
00:20:14,700 --> 00:20:15,850
have one block--

421
00:20:15,850 --> 00:20:18,620
then it's very difficult
to get anything done.

422
00:20:18,620 --> 00:20:20,540
So you need to be able
to hold several

423
00:20:20,540 --> 00:20:22,620
blocks worth of storage.

424
00:20:22,620 --> 00:20:25,200
The memory is treated
as a cache for disk.

425
00:20:25,200 --> 00:20:28,430
So once we've brought a block
in, we can keep using it for a

426
00:20:28,430 --> 00:20:30,290
while until we get rid of it.

427
00:20:30,290 --> 00:20:32,650
So have you guys done any
cache-oblivious data

428
00:20:32,650 --> 00:20:33,210
structures?

429
00:20:33,210 --> 00:20:33,990
OK.

430
00:20:33,990 --> 00:20:37,130
So you've seen this model.

431
00:20:37,130 --> 00:20:39,170
So the game here is to minimize
the number of clock

432
00:20:39,170 --> 00:20:41,640
cycles and not worrying
about the CPU cycles.

433
00:20:41,640 --> 00:20:44,300
So here's the theoretical
results.

434
00:20:44,300 --> 00:20:45,660
We'll start with a B-tree.

435
00:20:45,660 --> 00:20:48,480
So a B-tree which has
a block size b--

436
00:20:48,480 --> 00:20:51,270
and here I'm going to assume
that the things you're storing

437
00:20:51,270 --> 00:20:52,490
are unit sized.

438
00:20:52,490 --> 00:20:54,800
Because you can do the analysis,
but it gets more

439
00:20:54,800 --> 00:20:55,680
complicated.

440
00:20:55,680 --> 00:21:02,540
So the cost of a lookup, which
is the upper right side, is

441
00:21:02,540 --> 00:21:06,640
log N over log B. That's
the same as--

442
00:21:13,040 --> 00:21:17,260
you may not be used to
manipulating these, but

443
00:21:17,260 --> 00:21:20,800
usually people write this as
log base B of N. But that's

444
00:21:20,800 --> 00:21:26,920
the same as log N over log B.
And I'm going to write it this

445
00:21:26,920 --> 00:21:30,380
way, because then it's easier
to compare things.

446
00:21:30,380 --> 00:21:35,010
So if B is 1,000 or
something, then

447
00:21:35,010 --> 00:21:36,410
basically instead of paying--

448
00:21:40,590 --> 00:21:43,920
just as an example, if N is,
say, 2 to the 40th--

449
00:21:46,490 --> 00:21:48,670
let's these all to
lg's because it

450
00:21:48,670 --> 00:21:50,290
basically doesn't matter.

451
00:21:50,290 --> 00:21:54,880
So it's 40 over log base B,
and if B is, say, 2 to the

452
00:21:54,880 --> 00:22:02,710
10th, then that means that if
you have a trillion items and

453
00:22:02,710 --> 00:22:06,780
you have a fan-out of 1,000, it
takes you at most four disk

454
00:22:06,780 --> 00:22:11,880
I/Os to find any particular
item.

455
00:22:11,880 --> 00:22:15,590
An insertion cost is the same,
because to do an insertion, we

456
00:22:15,590 --> 00:22:18,820
have to find the leaf that the
item should have been in, and

457
00:22:18,820 --> 00:22:20,070
then put it there.

458
00:22:22,350 --> 00:22:26,370
So the append log-- well, what's
the cost of insertion?

459
00:22:26,370 --> 00:22:29,130
Well, we're appending
away, right?

460
00:22:29,130 --> 00:22:32,630
And once every B items, we
actually have to do a disk

461
00:22:32,630 --> 00:22:35,660
I/O. So the cost of an
insertion in the

462
00:22:35,660 --> 00:22:38,120
append log isn't 0.

463
00:22:38,120 --> 00:22:42,590
It's one Bth of a block
I/O per object.

464
00:22:42,590 --> 00:22:44,640
And the point query cost
looks really bad.

465
00:22:44,640 --> 00:22:47,970
It's N over B, which we
already discussed.

466
00:22:47,970 --> 00:22:51,760
So the fractal tree has this
kind of performance.

467
00:22:51,760 --> 00:22:54,750
It's log N over--

468
00:22:54,750 --> 00:22:58,370
it's not B, which would
be really great.

469
00:22:58,370 --> 00:23:01,150
It's something smaller.

470
00:23:01,150 --> 00:23:06,450
It's maybe square root of B
for the insertion cost.

471
00:23:06,450 --> 00:23:11,110
And the lookup cost is log N
over something, which I'm

472
00:23:11,110 --> 00:23:13,630
going to just hide.

473
00:23:13,630 --> 00:23:20,750
Let's set epsilon to 1/2 and
work out what that is.

474
00:23:20,750 --> 00:23:23,700
Because epsilon 1/2 is a
good engineering point.

475
00:23:23,700 --> 00:23:32,810
So the insertion cost is log N
over B to the 1/2, which is

476
00:23:32,810 --> 00:23:40,275
log N over root B. And the other
one, the lookup cost--

477
00:23:45,540 --> 00:23:47,780
there's all big 0's around here,
but I'm not going to

478
00:23:47,780 --> 00:23:49,800
draw those again-- over 1/2--

479
00:23:49,800 --> 00:23:51,540
so I'm going to maybe
ignore that--

480
00:23:51,540 --> 00:24:02,370
of the log of the square root
of B. Did I do that right?

481
00:24:02,370 --> 00:24:05,700
B to the 1 minus 1/2.

482
00:24:05,700 --> 00:24:08,180
Put the 1/2 back in
to make you happy.

483
00:24:08,180 --> 00:24:10,670
So big-O of that--

484
00:24:10,670 --> 00:24:12,350
well, what's log of root B?

485
00:24:16,850 --> 00:24:18,350
AUDIENCE: [INAUDIBLE]

486
00:24:18,350 --> 00:24:19,780
BRADLEY KUSZMAUL: I can't
quite hear you,

487
00:24:19,780 --> 00:24:23,020
but I know the answer.

488
00:24:23,020 --> 00:24:26,020
I can just say that's the same
as log B, when I'm doing big

489
00:24:26,020 --> 00:24:27,640
0's Get rid of the halves.

490
00:24:27,640 --> 00:24:33,890
So it's log N over log B. So
if you sort of choose block

491
00:24:33,890 --> 00:24:39,530
sizes, if you set this parameter
to be something

492
00:24:39,530 --> 00:24:41,720
where you're doing something
with the square root, you end

493
00:24:41,720 --> 00:24:45,320
up having lookups that cost,
asymptotically the

494
00:24:45,320 --> 00:24:48,450
same as for a B-tree.

495
00:24:48,450 --> 00:24:49,780
But there are these constants
in there.

496
00:24:49,780 --> 00:24:51,480
There's a factor of 4
or something that

497
00:24:51,480 --> 00:24:53,900
I've glossed over.

498
00:24:53,900 --> 00:24:56,160
But asymptotically,
it's the same.

499
00:24:56,160 --> 00:24:59,290
And insertions have this much
better performance.

500
00:24:59,290 --> 00:25:03,360
What if B was 1,000?

501
00:25:03,360 --> 00:25:05,090
Then we're dividing
by 30 here.

502
00:25:05,090 --> 00:25:07,140
But B isn't really 1,000.

503
00:25:07,140 --> 00:25:09,900
B's more like a million
in a modern system.

504
00:25:09,900 --> 00:25:14,940
So you actually get to divide by
something like 1,000 here.

505
00:25:14,940 --> 00:25:18,610
And that's a huge advantage, to
basically make insertions

506
00:25:18,610 --> 00:25:22,852
asymptotically be 1,000 times
faster, whatever that means.

507
00:25:25,670 --> 00:25:27,720
When you actually work out the
constants, perhaps it's a

508
00:25:27,720 --> 00:25:32,120
factor of 100, is what
we see in practice.

509
00:25:32,120 --> 00:25:39,070
So this is basically working
out those details.

510
00:25:39,070 --> 00:25:40,160
So here's an example.

511
00:25:40,160 --> 00:25:42,060
Here is a data structure that
can achieve this kind of

512
00:25:42,060 --> 00:25:44,380
performance.

513
00:25:44,380 --> 00:25:46,940
It's a simple version of
a streaming B-tree

514
00:25:46,940 --> 00:25:48,200
or a fractal tree.

515
00:25:48,200 --> 00:25:50,180
And what this data
structure is--

516
00:25:50,180 --> 00:25:55,340
so first of all, we're kind of
going to switch modes from

517
00:25:55,340 --> 00:25:58,770
marketoid, or at least
explaining what it's good for,

518
00:25:58,770 --> 00:26:00,640
to talking about what a data
structure is that actually

519
00:26:00,640 --> 00:26:02,070
solves the problem.

520
00:26:02,070 --> 00:26:04,500
So any questions before we
dive down that path?

521
00:26:04,500 --> 00:26:05,640
OK.

522
00:26:05,640 --> 00:26:07,210
So if there are any questions,
stop me.

523
00:26:07,210 --> 00:26:12,700
Because I like to race through
this stuff if possible.

524
00:26:12,700 --> 00:26:15,890
So the deal here is that you're
going to have log N

525
00:26:15,890 --> 00:26:20,230
arrays, and each one is a
power of two in size.

526
00:26:20,230 --> 00:26:21,920
And you're going to have one
for each power of 2.

527
00:26:21,920 --> 00:26:25,230
So there's going to be one array
of size 1, one of size

528
00:26:25,230 --> 00:26:31,620
2, one of size 4 and 8 and 16,
all the way up to a trillion,

529
00:26:31,620 --> 00:26:32,870
2 to the 40th.

530
00:26:35,130 --> 00:26:37,920
The second invariant of this
data structure is each array

531
00:26:37,920 --> 00:26:42,660
is either completely full
or completely empty.

532
00:26:42,660 --> 00:26:45,140
And the third one is that
each array is sorted.

533
00:26:49,780 --> 00:26:53,390
So I'll do an example here.

534
00:26:53,390 --> 00:26:57,920
If I have four elements in the
array, and these are the

535
00:26:57,920 --> 00:27:00,670
numbers, there's only one way
for me to put those in that

536
00:27:00,670 --> 00:27:03,160
satisfy all those
requirements.

537
00:27:03,160 --> 00:27:05,720
Because there's four items,
it has to go into the

538
00:27:05,720 --> 00:27:06,660
array of size four.

539
00:27:06,660 --> 00:27:07,480
It has to fill it up.

540
00:27:07,480 --> 00:27:09,670
I can't have any other
way of doing that.

541
00:27:09,670 --> 00:27:12,900
And within that array of size
four, they have to be sorted.

542
00:27:12,900 --> 00:27:16,330
So those four elements uniquely
go there, and that's

543
00:27:16,330 --> 00:27:18,390
the end of the story for
where four elements go.

544
00:27:21,210 --> 00:27:24,710
If there's 10 elements, you get
a little freedom, because,

545
00:27:24,710 --> 00:27:27,000
well, we have to fill up the 2
array and we have to fill up

546
00:27:27,000 --> 00:27:29,880
the 8 array, because there's
only one way to write in

547
00:27:29,880 --> 00:27:34,820
binary 10, which is 0101.

548
00:27:34,820 --> 00:27:35,870
But we get a little choice.

549
00:27:35,870 --> 00:27:39,070
It turns out that the bottom
array has to be sorted and the

550
00:27:39,070 --> 00:27:41,950
top array, the array
containing 5 and

551
00:27:41,950 --> 00:27:42,810
10 has to be sorted.

552
00:27:42,810 --> 00:27:46,850
But we could have put the five
down here and, say, swapped

553
00:27:46,850 --> 00:27:49,210
the 5 and the 6, and that
would've been a perfectly

554
00:27:49,210 --> 00:27:52,800
valid data structure for this
set of data as well.

555
00:27:52,800 --> 00:27:55,080
So we get a little
bit of freedom.

556
00:27:55,080 --> 00:27:57,180
OK?

557
00:27:57,180 --> 00:27:59,310
So that's the basic
data structure.

558
00:27:59,310 --> 00:28:00,910
So now what do we do?

559
00:28:00,910 --> 00:28:02,360
How do you search this
data structure?

560
00:28:02,360 --> 00:28:06,890
Well, the idea is just to
perform a binary search in

561
00:28:06,890 --> 00:28:08,140
each of the arrays.

562
00:28:13,060 --> 00:28:15,660
The advantage of this is it
works, and it's a lot faster

563
00:28:15,660 --> 00:28:17,750
than a table scan.

564
00:28:17,750 --> 00:28:20,670
The disadvantage is it's
actually quite a bit slower

565
00:28:20,670 --> 00:28:26,380
than a B-tree, because if you do
the analysis here, which in

566
00:28:26,380 --> 00:28:28,290
this class, you probably--
you've done things like master

567
00:28:28,290 --> 00:28:29,740
theorem and stuff, right?

568
00:28:29,740 --> 00:28:33,550
So you know what the cost of
doing the search in the

569
00:28:33,550 --> 00:28:35,440
biggest array is, right?

570
00:28:35,440 --> 00:28:38,552
How many disk I/Os is that
in the worst case?

571
00:28:38,552 --> 00:28:40,025
AUDIENCE: Log N.

572
00:28:40,025 --> 00:28:45,170
BRADLEY KUSZMAUL: It's log N.
It's going to be log base 2 of

573
00:28:45,170 --> 00:28:48,740
N, plus or minus a little bit.

574
00:28:48,740 --> 00:28:51,720
Just ignore all that stuff.

575
00:28:51,720 --> 00:28:55,670
I'll just do L-O-G. So what's
the size of doing the

576
00:28:55,670 --> 00:28:56,810
second-biggest array?

577
00:28:56,810 --> 00:28:58,690
What's the cost of searching
the second-biggest array?

578
00:29:02,010 --> 00:29:10,930
It's half as big, so it's
log of N over 2, right?

579
00:29:10,930 --> 00:29:12,180
I can't write.

580
00:29:14,640 --> 00:29:23,590
So this is log N. This is equal
to log of N minus 1.

581
00:29:23,590 --> 00:29:26,010
What's the next array?

582
00:29:26,010 --> 00:29:27,720
What's the cost of searching
the next biggest array?

583
00:29:32,150 --> 00:29:33,645
Log of N minus 2--

584
00:29:38,570 --> 00:29:42,700
you add that up, and
what's the sum?

585
00:29:42,700 --> 00:29:44,640
We don't even need recurrences
for this.

586
00:29:44,640 --> 00:29:47,200
We could have done it that
way, but what's the sum?

587
00:29:47,200 --> 00:29:50,310
When we finally get down to 1,
and you search the bottom

588
00:29:50,310 --> 00:29:54,150
array, you have to do one disk
I/O in the worst case.

589
00:29:54,150 --> 00:29:58,380
So this is an arithmetic
sequence, right?

590
00:29:58,380 --> 00:29:59,610
So what's the answer?

591
00:29:59,610 --> 00:30:04,680
Big-O. I'm not even going
to ask for the--

592
00:30:04,680 --> 00:30:05,160
pardon?

593
00:30:05,160 --> 00:30:05,805
AUDIENCE: Log squared?

594
00:30:05,805 --> 00:30:08,710
BRADLEY KUSZMAUL: Yes, it's log
squared, which is right

595
00:30:08,710 --> 00:30:09,960
there in green.

596
00:30:11,720 --> 00:30:14,230
So basically, this thing
is really expensive.

597
00:30:14,230 --> 00:30:18,900
Log squared N, when we were
trying to match a B-tree,

598
00:30:18,900 --> 00:30:23,920
which is log N over log B. So
not only is it not log base B,

599
00:30:23,920 --> 00:30:26,260
it's log base 2 or something.

600
00:30:26,260 --> 00:30:28,380
But it's squaring it.

601
00:30:28,380 --> 00:30:31,710
So if you think of having a
million items in your data

602
00:30:31,710 --> 00:30:35,750
structure, even a relatively
small one, log base

603
00:30:35,750 --> 00:30:37,720
1 million is 20.

604
00:30:37,720 --> 00:30:40,092
If you square that,
that's 400.

605
00:30:40,092 --> 00:30:41,660
Maybe you get to divide by 2.

606
00:30:41,660 --> 00:30:43,730
It's hundreds of disk I/Os
just to do a look

607
00:30:43,730 --> 00:30:46,250
up, instead of four.

608
00:30:46,250 --> 00:30:49,440
So this is just sucking
at this point.

609
00:30:49,440 --> 00:30:53,740
So let's put that aside and see
if we can do insertion,

610
00:30:53,740 --> 00:30:56,510
since we are doing so badly
at [? logging. ?]

611
00:30:56,510 --> 00:31:00,400
So to make this easier to think
about, I'm going to add

612
00:31:00,400 --> 00:31:02,750
another set of temporary
arrays.

613
00:31:02,750 --> 00:31:05,270
So I'm actually going to have
two arrays of each size.

614
00:31:05,270 --> 00:31:08,490
And the idea is at the beginning
of each step, after

615
00:31:08,490 --> 00:31:12,000
doing an insertion, all the
temporary arrays are empty.

616
00:31:12,000 --> 00:31:14,810
I'm only going to have arrays
on the left side that are

617
00:31:14,810 --> 00:31:17,670
going to have data in them.

618
00:31:17,670 --> 00:31:20,290
So to insert 15 into this data
structure, there's only one

619
00:31:20,290 --> 00:31:21,230
place to put it.

620
00:31:21,230 --> 00:31:25,110
I put it in the one array, if
I'm trying to be lazy about

621
00:31:25,110 --> 00:31:26,170
how much work I want to do.

622
00:31:26,170 --> 00:31:29,780
And it turns out, this is
exactly what you want to do.

623
00:31:29,780 --> 00:31:32,210
You have an an empty one array,
a new element comes in,

624
00:31:32,210 --> 00:31:33,460
just put it in there.

625
00:31:36,010 --> 00:31:37,520
Now I want to insert a 7.

626
00:31:37,520 --> 00:31:39,450
There's no place in the one
array, so I'm going to put it

627
00:31:39,450 --> 00:31:42,870
in the one array over
on the temp side.

628
00:31:42,870 --> 00:31:44,860
And then I'm going to merge
the two one arrays

629
00:31:44,860 --> 00:31:47,790
to make a two array.

630
00:31:47,790 --> 00:31:52,310
So the 15 and the 7 become
7 and 15 here.

631
00:31:52,310 --> 00:31:53,630
I couldn't put it there
because that

632
00:31:53,630 --> 00:31:55,010
array already was full.

633
00:31:55,010 --> 00:32:01,110
And then I merge those two
to make a new four array.

634
00:32:01,110 --> 00:32:03,140
So this is the final
result after

635
00:32:03,140 --> 00:32:06,332
inserting those two items.

636
00:32:06,332 --> 00:32:07,582
Does that make sense?

637
00:32:10,440 --> 00:32:13,300
It's not a hard data
structure.

638
00:32:13,300 --> 00:32:16,900
So one insertion can cause
a whole bunch of merges.

639
00:32:16,900 --> 00:32:19,200
Here we have sort
of an animation.

640
00:32:19,200 --> 00:32:24,100
So here I've laid out the one
array across the top, and then

641
00:32:24,100 --> 00:32:27,200
the temporary array just under
it, and then going down, we

642
00:32:27,200 --> 00:32:30,930
have a sequence of steps for the
data structure over time.

643
00:32:30,930 --> 00:32:33,170
So we have the whole arrays.

644
00:32:33,170 --> 00:32:35,180
We have one and the two and the
four and the eight array

645
00:32:35,180 --> 00:32:38,100
are all full, and we insert
one more item, which

646
00:32:38,100 --> 00:32:39,720
causes a big carry.

647
00:32:39,720 --> 00:32:43,450
So the one creates a two, the
two twos create two fours, the

648
00:32:43,450 --> 00:32:46,660
two fours and the eight create
two eights, and so forth.

649
00:32:46,660 --> 00:32:47,550
So here you are.

650
00:32:47,550 --> 00:32:49,190
You're running.

651
00:32:49,190 --> 00:32:52,310
You've built up a terabyte
of data.

652
00:32:52,310 --> 00:32:53,970
Your insert one more item,
and now you have to

653
00:32:53,970 --> 00:32:56,380
rewrite all of disk.

654
00:32:56,380 --> 00:32:59,860
So that also sounds a
little unappealing.

655
00:32:59,860 --> 00:33:03,880
But we'll build on this to make
a data structure that

656
00:33:03,880 --> 00:33:05,620
actually works.

657
00:33:05,620 --> 00:33:08,980
So first let's analyze what the
average cost for this data

658
00:33:08,980 --> 00:33:10,210
structure is.

659
00:33:10,210 --> 00:33:12,860
I've just sort of explained
why-- there are some really

660
00:33:12,860 --> 00:33:14,210
bad cases where you're doing an

661
00:33:14,210 --> 00:33:16,010
insertion and it's expensive.

662
00:33:16,010 --> 00:33:19,040
But on average, it turns
out it's really good.

663
00:33:19,040 --> 00:33:24,430
And the reason is that merging
of sorted arrays is really I/O

664
00:33:24,430 --> 00:33:29,170
efficient, because the merge
is essentially operating on

665
00:33:29,170 --> 00:33:30,480
that append data structure.

666
00:33:30,480 --> 00:33:33,710
We're reading two append data
structures and then writing

667
00:33:33,710 --> 00:33:35,790
the answer into another
append data structure.

668
00:33:35,790 --> 00:33:39,870
And that does hardly any I/O.

669
00:33:39,870 --> 00:33:44,960
So if you have two arrays of
size X, the cost to merge them

670
00:33:44,960 --> 00:33:47,320
is you have to read the two
arrays and you have to write

671
00:33:47,320 --> 00:33:48,070
the new array.

672
00:33:48,070 --> 00:33:54,070
And you add it all up, and
that's order X over B I/Os.

673
00:33:54,070 --> 00:33:55,950
Maybe it's 4x over
b or something.

674
00:33:55,950 --> 00:34:00,210
But big-O of X over B. So
the merge is efficient.

675
00:34:00,210 --> 00:34:05,770
The cost per element for the
merge is 1 over B, because

676
00:34:05,770 --> 00:34:08,420
order X elements were merged
when we did that.

677
00:34:08,420 --> 00:34:10,210
And we get to spread the cost.

678
00:34:10,210 --> 00:34:14,650
Sure, we had to rewrite a
trillion items when we filled

679
00:34:14,650 --> 00:34:19,940
up our disk, but actually, when
you divide that out over

680
00:34:19,940 --> 00:34:23,409
the trillion items, it's not
that much cost per item.

681
00:34:23,409 --> 00:34:29,659
And so the cost for each item of
that big operation is only

682
00:34:29,659 --> 00:34:31,900
1 over B disk I/Os.

683
00:34:31,900 --> 00:34:36,030
And each item only has to be
rewritten log N times.

684
00:34:36,030 --> 00:34:41,130
So the total average cost for an
insertion of one element is

685
00:34:41,130 --> 00:34:46,010
log N over B, which is actually
better than what I

686
00:34:46,010 --> 00:34:46,900
promised here.

687
00:34:46,900 --> 00:34:48,010
But this data structure's
going to be

688
00:34:48,010 --> 00:34:48,940
worse somewhere else.

689
00:34:48,940 --> 00:34:51,530
So this is a simplified
version.

690
00:34:51,530 --> 00:34:53,179
I'll get to within--

691
00:34:53,179 --> 00:34:56,120
ignoring epsilons and things,
it'll be good enough.

692
00:34:56,120 --> 00:34:59,280
So does that analysis
makes sense?

693
00:34:59,280 --> 00:35:03,310
It's not hard analysis, so if
it doesn't make sense, it's

694
00:35:03,310 --> 00:35:04,840
not because of you.

695
00:35:04,840 --> 00:35:07,080
It's got to be because I didn't
explain it, because

696
00:35:07,080 --> 00:35:10,410
it's too easy to
not understand.

697
00:35:15,910 --> 00:35:19,800
So if you're going to build
something like this, you can't

698
00:35:19,800 --> 00:35:23,940
just say, oh, well, your
database is great except once

699
00:35:23,940 --> 00:35:29,100
every couple days it hangs for
an hour while we resort

700
00:35:29,100 --> 00:35:30,030
everything.

701
00:35:30,030 --> 00:35:32,700
So the fix of this is that we're
going to get rid of the

702
00:35:32,700 --> 00:35:33,800
worst case.

703
00:35:33,800 --> 00:35:36,310
And the idea is, well, let's
just have a separate thread

704
00:35:36,310 --> 00:35:38,510
that does the merging
of the arrays.

705
00:35:38,510 --> 00:35:41,140
So we insert something into
a temporary array and just

706
00:35:41,140 --> 00:35:42,780
return immediately.

707
00:35:42,780 --> 00:35:48,640
And as long as the merge thread
gets to do at least log

708
00:35:48,640 --> 00:35:53,830
N moves every time we insert
something, it can keep up.

709
00:35:53,830 --> 00:35:57,190
You could actually do a very
careful dance, where I insert

710
00:35:57,190 --> 00:35:59,400
something, and part of the
insertion is I have to move

711
00:35:59,400 --> 00:36:01,600
something from this array and
something from this array and

712
00:36:01,600 --> 00:36:04,400
something from this array, and
I can keep everything up to

713
00:36:04,400 --> 00:36:05,590
date that way.

714
00:36:05,590 --> 00:36:09,660
So it's not very hard to
de-amortize this algorithm--

715
00:36:09,660 --> 00:36:14,400
that is, to turn the algorithm
from a good average-case

716
00:36:14,400 --> 00:36:16,840
behavior to good worst-case
behavior.

717
00:36:16,840 --> 00:36:21,827
The worst-case behavior just
becomes that it has to do log

718
00:36:21,827 --> 00:36:25,830
N work for an insertion,
which isn't so bad.

719
00:36:28,450 --> 00:36:30,660
Does that make sense?

720
00:36:30,660 --> 00:36:31,968
Yeah.

721
00:36:31,968 --> 00:36:34,872
AUDIENCE: Does that work if
these are [INAUDIBLE] items

722
00:36:34,872 --> 00:36:35,840
[INAUDIBLE]?

723
00:36:35,840 --> 00:36:40,680
What if somebody wants
[INAUDIBLE]?

724
00:36:40,680 --> 00:36:41,430
BRADLEY KUSZMAUL: Ah.

725
00:36:41,430 --> 00:36:44,070
Well, OK, so the question--

726
00:36:44,070 --> 00:36:45,480
let me repeat it and see if--

727
00:36:45,480 --> 00:36:49,440
so you're in the middle of doing
these merges and you

728
00:36:49,440 --> 00:36:51,690
have a background thread doing
that, say, and somebody comes

729
00:36:51,690 --> 00:36:53,480
along and wants to do a query.

730
00:36:53,480 --> 00:36:54,354
AUDIENCE: Yeah.

731
00:36:54,354 --> 00:36:55,230
[INAUDIBLE]

732
00:36:55,230 --> 00:36:57,220
BRADLEY KUSZMAUL: So the trick
to there is that you put a bit

733
00:36:57,220 --> 00:36:58,940
on the array that says,
the new arrays is

734
00:36:58,940 --> 00:37:00,050
not ready to query.

735
00:37:00,050 --> 00:37:03,670
Keep using the old arrays,
which are still there.

736
00:37:03,670 --> 00:37:05,270
Just don't destroy the
old ones until

737
00:37:05,270 --> 00:37:07,980
the new one's ready.

738
00:37:07,980 --> 00:37:12,250
So basically you have these
two megabyte-sized things.

739
00:37:12,250 --> 00:37:14,430
You're trying to make a two
megabyte-sized one.

740
00:37:14,430 --> 00:37:18,520
You leave the one-megabyte ones
lying around for a while

741
00:37:18,520 --> 00:37:21,650
while you're incrementally
moving things down.

742
00:37:21,650 --> 00:37:24,820
And then suddenly, when the big
ones done, you flip the

743
00:37:24,820 --> 00:37:28,460
bits, so in order one
operations, you can say, no,

744
00:37:28,460 --> 00:37:30,960
those two are no longer valid,
and this one's valid.

745
00:37:30,960 --> 00:37:34,130
So queries should use this
one instead of those.

746
00:37:34,130 --> 00:37:38,400
So that's basically the kind
of trick you might do.

747
00:37:38,400 --> 00:37:41,360
Or you would just search the
partially constructed arrays,

748
00:37:41,360 --> 00:37:42,005
if you have locks.

749
00:37:42,005 --> 00:37:43,350
There's lots of ways to do it.

750
00:37:47,440 --> 00:37:48,890
So that's a pretty
good question.

751
00:37:48,890 --> 00:37:50,830
Yes.

752
00:37:50,830 --> 00:37:53,380
That's one that we had to
think about a little.

753
00:37:53,380 --> 00:37:56,940
So it sounds glib, but it's
like, how do we do this?

754
00:37:56,940 --> 00:37:59,850
Any other questions?

755
00:37:59,850 --> 00:38:01,750
OK.

756
00:38:01,750 --> 00:38:06,340
So now we've got to do something
about the search,

757
00:38:06,340 --> 00:38:08,620
because the search
is really bad.

758
00:38:08,620 --> 00:38:12,380
Well, it's not as bad as the
insertion worst-case thing.

759
00:38:12,380 --> 00:38:15,000
I'm going to show you how to
shave off a factor of log N,

760
00:38:15,000 --> 00:38:17,840
and I don't think I'm going to
show you how to shave off the

761
00:38:17,840 --> 00:38:22,010
factor of 1 over log B. so we'll
just get it down to log

762
00:38:22,010 --> 00:38:26,150
N instead of log squared N.
Because if I actually want to

763
00:38:26,150 --> 00:38:28,330
get it down, then I
have to give up--

764
00:38:28,330 --> 00:38:31,410
remember, the performance that
I had was log of N over B. If

765
00:38:31,410 --> 00:38:33,960
I actually want to get rid
of things, I have to

766
00:38:33,960 --> 00:38:34,650
do something else.

767
00:38:34,650 --> 00:38:37,250
There's a lower-bound
argument.

768
00:38:37,250 --> 00:38:41,850
So the idea here is
we're searching--

769
00:38:41,850 --> 00:38:47,985
I'm going to flip those.

770
00:38:50,490 --> 00:38:52,505
We've got these arrays
of various sizes.

771
00:38:56,100 --> 00:38:59,020
And I've just done a binary
search on here and then here

772
00:38:59,020 --> 00:39:01,750
and then here, and I found out
the thing I'm looking for

773
00:39:01,750 --> 00:39:04,020
wasn't here and it wasn't
here and it wasn't here.

774
00:39:04,020 --> 00:39:07,800
That's where it would have been,
if it had been there.

775
00:39:07,800 --> 00:39:09,730
It should have been there
but it wasn't.

776
00:39:09,730 --> 00:39:12,000
It should have been here
but it wasn't.

777
00:39:12,000 --> 00:39:14,910
And then I'm going to start
searching in this array.

778
00:39:14,910 --> 00:39:18,850
And the intuition you might have
is that, gee, it's kind

779
00:39:18,850 --> 00:39:21,890
of wasteful to start a whole new
search on this array when

780
00:39:21,890 --> 00:39:23,700
we already knew where it
wasn't in this array.

781
00:39:26,370 --> 00:39:27,630
Right?

782
00:39:27,630 --> 00:39:31,320
So for example, if the data
were uniformly randomly

783
00:39:31,320 --> 00:39:35,270
distributed, and the thing was,
say, 1/3 of the array

784
00:39:35,270 --> 00:39:39,230
here, I might gain some
advantage by searching at the

785
00:39:39,230 --> 00:39:42,770
1/3 point over here to
see if it's there.

786
00:39:42,770 --> 00:39:45,180
Now, that's kind of
an intuition.

787
00:39:45,180 --> 00:39:46,610
I don't know how to
make that work.

788
00:39:46,610 --> 00:39:50,290
But I do know how to make
something work.

789
00:39:50,290 --> 00:39:53,620
But the intuition is, having
done some search here, I

790
00:39:53,620 --> 00:39:56,850
should in principal have
information about where to

791
00:39:56,850 --> 00:39:58,680
limit the search so that I don't
have to search the whole

792
00:39:58,680 --> 00:40:01,220
thing on the next array.

793
00:40:03,770 --> 00:40:04,120
OK?

794
00:40:04,120 --> 00:40:08,920
And here's basically what you
do, is every element, you get

795
00:40:08,920 --> 00:40:11,730
a forward pointer to where that
element would go in the

796
00:40:11,730 --> 00:40:13,430
next array.

797
00:40:13,430 --> 00:40:15,960
So for example, you have
something here and something

798
00:40:15,960 --> 00:40:19,300
here, which are the two things
that are less than and greater

799
00:40:19,300 --> 00:40:21,620
than the thing you're
looking for.

800
00:40:21,620 --> 00:40:24,700
And it says, oh, those
should have gone

801
00:40:24,700 --> 00:40:27,850
here in the next array.

802
00:40:27,850 --> 00:40:30,140
So if you maintain that

803
00:40:30,140 --> 00:40:34,150
information, it's almost enough.

804
00:40:34,150 --> 00:40:37,950
But let's gloss over
the almost part.

805
00:40:37,950 --> 00:40:41,910
If those two destinations of
those two pointers are close

806
00:40:41,910 --> 00:40:45,240
together, then you've
saved a lot of

807
00:40:45,240 --> 00:40:46,490
search in the next array.

808
00:40:49,840 --> 00:40:51,660
Does anybody see
a bug in this?

809
00:40:51,660 --> 00:40:52,805
There is one.

810
00:40:52,805 --> 00:40:55,450
The almost part.

811
00:40:55,450 --> 00:40:57,590
You don't have to see it,
because I've been thinking

812
00:40:57,590 --> 00:40:58,840
about this a lot.

813
00:41:01,740 --> 00:41:05,460
The problem is, what if all of
these items are less than all

814
00:41:05,460 --> 00:41:08,718
of these items, for example?

815
00:41:08,718 --> 00:41:13,240
In which case, these pointers
all point down to the

816
00:41:13,240 --> 00:41:16,220
beginning, and we've
got nothing.

817
00:41:16,220 --> 00:41:17,530
That's a case where
this fails.

818
00:41:17,530 --> 00:41:19,870
And that's allowed, right?

819
00:41:19,870 --> 00:41:23,330
In particular, if we were
inserting things--

820
00:41:23,330 --> 00:41:24,070
yeah.

821
00:41:24,070 --> 00:41:26,890
AUDIENCE: Then we know the
element is in the biggest

822
00:41:26,890 --> 00:41:28,926
array, because the element
was supposed to

823
00:41:28,926 --> 00:41:30,650
go between the two.

824
00:41:30,650 --> 00:41:33,060
BRADLEY KUSZMAUL: Ah, but in
this array, we found out that

825
00:41:33,060 --> 00:41:37,700
it's above the last element,
when we did our search, right?

826
00:41:37,700 --> 00:41:39,660
That's one of the
possible ways--

827
00:41:39,660 --> 00:41:42,250
the worst-case behavior is we've
got something where this

828
00:41:42,250 --> 00:41:43,870
array is less than this array.

829
00:41:43,870 --> 00:41:44,890
We're looking for that item.

830
00:41:44,890 --> 00:41:49,030
So we do a binary search and
find out, it's over here.

831
00:41:49,030 --> 00:41:51,220
And it doesn't help to special
case this or something,

832
00:41:51,220 --> 00:41:53,840
because they could be all to the
right or they could be all

833
00:41:53,840 --> 00:41:55,190
bunched up in funny ways.

834
00:41:55,190 --> 00:41:58,930
There's lots of screwy ways
that this could go wrong.

835
00:41:58,930 --> 00:42:01,710
But the simple version, it's
easy to come up with an

836
00:42:01,710 --> 00:42:03,580
example, which is everything's
to the left.

837
00:42:03,580 --> 00:42:04,416
Yeah.

838
00:42:04,416 --> 00:42:10,248
AUDIENCE: Can you still save
time by, when you do the

839
00:42:10,248 --> 00:42:12,678
binary search on the smallest
array-- but I guess you'd want

840
00:42:12,678 --> 00:42:13,650
[INAUDIBLE]

841
00:42:13,650 --> 00:42:14,136
search.

842
00:42:14,136 --> 00:42:15,870
It will help reduced cost
which gives you the

843
00:42:15,870 --> 00:42:18,315
next one and so on?

844
00:42:18,315 --> 00:42:19,370
BRADLEY KUSZMAUL: Yeah.

845
00:42:19,370 --> 00:42:22,950
So there is a way to fix it so
that the pointers in the

846
00:42:22,950 --> 00:42:26,480
smaller array do help
you reduce the

847
00:42:26,480 --> 00:42:28,010
cost in the next array.

848
00:42:28,010 --> 00:42:32,050
And that is to seed the smaller
array with some values

849
00:42:32,050 --> 00:42:32,850
from the next array.

850
00:42:32,850 --> 00:42:35,760
Like, suppose I put in every
20th item, and I stuck it in

851
00:42:35,760 --> 00:42:37,790
that array with a bit on it
that says, oh, this is a

852
00:42:37,790 --> 00:42:41,100
repeat, it's going to
be repeated again.

853
00:42:41,100 --> 00:42:44,630
So then I could guarantee that
there's these dummies that I

854
00:42:44,630 --> 00:42:50,510
throw in here, which are evenly
spaced, plus whatever

855
00:42:50,510 --> 00:42:51,150
else is in there.

856
00:42:51,150 --> 00:42:54,430
So put the other things in
there, and they have forward

857
00:42:54,430 --> 00:42:55,900
pointers too.

858
00:42:55,900 --> 00:43:02,310
And now I'm guaranteed that
the distance between two

859
00:43:02,310 --> 00:43:07,292
adjacent items is guaranteed
to be a constant.

860
00:43:07,292 --> 00:43:10,430
Does that make sense?

861
00:43:10,430 --> 00:43:16,040
The trick is to make it so that
having found two adjacent

862
00:43:16,040 --> 00:43:17,820
items that the thing
you want--

863
00:43:17,820 --> 00:43:21,570
then on the next array, the
image of those two items is

864
00:43:21,570 --> 00:43:25,100
separated by at most 20 items.

865
00:43:25,100 --> 00:43:29,890
And so that gets you down to
only log of N instead of log

866
00:43:29,890 --> 00:43:33,280
squared of n, because you're
searching constant items in

867
00:43:33,280 --> 00:43:35,535
this array, and there's
only log N arrays.

868
00:43:38,280 --> 00:43:39,110
Yeah.

869
00:43:39,110 --> 00:43:42,601
AUDIENCE: Doesn't that slow down
the merging of the rays?

870
00:43:42,601 --> 00:43:45,720
BRADLEY KUSZMAUL: Not
asymptotically.

871
00:43:45,720 --> 00:43:48,390
Because asymptotically,
what this means--

872
00:43:48,390 --> 00:43:50,860
if I'm going to build that
array, so I'm going to merge

873
00:43:50,860 --> 00:43:53,350
two arrays to make this array,
I have to do an additional

874
00:43:53,350 --> 00:43:55,890
scan of this other array as
I'm constructing this one.

875
00:43:55,890 --> 00:44:00,080
So the picture is I have two
arrays, and I'm trying to

876
00:44:00,080 --> 00:44:02,390
merge them into this array.

877
00:44:02,390 --> 00:44:08,230
And I'm trying to also insert
these dummy forward pointers

878
00:44:08,230 --> 00:44:11,770
from the next array, which
is only twice as big.

879
00:44:11,770 --> 00:44:15,751
So the big 0's are, if it's X,
instead of it being 1, 2, 3,

880
00:44:15,751 --> 00:44:18,550
4X, it's 8X.

881
00:44:18,550 --> 00:44:19,800
So it's only a constant.

882
00:44:23,320 --> 00:44:25,960
So basically, I can read
all three of these.

883
00:44:25,960 --> 00:44:29,660
I can read an array and the next
one and the next array,

884
00:44:29,660 --> 00:44:31,460
which is twice as big, and
the next array which is

885
00:44:31,460 --> 00:44:32,740
four times as big.

886
00:44:32,740 --> 00:44:36,510
It all adds up to 8 times the
size of the original array.

887
00:44:36,510 --> 00:44:38,330
So at least the asymptotics
aren't messed up.

888
00:44:38,330 --> 00:44:40,770
Maybe the engineer in you goes,
bleh, I have to read the

889
00:44:40,770 --> 00:44:42,200
data eight times.

890
00:44:42,200 --> 00:44:48,320
But remember, the game here is
not to get 100% of the disk's

891
00:44:48,320 --> 00:44:50,690
insertion capacity.

892
00:44:50,690 --> 00:44:55,200
That's not the game, going
back to the marketing

893
00:44:55,200 --> 00:44:56,100
perspective.

894
00:44:56,100 --> 00:45:00,470
The competition is only
getting 0.001%

895
00:45:00,470 --> 00:45:01,720
of the disk's capacity.

896
00:45:04,700 --> 00:45:07,640
That's what a B-tree gets
in the worst case.

897
00:45:07,640 --> 00:45:11,730
And so we don't have to get
100% to be three orders of

898
00:45:11,730 --> 00:45:15,680
magnitude better, which
is where we are.

899
00:45:15,680 --> 00:45:18,650
So it turns out that for this
kind of thing, we end up

900
00:45:18,650 --> 00:45:23,370
getting 1% of the disk's
capacity, and everybody's

901
00:45:23,370 --> 00:45:26,310
jumping around saying that's
great, because it's 1,000

902
00:45:26,310 --> 00:45:27,560
times faster.

903
00:45:30,630 --> 00:45:34,070
And why do we only get 1%?

904
00:45:34,070 --> 00:45:39,100
Well, there's a factor of two
here and there's a log N over

905
00:45:39,100 --> 00:45:46,050
there, and you divide all that,
and it's a constant

906
00:45:46,050 --> 00:45:50,100
challenge, because the engineers
at Tokutek always

907
00:45:50,100 --> 00:45:53,310
are having ideas for how
to make it faster.

908
00:45:53,310 --> 00:45:57,260
And right now, making this data
structure faster is not

909
00:45:57,260 --> 00:45:59,330
the thing that's going to
make people buy it.

910
00:45:59,330 --> 00:46:01,390
Because it's already 1,000
times faster than the

911
00:46:01,390 --> 00:46:02,780
competition.

912
00:46:02,780 --> 00:46:06,140
What's going to make it faster
is some other thing that adds

913
00:46:06,140 --> 00:46:08,180
features that make it
so it's easy to use.

914
00:46:08,180 --> 00:46:09,980
So I keep having to--

915
00:46:09,980 --> 00:46:13,760
no, you really need to work on
making it so that we can do

916
00:46:13,760 --> 00:46:17,370
backups, or something.

917
00:46:17,370 --> 00:46:20,820
It turns out, if you're selling
a database, you need

918
00:46:20,820 --> 00:46:22,930
to do more than just queries
and insertions.

919
00:46:22,930 --> 00:46:24,290
You need to be able
to do backups.

920
00:46:24,290 --> 00:46:26,360
You need to be able to
recover from a crash.

921
00:46:26,360 --> 00:46:33,540
You need to be able to cope
with the problem of some

922
00:46:33,540 --> 00:46:36,820
particularly heavy query that's
going and starving all

923
00:46:36,820 --> 00:46:39,300
the other queries from getting
their work done.

924
00:46:39,300 --> 00:46:42,460
All those problems turn out to
be the problems that, if you

925
00:46:42,460 --> 00:46:44,460
do any of them badly, people
won't buy you.

926
00:46:44,460 --> 00:46:49,710
And so I suspect that there's
another factor of 10 to be

927
00:46:49,710 --> 00:46:54,650
gotten over this data structure,
if you were to sit

928
00:46:54,650 --> 00:46:57,090
down and try to say, how could
I make it be the fastest

929
00:46:57,090 --> 00:46:58,510
possible thing.

930
00:46:58,510 --> 00:47:02,110
And someday, that work will have
to be done, because the

931
00:47:02,110 --> 00:47:03,900
competition will have
it and we won't.

932
00:47:07,370 --> 00:47:09,270
So let's see.

933
00:47:09,270 --> 00:47:10,570
I mentioned some of
these just now.

934
00:47:10,570 --> 00:47:14,010
So some of the things you have
to do in order to have an

935
00:47:14,010 --> 00:47:17,630
industrial strength dictionary
are you need to cope with

936
00:47:17,630 --> 00:47:20,240
variable-size rows.

937
00:47:20,240 --> 00:47:22,320
Now we assumed for the analysis
that the rows were

938
00:47:22,320 --> 00:47:23,240
all unit size.

939
00:47:23,240 --> 00:47:25,630
In fact, database rows
vary in size.

940
00:47:25,630 --> 00:47:26,670
And some of them are huge.

941
00:47:26,670 --> 00:47:28,550
Some of them are megabytes.

942
00:47:28,550 --> 00:47:31,670
Or sometimes people do things
like they put satellite images

943
00:47:31,670 --> 00:47:33,280
into databases.

944
00:47:33,280 --> 00:47:36,310
So they end up having
very large rows.

945
00:47:36,310 --> 00:47:38,295
You have to do deletions
as well as insertions.

946
00:47:41,000 --> 00:47:43,220
And it turns out we can do
deletions just as fast as

947
00:47:43,220 --> 00:47:45,060
insertions.

948
00:47:45,060 --> 00:47:47,840
And the idea there is basically,
if you want to do a

949
00:47:47,840 --> 00:47:53,310
delete, you just you insert
the thing with a bit on it

950
00:47:53,310 --> 00:47:55,240
that says, hey, this is
really a deletion.

951
00:47:55,240 --> 00:47:57,660
And then, whatever you get a
chance, when you're doing a

952
00:47:57,660 --> 00:48:01,490
merge, if you find something
that has the same value, you

953
00:48:01,490 --> 00:48:03,210
just annihilate it.

954
00:48:03,210 --> 00:48:07,700
And the delete has to keep
going down, because there

955
00:48:07,700 --> 00:48:09,640
might be more copies
of it further

956
00:48:09,640 --> 00:48:11,600
down that were shadowed.

957
00:48:11,600 --> 00:48:15,480
And eventually, when you finally
do the last merge,

958
00:48:15,480 --> 00:48:19,320
that tombstone goes away.

959
00:48:19,320 --> 00:48:21,460
You have to do transactions
and logging.

960
00:48:21,460 --> 00:48:23,610
You have to do crash recovery.

961
00:48:23,610 --> 00:48:26,230
And it's a big pain to get
that right, and a lot of

962
00:48:26,230 --> 00:48:29,980
companies have foundered when
they tried to move from one

963
00:48:29,980 --> 00:48:31,320
mode to the other.

964
00:48:31,320 --> 00:48:33,850
How many of you have experienced
the phenomena that

965
00:48:33,850 --> 00:48:37,030
your file system didn't come
back properly after a crash?

966
00:48:40,410 --> 00:48:42,050
You see the difference
in age here.

967
00:48:42,050 --> 00:48:46,530
They're all using file systems
that have transactional

968
00:48:46,530 --> 00:48:48,490
logging underneath them.

969
00:48:48,490 --> 00:48:50,714
When's the last time
it happened?

970
00:48:50,714 --> 00:48:51,470
AUDIENCE: Tuesday.

971
00:48:51,470 --> 00:48:53,430
BRADLEY KUSZMAUL: Tuesday.

972
00:48:53,430 --> 00:48:56,994
So the difference is you're
paying attention and they're

973
00:48:56,994 --> 00:48:59,304
not, right?

974
00:48:59,304 --> 00:49:01,235
AUDIENCE: [INAUDIBLE]
disk failure.

975
00:49:01,235 --> 00:49:02,660
BRADLEY KUSZMAUL:
Disk failure.

976
00:49:02,660 --> 00:49:03,920
That's a different problem.

977
00:49:03,920 --> 00:49:06,160
AUDIENCE: Cacheing is
not [? finalized. ?]

978
00:49:06,160 --> 00:49:07,360
BRADLEY KUSZMAUL: Yeah.

979
00:49:07,360 --> 00:49:09,650
You say everybody's running
with their disk

980
00:49:09,650 --> 00:49:11,520
cache turned on.

981
00:49:11,520 --> 00:49:14,150
And on some file systems,
that's a bad idea.

982
00:49:14,150 --> 00:49:18,520
So we're still suffering that
it's been difficult to switch

983
00:49:18,520 --> 00:49:23,840
from the original Unix file
system, which is 30 years old

984
00:49:23,840 --> 00:49:26,490
and wasn't designed to
recover from crash.

985
00:49:26,490 --> 00:49:30,870
You have to run fsck, and
it doesn't always work.

986
00:49:30,870 --> 00:49:32,660
We still have file systems
that don't

987
00:49:32,660 --> 00:49:33,800
recover from crashes.

988
00:49:33,800 --> 00:49:36,305
So you can see why that
could be difficult.

989
00:49:40,540 --> 00:49:43,450
It turns out that one common use
case is that the data is

990
00:49:43,450 --> 00:49:47,580
coming in sequentially, and this
data structure just sucks

991
00:49:47,580 --> 00:49:51,060
compared to a B-tree in the case
where you're inserting

992
00:49:51,060 --> 00:49:52,970
things and the data actually
is already

993
00:49:52,970 --> 00:49:54,110
sorted as it's inserted.

994
00:49:54,110 --> 00:49:56,220
Because this is moving things
all around and moving things

995
00:49:56,220 --> 00:49:56,710
all around.

996
00:49:56,710 --> 00:49:59,580
And it's like, why didn't you
just notice that it's sorted

997
00:49:59,580 --> 00:50:02,000
and put it in?

998
00:50:02,000 --> 00:50:06,470
You have to get rid of the log
base B to get it down to log

999
00:50:06,470 --> 00:50:10,620
base B of N instead of log base
2 of N for search costs.

1000
00:50:10,620 --> 00:50:16,060
Because people in fact do a
lot more searches than--

1001
00:50:16,060 --> 00:50:18,200
if you have to choose which
to do better, you want to

1002
00:50:18,200 --> 00:50:20,260
generally do searches better.

1003
00:50:20,260 --> 00:50:22,550
And compression turns
out to be important.

1004
00:50:22,550 --> 00:50:29,250
I had one customer who
had a database

1005
00:50:29,250 --> 00:50:31,720
which was 300 gigabytes.

1006
00:50:31,720 --> 00:50:34,910
He has a whole bunch of servers,
and on each server,

1007
00:50:34,910 --> 00:50:37,260
he had a 300 gigabyte
database.

1008
00:50:37,260 --> 00:50:40,210
And with us, it was 70

1009
00:50:40,210 --> 00:50:42,780
gigabytes, because we compress.

1010
00:50:42,780 --> 00:50:45,380
And we just do simple
compression of, basically,

1011
00:50:45,380 --> 00:50:46,200
large blocks.

1012
00:50:46,200 --> 00:50:48,885
When we do I/Os, we do I/Os
of like a megabyte.

1013
00:50:51,480 --> 00:50:54,460
So when we take one of those
megabytes, we compress it.

1014
00:50:54,460 --> 00:50:57,790
And it's a big advantage to
compress a megabyte at a time,

1015
00:50:57,790 --> 00:50:59,020
instead of what--

1016
00:50:59,020 --> 00:51:02,440
a lot of B-trees, they have
maybe 16 kilobytes.

1017
00:51:02,440 --> 00:51:05,580
And gzip hardly gets a chance to
get anywhere when you only

1018
00:51:05,580 --> 00:51:07,700
have 16 kilobytes.

1019
00:51:07,700 --> 00:51:09,960
And it gets down to
12 kilobytes.

1020
00:51:09,960 --> 00:51:13,000
But if you have a megabyte to
work with and you compress it,

1021
00:51:13,000 --> 00:51:14,610
particularly if it's sorted--

1022
00:51:14,610 --> 00:51:17,940
so this is a megabyte of data
that's sorted, so compression

1023
00:51:17,940 --> 00:51:19,650
works pretty well
on sorted data.

1024
00:51:19,650 --> 00:51:22,900
So you get factors of 5
or 10 or something.

1025
00:51:22,900 --> 00:51:26,950
And so we asked him to dump the
data without the indexes,

1026
00:51:26,950 --> 00:51:30,680
so just the primary table with
no indexes, and then run that

1027
00:51:30,680 --> 00:51:31,780
through gzip.

1028
00:51:31,780 --> 00:51:33,210
And it was 50 gigabytes.

1029
00:51:33,210 --> 00:51:36,930
So the smallest he could store
the raw data was 50 gigabytes,

1030
00:51:36,930 --> 00:51:39,140
and we were giving him a useful
database that was 70

1031
00:51:39,140 --> 00:51:41,370
gigabytes that had a
bunch of indexes.

1032
00:51:41,370 --> 00:51:42,620
So he was like, yeah.

1033
00:51:45,150 --> 00:51:47,360
And you have to deal with
multithreading and lots of

1034
00:51:47,360 --> 00:51:48,560
clients and stuff.

1035
00:51:48,560 --> 00:51:50,990
So here's an example.

1036
00:51:50,990 --> 00:51:53,350
We worked with Mark Callahan,
who was at Google at the

1037
00:51:53,350 --> 00:51:55,140
time-- he's now at Facebook--

1038
00:51:55,140 --> 00:51:57,190
on trying to come up with some
benchmarks, because none of

1039
00:51:57,190 --> 00:52:03,650
the benchmarks out in the
world do a good job of

1040
00:52:03,650 --> 00:52:06,820
measuring this insertion
performance problem.

1041
00:52:06,820 --> 00:52:09,860
So iiBench is an insertion
benchmark.

1042
00:52:09,860 --> 00:52:14,170
And basically what it does is
it sets up a database with

1043
00:52:14,170 --> 00:52:18,210
three indexes, and the
indexes are random.

1044
00:52:18,210 --> 00:52:21,290
So it's actually harder
than real workloads.

1045
00:52:21,290 --> 00:52:24,040
This workload, you basically
have a row and then you create

1046
00:52:24,040 --> 00:52:26,930
a random key to point
into that from

1047
00:52:26,930 --> 00:52:28,230
three different places.

1048
00:52:28,230 --> 00:52:31,620
Real databases, it turns out,
probably have more of a

1049
00:52:31,620 --> 00:52:32,620
Zipfian distribution.

1050
00:52:32,620 --> 00:52:36,660
Have you talked at all about
Zipfian distributions of data?

1051
00:52:36,660 --> 00:52:37,850
So this is sort of an
interesting thing.

1052
00:52:37,850 --> 00:52:49,810
If you're dealing with
real-world caches, you should

1053
00:52:49,810 --> 00:52:54,310
know that data ain't uniformly
randomly distributed.

1054
00:52:54,310 --> 00:52:55,970
That's a poor model.

1055
00:52:55,970 --> 00:53:03,150
So in particular, suppose I have
memory and disk, and this

1056
00:53:03,150 --> 00:53:06,410
is 10% of the disk.

1057
00:53:06,410 --> 00:53:07,800
Very simple situation.

1058
00:53:07,800 --> 00:53:09,660
Very common ratio.

1059
00:53:09,660 --> 00:53:12,190
You'll see, this
is how Facebook

1060
00:53:12,190 --> 00:53:13,440
sets up their databases.

1061
00:53:15,550 --> 00:53:19,810
They'll have a 300-gigabyte
database and 30 gigs of RAM.

1062
00:53:19,810 --> 00:53:25,710
If the queries that you wanted
to do were random, it wouldn't

1063
00:53:25,710 --> 00:53:27,790
matter what data structure
you were using.

1064
00:53:27,790 --> 00:53:31,520
Let's suppose that God tells
you where it is on disk, so

1065
00:53:31,520 --> 00:53:32,430
you don't have to find it.

1066
00:53:32,430 --> 00:53:34,680
You just have to move the
disk head and move it.

1067
00:53:34,680 --> 00:53:37,310
If they're random, then
basically, no matter what

1068
00:53:37,310 --> 00:53:41,510
you've done, 90% of the queries
you have to do are

1069
00:53:41,510 --> 00:53:45,360
going to do a random disk I/O.
10% are going to already be

1070
00:53:45,360 --> 00:53:48,090
there, because you got lucky.

1071
00:53:48,090 --> 00:53:51,980
So that is not reflecting
what's going on on any

1072
00:53:51,980 --> 00:53:53,350
workload that I know.

1073
00:53:53,350 --> 00:53:57,570
What they'll see is more like
99% of the queries hit here,

1074
00:53:57,570 --> 00:54:03,150
and 1% go out here, or maybe 95%
here and 5% go out there.

1075
00:54:03,150 --> 00:54:06,540
And it turns out that, that for
a lot of things, there is

1076
00:54:06,540 --> 00:54:10,710
a model of what's going on
called a Zipfian distribution.

1077
00:54:10,710 --> 00:54:12,525
This would a random uniform
distribution.

1078
00:54:12,525 --> 00:54:15,860
It's like every item has equal
probability of being chosen

1079
00:54:15,860 --> 00:54:17,460
for a query.

1080
00:54:17,460 --> 00:54:23,810
It turns out that for things
like what's the popularity of

1081
00:54:23,810 --> 00:54:30,590
web pages, or if you have a
library, what's the frequency

1082
00:54:30,590 --> 00:54:32,610
at which words appear
in the library--

1083
00:54:32,610 --> 00:54:36,880
so words like "the" appear
frequently, and words like

1084
00:54:36,880 --> 00:54:39,940
"polymorphic" are
less frequent.

1085
00:54:39,940 --> 00:54:44,520
So Zipf came up with this model,
and there's a simple

1086
00:54:44,520 --> 00:54:47,650
version of the model, which
says that the most popular

1087
00:54:47,650 --> 00:54:52,040
word has probability
proportional to 1.

1088
00:54:52,040 --> 00:54:52,990
It's not going to be 1.

1089
00:54:52,990 --> 00:54:54,970
It's going to be proportional
to 1.

1090
00:54:54,970 --> 00:54:58,330
The second most popular word
is going to have 1/2.

1091
00:54:58,330 --> 00:55:02,580
The third most popular
word is 1/3.

1092
00:55:02,580 --> 00:55:06,160
And the fourth most popular word
is 1/4 the probability of

1093
00:55:06,160 --> 00:55:09,240
the first word, and so forth.

1094
00:55:09,240 --> 00:55:13,260
So if you plot this
distribution, it kind of looks

1095
00:55:13,260 --> 00:55:14,440
like this, right?

1096
00:55:14,440 --> 00:55:17,590
It's like 1 over x.

1097
00:55:17,590 --> 00:55:20,930
And what would you tell me if
I told you that I had an

1098
00:55:20,930 --> 00:55:25,040
infinite universe of objects
that had a probability

1099
00:55:25,040 --> 00:55:29,690
distribution like this.

1100
00:55:29,690 --> 00:55:32,740
Does that seem plausible?

1101
00:55:32,740 --> 00:55:33,300
Why?

1102
00:55:33,300 --> 00:55:34,796
You're saying no.

1103
00:55:34,796 --> 00:55:36,046
AUDIENCE: [INAUDIBLE PHRASE]

1104
00:55:40,508 --> 00:55:41,460
try adding them all together.

1105
00:55:41,460 --> 00:55:42,900
BRADLEY KUSZMAUL: If
you add them all

1106
00:55:42,900 --> 00:55:45,520
together, it doesn't converge.

1107
00:55:45,520 --> 00:55:48,140
So it's a heavy-tailed
distribution.

1108
00:55:48,140 --> 00:55:55,430
So it turns out that if you sum
up these up to 1, the sum

1109
00:55:55,430 --> 00:56:05,540
from 1 to n of 1 over i, it's
the nth harmonic number.

1110
00:56:05,540 --> 00:56:07,030
And that grows over time.

1111
00:56:07,030 --> 00:56:09,870
It's basically like the integral
under this curve,

1112
00:56:09,870 --> 00:56:14,410
from 1 to n.

1113
00:56:14,410 --> 00:56:15,910
It's close to that.

1114
00:56:15,910 --> 00:56:17,190
And what is the integral
of that?

1115
00:56:20,900 --> 00:56:23,170
It's like something you learned
seven years ago, and

1116
00:56:23,170 --> 00:56:24,535
now you've forgotten, right?

1117
00:56:24,535 --> 00:56:26,590
You learned it when you
were sophomores in

1118
00:56:26,590 --> 00:56:28,510
high school, or something.

1119
00:56:28,510 --> 00:56:33,770
So it's approximately
log of n.

1120
00:56:33,770 --> 00:56:39,410
It's actually like log of n
plus 0.57, is a very good

1121
00:56:39,410 --> 00:56:41,630
approximation for the
nth harmonic number.

1122
00:56:41,630 --> 00:56:43,800
When you're doing this kind of
analysis, boy, it depresses

1123
00:56:43,800 --> 00:56:47,000
people, because you say, oh,
that's H of n, and if you have

1124
00:56:47,000 --> 00:56:50,510
a million items in your
database, then the sum of all

1125
00:56:50,510 --> 00:56:52,270
those things is H of
1 million, and

1126
00:56:52,270 --> 00:56:53,540
what's H of 1 million?

1127
00:56:53,540 --> 00:56:56,930
Well, the log base 2 of 1
million, I know that, because

1128
00:56:56,930 --> 00:56:58,520
I'm a computer scientist.

1129
00:56:58,520 --> 00:57:00,560
So it's going to be like 20,
and because we're doing log

1130
00:57:00,560 --> 00:57:05,370
base e in that formula, maybe
it's 15 or something.

1131
00:57:05,370 --> 00:57:10,220
So if you have 1 million items,
then the most popular

1132
00:57:10,220 --> 00:57:13,620
item is going to-- you have
to divide by H of n here.

1133
00:57:13,620 --> 00:57:15,250
So the most popular
item is going to

1134
00:57:15,250 --> 00:57:17,360
appear 1/15 of the time.

1135
00:57:17,360 --> 00:57:19,300
And the next most popular item
is going to appear--

1136
00:57:22,260 --> 00:57:23,620
emergency backup chalk.

1137
00:57:26,800 --> 00:57:30,530
Somebody's been burning both
ends of this chalk.

1138
00:57:30,530 --> 00:57:34,810
1/30 of the time, and 1/45 of
the time, and those add up.

1139
00:57:34,810 --> 00:57:37,430
When you go up to 1
over 1 million--

1140
00:57:41,610 --> 00:57:43,250
another zero in there--

1141
00:57:43,250 --> 00:57:49,060
times 15, that finite series
will add up to 1,

1142
00:57:49,060 --> 00:57:50,060
approximately.

1143
00:57:50,060 --> 00:57:53,150
Except to the extent that
I've approximated.

1144
00:57:53,150 --> 00:57:55,080
So this is what's going on.

1145
00:57:55,080 --> 00:58:01,270
So the most popular
Facebook page--

1146
00:58:01,270 --> 00:58:04,350
they might have 1 billion
pages, so how

1147
00:58:04,350 --> 00:58:05,320
does that change things?

1148
00:58:05,320 --> 00:58:08,120
Well, that means the most
popular one has a probability

1149
00:58:08,120 --> 00:58:11,860
1 in 20, and the second
most is 1 in 40.

1150
00:58:11,860 --> 00:58:14,540
And this explains why
cache works for

1151
00:58:14,540 --> 00:58:15,820
this kind of workload.

1152
00:58:15,820 --> 00:58:20,370
Nobody really knows why Facebook
pages and words in

1153
00:58:20,370 --> 00:58:25,300
libraries and everything else
have this distribution, which

1154
00:58:25,300 --> 00:58:28,870
is named after a
guy named Zipf.

1155
00:58:31,660 --> 00:58:32,510
But they do.

1156
00:58:32,510 --> 00:58:33,740
Everything has this property.

1157
00:58:33,740 --> 00:58:36,000
And so you can sort of predict
what's happening.

1158
00:58:36,000 --> 00:58:38,400
So iiBench should
have a Zipfian

1159
00:58:38,400 --> 00:58:39,470
distribution and it doesn't.

1160
00:58:39,470 --> 00:58:41,590
So this is painting
a worse picture.

1161
00:58:41,590 --> 00:58:42,450
Or a better picture.

1162
00:58:42,450 --> 00:58:45,610
It's making us look better than
we really are, because

1163
00:58:45,610 --> 00:58:49,840
the real world is going to have
more hits on the stuff

1164
00:58:49,840 --> 00:58:53,700
that's in memory for a B-tree
than this model, where

1165
00:58:53,700 --> 00:58:56,240
basically you're completely
hosed all the time because

1166
00:58:56,240 --> 00:58:57,720
it's random.

1167
00:58:57,720 --> 00:59:00,960
So this is an example in the
category of how to lie with

1168
00:59:00,960 --> 00:59:02,090
statistics.

1169
00:59:02,090 --> 00:59:05,650
And it's a pretty sophisticated
lie.

1170
00:59:05,650 --> 00:59:07,255
If you're going to lie,
be sophisticated.

1171
00:59:11,340 --> 00:59:14,050
So these measurements were
taken in the top graph.

1172
00:59:14,050 --> 00:59:15,440
Up is good.

1173
00:59:15,440 --> 00:59:17,930
It's how many rows per second
we could insert.

1174
00:59:17,930 --> 00:59:22,800
And this axis is how many rows
have been inserted so far.

1175
00:59:22,800 --> 00:59:27,200
And the green one is a B-tree.

1176
00:59:27,200 --> 00:59:29,750
According to Mark Callahan,
who's essentially a

1177
00:59:29,750 --> 00:59:33,140
disinterested observer, it's the
best implementation of a

1178
00:59:33,140 --> 00:59:36,130
B-tree ever.

1179
00:59:36,130 --> 00:59:38,840
And you can sort of see what
happens, is that as you insert

1180
00:59:38,840 --> 00:59:41,450
stuff, the system falls out
of main memory, and the

1181
00:59:41,450 --> 00:59:43,020
performance was really good
at the beginning--

1182
00:59:43,020 --> 00:59:46,150
40,000 per second-- and then
boom, you're down to 200 down

1183
00:59:46,150 --> 00:59:47,410
here at the end, by
the time you've

1184
00:59:47,410 --> 00:59:49,170
inserted a billion rows.

1185
00:59:49,170 --> 00:59:53,005
Whereas, for the fractal
tree, you can

1186
00:59:53,005 --> 00:59:54,250
sort of see this noise.

1187
00:59:54,250 --> 00:59:57,390
That's because some insertions
are a little cheaper than

1188
00:59:57,390 --> 00:59:59,690
other insertions.

1189
00:59:59,690 --> 01:00:01,760
Every other insertion's
completely free, right?

1190
01:00:01,760 --> 01:00:02,630
You had a free spot.

1191
01:00:02,630 --> 01:00:03,880
You just put it in.

1192
01:00:07,610 --> 01:00:10,440
One out of four insertions, the
ones that weren't free,

1193
01:00:10,440 --> 01:00:13,260
half of them only had to do a
little operation in memory.

1194
01:00:13,260 --> 01:00:17,100
So you see this high frequency
noise, because some things are

1195
01:00:17,100 --> 01:00:18,540
cheaper than others.

1196
01:00:18,540 --> 01:00:21,730
And that's like a factor
of 30 or something.

1197
01:00:24,640 --> 01:00:28,480
It turns out it even works
on SSD, solid state disk.

1198
01:00:28,480 --> 01:00:29,300
You might think--

1199
01:00:29,300 --> 01:00:31,500
all this time I've been talking
about disk drives.

1200
01:00:31,500 --> 01:00:34,440
Solid state disk has a
complicated cache hierarchy

1201
01:00:34,440 --> 01:00:37,250
inside it, and we were
surprised to see that

1202
01:00:37,250 --> 01:00:43,840
basically we're faster on this
workload on a rotating disk

1203
01:00:43,840 --> 01:00:49,220
than a B-tree is on an SSD,
which is orders of magnitude

1204
01:00:49,220 --> 01:00:52,500
faster in principle, but turns
out that for various

1205
01:00:52,500 --> 01:00:53,750
reasons it's not.

1206
01:00:59,440 --> 01:01:03,190
One question I get often is, the
world is moving away from

1207
01:01:03,190 --> 01:01:05,660
rotating disk to solid
state disk.

1208
01:01:05,660 --> 01:01:06,760
A lot of applications--

1209
01:01:06,760 --> 01:01:10,050
how many of you have solid state
disks in your laptops?

1210
01:01:10,050 --> 01:01:12,430
That's a really good application
for a solid state

1211
01:01:12,430 --> 01:01:15,070
disk, because it's
not sensitive to

1212
01:01:15,070 --> 01:01:16,990
being knocked around.

1213
01:01:16,990 --> 01:01:20,550
So it's worth it to have a solid
state disk even if it

1214
01:01:20,550 --> 01:01:22,810
were more expensive,
which it is.

1215
01:01:22,810 --> 01:01:24,020
It turns out it's not
that much more

1216
01:01:24,020 --> 01:01:24,950
expensive for a laptop.

1217
01:01:24,950 --> 01:01:27,990
It's a couple of hundred dollars
more or something.

1218
01:01:27,990 --> 01:01:30,250
But the advantage of it is
that if you go up in an

1219
01:01:30,250 --> 01:01:32,820
airplane and you're sitting
and trying to type in the

1220
01:01:32,820 --> 01:01:35,140
middle of a thunderstorm,
flying across--

1221
01:01:35,140 --> 01:01:38,410
it doesn't care.

1222
01:01:38,410 --> 01:01:40,415
Disk drives, if you do that--

1223
01:01:40,415 --> 01:01:44,170
disk drives do not like flying
at high altitude, because they

1224
01:01:44,170 --> 01:01:47,380
work by having a cushion of air
that the head is flying

1225
01:01:47,380 --> 01:01:52,670
on, and in airplanes, which
pressurize the cabin to the

1226
01:01:52,670 --> 01:01:56,450
same altitude as 8,000 feet,
that's half an atmosphere.

1227
01:01:56,450 --> 01:01:59,340
So there's only half as much
air keeping it off.

1228
01:01:59,340 --> 01:02:02,830
So if you travel a lot, that's
when your disk drive will

1229
01:02:02,830 --> 01:02:06,100
fail, is when you're flying.

1230
01:02:06,100 --> 01:02:08,790
OK.

1231
01:02:08,790 --> 01:02:12,940
So it looks like, however that
rotating disk is getting

1232
01:02:12,940 --> 01:02:16,810
cheaper faster then solid
state disk is.

1233
01:02:16,810 --> 01:02:21,200
So rotating disk is an order of
magnitude cheaper per byte

1234
01:02:21,200 --> 01:02:23,250
than solid state disk today.

1235
01:02:23,250 --> 01:02:25,010
Maybe two orders of
magnitude cheaper.

1236
01:02:25,010 --> 01:02:27,480
It's hard to measure fairly.

1237
01:02:27,480 --> 01:02:30,470
But rotating disk, according
to Seagate--

1238
01:02:30,470 --> 01:02:33,650
they're saying, by the end of
the decade, we'll have 70

1239
01:02:33,650 --> 01:02:38,870
terabyte drives that are
the same form factor.

1240
01:02:38,870 --> 01:02:41,870
And so you figure out what the
Moore's Law is for that, and

1241
01:02:41,870 --> 01:02:45,330
it's better than for
lithography.

1242
01:02:45,330 --> 01:02:48,670
Lithography is not going
to be that much more

1243
01:02:48,670 --> 01:02:50,580
dense in that timeframe.

1244
01:02:50,580 --> 01:02:55,440
So at least for the next 5 or
10 years, it looks like disk

1245
01:02:55,440 --> 01:02:59,460
drives are going to maintain
their cost advantage over

1246
01:02:59,460 --> 01:03:01,450
solid state storage,
and maybe even

1247
01:03:01,450 --> 01:03:02,770
spread that cost advantage.

1248
01:03:02,770 --> 01:03:05,920
So for any particular
application, for storing your

1249
01:03:05,920 --> 01:03:09,390
music, SSD will be cheap enough,
but for those people

1250
01:03:09,390 --> 01:03:11,820
that have really big data
sets, like these new

1251
01:03:11,820 --> 01:03:13,510
telescopes they're
putting up--

1252
01:03:13,510 --> 01:03:15,040
these new telescopes
are crazy.

1253
01:03:15,040 --> 01:03:17,480
These people are putting
up these telescopes.

1254
01:03:17,480 --> 01:03:19,540
They're putting up 1,500
telescopes across the

1255
01:03:19,540 --> 01:03:21,300
Australian Outback.

1256
01:03:21,300 --> 01:03:24,370
And each of those telescopes in
the first 15 minutes live

1257
01:03:24,370 --> 01:03:26,940
is going to produce more data
than has come down from the

1258
01:03:26,940 --> 01:03:29,910
Hubble, total.

1259
01:03:29,910 --> 01:03:33,690
And there's just no
way for them to--

1260
01:03:33,690 --> 01:03:34,760
I don't know what they're
going to do.

1261
01:03:34,760 --> 01:03:37,140
But it's a huge amount of data,
and they're going to

1262
01:03:37,140 --> 01:03:39,040
have to use disks to store
whatever it is that

1263
01:03:39,040 --> 01:03:40,010
they want to keep.

1264
01:03:40,010 --> 01:03:41,950
And they don't like throwing
away data, because it's so

1265
01:03:41,950 --> 01:03:43,790
expensive to make.

1266
01:03:43,790 --> 01:03:46,720
So if I were a disk maker, I'd
make sure that my salesmen had

1267
01:03:46,720 --> 01:03:48,322
an office somewhere out there.

1268
01:03:51,930 --> 01:03:53,870
So the conclusion is you're not
going to be able to, at

1269
01:03:53,870 --> 01:03:56,130
least for those applications,
just have an

1270
01:03:56,130 --> 01:03:57,250
index in main memory.

1271
01:03:57,250 --> 01:03:59,430
You're going to have to have
a data structure that

1272
01:03:59,430 --> 01:04:00,680
works well on disk.

1273
01:04:03,520 --> 01:04:05,740
The speed trends--

1274
01:04:05,740 --> 01:04:08,530
well, seek time is not
going to change.

1275
01:04:08,530 --> 01:04:09,350
It hasn't changed.

1276
01:04:09,350 --> 01:04:10,970
It's not going to change.

1277
01:04:10,970 --> 01:04:15,000
The bandwidth of a disk drive
grows with the square root of

1278
01:04:15,000 --> 01:04:16,440
its capacity.

1279
01:04:16,440 --> 01:04:19,150
So if you quadruple the storage
on the disk because

1280
01:04:19,150 --> 01:04:24,500
you've made the bits twice as
dense in each dimension, then

1281
01:04:24,500 --> 01:04:27,800
one spin of the disk sees twice
as many disks, not four

1282
01:04:27,800 --> 01:04:29,310
times as many disks.

1283
01:04:29,310 --> 01:04:32,110
So that projects out to
something like disks that are

1284
01:04:32,110 --> 01:04:34,210
500 megabytes per second.

1285
01:04:34,210 --> 01:04:37,650
So how long is it going to take
to back up a 67 terabyte

1286
01:04:37,650 --> 01:04:38,900
disk drive?

1287
01:04:42,860 --> 01:04:47,700
So there remain systems
problems.

1288
01:04:47,700 --> 01:04:51,720
And I was explaining to my son
that there's all these

1289
01:04:51,720 --> 01:04:54,020
problems in systems.

1290
01:04:54,020 --> 01:04:59,320
Data structures aren't suited,
and all these systems suck.

1291
01:04:59,320 --> 01:05:01,740
He said, well, isn't
that horrible if

1292
01:05:01,740 --> 01:05:02,790
you're computer scientist?

1293
01:05:02,790 --> 01:05:09,650
I said, no, because
we make our living

1294
01:05:09,650 --> 01:05:11,060
off of these problems.

1295
01:05:14,060 --> 01:05:15,270
So here are some problems.

1296
01:05:15,270 --> 01:05:17,320
There's plenty of living
to be made yet.

1297
01:05:23,760 --> 01:05:26,630
Power consumption is also a big
issue for these things.

1298
01:05:26,630 --> 01:05:32,300
If you fill up a room a Google
data center, a room which is

1299
01:05:32,300 --> 01:05:36,960
probably bigger than this
room, full of machines.

1300
01:05:36,960 --> 01:05:39,930
The Facebook data center is
probably a room about this

1301
01:05:39,930 --> 01:05:41,690
size, full of machines.

1302
01:05:41,690 --> 01:05:46,420
And power and cooling is
something like half the cost

1303
01:05:46,420 --> 01:05:47,440
of the machines.

1304
01:05:47,440 --> 01:05:52,940
The machines for something like
Facebook, the hardware

1305
01:05:52,940 --> 01:05:55,160
might cost them $10 million or
$20 million a year, and the

1306
01:05:55,160 --> 01:05:57,510
heating and cooling is another
$10 million or $20 million a

1307
01:05:57,510 --> 01:06:00,080
year, which is why they go off
and they build these data

1308
01:06:00,080 --> 01:06:03,430
centers in places like North
Carolina, where I guess

1309
01:06:03,430 --> 01:06:08,350
they're willing to give them
power for free or something.

1310
01:06:08,350 --> 01:06:11,460
So making good use of disk
bandwidth offers huge power

1311
01:06:11,460 --> 01:06:15,530
savings, because basically you
can use disks which are

1312
01:06:15,530 --> 01:06:17,450
cheaper than solid
state for power.

1313
01:06:17,450 --> 01:06:24,000
And you want to use that well.

1314
01:06:24,000 --> 01:06:25,110
CPU trends.

1315
01:06:25,110 --> 01:06:26,900
Well, you've probably talked
about this, right?

1316
01:06:26,900 --> 01:06:30,600
CPUs are going to get
a lot more cores.

1317
01:06:30,600 --> 01:06:35,160
I actually have a 48-core
machine that cost $10,000 that

1318
01:06:35,160 --> 01:06:37,340
I bought about a month ago.

1319
01:06:37,340 --> 01:06:41,790
And our customers mostly
use machines that

1320
01:06:41,790 --> 01:06:44,000
are like $5,000 machines.

1321
01:06:44,000 --> 01:06:46,610
So when I provisioned this
machine, I said, well, I

1322
01:06:46,610 --> 01:06:49,370
should spend and buy a machine
that's twice as good as what

1323
01:06:49,370 --> 01:06:52,030
they're buying, because I'm
developing software that

1324
01:06:52,030 --> 01:06:55,010
they're going to
use next year.

1325
01:06:55,010 --> 01:06:58,440
So I bought a $10,000 machine,
which is 48 cores.

1326
01:06:58,440 --> 01:07:03,160
And we're having all
sorts of making a

1327
01:07:03,160 --> 01:07:06,110
living with that machine.

1328
01:07:06,110 --> 01:07:09,790
The memory bandwidth and the I/O
bus bandwidth will grow.

1329
01:07:09,790 --> 01:07:14,410
And so I think it's going to get
more and more exciting to

1330
01:07:14,410 --> 01:07:15,970
try to use all these cores.

1331
01:07:15,970 --> 01:07:21,180
Fractal trees have a lot of
opportunity to use those cores

1332
01:07:21,180 --> 01:07:25,510
to improve and reduce the
number of disk I/Os.

1333
01:07:25,510 --> 01:07:30,140
So the conclusion is, basically,
these data

1334
01:07:30,140 --> 01:07:35,850
structures dominate B-trees
asymptotically.

1335
01:07:35,850 --> 01:07:40,870
And then B-trees have 40 years
of engineering advantage, but

1336
01:07:40,870 --> 01:07:42,120
that will evaporate
eventually.

1337
01:07:45,300 --> 01:07:48,170
These data structures ride
better technology curves than

1338
01:07:48,170 --> 01:07:52,810
B-trees do, and so I find it
hard to believe that in 10

1339
01:07:52,810 --> 01:07:56,180
years that anybody would
design a system using a

1340
01:07:56,180 --> 01:08:00,700
B-tree, because how do you
overcome those advantages.

1341
01:08:00,700 --> 01:08:03,660
So basically all storage systems
are going to use data

1342
01:08:03,660 --> 01:08:06,940
structures that are like this,
or something else.

1343
01:08:06,940 --> 01:08:09,000
There's a whole bunch of other
kinds of indexes that we

1344
01:08:09,000 --> 01:08:12,450
haven't attacked, things like
indexing multi-dimensional

1345
01:08:12,450 --> 01:08:19,609
data or indexing data where you
have very large keys, very

1346
01:08:19,609 --> 01:08:22,189
large rows.

1347
01:08:22,189 --> 01:08:25,729
Imagine that you're trying to
index DNA sequences, which are

1348
01:08:25,729 --> 01:08:29,520
much bigger than a disk block.

1349
01:08:29,520 --> 01:08:33,160
So there's a whole bunch of
interesting opportunities.

1350
01:08:33,160 --> 01:08:37,880
And that's what I'm
working on.

1351
01:08:37,880 --> 01:08:41,660
So any questions or comments?

1352
01:08:41,660 --> 01:08:43,109
Arguments?

1353
01:08:43,109 --> 01:08:44,359
Fistfights?

1354
01:08:53,540 --> 01:08:54,232
OK.

1355
01:08:54,232 --> 01:08:55,359
AUDIENCE: Where's the mic?

1356
01:08:55,359 --> 01:08:57,138
BRADLEY KUSZMAUL: Where
is the mic?

1357
01:08:57,138 --> 01:08:58,094
AUDIENCE: That's OK.

1358
01:08:58,094 --> 01:08:59,528
I can [INAUDIBLE].

1359
01:08:59,528 --> 01:09:00,778
BRADLEY KUSZMAUL:
It's on my coat.

1360
01:09:05,122 --> 01:09:07,540
PROFESSOR: So actually, this is
a very interesting point,

1361
01:09:07,540 --> 01:09:11,029
because if you think where the
world is leading, I think that

1362
01:09:11,029 --> 01:09:14,109
big data is something that's
very, very interesting,

1363
01:09:14,109 --> 01:09:16,819
because all these people are
gathering huge amounts of

1364
01:09:16,819 --> 01:09:19,330
data, and they're storing
huge amounts of data.

1365
01:09:19,330 --> 01:09:22,300
And what do with data, accessing
them, is going to be

1366
01:09:22,300 --> 01:09:23,080
one big problem.

1367
01:09:23,080 --> 01:09:25,890
I mean, if you look at what
people like Google are doing,

1368
01:09:25,890 --> 01:09:27,710
they're just collecting
all those.

1369
01:09:27,710 --> 01:09:29,250
Nobody's throwing
anything out.

1370
01:09:29,250 --> 01:09:34,800
And I believe if you to kind of
look at them, analyze them,

1371
01:09:34,800 --> 01:09:36,720
do cool things with the
data, it's going to

1372
01:09:36,720 --> 01:09:38,080
be very, very important.

1373
01:09:38,080 --> 01:09:40,729
So I think that would
be very interesting,

1374
01:09:40,729 --> 01:09:43,160
high-performance end.

1375
01:09:43,160 --> 01:09:46,220
It's not just doing
number crunching.

1376
01:09:46,220 --> 01:09:47,569
Until now, when people look at

1377
01:09:47,569 --> 01:09:48,960
high-performance, it's about CPU.

1378
01:09:48,960 --> 01:09:52,290
It's about how many FLOPS
per second can you do?

1379
01:09:52,290 --> 01:09:55,060
TeraFLOPS, petaFLOP machines
and stuff like that.

1380
01:09:55,060 --> 01:09:56,140
But I think one thing
that's really

1381
01:09:56,140 --> 01:09:58,250
interesting is it's not petaFLOPS.

1382
01:09:58,250 --> 01:10:00,690
How many terabytes of
data can you process

1383
01:10:00,690 --> 01:10:03,150
through to find something.

1384
01:10:03,150 --> 01:10:09,120
BRADLEY KUSZMAUL: So I was at
a talk by Facebook, and they

1385
01:10:09,120 --> 01:10:14,410
serve 37 gigabytes per data
per second out of their

1386
01:10:14,410 --> 01:10:17,310
database tier.

1387
01:10:17,310 --> 01:10:25,130
And that's a lot of serving.

1388
01:10:25,130 --> 01:10:27,130
Out of one little piece of
whatever they're doing.

1389
01:10:29,780 --> 01:10:33,860
Those guys have three
or five petabytes.

1390
01:10:33,860 --> 01:10:38,680
And in the petabyte club,
they're small potatoes.

1391
01:10:38,680 --> 01:10:42,040
There's people who have hundreds
of petabytes, people

1392
01:10:42,040 --> 01:10:43,290
with three-letter acronyms.

1393
01:10:46,448 --> 01:10:48,795
PROFESSOR: I mean, some of those
three-letter acronym

1394
01:10:48,795 --> 01:10:52,270
places, the amount of data they
are getting and they are

1395
01:10:52,270 --> 01:10:55,000
processing is just gigantic.

1396
01:10:55,000 --> 01:11:00,170
And I think to a point that even
some of the interesting

1397
01:11:00,170 --> 01:11:01,936
things about--

1398
01:11:01,936 --> 01:11:06,040
if they keep growing their data
centers at the rate they

1399
01:11:06,040 --> 01:11:09,490
keep growing in the next couple
of decades, they will

1400
01:11:09,490 --> 01:11:11,290
need the entire power of the
United States to power their

1401
01:11:11,290 --> 01:11:13,560
data centers, because they
are at that kind of

1402
01:11:13,560 --> 01:11:14,810
thing at this point.

1403
01:11:16,940 --> 01:11:22,150
Even in these big national labs,
the reason they can't

1404
01:11:22,150 --> 01:11:24,480
expand is not that they don't
have money to buy the

1405
01:11:24,480 --> 01:11:26,690
machines, but they don't have
money to pay for the

1406
01:11:26,690 --> 01:11:28,750
electricity, and also they don't
have electricity-- that

1407
01:11:28,750 --> 01:11:29,560
much electricity--

1408
01:11:29,560 --> 01:11:30,820
[UNINTELLIGIBLE]

1409
01:11:30,820 --> 01:11:32,672
them to basically feed it.

1410
01:11:32,672 --> 01:11:35,740
BRADLEY KUSZMAUL: I've run into
people for whom the power

1411
01:11:35,740 --> 01:11:37,010
issue was a big deal.

1412
01:11:37,010 --> 01:11:40,780
I look at it and say, eh, you
bought a $5,000 machine.

1413
01:11:40,780 --> 01:11:44,230
You spend $5,000 in power over
the lifetime of the machine.

1414
01:11:44,230 --> 01:11:47,390
It doesn't seem like it's
that big a deal.

1415
01:11:47,390 --> 01:11:50,490
But they've filled up their data
center, and the cost of

1416
01:11:50,490 --> 01:11:54,310
adding one more machine has a
huge incremental cost, because

1417
01:11:54,310 --> 01:11:56,470
they can't fit one more in.

1418
01:11:56,470 --> 01:11:59,330
So that means they have to
build another building.

1419
01:11:59,330 --> 01:12:06,260
And so almost everybody's facing
that problem who's in

1420
01:12:06,260 --> 01:12:09,260
this business.

1421
01:12:09,260 --> 01:12:10,980
And then they try to build a
building somewhere where

1422
01:12:10,980 --> 01:12:12,350
there's natural cooled--

1423
01:12:12,350 --> 01:12:14,870
Google's written these papers
about, oh, it turns out if you

1424
01:12:14,870 --> 01:12:16,610
don't air condition your
computers, most

1425
01:12:16,610 --> 01:12:17,860
of them work anyway.

1426
01:12:24,180 --> 01:12:27,710
So, well, air conditioning is a
quarter of the cost over the

1427
01:12:27,710 --> 01:12:28,880
lifetime of the computer.

1428
01:12:28,880 --> 01:12:33,490
So if you can make more than 3/4
of them give you service,

1429
01:12:33,490 --> 01:12:36,815
you come out ahead.

1430
01:12:36,815 --> 01:12:40,660
GUEST SPEAKER: On that note,
MIT is part of a consortium

1431
01:12:40,660 --> 01:12:45,590
that includes Harvard,
Northeastern, Boston

1432
01:12:45,590 --> 01:12:50,130
University, and University of
Massachusetts Amherst, to

1433
01:12:50,130 --> 01:12:54,920
relocate all of our
high-performance computing

1434
01:12:54,920 --> 01:13:00,280
into a new green data center
in Holyoke, Massachusetts.

1435
01:13:00,280 --> 01:13:04,410
So the idea is that rather than
us locating things here

1436
01:13:04,410 --> 01:13:08,130
on campus, where the energy
costs are high and we get a

1437
01:13:08,130 --> 01:13:16,060
lot of our energy from fuels
that have a big carbon

1438
01:13:16,060 --> 01:13:19,660
footprint, locating
it in Holyoke--

1439
01:13:19,660 --> 01:13:27,080
they have a lot of hydro power
and nuclear power there.

1440
01:13:27,080 --> 01:13:31,660
And they're able to build a
building that is extremely

1441
01:13:31,660 --> 01:13:33,140
energy-efficient.

1442
01:13:33,140 --> 01:13:37,810
And it turns out that a bunch
of years ago when they were

1443
01:13:37,810 --> 01:13:42,780
digging up the Route 90, the
Mass Pike, they laid a lot of

1444
01:13:42,780 --> 01:13:45,220
dark fiber down the length.

1445
01:13:45,220 --> 01:13:48,080
And so what they're going to
do is light up that fiber,

1446
01:13:48,080 --> 01:13:51,220
which comes right back here
to the Boston area.

1447
01:13:51,220 --> 01:13:53,940
And so for most people who
are using these very

1448
01:13:53,940 --> 01:13:56,210
high-performance things, it
doesn't really matter where

1449
01:13:56,210 --> 01:13:59,590
it's located anymore, at
the level of that.

1450
01:13:59,590 --> 01:14:03,830
So instead of just locating some
piece of equipment here,

1451
01:14:03,830 --> 01:14:08,490
we just will locate it out
there, and the price will drop

1452
01:14:08,490 --> 01:14:09,240
dramatically.

1453
01:14:09,240 --> 01:14:13,490
And it'll be a much greener
way for us to be doing our

1454
01:14:13,490 --> 01:14:15,450
high-end computing.

1455
01:14:15,450 --> 01:14:16,170
Yeah, question?

1456
01:14:16,170 --> 01:14:18,510
AUDIENCE: Isn't someone talking
about water-cooled

1457
01:14:18,510 --> 01:14:19,914
offshore floating
data centers?

1458
01:14:19,914 --> 01:14:21,050
GUEST SPEAKER: Sure.

1459
01:14:21,050 --> 01:14:22,080
Sure.

1460
01:14:22,080 --> 01:14:25,930
So the question is, are people
talking about water-coooled

1461
01:14:25,930 --> 01:14:27,630
offshore floating data centers.

1462
01:14:27,630 --> 01:14:27,850
Yeah.

1463
01:14:27,850 --> 01:14:34,500
I mean, locating things in some
area where you can cool

1464
01:14:34,500 --> 01:14:36,860
things easily makes
a lot of sense.

1465
01:14:36,860 --> 01:14:41,490
Usually, they tend to want those
near rivers rather than

1466
01:14:41,490 --> 01:14:44,990
in the middle of the ocean,
just because you get the

1467
01:14:44,990 --> 01:14:46,340
hydropower.

1468
01:14:46,340 --> 01:14:49,210
But even in the ocean, you can
use currents to do very much

1469
01:14:49,210 --> 01:14:50,200
the same kind of thing.

1470
01:14:50,200 --> 01:14:52,620
So for some of these things,
people are looking very

1471
01:14:52,620 --> 01:15:00,830
seriously at a whole bunch of
different strategies for

1472
01:15:00,830 --> 01:15:03,625
containing large-scale
equipment.

1473
01:15:03,625 --> 01:15:08,060
PROFESSOR: So one that's very
counterintuitive is people are

1474
01:15:08,060 --> 01:15:11,680
trying to build data centers in
the middle of deserts, when

1475
01:15:11,680 --> 01:15:13,130
it's very hot.

1476
01:15:13,130 --> 01:15:15,715
I mean, why do you think people
want to build a data

1477
01:15:15,715 --> 01:15:17,530
center in the middle
of the desert?

1478
01:15:17,530 --> 01:15:19,310
AUDIENCE: Solar power?

1479
01:15:19,310 --> 01:15:21,110
PROFESSOR: Solar power
is one thing.

1480
01:15:21,110 --> 01:15:22,626
No, it's not solar power.

1481
01:15:22,626 --> 01:15:23,964
AUDIENCE: It gets really
cold at night.

1482
01:15:23,964 --> 01:15:25,760
PROFESSOR: No, it's not
really cold at night.

1483
01:15:25,760 --> 01:15:26,990
That's not it.

1484
01:15:26,990 --> 01:15:29,740
GUEST SPEAKER: Cheap property.

1485
01:15:29,740 --> 01:15:33,120
PROFESSOR: No, the biggest thing
about cooling is either

1486
01:15:33,120 --> 01:15:35,140
you can do air conditioning,
where you're using power to

1487
01:15:35,140 --> 01:15:39,920
pull heat out, or you can
use just water to cool.

1488
01:15:39,920 --> 01:15:43,750
And what happens is most of
other places is the humidity

1489
01:15:43,750 --> 01:15:44,710
is too high.

1490
01:15:44,710 --> 01:15:47,470
And when you go to the desert,
humidity is low enough that

1491
01:15:47,470 --> 01:15:49,590
you can just pump water through
the thing and get the

1492
01:15:49,590 --> 01:15:52,280
water evaporation going
by, and then use

1493
01:15:52,280 --> 01:15:53,750
that to cool the system.

1494
01:15:53,750 --> 01:15:56,970
So sometimes they're looking
at data centers in places

1495
01:15:56,970 --> 01:16:00,480
where it could be 120 degrees,
but very low humidity.

1496
01:16:00,480 --> 01:16:02,940
And they think that is
a lot more efficient

1497
01:16:02,940 --> 01:16:04,490
to cool than that.

1498
01:16:04,490 --> 01:16:06,570
So there are a lot of these
interesting nonintuitive

1499
01:16:06,570 --> 01:16:08,280
things people are looking at.

1500
01:16:08,280 --> 01:16:10,435
So what [UNINTELLIGIBLE] they
say is that humidity's the

1501
01:16:10,435 --> 01:16:11,988
killer, not the temperature.

1502
01:16:11,988 --> 01:16:15,480
AUDIENCE: What if you're
located on the South--

1503
01:16:15,480 --> 01:16:19,910
if you're located on the South
Pole, then that's both cold

1504
01:16:19,910 --> 01:16:24,370
and really low humidity.

1505
01:16:24,370 --> 01:16:25,540
GUEST SPEAKER: Yeah.

1506
01:16:25,540 --> 01:16:28,270
I mean it'll be interesting to
see how these things develop.

1507
01:16:28,270 --> 01:16:35,300
It's a very so-called hot topic
these days, is energy

1508
01:16:35,300 --> 01:16:36,090
for computing.

1509
01:16:36,090 --> 01:16:39,820
And the energy for computing
of course matters also not

1510
01:16:39,820 --> 01:16:42,000
only at the large scale but
also at the small scale,

1511
01:16:42,000 --> 01:16:49,410
because you want your
favorite handheld to

1512
01:16:49,410 --> 01:16:51,110
use very little battery.

1513
01:16:51,110 --> 01:16:52,540
So your batteries last longer.

1514
01:16:52,540 --> 01:16:56,620
So the issue of energy, using
that as a measure--

1515
01:16:56,620 --> 01:16:59,830
we've mostly been looking at
how fast we can make things

1516
01:16:59,830 --> 01:17:03,260
run in this class, but many of
the lessons you can use to

1517
01:17:03,260 --> 01:17:06,550
say, well, how can I
make this run as

1518
01:17:06,550 --> 01:17:09,180
energy-efficient as possible?

1519
01:17:09,180 --> 01:17:13,410
And what you'll learn is that
many of the lessons we've had

1520
01:17:13,410 --> 01:17:17,310
in the class during the term
we focused, as I say, on

1521
01:17:17,310 --> 01:17:18,180
performance.

1522
01:17:18,180 --> 01:17:21,070
But there are many resources
in any given situation that

1523
01:17:21,070 --> 01:17:22,390
you might want to optimize.

1524
01:17:22,390 --> 01:17:25,370
And so understanding something
about how do I minimize

1525
01:17:25,370 --> 01:17:29,200
energy, how do I minimize disk
I/Os, how do I minimize clock

1526
01:17:29,200 --> 01:17:32,760
cycles, how do I minimize
off-chip accesses--

1527
01:17:32,760 --> 01:17:34,430
which tend to be much
more energy

1528
01:17:34,430 --> 01:17:36,070
intensive than on-chip--

1529
01:17:36,070 --> 01:17:40,740
all those different kinds of
measures end up being part of

1530
01:17:40,740 --> 01:17:42,700
the mix of what you have to do
when you're really engineering

1531
01:17:42,700 --> 01:17:43,950
these systems.

1532
01:17:46,142 --> 01:17:49,130
PROFESSOR: So I think another
interesting thing is, because

1533
01:17:49,130 --> 01:17:54,170
we are in this time where some
stuff grows at exponential

1534
01:17:54,170 --> 01:17:57,630
rates and stuff like that, some
of those ratios that made

1535
01:17:57,630 --> 01:18:02,330
sense at some point just
suddenly start making really

1536
01:18:02,330 --> 01:18:02,960
bad things.

1537
01:18:02,960 --> 01:18:06,770
Like, for example, in this, at
some point the seek times were

1538
01:18:06,770 --> 01:18:08,870
normal enough that
you didn't care.

1539
01:18:08,870 --> 01:18:13,620
And at some point, because the
rest of the things took off so

1540
01:18:13,620 --> 01:18:16,610
fast, suddenly it becomes
this really, really big

1541
01:18:16,610 --> 01:18:17,150
bottlenecks.

1542
01:18:17,150 --> 01:18:19,660
BRADLEY KUSZMAUL: B-trees
were a really good data

1543
01:18:19,660 --> 01:18:22,750
structure in 1972.

1544
01:18:22,750 --> 01:18:26,590
Because, well, the seek
time and the transfer

1545
01:18:26,590 --> 01:18:28,660
time and the CPU--

1546
01:18:28,660 --> 01:18:33,040
the CPUs actually couldn't
read in the data in one

1547
01:18:33,040 --> 01:18:36,530
rotation, so people didn't even
read consecutive blocks,

1548
01:18:36,530 --> 01:18:38,860
because the CPU just couldn't
handle data

1549
01:18:38,860 --> 01:18:40,040
coming in that fast.

1550
01:18:40,040 --> 01:18:42,200
You would stagger blocks around
the disk, so that when

1551
01:18:42,200 --> 01:18:44,640
you did sequential reads, you'd
get this one and then

1552
01:18:44,640 --> 01:18:46,330
this one and this one.

1553
01:18:46,330 --> 01:18:48,730
There was this whole thing about
tuning your file system.

1554
01:18:48,730 --> 01:18:50,190
It's like--

1555
01:18:50,190 --> 01:18:52,640
AUDIENCE: By the way, back
when disks were--

1556
01:18:52,640 --> 01:18:54,310
BRADLEY KUSZMAUL: Yeah.

1557
01:18:54,310 --> 01:18:55,180
Washing machine.

1558
01:18:55,180 --> 01:18:57,335
AUDIENCE: Washing
machine size.

1559
01:18:57,335 --> 01:18:59,565
BRADLEY KUSZMAUL: For
20 megabytes.

1560
01:18:59,565 --> 01:19:01,040
PROFESSOR: Oh, yeah, that's
the big disk.

1561
01:19:05,540 --> 01:19:07,870
So hopefully you guys
got a feel for--

1562
01:19:07,870 --> 01:19:09,840
we have been looking at this
performance on a small

1563
01:19:09,840 --> 01:19:12,440
multi-core and stuff like that,
how it can scale in

1564
01:19:12,440 --> 01:19:15,030
different directions and
the kind of impact

1565
01:19:15,030 --> 01:19:16,365
performance can have.

1566
01:19:16,365 --> 01:19:22,050
And in fact, if anybody has read
books on why Google is

1567
01:19:22,050 --> 01:19:25,530
successful, one of the biggest
things for their success is

1568
01:19:25,530 --> 01:19:29,090
they managed to do a huge amount
of work very cheaply,

1569
01:19:29,090 --> 01:19:32,790
because the amount of work they
do, if anybody did in the

1570
01:19:32,790 --> 01:19:36,400
traditional way, they can't
afford that model, to give it

1571
01:19:36,400 --> 01:19:39,820
for free or give it supplied
for advertising.

1572
01:19:39,820 --> 01:19:42,360
Because they can get it done
because it's about

1573
01:19:42,360 --> 01:19:43,980
optimization.

1574
01:19:43,980 --> 01:19:47,770
Performance, Performance
basically relates to cost.

1575
01:19:47,770 --> 01:19:50,380
And if the cost is low enough,
then they don't have to keep

1576
01:19:50,380 --> 01:19:53,370
charging a huge amount of
money for each search.

1577
01:19:53,370 --> 01:19:57,790
GUEST SPEAKER: So let's
thank Dr. Kuszmaul for

1578
01:19:57,790 --> 01:20:01,438
an excellent talk.

1579
01:20:01,438 --> 01:20:04,245
And can you hang out for just
a little bit, if people want

1580
01:20:04,245 --> 01:20:04,710
to come down?

1581
01:20:04,710 --> 01:20:06,200
OK.

1582
01:20:06,200 --> 01:20:07,450
Thanks.