1 00:00:00,090 --> 00:00:01,780 The following content is provided 2 00:00:01,780 --> 00:00:04,019 under a Creative Commons license. 3 00:00:04,019 --> 00:00:06,870 Your support will help MIT OpenCourseWare continue 4 00:00:06,870 --> 00:00:10,730 to offer high quality educational resources for free. 5 00:00:10,730 --> 00:00:13,330 To make a donation or view additional materials 6 00:00:13,330 --> 00:00:17,215 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,215 --> 00:00:17,840 at ocw.mit.edu. 8 00:00:26,225 --> 00:00:27,600 PROFESSOR: Welcome to Foundations 9 00:00:27,600 --> 00:00:30,330 of Computational and Systems Biology. 10 00:00:30,330 --> 00:00:32,030 This course has many numbers. 11 00:00:32,030 --> 00:00:35,410 We'll explain all the differences and similarities. 12 00:00:35,410 --> 00:00:39,540 But briefly, there are three undergrad course numbers, 13 00:00:39,540 --> 00:00:44,770 which are all similar in content, 7.36, 20.390, 6.802. 14 00:00:44,770 --> 00:00:49,470 And then, there are four graduate versions. 15 00:00:49,470 --> 00:00:55,410 The 7.91, 20.490, and HST versions are all very similar, 16 00:00:55,410 --> 00:00:57,030 basically identical. 17 00:00:57,030 --> 00:01:01,245 But the 6.874 has some additional AI content 18 00:01:01,245 --> 00:01:04,420 that we'll discuss in a moment. 19 00:01:04,420 --> 00:01:06,720 So make sure that you are registered 20 00:01:06,720 --> 00:01:11,160 for the appropriate version of this course. 21 00:01:11,160 --> 00:01:15,710 And please interrupt with questions at any time. 22 00:01:15,710 --> 00:01:22,260 The main goal today is to give an overview of the course, both 23 00:01:22,260 --> 00:01:24,420 the content as well as the mechanics of how 24 00:01:24,420 --> 00:01:25,503 the course will be taught. 25 00:01:25,503 --> 00:01:28,310 And we want to make sure that everything is clear. 26 00:01:28,310 --> 00:01:31,070 So this course is taught by myself 27 00:01:31,070 --> 00:01:35,135 and Chris Burge from Biology, Professor Fraenkel from BE, 28 00:01:35,135 --> 00:01:38,960 and Professor Gifford from EECS. 29 00:01:38,960 --> 00:01:45,450 We have three TAs, Peter Freese and Collette Picard, 30 00:01:45,450 --> 00:01:51,570 from Computational and Systems Biology, and Tahin, from EECS. 31 00:01:51,570 --> 00:01:54,620 All the TAs have expertise in computational biology 32 00:01:54,620 --> 00:01:59,480 as well as other quantitative areas 33 00:01:59,480 --> 00:02:02,429 like math, statistics, computer science. 34 00:02:02,429 --> 00:02:04,845 So in addition to the lectures by the regular instructors, 35 00:02:04,845 --> 00:02:09,320 we will also have guest lectures by George Church, 36 00:02:09,320 --> 00:02:12,640 from Harvard, toward the end of the semester. 37 00:02:12,640 --> 00:02:14,640 Doug Lauffenburger will give a lecture 38 00:02:14,640 --> 00:02:18,530 in the regulatory network section of the course. 39 00:02:18,530 --> 00:02:21,350 And we'll have a guest lecture from Ron Weiss 40 00:02:21,350 --> 00:02:23,220 on synthetic biology. 41 00:02:23,220 --> 00:02:25,770 And just a note that today's lecture 42 00:02:25,770 --> 00:02:27,580 and all the lectures this semester 43 00:02:27,580 --> 00:02:32,950 are being recorded by AMPS, by MIT's OpenCourseWare. 44 00:02:32,950 --> 00:02:35,850 So the videos, after a little bit of editing, 45 00:02:35,850 --> 00:02:38,972 will eventually end up on OpenCourseWare. 46 00:02:38,972 --> 00:02:39,930 What are these courses? 47 00:02:39,930 --> 00:02:45,860 So these course numbers are the graduate level versions, 48 00:02:45,860 --> 00:02:49,170 which are survey courses in computational biology. 49 00:02:49,170 --> 00:02:51,310 Our target audience is graduate students 50 00:02:51,310 --> 00:02:54,935 who have a solid background and comfort-- 51 00:02:54,935 --> 00:02:56,740 or, a solid background in biology, 52 00:02:56,740 --> 00:03:01,130 and also, a comfort with quantitative approaches. 53 00:03:01,130 --> 00:03:03,230 We don't assume that you've programmed before, 54 00:03:03,230 --> 00:03:05,200 but there will be some programming 55 00:03:05,200 --> 00:03:06,970 content on the homeworks. 56 00:03:06,970 --> 00:03:12,430 And you will therefore need to learn some Python programming. 57 00:03:12,430 --> 00:03:16,180 And the TAs will help with that component of the course. 58 00:03:16,180 --> 00:03:20,240 We also have some online tutorials on Python programming 59 00:03:20,240 --> 00:03:22,210 that are available. 60 00:03:22,210 --> 00:03:24,830 The undergrad course numbers-- this 61 00:03:24,830 --> 00:03:26,940 is an upper level undergraduate survey 62 00:03:26,940 --> 00:03:28,730 course in computational biology. 63 00:03:28,730 --> 00:03:33,200 And our target audience are upper level undergraduates 64 00:03:33,200 --> 00:03:35,182 with solid biology background and comfort 65 00:03:35,182 --> 00:03:36,390 with quantitative approaches. 66 00:03:36,390 --> 00:03:39,350 So there's one key difference between the graduate 67 00:03:39,350 --> 00:03:42,540 and undergraduate versions, which I'll come to in a moment. 68 00:03:42,540 --> 00:03:45,600 So the goal of this course is to develop understanding 69 00:03:45,600 --> 00:03:49,070 of foundational methods in computational biology that 70 00:03:49,070 --> 00:03:53,490 will enable you to contextualize and understand 71 00:03:53,490 --> 00:03:55,990 a good portion of research literature in a growing field. 72 00:03:55,990 --> 00:03:58,922 So if you pick up Science, or Nature, or PLOS Computational 73 00:03:58,922 --> 00:04:03,780 Biology and you want to read those papers 74 00:04:03,780 --> 00:04:05,500 and understand them, after this course, 75 00:04:05,500 --> 00:04:06,960 you will have a better chance. 76 00:04:06,960 --> 00:04:09,770 We're not guaranteeing you'll be able to understand all of them. 77 00:04:09,770 --> 00:04:15,480 But you'll be able to recognize, perhaps, the category of paper, 78 00:04:15,480 --> 00:04:18,850 the class, and perhaps, some of the algorithms 79 00:04:18,850 --> 00:04:20,579 that are involved. 80 00:04:20,579 --> 00:04:23,940 And for the graduate version, another goal 81 00:04:23,940 --> 00:04:27,350 is to help you gain exposure to research in this field. 82 00:04:27,350 --> 00:04:30,440 So it's actually possible to do a smaller 83 00:04:30,440 --> 00:04:32,820 scale computational biology research project on your own, 84 00:04:32,820 --> 00:04:36,700 on your laptop-- perhaps, on ATHENA-- with relatively 85 00:04:36,700 --> 00:04:39,840 limited computational resources and potentially 86 00:04:39,840 --> 00:04:42,190 even discover something new. 87 00:04:42,190 --> 00:04:44,720 And so we want to give you that experience. 88 00:04:44,720 --> 00:04:47,110 And that's through the project component 89 00:04:47,110 --> 00:04:49,400 that we'll say more about in a moment. 90 00:04:49,400 --> 00:04:52,860 So just to make sure that everyone's in the right class-- 91 00:04:52,860 --> 00:04:55,840 this is not a systems biology class. 92 00:04:55,840 --> 00:04:57,340 There are some more focused systems 93 00:04:57,340 --> 00:04:58,900 biology classes offered on campus. 94 00:04:58,900 --> 00:05:01,380 But we will cover some topics that 95 00:05:01,380 --> 00:05:04,390 are important for analyzing complex systems. 96 00:05:04,390 --> 00:05:08,170 This is also certainly not a synthetic biology course. 97 00:05:08,170 --> 00:05:12,050 Some of the systems methods are also used in synthetic biology. 98 00:05:12,050 --> 00:05:14,430 And there will be this guest lecturer I mentioned, 99 00:05:14,430 --> 00:05:18,030 Ron Weiss, which will cover synthetic biology. 100 00:05:18,030 --> 00:05:20,140 It's also not an algorithms class. 101 00:05:20,140 --> 00:05:22,880 We don't assume that you have experience 102 00:05:22,880 --> 00:05:25,290 in designing or analyzing algorithms. 103 00:05:25,290 --> 00:05:28,220 We'll discuss various bioinformatics algorithms. 104 00:05:28,220 --> 00:05:30,830 And you'll have the opportunity to implement 105 00:05:30,830 --> 00:05:36,050 at least one bioinformatics algorithm on your homework. 106 00:05:36,050 --> 00:05:39,910 But algorithms and not really the center of the course. 107 00:05:39,910 --> 00:05:41,620 And there's one exception to this, 108 00:05:41,620 --> 00:05:44,670 which is, those of you who are taking 6.874 109 00:05:44,670 --> 00:05:46,480 will go to a special recitation that 110 00:05:46,480 --> 00:05:50,350 will cover more advanced algorithm content. 111 00:05:50,350 --> 00:05:53,190 And there will be special homework problems for you 112 00:05:53,190 --> 00:05:53,690 as well. 113 00:05:53,690 --> 00:05:58,240 So that course really does have more algorithm content. 114 00:05:58,240 --> 00:06:00,280 So the plan for today is that I will just 115 00:06:00,280 --> 00:06:03,750 do a brief, anecdotal history of computational and systems 116 00:06:03,750 --> 00:06:04,250 biology. 117 00:06:04,250 --> 00:06:07,770 This is to set the stage and the context for the class. 118 00:06:07,770 --> 00:06:10,620 And then we'll spend a significant amount 119 00:06:10,620 --> 00:06:14,600 of time reviewing the course mechanics, organization, 120 00:06:14,600 --> 00:06:15,130 and content. 121 00:06:15,130 --> 00:06:17,730 Because as you'll see, it's a little bit complicated. 122 00:06:17,730 --> 00:06:22,560 But it'll make sense once we go over it, hopefully. 123 00:06:22,560 --> 00:06:26,660 So this is my take on computational 124 00:06:26,660 --> 00:06:27,650 and systems biology. 125 00:06:27,650 --> 00:06:29,910 Again, it's not a scholarly overview. 126 00:06:29,910 --> 00:06:32,770 It doesn't hit everything important that happened. 127 00:06:32,770 --> 00:06:35,870 It just gives you a flavor of what 128 00:06:35,870 --> 00:06:41,860 was happening in computational biology decade by decade. 129 00:06:41,860 --> 00:06:46,060 So first of all, where does this field 130 00:06:46,060 --> 00:06:49,470 fall in the academic scheme of things? 131 00:06:49,470 --> 00:06:53,110 So I consider computational biology 132 00:06:53,110 --> 00:06:55,550 to be actually part of biology. 133 00:06:55,550 --> 00:07:01,110 So in the way that genetics or biochemistry are disciplines 134 00:07:01,110 --> 00:07:03,920 that have strategies for understanding 135 00:07:03,920 --> 00:07:07,080 biological questions, so does computational biology. 136 00:07:07,080 --> 00:07:09,210 You can use it to understand a variety 137 00:07:09,210 --> 00:07:10,790 of computational questions in gene 138 00:07:10,790 --> 00:07:15,000 regulation, many other areas. 139 00:07:15,000 --> 00:07:18,430 There is also, some people make a distinction 140 00:07:18,430 --> 00:07:21,500 that bioinformatics is more about building tools 141 00:07:21,500 --> 00:07:25,811 whereas computational biology is more about using tools, 142 00:07:25,811 --> 00:07:26,310 for example. 143 00:07:26,310 --> 00:07:29,130 Although many people don't-- it's very blurry-- 144 00:07:29,130 --> 00:07:33,199 and don't try to-- people use them in various ways. 145 00:07:33,199 --> 00:07:34,990 And then, you could think of bioinformatics 146 00:07:34,990 --> 00:07:39,280 as being embedded in the larger field of informatics, where you 147 00:07:39,280 --> 00:07:43,340 include tools for management and analysis of data in general. 148 00:07:43,340 --> 00:07:47,570 And it's certainly true that many of the core concepts 149 00:07:47,570 --> 00:07:51,170 and algorithms in bioinformatics come from the field, 150 00:07:51,170 --> 00:07:53,980 come from computer science, come from other branches 151 00:07:53,980 --> 00:07:57,710 of engineering, from statistics, mathematics, and so forth. 152 00:07:57,710 --> 00:07:59,710 So it's really a cross disciplinary field. 153 00:07:59,710 --> 00:08:03,120 And then, synthetic biology cuts across in the sense 154 00:08:03,120 --> 00:08:05,404 that it's really an engineering discipline. 155 00:08:05,404 --> 00:08:07,070 Because you're designing and engineering 156 00:08:07,070 --> 00:08:09,820 synthetic molecular cellular systems. 157 00:08:09,820 --> 00:08:11,700 But you can also use synthetic biology 158 00:08:11,700 --> 00:08:16,220 to help understand natural biological systems, of course. 159 00:08:16,220 --> 00:08:19,030 All right, so what was happening, decade by decade? 160 00:08:19,030 --> 00:08:22,910 So in the '70s, there were not genome sequences 161 00:08:22,910 --> 00:08:26,027 available or large sequence databases of any sort. 162 00:08:26,027 --> 00:08:28,360 Except there were starting to be some protein sequences. 163 00:08:28,360 --> 00:08:31,030 And early computational biologists 164 00:08:31,030 --> 00:08:36,440 focused on comparing proteins, understanding their function, 165 00:08:36,440 --> 00:08:38,419 structure, and evolution. 166 00:08:38,419 --> 00:08:40,840 And so in order to compare proteins, 167 00:08:40,840 --> 00:08:44,810 you need a protein-- an amino acid substitution matrix-- 168 00:08:44,810 --> 00:08:47,730 a matrix that describes how often one amino acid is 169 00:08:47,730 --> 00:08:48,750 substituted for another. 170 00:08:48,750 --> 00:08:50,990 And Margaret Dayhoff was a pioneer 171 00:08:50,990 --> 00:08:52,630 in developing these sorts of matrices. 172 00:08:52,630 --> 00:08:55,980 And some of the matrices she developed, the PAM series, 173 00:08:55,980 --> 00:08:56,870 are still used today. 174 00:08:56,870 --> 00:09:00,720 And we'll discuss those matrices early next week. 175 00:09:00,720 --> 00:09:04,310 So in terms of asking evolutionary questions, 176 00:09:04,310 --> 00:09:09,400 two big thinkers were Russ Doolittle and Carl Woese, 177 00:09:09,400 --> 00:09:14,930 analyzing both ribosomal RNA sequences to study evolution. 178 00:09:14,930 --> 00:09:18,780 And Carl Woese realized, looking at these RNA alignments, 179 00:09:18,780 --> 00:09:21,777 that actually, the prokaryotes, which had been-- there 180 00:09:21,777 --> 00:09:23,360 was this big split between prokaryotes 181 00:09:23,360 --> 00:09:25,530 and eukaryotes was sort of a false split-- 182 00:09:25,530 --> 00:09:27,430 that actually, there was a subgroup 183 00:09:27,430 --> 00:09:30,750 of single-celled anuclear organisms 184 00:09:30,750 --> 00:09:32,520 that were closer to the eukaryotes-- 185 00:09:32,520 --> 00:09:33,740 and named them the Archaea. 186 00:09:33,740 --> 00:09:36,890 So a whole kingdom of life was recognized, really, 187 00:09:36,890 --> 00:09:38,350 by sequence analysis. 188 00:09:38,350 --> 00:09:42,040 And Russ Doolittle also did a lot of analysis 189 00:09:42,040 --> 00:09:45,030 approaching sequences and came up with this molecular clock 190 00:09:45,030 --> 00:09:48,860 idea, or contributed to that idea, 191 00:09:48,860 --> 00:09:52,390 to actually build-- instead of systematics being based 192 00:09:52,390 --> 00:09:57,770 on phenotypic characteristics, do it on a molecular level. 193 00:09:57,770 --> 00:10:00,780 So in the '80s, the databases started to expand. 194 00:10:00,780 --> 00:10:04,000 Sequence alignment and search became more important. 195 00:10:04,000 --> 00:10:07,830 And various people developed fast algorithms 196 00:10:07,830 --> 00:10:12,530 to compare protein and DNA sequences and align them. 197 00:10:12,530 --> 00:10:15,430 So the FASTA program was widely used. 198 00:10:15,430 --> 00:10:17,820 BLAST-- several of the authors of BLAST are shown here-- 199 00:10:17,820 --> 00:10:21,580 David Lipman, Pearson, Webb Miller, Stephen Altschul. 200 00:10:21,580 --> 00:10:25,610 The statistics for knowing when a BLAST search result is 201 00:10:25,610 --> 00:10:28,110 significant were developed by Karlin and Altschul. 202 00:10:28,110 --> 00:10:31,505 And there was also progress in gapped alignment, 203 00:10:31,505 --> 00:10:35,340 in particularly, Smith-Waterman, shown above. 204 00:10:35,340 --> 00:10:37,700 Also progress in RNA secondary structure prediction 205 00:10:37,700 --> 00:10:40,140 from Nusinov and Zuker. 206 00:10:40,140 --> 00:10:44,730 We'll talk about all of these algorithms during the course. 207 00:10:44,730 --> 00:10:48,100 And there was also development of literature databases. 208 00:10:48,100 --> 00:10:49,665 I always liked this picture. 209 00:10:49,665 --> 00:10:51,040 Many of you probably used PubMed. 210 00:10:51,040 --> 00:10:56,400 But Al Gore was well coached here, by these experts, 211 00:10:56,400 --> 00:10:57,530 in how to use it. 212 00:10:57,530 --> 00:11:01,480 And then, in the '90s, computational biology 213 00:11:01,480 --> 00:11:02,736 really started to expand. 214 00:11:02,736 --> 00:11:04,360 It was driven partly by the development 215 00:11:04,360 --> 00:11:09,540 of a microarrays, the first genome sequences, and questions 216 00:11:09,540 --> 00:11:12,030 like how to identify domains in a protein, 217 00:11:12,030 --> 00:11:15,520 how to identify genes in the genome. 218 00:11:15,520 --> 00:11:18,340 It was recognized that this family of models 219 00:11:18,340 --> 00:11:22,260 from electrical engineering, the hidden Markov model, 220 00:11:22,260 --> 00:11:25,780 were quite useful for these sort of sequence labeling problems. 221 00:11:25,780 --> 00:11:28,800 That was really pioneered by Anders Krogh here, 222 00:11:28,800 --> 00:11:30,220 and David Haussler. 223 00:11:30,220 --> 00:11:32,700 And a variety of algorithms were developed 224 00:11:32,700 --> 00:11:36,610 that performed these useful tasks. 225 00:11:36,610 --> 00:11:39,150 There was also important progress 226 00:11:39,150 --> 00:11:43,630 in the earliest comparative genomic approaches, 227 00:11:43,630 --> 00:11:45,475 since you have-- the first genomes 228 00:11:45,475 --> 00:11:47,870 were sequenced in the mid '90s, of free living organisms. 229 00:11:47,870 --> 00:11:50,510 And so you could then start to compare these genomes 230 00:11:50,510 --> 00:11:51,350 and learn a lot. 231 00:11:51,350 --> 00:11:54,040 We'll talk a little bit about comparative genomics. 232 00:11:54,040 --> 00:11:59,180 And there was important progress on predicting protein structure 233 00:11:59,180 --> 00:12:00,950 from primary sequence. 234 00:12:00,950 --> 00:12:03,220 Particularly, David Baker made notable progress 235 00:12:03,220 --> 00:12:05,190 on this Rosetta algorithm. 236 00:12:05,190 --> 00:12:07,750 So it's a biophysics field, but it's 237 00:12:07,750 --> 00:12:12,310 very much part of computational biology as well. 238 00:12:12,310 --> 00:12:17,380 So in the 2000s, definitely, genome sequencing 239 00:12:17,380 --> 00:12:21,370 became very fashionable, as you can see here. 240 00:12:21,370 --> 00:12:28,030 And the genomes of now larger organisms, 241 00:12:28,030 --> 00:12:31,450 including human, it became possible to sequence them. 242 00:12:31,450 --> 00:12:33,540 And then, this introduced a huge host 243 00:12:33,540 --> 00:12:36,849 of computational challenges in assembling the genomes, 244 00:12:36,849 --> 00:12:38,390 annotating the genomes, and so forth. 245 00:12:38,390 --> 00:12:41,510 And we'll hear from Professor Gifford 246 00:12:41,510 --> 00:12:44,510 about some genome assembly topics. 247 00:12:44,510 --> 00:12:47,750 And annotation will come up throughout. 248 00:12:51,014 --> 00:12:52,430 Actually, let's just mention, this 249 00:12:52,430 --> 00:12:56,779 is Jim Kent, who's the guru who did the first human genome 250 00:12:56,779 --> 00:12:58,570 assembly-- at least, that was widely used-- 251 00:12:58,570 --> 00:13:00,950 and also was involved in UCSC. 252 00:13:00,950 --> 00:13:05,810 And here, Ewan Birney has started Ensembl 253 00:13:05,810 --> 00:13:08,904 and continues to run it today. 254 00:13:08,904 --> 00:13:10,820 You know who these other people are, probably. 255 00:13:10,820 --> 00:13:11,830 OK. 256 00:13:11,830 --> 00:13:12,330 All right. 257 00:13:12,330 --> 00:13:19,200 So in another phase of the last decade, 258 00:13:19,200 --> 00:13:24,070 I would say that much of biological research 259 00:13:24,070 --> 00:13:27,570 became more high throughput than it was before. 260 00:13:27,570 --> 00:13:29,270 So molecular biology had traditionally, 261 00:13:29,270 --> 00:13:31,670 in the '80s and '90s, mostly focused 262 00:13:31,670 --> 00:13:35,420 on analysis of individual gene or protein products. 263 00:13:35,420 --> 00:13:38,970 But now it became possible, and in widespread use, 264 00:13:38,970 --> 00:13:42,380 that you could measure the expression of all the genes, 265 00:13:42,380 --> 00:13:45,270 in theory-- using microarrays, for example-- 266 00:13:45,270 --> 00:13:48,330 and you could start to profile all 267 00:13:48,330 --> 00:13:49,900 of the transcripts in the cell, all 268 00:13:49,900 --> 00:13:53,670 of the proteins in the cell, and so forth. 269 00:13:53,670 --> 00:13:55,690 And then, a variety of groups started 270 00:13:55,690 --> 00:13:57,510 to use some of these high throughput data 271 00:13:57,510 --> 00:14:00,850 to study various challenges in gene expression 272 00:14:00,850 --> 00:14:04,710 to understand how transcription works, how splicing works, 273 00:14:04,710 --> 00:14:07,940 how microRNAs work, translation, epigenetics, and so forth. 274 00:14:07,940 --> 00:14:14,920 And you'll hear updates on some of that work in this course. 275 00:14:14,920 --> 00:14:17,000 Bioimage informatics, particularly 276 00:14:17,000 --> 00:14:21,000 for developmental biology, became popular. 277 00:14:21,000 --> 00:14:25,230 It continues to be a new emerging area. 278 00:14:25,230 --> 00:14:31,980 Systems biology was also really born around 2000, roughly. 279 00:14:31,980 --> 00:14:35,610 A very prominent example would be the development 280 00:14:35,610 --> 00:14:38,890 of the first gene regulatory network 281 00:14:38,890 --> 00:14:43,020 models that describe sea urchin development, here, my Eric 282 00:14:43,020 --> 00:14:48,565 Davidson, as well as a whole variety of models of other gene 283 00:14:48,565 --> 00:14:50,510 networks in the cell that control things like 284 00:14:50,510 --> 00:14:53,990 cell proliferation, apoptosis, et cetera. 285 00:14:56,660 --> 00:15:01,450 At the same time, a new field of synthetic biology 286 00:15:01,450 --> 00:15:05,470 was born with the development of some of the first 287 00:15:05,470 --> 00:15:09,610 completely artificial gene networks that would then 288 00:15:09,610 --> 00:15:14,220 program cells to perform desired behavior. 289 00:15:14,220 --> 00:15:16,810 So an example would be this so-called repressilator, 290 00:15:16,810 --> 00:15:20,050 where you have a network of three transcription factors. 291 00:15:20,050 --> 00:15:22,100 Each represses the other. 292 00:15:22,100 --> 00:15:24,127 And then, one of them represses GFP. 293 00:15:24,127 --> 00:15:25,460 And you put these into bacteria. 294 00:15:25,460 --> 00:15:29,740 And it causes oscillations in GFP, expressions 295 00:15:29,740 --> 00:15:33,400 that are described by these differential equations here. 296 00:15:33,400 --> 00:15:36,856 And some of the modeling approaches 297 00:15:36,856 --> 00:15:38,230 used in synthetic biology will be 298 00:15:38,230 --> 00:15:43,910 covered by Professor Fraenkel and Lauffenburger later. 299 00:15:43,910 --> 00:15:44,410 All right. 300 00:15:44,410 --> 00:15:48,840 So late 2000s, early 2010s, it's still too early 301 00:15:48,840 --> 00:15:53,130 to say, for sure, what the most important developments will be. 302 00:15:53,130 --> 00:15:57,020 But certainly, in the late 2000s, 303 00:15:57,020 --> 00:15:59,020 next gen sequencing-- which now probably 304 00:15:59,020 --> 00:16:00,936 should be called second generation sequencing, 305 00:16:00,936 --> 00:16:04,280 since there may be future generations-- 306 00:16:04,280 --> 00:16:08,980 really started to transform a whole wide variety 307 00:16:08,980 --> 00:16:12,927 of applications in biology, from making genome sequencing-- 308 00:16:12,927 --> 00:16:15,010 instead of having to be done in the genome center, 309 00:16:15,010 --> 00:16:18,640 now an individual lab can easily do microbial genome sequencing. 310 00:16:18,640 --> 00:16:21,560 And when needed, it's possible, also, 311 00:16:21,560 --> 00:16:25,140 to do genome sequencing of larger organisms. 312 00:16:25,140 --> 00:16:28,850 Transcriptome sequencing is now routine. 313 00:16:28,850 --> 00:16:31,290 We'll hear about that. 314 00:16:31,290 --> 00:16:34,550 There are applications for mapping protein-DNA 315 00:16:34,550 --> 00:16:36,510 interactions genome wide, including 316 00:16:36,510 --> 00:16:38,750 both sequence specific transcription factors 317 00:16:38,750 --> 00:16:41,710 as well as more general factors like histones, 318 00:16:41,710 --> 00:16:43,930 protein-RNA interactions-- a method called 319 00:16:43,930 --> 00:16:47,860 CLIP-Seq-- methods for mapping all the translated messages, 320 00:16:47,860 --> 00:16:51,320 the methylated sites in the genome, open chromatin, 321 00:16:51,320 --> 00:16:52,730 and so forth. 322 00:16:52,730 --> 00:16:55,250 So many people contributed to this, obviously. 323 00:16:55,250 --> 00:16:57,530 I'm just mentioning, Barbara Wold 324 00:16:57,530 --> 00:17:03,240 was a pioneer in both RNA-Seq as well as ChIP-Seq. 325 00:17:03,240 --> 00:17:07,380 And some of the sequencing technologies that came out here 326 00:17:07,380 --> 00:17:09,390 are shown here. 327 00:17:09,390 --> 00:17:12,800 And we will discuss those at the beginning of lecture 328 00:17:12,800 --> 00:17:13,960 on Thursday. 329 00:17:13,960 --> 00:17:16,089 So I encourage you to read this review here, 330 00:17:16,089 --> 00:17:19,990 by Metzger, which covers many of the newer sequencing 331 00:17:19,990 --> 00:17:21,219 technologies. 332 00:17:21,219 --> 00:17:22,510 And they're pretty interesting. 333 00:17:22,510 --> 00:17:24,800 As you'll see, there's some interesting tricks, 334 00:17:24,800 --> 00:17:29,600 interesting chemistry and image analysis tricks. 335 00:17:29,600 --> 00:17:30,100 All right. 336 00:17:30,100 --> 00:17:33,540 So that was not very scholarly. 337 00:17:33,540 --> 00:17:37,960 But if you want a proper history, then this guy, 338 00:17:37,960 --> 00:17:41,120 Hallam Stevens, who was a History of Science PhD student 339 00:17:41,120 --> 00:17:43,150 at Harvard and recently graduated, 340 00:17:43,150 --> 00:17:46,250 wrote this history of bioinformatics. 341 00:17:49,000 --> 00:17:49,910 OK. 342 00:17:49,910 --> 00:17:51,290 So let's look at the syllabus. 343 00:17:51,290 --> 00:17:55,665 So also posted on the [INAUDIBLE] site is a syllabus. 344 00:17:55,665 --> 00:17:57,430 It looks like this. 345 00:17:57,430 --> 00:18:00,430 This is quite an information-rich document. 346 00:18:00,430 --> 00:18:03,280 It has all the lecture titles, all the due dates 347 00:18:03,280 --> 00:18:04,940 of all the problem sets, and so forth. 348 00:18:04,940 --> 00:18:08,710 So please print yourself a copy and familiarize yourself 349 00:18:08,710 --> 00:18:09,510 with it. 350 00:18:09,510 --> 00:18:13,270 So we'll just try to look at a high level 351 00:18:13,270 --> 00:18:15,920 first, and then zoom in to the details. 352 00:18:15,920 --> 00:18:20,600 So at a high level, if you look in this column here, 353 00:18:20,600 --> 00:18:25,681 we've broken the course into six different topics. 354 00:18:25,681 --> 00:18:26,180 OK? 355 00:18:26,180 --> 00:18:29,430 So there's Genomic Analysis I, that I'll 356 00:18:29,430 --> 00:18:34,080 be teaching, which is more classical computational 357 00:18:34,080 --> 00:18:38,980 biology, you could say-- local alignment, global alignment, 358 00:18:38,980 --> 00:18:39,910 and so forth. 359 00:18:39,910 --> 00:18:42,295 Then, Genomic Analysis II, which Professor Gifford will 360 00:18:42,295 --> 00:18:44,770 be teaching, covers some newer methods 361 00:18:44,770 --> 00:18:46,390 that are required when you're doing 362 00:18:46,390 --> 00:18:49,580 a lot of second generation sequencing-- 363 00:18:49,580 --> 00:18:51,550 the standard algorithms are not fast enough, 364 00:18:51,550 --> 00:18:55,960 you need better algorithms, and so forth. 365 00:18:55,960 --> 00:19:02,180 And then, I will come back and give a few lectures on modeling 366 00:19:02,180 --> 00:19:03,300 biological function. 367 00:19:03,300 --> 00:19:06,630 This will have to do with sequence motifs, hidden Markov 368 00:19:06,630 --> 00:19:09,217 models, and RNA secondary structure. 369 00:19:09,217 --> 00:19:10,800 Professor Fraenkel will then do a unit 370 00:19:10,800 --> 00:19:13,190 on proteomics and protein structure. 371 00:19:13,190 --> 00:19:16,080 And then, there will be an extended unit 372 00:19:16,080 --> 00:19:17,777 on regulatory networks. 373 00:19:17,777 --> 00:19:19,360 Different types of regulatory networks 374 00:19:19,360 --> 00:19:24,480 will be covered, with most of the lectures by Ernest, 375 00:19:24,480 --> 00:19:27,020 one by David, and one by Doug. 376 00:19:27,020 --> 00:19:30,660 And then, we'll finish up with computational genetics, 377 00:19:30,660 --> 00:19:32,440 by David. 378 00:19:32,440 --> 00:19:37,380 And there will also be some guest lecturers, one of them 379 00:19:37,380 --> 00:19:39,180 interspersed in regulatory networks, 380 00:19:39,180 --> 00:19:43,780 and then two at the end, from Ron Weiss and George Church. 381 00:19:43,780 --> 00:19:49,090 So I just wanted to point out that in all of these topics, 382 00:19:49,090 --> 00:19:52,299 we will include some discussion of motivating question. 383 00:19:52,299 --> 00:19:53,840 So, what are the biological questions 384 00:19:53,840 --> 00:19:56,760 that we're seeking to address with these approaches. 385 00:19:56,760 --> 00:19:59,590 And there will also be some discussion 386 00:19:59,590 --> 00:20:00,990 of the experimental method. 387 00:20:00,990 --> 00:20:02,940 So for example, in the first unit, 388 00:20:02,940 --> 00:20:04,640 it's heavy on sequence analysis. 389 00:20:04,640 --> 00:20:08,240 So we'll talk about how sequencing is done, 390 00:20:08,240 --> 00:20:12,440 and then, quite a bit about the interaction 391 00:20:12,440 --> 00:20:14,810 between the experimental technology 392 00:20:14,810 --> 00:20:18,630 and the computational analysis, which often involves 393 00:20:18,630 --> 00:20:20,470 statistical methods for estimating the error 394 00:20:20,470 --> 00:20:23,740 rate of the experimental method, and things like that. 395 00:20:23,740 --> 00:20:25,690 So the emphasis is on the computational part, 396 00:20:25,690 --> 00:20:31,751 but we'll have some discussion about experiments. 397 00:20:31,751 --> 00:20:32,750 Everyone with me so far? 398 00:20:32,750 --> 00:20:33,590 Any questions? 399 00:20:36,440 --> 00:20:39,290 OK. 400 00:20:39,290 --> 00:20:40,090 All right. 401 00:20:40,090 --> 00:20:42,050 So what are some of these motivating questions 402 00:20:42,050 --> 00:20:43,216 that we'll be talking about? 403 00:20:43,216 --> 00:20:48,560 So what are the instructions encoded in our genomes? 404 00:20:48,560 --> 00:20:50,760 You can think of the genome as a book. 405 00:20:50,760 --> 00:20:52,610 But it's in this very strange language. 406 00:20:52,610 --> 00:20:57,450 And we need to understand the rules, the code that underlies 407 00:20:57,450 --> 00:21:01,730 a lot of research in gene expression. 408 00:21:01,730 --> 00:21:03,640 How are chromosomes organized? 409 00:21:03,640 --> 00:21:07,840 What genes are present-- so tools for annotating genomes. 410 00:21:07,840 --> 00:21:10,220 What regulatory circuitry is encoded? 411 00:21:10,220 --> 00:21:13,470 You'd like to be able to eventually look at a genome, 412 00:21:13,470 --> 00:21:15,590 understand all the regulatory elements, 413 00:21:15,590 --> 00:21:18,960 and be able to predict that there's some feedback 414 00:21:18,960 --> 00:21:23,650 circuit there that responds to-- a particular stimulation that 415 00:21:23,650 --> 00:21:26,360 responds to light, or nutrient deprivation, 416 00:21:26,360 --> 00:21:28,195 or whatever it might be. 417 00:21:28,195 --> 00:21:30,320 Can the transcriptome be predicted from the genome? 418 00:21:30,320 --> 00:21:33,040 This is a longstanding question. 419 00:21:33,040 --> 00:21:37,220 The translatome, if you will-- well, let's say, 420 00:21:37,220 --> 00:21:40,670 the proteome can be predicted from the transcriptome 421 00:21:40,670 --> 00:21:43,570 in the sense that we have a genetic code 422 00:21:43,570 --> 00:21:46,850 and we can look up those triplets. 423 00:21:46,850 --> 00:21:49,340 So there's a dream that we would be 424 00:21:49,340 --> 00:21:53,550 able to model other steps in gene expression 425 00:21:53,550 --> 00:21:57,780 with the precision with which the genetic code predicts 426 00:21:57,780 --> 00:22:00,340 translation-- that we'd be able to predict where 427 00:22:00,340 --> 00:22:02,590 the polymerase would start transcribing, where it will 428 00:22:02,590 --> 00:22:08,100 finish transcribing, how a transcript will be spliced, 429 00:22:08,100 --> 00:22:13,800 et cetera-- all the other steps in gene expression. 430 00:22:13,800 --> 00:22:17,470 And that motivates a lot of work in the field. 431 00:22:17,470 --> 00:22:19,530 Can protein function be predicted from sequence? 432 00:22:19,530 --> 00:22:23,240 So this is a very classical problem. 433 00:22:23,240 --> 00:22:26,680 But there are a number of new and interesting developments 434 00:22:26,680 --> 00:22:30,900 as resolved from a lot of this high throughput data 435 00:22:30,900 --> 00:22:34,060 generation, both in nucleic acid sequencing 436 00:22:34,060 --> 00:22:36,830 as well as in proteomics. 437 00:22:36,830 --> 00:22:39,520 Can evolutionary history be reconstructed from sequence? 438 00:22:39,520 --> 00:22:43,292 Again, this has been a longstanding goal of the field. 439 00:22:43,292 --> 00:22:45,000 And a lot of progress has been made here. 440 00:22:45,000 --> 00:22:48,400 And now most evolutionary classifications 441 00:22:48,400 --> 00:22:52,070 are actually based on molecular sequence at some level. 442 00:22:52,070 --> 00:22:56,440 And new species are often defined based on sequence. 443 00:22:58,990 --> 00:22:59,800 OK. 444 00:22:59,800 --> 00:23:01,210 Other motivating questions. 445 00:23:01,210 --> 00:23:03,717 So what would you need to measure 446 00:23:03,717 --> 00:23:05,800 if you wanted to discover the causes of a disease, 447 00:23:05,800 --> 00:23:07,866 the mechanisms of existing drugs, 448 00:23:07,866 --> 00:23:09,490 metabolic pathways in a micro-organism? 449 00:23:09,490 --> 00:23:11,980 So this is a systems biology question. 450 00:23:11,980 --> 00:23:13,350 You've got a new bug. 451 00:23:13,350 --> 00:23:16,030 It causes some disease. 452 00:23:16,030 --> 00:23:17,030 What should you measure? 453 00:23:17,030 --> 00:23:18,110 Should you sequences its genome? 454 00:23:18,110 --> 00:23:19,693 Should you sequence its transcriptome? 455 00:23:19,693 --> 00:23:21,540 Should you do proteomics? 456 00:23:21,540 --> 00:23:23,260 What type of proteomics? 457 00:23:23,260 --> 00:23:28,330 Should you perturb the system in some way and do a time series? 458 00:23:28,330 --> 00:23:31,050 What are the most efficient ways? 459 00:23:31,050 --> 00:23:33,530 What information should be gathered, and in what 460 00:23:33,530 --> 00:23:34,150 quantities? 461 00:23:34,150 --> 00:23:37,810 And how should that information be integrated in order 462 00:23:37,810 --> 00:23:40,895 to come up with an understanding of the physiology 463 00:23:40,895 --> 00:23:42,640 of that organisms so that, then, you 464 00:23:42,640 --> 00:23:44,900 can know where to intervene, what 465 00:23:44,900 --> 00:23:46,470 would be suitable drug targets? 466 00:23:50,200 --> 00:23:50,700 Yeah. 467 00:23:50,700 --> 00:23:51,770 What kind of modeling would help you 468 00:23:51,770 --> 00:23:53,540 to use the data to design new therapies, 469 00:23:53,540 --> 00:23:56,920 or even, in a synthetic biology context, 470 00:23:56,920 --> 00:24:01,435 to re-engineer organisms for new purposes? 471 00:24:01,435 --> 00:24:04,890 So microbes to generate-- to produce 472 00:24:04,890 --> 00:24:09,820 fuel, for example, or other useful products. 473 00:24:09,820 --> 00:24:11,670 What can we currently measure? 474 00:24:11,670 --> 00:24:15,119 What does each type of data mean individually? 475 00:24:15,119 --> 00:24:16,660 What are the strengths and weaknesses 476 00:24:16,660 --> 00:24:20,010 of each of the types of high throughput approaches 477 00:24:20,010 --> 00:24:21,580 that we have? 478 00:24:21,580 --> 00:24:26,136 And how do we integrate all the data we have on a system 479 00:24:26,136 --> 00:24:28,010 to understand the functioning of that system? 480 00:24:28,010 --> 00:24:30,600 So these are some of the questions that 481 00:24:30,600 --> 00:24:35,920 motivate the latter topics on regulatory networks. 482 00:24:35,920 --> 00:24:36,840 OK. 483 00:24:36,840 --> 00:24:40,862 So let's now zoom in and look more closely 484 00:24:40,862 --> 00:24:41,820 at the course syllabus. 485 00:24:41,820 --> 00:24:44,440 I've just broken it into two halves, 486 00:24:44,440 --> 00:24:48,500 just so it's more readable. 487 00:24:48,500 --> 00:24:52,010 So today we're going over, obviously, course mechanics 488 00:24:52,010 --> 00:24:52,510 mostly. 489 00:24:57,160 --> 00:25:01,120 On Thursday, we'll cover both some DNA sequencing 490 00:25:01,120 --> 00:25:03,970 technologies and we'll talk about local alignment on BLAST. 491 00:25:03,970 --> 00:25:05,100 More on that in a bit. 492 00:25:10,060 --> 00:25:16,790 The 6.8047 recitation-- 6.874, thank you-- recitation 493 00:25:16,790 --> 00:25:19,250 will be on Friday. 494 00:25:19,250 --> 00:25:21,460 The other recitations will start next week. 495 00:25:21,460 --> 00:25:27,022 And then, as you can see, we'll move through the other topics. 496 00:25:27,022 --> 00:25:28,480 So each of the instructors is going 497 00:25:28,480 --> 00:25:31,520 to briefly review their topic. 498 00:25:31,520 --> 00:25:34,210 So I won't go through all the titles here. 499 00:25:34,210 --> 00:25:37,490 But please note, on the left side 500 00:25:37,490 --> 00:25:41,270 here, that their assignment due dates are marked. 501 00:25:41,270 --> 00:25:41,950 OK? 502 00:25:41,950 --> 00:25:45,530 And they're all due at noon on the indicated day. 503 00:25:45,530 --> 00:25:48,210 And so some of these are problem sets. 504 00:25:48,210 --> 00:25:49,910 So for example, problem set 1 will 505 00:25:49,910 --> 00:25:53,410 be due on Thursday, February 20 at noon. 506 00:25:53,410 --> 00:25:56,770 And some of the other assignments 507 00:25:56,770 --> 00:26:00,200 relate to the project component of the course, which we're 508 00:26:00,200 --> 00:26:02,610 going to talk more about in a moment. 509 00:26:02,610 --> 00:26:04,970 In particular, we're going to ask 510 00:26:04,970 --> 00:26:10,860 you to submit a brief statement of your background 511 00:26:10,860 --> 00:26:14,630 and your research interests related to forming teams. 512 00:26:14,630 --> 00:26:16,700 So the projects are going to be done 513 00:26:16,700 --> 00:26:19,290 in teams of one to five students. 514 00:26:19,290 --> 00:26:23,910 And in order to facilitate especially 515 00:26:23,910 --> 00:26:26,100 cross disciplinary teams-- we'd love 516 00:26:26,100 --> 00:26:28,340 if you interact with, maybe, students 517 00:26:28,340 --> 00:26:31,440 in a different grad program, or whatever-- 518 00:26:31,440 --> 00:26:33,820 you'll post your background. 519 00:26:33,820 --> 00:26:36,350 You know, I'm a first year BE student 520 00:26:36,350 --> 00:26:39,200 and I have a background in Perl programming, 521 00:26:39,200 --> 00:26:42,470 but never done Python, or whatever-- something like that. 522 00:26:42,470 --> 00:26:47,310 And then, I'm interested in doing systems biology 523 00:26:47,310 --> 00:26:50,290 modeling in microbial systems, or something like that. 524 00:26:50,290 --> 00:26:52,460 And then you can match up your interests 525 00:26:52,460 --> 00:26:54,000 with others and form teams. 526 00:26:54,000 --> 00:26:57,870 And then you'll come up with your own project ideas 527 00:26:57,870 --> 00:26:59,710 so that the team and initial idea 528 00:26:59,710 --> 00:27:04,090 will be due here, February 25th. 529 00:27:04,090 --> 00:27:08,010 Then you'll need to do some aims and so forth. 530 00:27:08,010 --> 00:27:11,220 So the project components here, these are only for those 531 00:27:11,220 --> 00:27:13,870 taking the grad version of the course. 532 00:27:13,870 --> 00:27:15,770 We'll make that clear later. 533 00:27:15,770 --> 00:27:16,310 OK. 534 00:27:16,310 --> 00:27:20,710 So after the first three topics here, 535 00:27:20,710 --> 00:27:26,170 taught by myself and David, there will be an exam. 536 00:27:26,170 --> 00:27:27,140 More on that later. 537 00:27:27,140 --> 00:27:30,990 And then there will be three more topics, mostly taught 538 00:27:30,990 --> 00:27:34,110 by Ernest. 539 00:27:34,110 --> 00:27:38,529 And notice there are additional assignments here related 540 00:27:38,529 --> 00:27:40,320 to the project-- so, to research strategy-- 541 00:27:40,320 --> 00:27:45,490 and the final written report, additional problems sets. 542 00:27:45,490 --> 00:27:48,350 We'll have a guest lecturer here. 543 00:27:48,350 --> 00:27:50,750 This will be Ron Weiss. 544 00:27:50,750 --> 00:27:53,840 Then, there will be the second exam. 545 00:27:53,840 --> 00:27:56,105 Exams are non-cumulative, so the second exam 546 00:27:56,105 --> 00:28:02,370 will just cover these three topics here predominantly. 547 00:28:02,370 --> 00:28:06,300 And then there will be another guest lecturer. 548 00:28:06,300 --> 00:28:08,290 This would be George Church here. 549 00:28:08,290 --> 00:28:10,480 And then notice, here, presentation. 550 00:28:10,480 --> 00:28:15,170 So those who are doing the project component, 551 00:28:15,170 --> 00:28:19,820 those teams will be given-- assigned a time to present, 552 00:28:19,820 --> 00:28:23,490 to the class, the results of their research. 553 00:28:23,490 --> 00:28:28,600 And you'll be graded-- the presentation will 554 00:28:28,600 --> 00:28:31,439 be part of the overall project grade assigned 555 00:28:31,439 --> 00:28:32,230 by the instructors. 556 00:28:32,230 --> 00:28:35,640 But you'll also-- we'll also ask all the students 557 00:28:35,640 --> 00:28:38,720 in the class to send comments on the presentations. 558 00:28:38,720 --> 00:28:41,850 So you may find that you get helpful suggestions 559 00:28:41,850 --> 00:28:44,410 about interpreting your data from other people and so forth. 560 00:28:44,410 --> 00:28:48,170 So that'll be a required component 561 00:28:48,170 --> 00:28:51,800 of the course for all students, to attend the presentations 562 00:28:51,800 --> 00:28:53,210 and comment on them. 563 00:28:53,210 --> 00:28:56,820 And we hope that will be a lot of fun. 564 00:28:56,820 --> 00:28:57,320 OK. 565 00:28:57,320 --> 00:28:59,220 So is this the right course for me? 566 00:28:59,220 --> 00:29:02,350 So I just wanted to let you know you're 567 00:29:02,350 --> 00:29:05,880 fortunate to have a rich selection of courses 568 00:29:05,880 --> 00:29:09,080 in computational systems, synthetic biology here at MIT. 569 00:29:09,080 --> 00:29:10,760 I've listed many of them. 570 00:29:10,760 --> 00:29:14,510 Probably not all, but the ones that I'm 571 00:29:14,510 --> 00:29:16,820 aware of that are available on-campus. 572 00:29:19,860 --> 00:29:22,750 757 is really only for biology grad students. 573 00:29:22,750 --> 00:29:26,030 But the other courses listed here are generally open. 574 00:29:26,030 --> 00:29:29,180 Some are more geared for graduate students, some more 575 00:29:29,180 --> 00:29:29,990 undergrads. 576 00:29:29,990 --> 00:29:31,210 Some are more specialized. 577 00:29:31,210 --> 00:29:35,080 So for example, Jeff Gore's systems biology course, 578 00:29:35,080 --> 00:29:37,030 it's more focused on systems biology 579 00:29:37,030 --> 00:29:41,307 whereas our course covers both computational and systems. 580 00:29:41,307 --> 00:29:43,770 So keep that in mind. 581 00:29:43,770 --> 00:29:45,270 Make sure you're in the right place, 582 00:29:45,270 --> 00:29:48,050 that this is what you want. 583 00:29:48,050 --> 00:29:48,910 OK. 584 00:29:48,910 --> 00:29:54,000 A few notes on the textbook-- so there is a textbook. 585 00:29:54,000 --> 00:29:56,400 It's not required. 586 00:29:56,400 --> 00:29:58,080 It's called Understanding Bioinformatics 587 00:29:58,080 --> 00:30:00,000 by Zvelebil and Baum. 588 00:30:00,000 --> 00:30:03,270 It's quite good on certain topics. 589 00:30:03,270 --> 00:30:07,590 But it really only covers about, maybe, a third of what 590 00:30:07,590 --> 00:30:09,410 we cover in the course. 591 00:30:09,410 --> 00:30:12,890 So there is good content on local alignment, 592 00:30:12,890 --> 00:30:14,690 global alignment, scoring matrices-- 593 00:30:14,690 --> 00:30:17,070 the topics of the next couple lectures. 594 00:30:17,070 --> 00:30:20,220 And I'll point you to those chapters. 595 00:30:20,220 --> 00:30:23,130 But it's very important to emphasize 596 00:30:23,130 --> 00:30:26,740 that the content of the course is really 597 00:30:26,740 --> 00:30:30,410 what happens in lecture, and on the homeworks, 598 00:30:30,410 --> 00:30:34,290 and to some extent, what happens in recitation. 599 00:30:34,290 --> 00:30:39,270 And the textbook is just there as a backup, if you will, 600 00:30:39,270 --> 00:30:41,990 or for those who would like to get more 601 00:30:41,990 --> 00:30:46,860 background on the topic or want to read 602 00:30:46,860 --> 00:30:50,610 a different description of that topic. 603 00:30:50,610 --> 00:30:55,870 So you decide whether you want to purchase the textbook 604 00:30:55,870 --> 00:30:56,370 or not. 605 00:30:56,370 --> 00:30:59,330 It's available at the Coop or through Amazon. 606 00:30:59,330 --> 00:31:00,630 Shop around. 607 00:31:00,630 --> 00:31:01,895 You can find it. 608 00:31:01,895 --> 00:31:02,520 It's paperback. 609 00:31:05,030 --> 00:31:08,500 Pretty good general reference on a variety of topics, 610 00:31:08,500 --> 00:31:11,881 but it doesn't really have much on systems biology. 611 00:31:11,881 --> 00:31:12,380 All right. 612 00:31:12,380 --> 00:31:15,210 Another important reference that was developed specifically 613 00:31:15,210 --> 00:31:19,230 for this course a few years ago is the probability 614 00:31:19,230 --> 00:31:20,610 and statistics primer. 615 00:31:20,610 --> 00:31:24,350 So you'll notice that some of the homeworks, 616 00:31:24,350 --> 00:31:27,410 particularly in the earlier parts of the course, 617 00:31:27,410 --> 00:31:30,630 will have significant probability and statistics. 618 00:31:30,630 --> 00:31:35,800 And we assume that you have some background in this area. 619 00:31:35,800 --> 00:31:37,120 Many of you do. 620 00:31:37,120 --> 00:31:40,840 If you don't, you'll need to pick that up. 621 00:31:40,840 --> 00:31:44,710 And this primer was written to provide those topics, 622 00:31:44,710 --> 00:31:50,270 in probability especially, that are foundational 623 00:31:50,270 --> 00:31:52,910 and most relevant to computational biology. 624 00:31:52,910 --> 00:31:57,540 So for example, there are some concepts 625 00:31:57,540 --> 00:32:02,600 like p-value, probability density function, probability 626 00:32:02,600 --> 00:32:07,010 mass function, cumulative distribution function, 627 00:32:07,010 --> 00:32:10,180 and then, common distributions, exponential distribution, 628 00:32:10,180 --> 00:32:13,905 Poisson distribution, extreme value distribution. 629 00:32:16,940 --> 00:32:20,800 If those are mostly sounding familiar to you, that's good. 630 00:32:20,800 --> 00:32:24,600 If they're familiar, but you couldn't-- you really 631 00:32:24,600 --> 00:32:27,920 don't-- you get binomial and Poisson confused or something, 632 00:32:27,920 --> 00:32:33,100 then, definitely, you want to consult this primer. 633 00:32:33,100 --> 00:32:39,860 So I think, looking at the lectures and the homeworks, 634 00:32:39,860 --> 00:32:42,550 it should probably become pretty clear 635 00:32:42,550 --> 00:32:45,500 which aspects are going to be relevant. 636 00:32:45,500 --> 00:32:48,410 And I'll try to point those out when possible. 637 00:32:48,410 --> 00:32:51,490 And you can also consult your TAs 638 00:32:51,490 --> 00:32:54,631 if you're having trouble with the probability and statistics 639 00:32:54,631 --> 00:32:55,130 content. 640 00:32:55,130 --> 00:32:57,280 So we are going to focus, here, on, really, 641 00:32:57,280 --> 00:33:00,340 the computational biology, bioinformatics content. 642 00:33:00,340 --> 00:33:04,792 And we might briefly review a concept from probability, 643 00:33:04,792 --> 00:33:06,500 like, maybe, conditional probability when 644 00:33:06,500 --> 00:33:07,666 we talk about Markov chains. 645 00:33:07,666 --> 00:33:09,620 But we're not going to spend a lot of time. 646 00:33:09,620 --> 00:33:10,560 So if that's the first time you've 647 00:33:10,560 --> 00:33:13,200 seen conditional probability, you might be a little bit lost. 648 00:33:13,200 --> 00:33:18,760 So you'd be better off reading about it in advance. 649 00:33:18,760 --> 00:33:20,270 OK? 650 00:33:20,270 --> 00:33:22,329 Questions? 651 00:33:22,329 --> 00:33:22,870 No questions? 652 00:33:22,870 --> 00:33:23,370 All right. 653 00:33:23,370 --> 00:33:26,900 Maybe it's that the video is intimidating people. 654 00:33:26,900 --> 00:33:28,420 OK. 655 00:33:28,420 --> 00:33:31,120 All right. 656 00:33:31,120 --> 00:33:33,920 The TAs know a lot about probability and statistics 657 00:33:33,920 --> 00:33:36,260 and will be able to help you. 658 00:33:36,260 --> 00:33:36,910 OK. 659 00:33:36,910 --> 00:33:38,820 So homework-- so I apologize, the font 660 00:33:38,820 --> 00:33:40,270 is a little bit small here. 661 00:33:40,270 --> 00:33:42,040 So I'll try to state it clearly. 662 00:33:42,040 --> 00:33:46,080 So there are going to be five problem 663 00:33:46,080 --> 00:33:51,640 sets that are roughly one per topic. 664 00:33:51,640 --> 00:33:54,962 Except you'll see p set two covers topics two and three. 665 00:33:54,962 --> 00:33:56,420 So it might be a little bit longer. 666 00:34:01,450 --> 00:34:08,090 The way we handle students who have to travel-- so many of you 667 00:34:08,090 --> 00:34:08,810 might be seniors. 668 00:34:08,810 --> 00:34:10,768 You might be interviewing for graduate schools. 669 00:34:10,768 --> 00:34:13,949 Or you might have other conflicts with the course. 670 00:34:13,949 --> 00:34:15,820 So rather than doing that on a case 671 00:34:15,820 --> 00:34:20,050 by case basis, which, we've found, gets very complicated 672 00:34:20,050 --> 00:34:23,000 and is not necessarily fair, the way we've set it up 673 00:34:23,000 --> 00:34:25,620 is that the total number of points available on the five 674 00:34:25,620 --> 00:34:28,040 homeworks is 120. 675 00:34:28,040 --> 00:34:28,639 OK? 676 00:34:28,639 --> 00:34:33,870 But the maximum score that you can get is 100. 677 00:34:33,870 --> 00:34:40,929 So if you, for example, were to get 90% on all five 678 00:34:40,929 --> 00:34:43,350 of the homeworks, that would be 90% 679 00:34:43,350 --> 00:34:46,380 of 120, which would be 108 points. 680 00:34:46,380 --> 00:34:48,980 You would get the full 100-- you'd 681 00:34:48,980 --> 00:34:50,800 get 100% on your homework. 682 00:34:50,800 --> 00:34:53,070 That would be a perfect score on the homework. 683 00:34:53,070 --> 00:34:53,570 OK? 684 00:34:53,570 --> 00:34:56,130 But because of that-- because there's more points available 685 00:34:56,130 --> 00:35:02,060 than you need-- we don't allow you to drop homeworks, 686 00:35:02,060 --> 00:35:04,730 or to do an alternate assignment, or something 687 00:35:04,730 --> 00:35:06,690 like that. 688 00:35:06,690 --> 00:35:08,510 So the way it works is you can basically 689 00:35:08,510 --> 00:35:12,441 miss-- as long as you do well on, say, 690 00:35:12,441 --> 00:35:14,190 four of the homeworks-- you could actually 691 00:35:14,190 --> 00:35:16,540 miss one without much of a penalty. 692 00:35:16,540 --> 00:35:22,090 For example, if each of the homeworks were worth 24 points 693 00:35:22,090 --> 00:35:24,770 and you got a perfect score on four of them, 694 00:35:24,770 --> 00:35:26,800 that would be 96 points. 695 00:35:26,800 --> 00:35:29,300 You would have an almost perfect score on your homework 696 00:35:29,300 --> 00:35:31,380 and you could miss that fifth homework. 697 00:35:31,380 --> 00:35:32,920 Now of course, we don't encourage 698 00:35:32,920 --> 00:35:35,430 you to skip that homework. 699 00:35:35,430 --> 00:35:38,620 We think the homeworks are useful and are a good way 700 00:35:38,620 --> 00:35:42,627 to solidify the information you've gotten from lecture, 701 00:35:42,627 --> 00:35:43,710 and reading, and so forth. 702 00:35:43,710 --> 00:35:45,780 So It's good to do them. 703 00:35:45,780 --> 00:35:48,850 And doing the homeworks will help you and perhaps 704 00:35:48,850 --> 00:35:50,160 prepare you for the exams. 705 00:35:50,160 --> 00:35:53,555 But that's the way we handle the homework policy. 706 00:35:53,555 --> 00:35:55,930 Now I should also mention that not all the homeworks will 707 00:35:55,930 --> 00:35:57,820 be the same number of points. 708 00:35:57,820 --> 00:36:00,770 We'll apportion the points in proportion 709 00:36:00,770 --> 00:36:03,836 to the difficulty and length of the homework assignment. 710 00:36:03,836 --> 00:36:05,710 So for example, the first homework assignment 711 00:36:05,710 --> 00:36:07,790 is a little bit easier than the others. 712 00:36:07,790 --> 00:36:11,140 So it's going to have somewhat fewer points. 713 00:36:11,140 --> 00:36:13,990 So late assignments-- so all the homeworks 714 00:36:13,990 --> 00:36:16,410 are due at noon on the indicated day. 715 00:36:16,410 --> 00:36:19,670 And if it's within 24 hours after that, 716 00:36:19,670 --> 00:36:21,850 you'll be eligible for 50% credit. 717 00:36:21,850 --> 00:36:27,330 And beyond that, you don't get any points, in part 718 00:36:27,330 --> 00:36:30,976 because the TAs will be posting the answers to the homeworks. 719 00:36:30,976 --> 00:36:32,260 OK? 720 00:36:32,260 --> 00:36:36,290 And we want to be able to post them promptly so that you'll 721 00:36:36,290 --> 00:36:38,480 get the answers while those problems are 722 00:36:38,480 --> 00:36:39,667 fresh in your mind. 723 00:36:39,667 --> 00:36:40,750 Questions about homeworks? 724 00:36:44,150 --> 00:36:44,940 OK. 725 00:36:44,940 --> 00:36:45,560 Good. 726 00:36:45,560 --> 00:36:46,976 So collaboration on problem sets-- 727 00:36:46,976 --> 00:36:52,101 so we want you to do the problem sets. 728 00:36:52,101 --> 00:36:53,350 You can do them independently. 729 00:36:53,350 --> 00:36:57,490 You can work with a friend on them, or even in a group, 730 00:36:57,490 --> 00:36:58,820 discuss them together. 731 00:36:58,820 --> 00:37:01,065 But write up your solutions independently. 732 00:37:01,065 --> 00:37:03,480 You don't learn anything by copying 733 00:37:03,480 --> 00:37:05,190 someone else's solution. 734 00:37:05,190 --> 00:37:12,890 And if the TAs see duplicate or near identical solutions, 735 00:37:12,890 --> 00:37:15,050 both of those homeworks will get a 0. 736 00:37:15,050 --> 00:37:15,550 OK? 737 00:37:15,550 --> 00:37:16,800 And this occasionally happens. 738 00:37:16,800 --> 00:37:19,910 We don't want this to happen to you. 739 00:37:19,910 --> 00:37:21,790 So just avoid that. 740 00:37:21,790 --> 00:37:26,860 So discuss together, but write up your solutions separately. 741 00:37:26,860 --> 00:37:29,340 Similarly, with programming, if you 742 00:37:29,340 --> 00:37:31,590 have a friend who's a more experienced programmer 743 00:37:31,590 --> 00:37:34,990 than you are, by all means, ask them 744 00:37:34,990 --> 00:37:39,270 for advice, general things, how should I structure my program, 745 00:37:39,270 --> 00:37:43,070 do you know of a function that generates a loop, 746 00:37:43,070 --> 00:37:45,200 or whatever it is that you need. 747 00:37:45,200 --> 00:37:50,350 But don't share code with anyone else. 748 00:37:50,350 --> 00:37:51,434 OK? 749 00:37:51,434 --> 00:37:52,350 That would be a no-no. 750 00:37:52,350 --> 00:37:55,240 So write up your code independently. 751 00:37:55,240 --> 00:38:00,430 And again, the graders will be looking for identical code. 752 00:38:00,430 --> 00:38:02,280 And that will be thrown out. 753 00:38:02,280 --> 00:38:05,270 And so we don't want to have any misconduct 754 00:38:05,270 --> 00:38:07,630 of that type occurring. 755 00:38:07,630 --> 00:38:08,130 All right. 756 00:38:08,130 --> 00:38:11,030 So recitations-- there are three recitation sessions 757 00:38:11,030 --> 00:38:14,080 offered each week, Wednesday at 4:00 by Peter, 758 00:38:14,080 --> 00:38:17,920 Thursday at 4:00 by Colette, Friday at 4:00 by Tahin. 759 00:38:17,920 --> 00:38:21,990 And that is a special recitation that's 760 00:38:21,990 --> 00:38:25,550 required for the 6.874 students-- David? 761 00:38:25,550 --> 00:38:29,090 Yes-- and has additional AI content. 762 00:38:29,090 --> 00:38:31,320 So anyone is welcome to go to that recitation. 763 00:38:31,320 --> 00:38:36,450 But those who are taking 6.874 must go. 764 00:38:36,450 --> 00:38:39,180 For students registered for the other versions of the course, 765 00:38:39,180 --> 00:38:42,660 going to recitation is optional but strongly recommended. 766 00:38:42,660 --> 00:38:46,780 Because the TAs will go over material from the lectures, 767 00:38:46,780 --> 00:38:49,040 material that's helpful for the homeworks 768 00:38:49,040 --> 00:38:53,090 or for studying for exams-- in the first weeks, Python, 769 00:38:53,090 --> 00:38:55,020 probability as well. 770 00:38:55,020 --> 00:38:59,260 So go to the recitations, particularly 771 00:38:59,260 --> 00:39:01,890 if you're having trouble in the course. 772 00:39:01,890 --> 00:39:04,750 So Tahin's recitation starts this week. 773 00:39:04,750 --> 00:39:08,020 And Peter and Colette's will start next week. 774 00:39:08,020 --> 00:39:08,550 Question. 775 00:39:08,550 --> 00:39:08,880 Yes. 776 00:39:08,880 --> 00:39:11,046 AUDIENCE: Are they covering the same material, Peter 777 00:39:11,046 --> 00:39:12,310 and Colette? 778 00:39:12,310 --> 00:39:14,840 PROFESSOR: Peter and Colette will cover similar material 779 00:39:14,840 --> 00:39:16,830 and Tahin will cover different material. 780 00:39:24,730 --> 00:39:28,900 Python instruction-- so the first problem set, 781 00:39:28,900 --> 00:39:30,900 which will be posted this evening, 782 00:39:30,900 --> 00:39:32,860 doesn't have any programming on it. 783 00:39:32,860 --> 00:39:35,440 But if you don't have programming, 784 00:39:35,440 --> 00:39:38,010 you need to start learning it very soon. 785 00:39:38,010 --> 00:39:42,600 And so there will be a significant programming problem 786 00:39:42,600 --> 00:39:44,010 on p set two. 787 00:39:44,010 --> 00:39:46,720 And we'll be posting that problem soon-- this week, 788 00:39:46,720 --> 00:39:47,700 some time. 789 00:39:47,700 --> 00:39:50,800 And you'll want to look at that and gauge, 790 00:39:50,800 --> 00:39:53,560 how much Python do I need to learn 791 00:39:53,560 --> 00:39:55,690 to at least do that problem. 792 00:39:55,690 --> 00:39:57,210 So what is this project component 793 00:39:57,210 --> 00:39:59,950 that we've been hearing about? 794 00:39:59,950 --> 00:40:02,580 So here's a more concrete description. 795 00:40:02,580 --> 00:40:04,740 So again, this is only for the graduate versions 796 00:40:04,740 --> 00:40:05,780 of the course. 797 00:40:05,780 --> 00:40:09,840 So students will-- basically, we've 798 00:40:09,840 --> 00:40:12,940 structured it so you work incrementally 799 00:40:12,940 --> 00:40:16,240 toward the final research project 800 00:40:16,240 --> 00:40:19,720 and so that we can offer feedback and help along the way 801 00:40:19,720 --> 00:40:20,700 if needed. 802 00:40:20,700 --> 00:40:24,650 So the first assignment will be next week. 803 00:40:24,650 --> 00:40:27,407 I think it's on Tuesday. 804 00:40:27,407 --> 00:40:29,490 I'll have more instructions on Thursday's lecture. 805 00:40:29,490 --> 00:40:33,130 But all the students registered for the grad version 806 00:40:33,130 --> 00:40:34,994 will submit their background and interests 807 00:40:34,994 --> 00:40:36,410 for posting on the course website. 808 00:40:36,410 --> 00:40:38,250 And then you can look at those and try 809 00:40:38,250 --> 00:40:42,020 to find other students who have, ideally, similar interests 810 00:40:42,020 --> 00:40:44,620 but perhaps somewhat different backgrounds. 811 00:40:44,620 --> 00:40:47,949 It's particularly helpful to have 812 00:40:47,949 --> 00:40:49,240 a strong biologist on the team. 813 00:40:49,240 --> 00:40:53,260 And perhaps, a strong programmer would help as well. 814 00:40:53,260 --> 00:40:57,710 So then, you will choose your teams 815 00:40:57,710 --> 00:41:02,320 and submit a project title and one-paragraph summary, 816 00:41:02,320 --> 00:41:04,660 the basic idea of your project. 817 00:41:04,660 --> 00:41:08,720 Now we are not providing a menu of research projects. 818 00:41:08,720 --> 00:41:11,690 It's your choice-- whatever you want 819 00:41:11,690 --> 00:41:14,511 to do, as long as it's related to computational and systems 820 00:41:14,511 --> 00:41:15,010 biology. 821 00:41:15,010 --> 00:41:20,060 So it could be analysis of some publicly available data. 822 00:41:20,060 --> 00:41:22,600 Could be analysis of some data that you 823 00:41:22,600 --> 00:41:24,620 got during your rotation. 824 00:41:24,620 --> 00:41:27,560 Or for those of you who are already in labs that you're 825 00:41:27,560 --> 00:41:29,510 actually working on, it's totally 826 00:41:29,510 --> 00:41:32,580 fine and encouraged that the project be something that's 827 00:41:32,580 --> 00:41:38,980 related to your main PhD work if you've started on that. 828 00:41:38,980 --> 00:41:44,280 And it could also be more in the modeling, some modeling 829 00:41:44,280 --> 00:41:47,780 with MATLAB or something, if you're familiar with that. 830 00:41:47,780 --> 00:41:49,840 But, a variety of possibilities. 831 00:41:49,840 --> 00:41:51,980 We'll have more information on this later. 832 00:41:51,980 --> 00:41:53,900 But we want you to form teams. 833 00:41:53,900 --> 00:41:58,920 The teams can work independently or with up to four friends 834 00:41:58,920 --> 00:42:00,220 in teams of five. 835 00:42:00,220 --> 00:42:02,807 And if people want to have a giant team 836 00:42:02,807 --> 00:42:04,390 to do some really challenging project, 837 00:42:04,390 --> 00:42:07,110 then you can come and discuss with us. 838 00:42:07,110 --> 00:42:11,010 And we'll see if that would work. 839 00:42:11,010 --> 00:42:14,830 So then there's the initial title 840 00:42:14,830 --> 00:42:16,372 and one-paragraph summary. 841 00:42:16,372 --> 00:42:18,080 We'll give you a little feedback on that. 842 00:42:18,080 --> 00:42:21,900 And then you'll submit an actual specific aims document-- 843 00:42:21,900 --> 00:42:25,060 so with actual, NIH-style, specific aims-- 844 00:42:25,060 --> 00:42:29,400 the goal is to understand whether this organism has 845 00:42:29,400 --> 00:42:36,280 operons or not, or some actual scientific question-- and a bit 846 00:42:36,280 --> 00:42:38,180 about how you will undertake that. 847 00:42:38,180 --> 00:42:43,730 And then you'll submit a longer two-page research strategy, 848 00:42:43,730 --> 00:42:45,540 which will include, specifically, 849 00:42:45,540 --> 00:42:47,120 we will use these data. 850 00:42:47,120 --> 00:42:49,720 We will use this software, these statistical 851 00:42:49,720 --> 00:42:52,010 approaches-- that sort of thing. 852 00:42:52,010 --> 00:42:54,960 And then, toward the end of the semester, a final written 853 00:42:54,960 --> 00:42:58,080 report will be due that'll be five pages. 854 00:42:58,080 --> 00:42:59,740 You'll work on it together, but it'll 855 00:42:59,740 --> 00:43:01,750 need to be clear who did what. 856 00:43:01,750 --> 00:43:04,130 You'll need an author contribution statement. 857 00:43:04,130 --> 00:43:05,480 So and so did this analysis. 858 00:43:05,480 --> 00:43:08,687 So and so wrote this section-- that sort of thing. 859 00:43:08,687 --> 00:43:10,270 And then, as I mentioned before, there 860 00:43:10,270 --> 00:43:12,930 will be oral presentations, by each team, 861 00:43:12,930 --> 00:43:16,335 on the last two course sessions. 862 00:43:16,335 --> 00:43:18,610 OK? 863 00:43:18,610 --> 00:43:20,390 Questions about the projects? 864 00:43:20,390 --> 00:43:23,787 There's more information on the course info document online. 865 00:43:26,821 --> 00:43:27,320 OK? 866 00:43:27,320 --> 00:43:29,850 Good. 867 00:43:29,850 --> 00:43:36,560 Yes, so for those taking 6.874, in addition to the project, 868 00:43:36,560 --> 00:43:39,700 there will also be additional AI problems 869 00:43:39,700 --> 00:43:42,620 on both the p sets-- and the exam? 870 00:43:42,620 --> 00:43:43,120 David? 871 00:43:43,120 --> 00:43:43,350 Yes. 872 00:43:43,350 --> 00:43:43,849 OK. 873 00:43:47,090 --> 00:43:49,964 And they're optional for others. 874 00:43:49,964 --> 00:43:50,840 All right. 875 00:43:50,840 --> 00:43:51,870 Good. 876 00:43:51,870 --> 00:43:52,370 OK. 877 00:43:52,370 --> 00:43:55,470 So how are we going to do the exam? 878 00:43:55,470 --> 00:43:58,282 So as I mentioned, there's two 80-minute exams. 879 00:43:58,282 --> 00:43:59,240 They're non-cumulative. 880 00:43:59,240 --> 00:44:00,490 So the first exam covers, basically, 881 00:44:00,490 --> 00:44:01,448 the first three topics. 882 00:44:01,448 --> 00:44:04,369 The second exam is on the last three topics. 883 00:44:04,369 --> 00:44:05,160 They're 80 minutes. 884 00:44:05,160 --> 00:44:08,880 They're during normal class time. 885 00:44:08,880 --> 00:44:11,960 There's no final exam. 886 00:44:11,960 --> 00:44:15,450 The grading-- so for those taking the undergrad version, 887 00:44:15,450 --> 00:44:20,390 the homeworks will count 36% out of the maximum 100 points. 888 00:44:20,390 --> 00:44:24,000 And the exams will count 62%. 889 00:44:24,000 --> 00:44:25,880 And then, this peer review, where 890 00:44:25,880 --> 00:44:27,650 there's two days where you go, and you 891 00:44:27,650 --> 00:44:29,340 listen to presentations, and you submit 892 00:44:29,340 --> 00:44:32,300 comments online counts 2%. 893 00:44:32,300 --> 00:44:35,970 For the graduate Bio BE HST versions, 894 00:44:35,970 --> 00:44:43,240 it's 30% homeworks, 48% exams, 20% project, 2% peer review. 895 00:44:43,240 --> 00:44:49,800 For the EECS version, 6.874, 25% homework, 48% exams, 20% 896 00:44:49,800 --> 00:44:52,870 project, and then, 5% for these extra AI related 897 00:44:52,870 --> 00:44:55,830 problems, and 2% peer review. 898 00:44:55,830 --> 00:44:57,440 And those should add up to 100. 899 00:44:57,440 --> 00:45:02,020 And then, in addition, we will reward 1% extra credit 900 00:45:02,020 --> 00:45:05,320 for outstanding class participation-- 901 00:45:05,320 --> 00:45:10,405 so questions, comments during class. 902 00:45:10,405 --> 00:45:12,385 OK. 903 00:45:12,385 --> 00:45:13,870 All right. 904 00:45:13,870 --> 00:45:15,690 So a few announcements about topic one. 905 00:45:15,690 --> 00:45:21,180 And then, each of us will review the topics that are coming up. 906 00:45:21,180 --> 00:45:22,855 So p set one will be posted tonight. 907 00:45:22,855 --> 00:45:25,040 It's due February 20th, at noon. 908 00:45:25,040 --> 00:45:28,112 It involves basic microbiology, probability, and statistics. 909 00:45:28,112 --> 00:45:29,820 It'll give you some experience with BLAST 910 00:45:29,820 --> 00:45:31,760 and some of the statistics associated. 911 00:45:31,760 --> 00:45:35,390 P set two will be posted later this week. 912 00:45:35,390 --> 00:45:38,590 So you don't need to, obviously, start on p set two yet. 913 00:45:38,590 --> 00:45:40,210 It's not due for several weeks. 914 00:45:40,210 --> 00:45:42,670 But definitely, look at the programming problem 915 00:45:42,670 --> 00:45:45,020 to give you an idea of what's involved 916 00:45:45,020 --> 00:45:49,750 and what to focus on when you're reviewing your Python. 917 00:45:49,750 --> 00:45:52,110 Mentioned the probability/stats primer. 918 00:45:52,110 --> 00:45:55,110 Sequencing technology-- so for Thursday's lecture, 919 00:45:55,110 --> 00:45:59,260 it will be very helpful if you read the review, by Metzger, 920 00:45:59,260 --> 00:46:01,240 on next gen sequencing technologies. 921 00:46:01,240 --> 00:46:02,770 It's pretty well written. 922 00:46:02,770 --> 00:46:06,190 Covers Illumina, 454, PACBIO, and 923 00:46:06,190 --> 00:46:11,000 a few other interesting sequence technologies. 924 00:46:11,000 --> 00:46:12,680 Other background reading-- so we'll 925 00:46:12,680 --> 00:46:15,200 be talking about local alignment, global alignment, 926 00:46:15,200 --> 00:46:18,430 statistics, and similarly matrices for the next two 927 00:46:18,430 --> 00:46:19,310 lectures. 928 00:46:19,310 --> 00:46:21,340 And chapters four and five of the textbook 929 00:46:21,340 --> 00:46:23,760 provide a pretty good background on these topics. 930 00:46:23,760 --> 00:46:27,332 I encourage you to take a look. 931 00:46:27,332 --> 00:46:28,220 OK. 932 00:46:28,220 --> 00:46:30,390 So I'm just going to briefly review my lectures. 933 00:46:30,390 --> 00:46:33,260 And then we'll have David and Ernest do the same. 934 00:46:33,260 --> 00:46:35,990 So sequencing technologies will be 935 00:46:35,990 --> 00:46:37,300 the beginning of lecture two. 936 00:46:37,300 --> 00:46:43,170 And then we'll talk about local ungapped sequence 937 00:46:43,170 --> 00:46:45,670 alignment-- in particular, BLAST. 938 00:46:45,670 --> 00:46:49,140 So BLAST is something like the Google search 939 00:46:49,140 --> 00:46:51,150 engine of bioinformatics, if you will. 940 00:46:51,150 --> 00:46:53,860 It's one of the most widely used tools. 941 00:46:53,860 --> 00:46:56,940 And it's important to understand something about how it works, 942 00:46:56,940 --> 00:46:58,440 and in particular, how to evaluate 943 00:46:58,440 --> 00:47:01,260 the significance of BLAST hits, which are described 944 00:47:01,260 --> 00:47:07,110 by this extreme value distribution here. 945 00:47:07,110 --> 00:47:11,900 And then, lecture three, we'll talk about global alignment 946 00:47:11,900 --> 00:47:14,750 and introducing gaps into sequence alignments. 947 00:47:14,750 --> 00:47:18,680 We'll talk about some dynamic programming 948 00:47:18,680 --> 00:47:21,530 algorithms-- Needleman-Wunsch, Smith-Waterman. 949 00:47:21,530 --> 00:47:23,930 And in lecture four, we'll talk about 950 00:47:23,930 --> 00:47:27,840 comparative genomic analysis of gene regulation-- 951 00:47:27,840 --> 00:47:33,440 so using sequence similarity across genomes two infer 952 00:47:33,440 --> 00:47:36,060 location of regulatory elements such as microRNA target 953 00:47:36,060 --> 00:47:39,190 sites, other things like that. 954 00:47:39,190 --> 00:47:39,690 All right. 955 00:47:39,690 --> 00:47:44,000 So I think this is now-- oh sorry, a few more lectures. 956 00:47:44,000 --> 00:47:48,570 Then, in the next unit, modeling biological function, 957 00:47:48,570 --> 00:47:51,160 I'll talk about the problem of motif finding-- 958 00:47:51,160 --> 00:47:56,640 so searching a set of sequences for a common subsequence, 959 00:47:56,640 --> 00:48:00,756 or similar subsequences, that possess 960 00:48:00,756 --> 00:48:02,130 a particular biological function, 961 00:48:02,130 --> 00:48:04,124 like binding to a protein. 962 00:48:04,124 --> 00:48:05,540 It's often a complex search space. 963 00:48:05,540 --> 00:48:06,998 We'll talk about the Gibbs sampling 964 00:48:06,998 --> 00:48:08,760 algorithm and some alternatives. 965 00:48:08,760 --> 00:48:11,410 And then, in lecture 10, I'll talk 966 00:48:11,410 --> 00:48:13,960 about Markov and hidden Markov models, which 967 00:48:13,960 --> 00:48:18,720 have been called the Legos of bioinformatics, which 968 00:48:18,720 --> 00:48:23,400 can be used to model a variety of linear sequence 969 00:48:23,400 --> 00:48:25,290 labeling problems. 970 00:48:25,290 --> 00:48:30,200 And then, in the last lecture of that unit, 971 00:48:30,200 --> 00:48:32,600 I'll talk a little bit about protein-- 972 00:48:32,600 --> 00:48:34,610 I'm sorry, about RNA-- secondary structure-- 973 00:48:34,610 --> 00:48:39,970 so the base pairing of RNAs-- predicting it 974 00:48:39,970 --> 00:48:44,170 from thermodynamic tools, as well as 975 00:48:44,170 --> 00:48:46,620 comparative genomic approaches. 976 00:48:46,620 --> 00:48:50,190 And you'll learn about the mfold tool 977 00:48:50,190 --> 00:48:52,390 and how you can use a diagram like that 978 00:48:52,390 --> 00:48:56,350 to infer that this RNA may have different possible structures 979 00:48:56,350 --> 00:48:59,360 that I can fold into, like those shown. 980 00:48:59,360 --> 00:49:00,060 All right. 981 00:49:00,060 --> 00:49:02,820 So I'm going to pass it off to David here. 982 00:49:11,270 --> 00:49:14,492 PROFESSOR: Thanks very much, Chris. 983 00:49:14,492 --> 00:49:15,170 All Right. 984 00:49:15,170 --> 00:49:16,480 So I'm David Gifford. 985 00:49:16,480 --> 00:49:18,220 And I'm delighted to be here. 986 00:49:18,220 --> 00:49:20,560 It's really a wonderfully exciting time 987 00:49:20,560 --> 00:49:22,920 in computational biology. 988 00:49:22,920 --> 00:49:24,620 And one of the reasons it's so exciting 989 00:49:24,620 --> 00:49:29,450 is shown on this slide, which is the production of DNA base 990 00:49:29,450 --> 00:49:31,620 sequence per instrument over time. 991 00:49:31,620 --> 00:49:36,290 And as you can see, it's just amazingly more efficient 992 00:49:36,290 --> 00:49:37,620 as time goes forward. 993 00:49:37,620 --> 00:49:40,570 And if you think about the reciprocal of this curve, 994 00:49:40,570 --> 00:49:44,960 the cost per base is basically becoming extraordinarily low. 995 00:49:44,960 --> 00:49:47,730 And this kind of instrument allows 996 00:49:47,730 --> 00:49:49,910 us to produce hundreds of millions 997 00:49:49,910 --> 00:49:53,751 of sequence reads for a single experiment. 998 00:49:53,751 --> 00:49:56,000 And thus do we not only need new computational methods 999 00:49:56,000 --> 00:49:59,086 to handle this kind of a big data problem, 1000 00:49:59,086 --> 00:50:00,460 but we need computational methods 1001 00:50:00,460 --> 00:50:03,730 to represent the results in computational models. 1002 00:50:03,730 --> 00:50:06,760 And so we have multiple challenges computationally. 1003 00:50:06,760 --> 00:50:09,710 Because modern biology really can't be done 1004 00:50:09,710 --> 00:50:12,990 outside of a computational framework. 1005 00:50:12,990 --> 00:50:16,690 And to summarise the way people have adapted 1006 00:50:16,690 --> 00:50:18,950 these high throughput DNA sequencing instruments, 1007 00:50:18,950 --> 00:50:21,010 I built this small figure for you. 1008 00:50:21,010 --> 00:50:23,325 And you can see that-- Professor Burge will 1009 00:50:23,325 --> 00:50:26,490 be talking about DNA sequencing next time. 1010 00:50:26,490 --> 00:50:28,310 And obviously, you can use DNA sequencing 1011 00:50:28,310 --> 00:50:30,800 to sequence your own genomes, or the genomes 1012 00:50:30,800 --> 00:50:33,850 of your favorite pet, or whatever you like. 1013 00:50:33,850 --> 00:50:36,060 And so we'll talk about, in lecture six, how 1014 00:50:36,060 --> 00:50:38,300 to actually do genome sequencing. 1015 00:50:38,300 --> 00:50:41,590 And one of the challenges in doing genome sequencing 1016 00:50:41,590 --> 00:50:44,380 is how to actually find what you have sequenced. 1017 00:50:44,380 --> 00:50:48,700 And we'll be talking about how to map sequence reads as well. 1018 00:50:48,700 --> 00:50:50,170 Another way to use DNA sequencing 1019 00:50:50,170 --> 00:50:53,130 is to take the RNA species present in a single cell, 1020 00:50:53,130 --> 00:50:55,590 or in a population of cells, and convert them 1021 00:50:55,590 --> 00:50:58,680 into DNA using reverse transcriptase. 1022 00:50:58,680 --> 00:51:02,140 Then we can sequence the DNA and understand the RNA component 1023 00:51:02,140 --> 00:51:03,810 of the cell, which, of course, either 1024 00:51:03,810 --> 00:51:07,260 can be used as messenger RNA to code for protein, 1025 00:51:07,260 --> 00:51:09,917 or for structural RNAs, or for non-coding RNAs that 1026 00:51:09,917 --> 00:51:12,250 have other kinds of functions associated with chromatin. 1027 00:51:14,780 --> 00:51:16,700 In lecture seven, we'll be talking 1028 00:51:16,700 --> 00:51:19,440 about protein/DNA interactions, which Professor Burge already 1029 00:51:19,440 --> 00:51:21,620 mentioned-- the idea that we can actually 1030 00:51:21,620 --> 00:51:24,010 locate all the regulatory factors associated 1031 00:51:24,010 --> 00:51:27,910 with the genome using a single high throughput experiment. 1032 00:51:27,910 --> 00:51:30,750 We do this by isolating the proteins and their associated 1033 00:51:30,750 --> 00:51:33,400 DNA fragments and sequencing the DNA fragments 1034 00:51:33,400 --> 00:51:37,010 using this DNA sequencing technology. 1035 00:51:37,010 --> 00:51:40,820 So briefly, the first thing that we'll 1036 00:51:40,820 --> 00:51:44,160 look at in lecture five is, given a reference genome 1037 00:51:44,160 --> 00:51:47,250 sequence and a basket of DNA sequence reads, 1038 00:51:47,250 --> 00:51:49,710 how do we build an efficient index 1039 00:51:49,710 --> 00:51:51,970 so that we can either map or align those 1040 00:51:51,970 --> 00:51:54,420 reads back to the reference genome. 1041 00:51:54,420 --> 00:51:56,840 That's a very important and fundamental problem. 1042 00:51:56,840 --> 00:51:59,510 Because if I give you a basket of 200 million reads, 1043 00:51:59,510 --> 00:52:03,280 we need to build its alignment very, very rapidly, 1044 00:52:03,280 --> 00:52:05,510 and quickly, and accurately, especially 1045 00:52:05,510 --> 00:52:08,380 in the context of repetitive elements. 1046 00:52:08,380 --> 00:52:11,300 Because a genome obviously has many repeats in it. 1047 00:52:11,300 --> 00:52:13,520 We need to consider how our indexing and searching 1048 00:52:13,520 --> 00:52:16,550 algorithms are going to handle those sorts of elements. 1049 00:52:16,550 --> 00:52:18,910 In the next lecture, we'll talk about how 1050 00:52:18,910 --> 00:52:22,075 to actually sequence a genome and assemble it. 1051 00:52:22,075 --> 00:52:24,450 So the fundamental way that we approach genome sequencing 1052 00:52:24,450 --> 00:52:26,370 given today's sequencing instruments 1053 00:52:26,370 --> 00:52:29,450 is that we take intact chromosomes at the top, which, 1054 00:52:29,450 --> 00:52:31,770 of course, are hundreds of millions of bases long, 1055 00:52:31,770 --> 00:52:33,980 and we shatter them into pieces. 1056 00:52:33,980 --> 00:52:37,160 And then we sequence size selected pieces 1057 00:52:37,160 --> 00:52:39,340 in a sequencing instrument. 1058 00:52:39,340 --> 00:52:41,690 And then we need to put the jigsaw puzzle back together 1059 00:52:41,690 --> 00:52:43,870 with a computational assembler. 1060 00:52:43,870 --> 00:52:46,740 So we'll be talking about assembly algorithms, how 1061 00:52:46,740 --> 00:52:50,390 they work, and furthermore, how resolve ambiguities 1062 00:52:50,390 --> 00:52:53,310 as we put that puzzle together, which often arise 1063 00:52:53,310 --> 00:52:56,280 in the context of repetitive sequence. 1064 00:52:56,280 --> 00:52:58,480 And in the next lecture, in lecture seven, 1065 00:52:58,480 --> 00:53:02,140 we'll be looking at how to actually take those little DNA 1066 00:53:02,140 --> 00:53:04,460 molecules that are associated with proteins 1067 00:53:04,460 --> 00:53:08,810 and analyze them to figure out where particular proteins are 1068 00:53:08,810 --> 00:53:11,610 bound to the genome and how they might regulate target genes. 1069 00:53:11,610 --> 00:53:13,970 Here we see two different occurrences 1070 00:53:13,970 --> 00:53:16,762 of OCT4 binding events binding proximally 1071 00:53:16,762 --> 00:53:18,595 to the SOX2 gene, which they are regulating. 1072 00:53:21,250 --> 00:53:29,730 So that's the beginning of our analysis of DNA sequencing. 1073 00:53:29,730 --> 00:53:32,940 And then we'll look at RNA sequencing-- once again, 1074 00:53:32,940 --> 00:53:34,610 with going through a DNA intermediate-- 1075 00:53:34,610 --> 00:53:36,770 and ask the question, how can we look 1076 00:53:36,770 --> 00:53:39,830 at the expression of particular genes 1077 00:53:39,830 --> 00:53:43,620 by mapping RNA sequence reads back onto the genome. 1078 00:53:43,620 --> 00:53:46,910 And there are two fundamental questions we can address here, 1079 00:53:46,910 --> 00:53:51,860 which is, what is the level of expression of a given gene, 1080 00:53:51,860 --> 00:53:55,860 and secondarily, what isoforms are being expressed. 1081 00:53:55,860 --> 00:53:58,210 The second sets of reads, you see up on the screen, 1082 00:53:58,210 --> 00:54:01,930 are split reads that cross splice junctions. 1083 00:54:01,930 --> 00:54:05,760 And so by looking at how reads align to the genome, 1084 00:54:05,760 --> 00:54:09,170 we can figure out which particular axons are included 1085 00:54:09,170 --> 00:54:13,750 or excluded from a particular transcript. 1086 00:54:13,750 --> 00:54:17,400 So that's the beginning of the high throughput biology 1087 00:54:17,400 --> 00:54:19,970 genomic analysis module. 1088 00:54:19,970 --> 00:54:22,870 And I'll be returning to talk, later in the term, 1089 00:54:22,870 --> 00:54:26,130 about computational genetics, which, really, is a way 1090 00:54:26,130 --> 00:54:27,790 to summarize everything we're learning 1091 00:54:27,790 --> 00:54:30,980 in the course into an applicable way 1092 00:54:30,980 --> 00:54:34,250 to ask fundamental questions about genome function, which 1093 00:54:34,250 --> 00:54:38,000 Professor Burge talked about earlier. 1094 00:54:38,000 --> 00:54:43,900 Now we all have about 3 billion bases in our genomes. 1095 00:54:43,900 --> 00:54:46,350 And as you know, you differ from the individual 1096 00:54:46,350 --> 00:54:49,950 sitting next to you in about one in every 1,000 base pairs, 1097 00:54:49,950 --> 00:54:51,770 on average. 1098 00:54:51,770 --> 00:54:53,670 And so one question is, how do we actually 1099 00:54:53,670 --> 00:54:57,880 interpret these differences between genomes. 1100 00:54:57,880 --> 00:55:00,900 And how can we build accurate computational models 1101 00:55:00,900 --> 00:55:05,410 that allow us to infer function from genome sequence? 1102 00:55:05,410 --> 00:55:08,170 And that's a pretty big challenge. 1103 00:55:08,170 --> 00:55:13,100 So we'll start by asking questions about, 1104 00:55:13,100 --> 00:55:14,740 what parts of the genome are active 1105 00:55:14,740 --> 00:55:17,440 and how could we annotate them. 1106 00:55:17,440 --> 00:55:21,030 So we can use, once again, different kinds of sequencing 1107 00:55:21,030 --> 00:55:24,325 based assays to identify the regions of the genome that 1108 00:55:24,325 --> 00:55:27,260 are active in any given cellular state. 1109 00:55:27,260 --> 00:55:29,680 And furthermore, if we look at different cells, 1110 00:55:29,680 --> 00:55:31,420 we can tell which parts of the genome 1111 00:55:31,420 --> 00:55:33,250 are differentially active. 1112 00:55:33,250 --> 00:55:35,670 Here you can see the active chromatin 1113 00:55:35,670 --> 00:55:38,240 during the differentiation of an ESL 1114 00:55:38,240 --> 00:55:41,500 into a terminal type over a 50 kilobase window. 1115 00:55:41,500 --> 00:55:44,090 And the regions of the genome that are shaded in yellow 1116 00:55:44,090 --> 00:55:48,000 represent regions that are differentially active. 1117 00:55:48,000 --> 00:55:52,090 And so, using this kind of DNA seq. data and other data, 1118 00:55:52,090 --> 00:55:54,860 we can automatically annotate the genome 1119 00:55:54,860 --> 00:55:56,840 with where the regulatory elements are 1120 00:55:56,840 --> 00:56:00,430 and begin to understand what the regulatory code of the genome 1121 00:56:00,430 --> 00:56:02,560 is. 1122 00:56:02,560 --> 00:56:05,600 So once we understand what parts of the genome are active, 1123 00:56:05,600 --> 00:56:07,350 we can ask questions about, how do they 1124 00:56:07,350 --> 00:56:10,720 contribute to some overall phenotype. 1125 00:56:10,720 --> 00:56:14,240 And our next lecture, lecture 19, 1126 00:56:14,240 --> 00:56:16,410 we'll be looking at how we can build 1127 00:56:16,410 --> 00:56:19,530 a model of a quantitative trait based 1128 00:56:19,530 --> 00:56:22,145 upon multiple loci and the particular alleles that 1129 00:56:22,145 --> 00:56:23,870 are present at that loci. 1130 00:56:23,870 --> 00:56:27,040 So here you see an example of a bunch 1131 00:56:27,040 --> 00:56:29,090 of different quantitative trait loci 1132 00:56:29,090 --> 00:56:31,240 that are contributing to the growth rate of yeast 1133 00:56:31,240 --> 00:56:33,410 in a given condition. 1134 00:56:33,410 --> 00:56:37,750 So part of our exploration during this term 1135 00:56:37,750 --> 00:56:40,070 will be to develop computational methods 1136 00:56:40,070 --> 00:56:43,130 to automatically identify regions of the genome that 1137 00:56:43,130 --> 00:56:47,760 control such traits and to assign them significance. 1138 00:56:47,760 --> 00:56:51,010 And finally, we'd like to put all this together and ask 1139 00:56:51,010 --> 00:56:52,970 a very fundamental question, which 1140 00:56:52,970 --> 00:56:57,290 is, how do we assign variations in the human genome 1141 00:56:57,290 --> 00:57:00,500 to differential risk for human disease. 1142 00:57:00,500 --> 00:57:02,880 And associated questions are, how could we 1143 00:57:02,880 --> 00:57:06,230 assign those variants to what best therapy would be applied 1144 00:57:06,230 --> 00:57:08,120 to the disease-- what therapeutics 1145 00:57:08,120 --> 00:57:10,370 might be used, for example. 1146 00:57:10,370 --> 00:57:13,890 And here we have a bunch of results 1147 00:57:13,890 --> 00:57:15,976 from genome wide association studies, 1148 00:57:15,976 --> 00:57:17,725 starting at the top with bipolar disorder. 1149 00:57:17,725 --> 00:57:21,410 And at the bottom is type 2 diabetes. 1150 00:57:21,410 --> 00:57:23,450 And looking along the chromosomes, 1151 00:57:23,450 --> 00:57:27,100 we're asking which locations along the genome 1152 00:57:27,100 --> 00:57:29,300 have variants that are highly associated 1153 00:57:29,300 --> 00:57:32,640 with these particular diseases in these so-called Manhattan 1154 00:57:32,640 --> 00:57:33,140 plots. 1155 00:57:33,140 --> 00:57:36,030 Because the things that stick up look like buildings. 1156 00:57:36,030 --> 00:57:39,920 And so these sorts of studies are yielding very interesting 1157 00:57:39,920 --> 00:57:45,444 insights into variants that are associated with human disease. 1158 00:57:45,444 --> 00:57:47,360 And the next step, of course, is to figure out 1159 00:57:47,360 --> 00:57:51,720 how to actually prove that these variants are causal, 1160 00:57:51,720 --> 00:57:54,020 and also, to look at mechanisms where 1161 00:57:54,020 --> 00:57:56,550 we might be able to address what kinds of therapeutics 1162 00:57:56,550 --> 00:58:00,370 might be applied to deal with these diseases. 1163 00:58:00,370 --> 00:58:01,930 So those are my two units. 1164 00:58:01,930 --> 00:58:05,090 Once again, high throughput genomic analysis, 1165 00:58:05,090 --> 00:58:07,870 and secondarily, computational genetics. 1166 00:58:07,870 --> 00:58:11,026 And finally, if you have any questions about 6.874, 1167 00:58:11,026 --> 00:58:12,150 I'll be here after lecture. 1168 00:58:12,150 --> 00:58:13,850 Feel free to ask me. 1169 00:58:13,850 --> 00:58:14,350 Ernest. 1170 00:58:24,294 --> 00:58:25,970 PROFESSOR: Thank you very much, Dave. 1171 00:58:25,970 --> 00:58:26,800 All right. 1172 00:58:26,800 --> 00:58:28,500 So in the preceding lectures, you'll 1173 00:58:28,500 --> 00:58:30,550 have heard a lot about the amazing things 1174 00:58:30,550 --> 00:58:32,842 we can learn from nucleic acid sequencing. 1175 00:58:32,842 --> 00:58:35,300 And what we're going to do in the latter part of the course 1176 00:58:35,300 --> 00:58:37,034 is look at other modalities in the cell. 1177 00:58:37,034 --> 00:58:38,450 Obviously, there's a lot that goes 1178 00:58:38,450 --> 00:58:41,100 on inside of cells that's not taking place 1179 00:58:41,100 --> 00:58:42,830 at the level of DNA/RNA. 1180 00:58:42,830 --> 00:58:45,520 And so we're going to start to look at proteins, protein 1181 00:58:45,520 --> 00:58:49,680 interactions, and ultimately, protein interaction networks. 1182 00:58:49,680 --> 00:58:51,890 So we'll start with the small scale, 1183 00:58:51,890 --> 00:58:54,180 looking at intermolecular interactions 1184 00:58:54,180 --> 00:58:56,530 of the biophysics-- the fundamental biophysics 1185 00:58:56,530 --> 00:58:57,866 of a protein structure. 1186 00:58:57,866 --> 00:59:00,240 Then we'll start to look at protein-protein interactions. 1187 00:59:00,240 --> 00:59:03,560 And as I said, the final level will be networks. 1188 00:59:03,560 --> 00:59:05,850 There's been an amazing advance in our ability 1189 00:59:05,850 --> 00:59:07,830 to predict protein structure. 1190 00:59:07,830 --> 00:59:10,260 So it's always been the dream of computational biology 1191 00:59:10,260 --> 00:59:12,750 to be able to go from the sequence of a gene 1192 00:59:12,750 --> 00:59:15,400 to the structure of the corresponding protein. 1193 00:59:15,400 --> 00:59:17,400 And ever since Anfinsen, we know that, at least, 1194 00:59:17,400 --> 00:59:20,035 that should be theoretically possible for a lot of proteins. 1195 00:59:20,035 --> 00:59:22,160 But it's been computationally virtually intractable 1196 00:59:22,160 --> 00:59:23,351 until quite recently. 1197 00:59:23,351 --> 00:59:24,850 And a number of different approaches 1198 00:59:24,850 --> 00:59:27,150 have allowed us to predict protein structure. 1199 00:59:27,150 --> 00:59:28,680 So this slide shows, for example-- 1200 00:59:28,680 --> 00:59:31,120 I don't know which one's which-- but in blue, perhaps, 1201 00:59:31,120 --> 00:59:33,810 prediction, and in red, the true structure, or the other way 1202 00:59:33,810 --> 00:59:34,120 around. 1203 00:59:34,120 --> 00:59:35,995 But you can see, it doesn't matter very much. 1204 00:59:35,995 --> 00:59:39,320 We're getting extremely accurate predictions of small protein 1205 00:59:39,320 --> 00:59:40,637 structure. 1206 00:59:40,637 --> 00:59:42,220 There are a variety of approaches here 1207 00:59:42,220 --> 00:59:43,800 that live on a spectrum. 1208 00:59:43,800 --> 00:59:47,270 On the one hand, there's the computational approach 1209 00:59:47,270 --> 00:59:50,870 that tries to make special purpose hardware to carry out 1210 00:59:50,870 --> 00:59:52,890 the calculations for protein structure. 1211 00:59:52,890 --> 00:59:54,380 And that's been wildly successful. 1212 00:59:54,380 --> 00:59:56,390 On the other end of the extreme, there's 1213 00:59:56,390 --> 00:59:58,450 been the crowd sourcing approach to have gamers 1214 00:59:58,450 --> 00:59:59,926 try to predict protein structure. 1215 00:59:59,926 --> 01:00:01,425 And that turns out to be successful. 1216 01:00:01,425 --> 01:00:04,244 And there's a lot of interesting computational approaches 1217 01:00:04,244 --> 01:00:05,160 in the middle as well. 1218 01:00:05,160 --> 01:00:09,080 So we'll explore some of these different strategies. 1219 01:00:09,080 --> 01:00:11,790 Then, the ability to go from just seeing a single protein 1220 01:00:11,790 --> 01:00:13,970 structure, how these proteins interact. 1221 01:00:13,970 --> 01:00:16,150 So now there are pretty good algorithms 1222 01:00:16,150 --> 01:00:18,220 predicting protein-protein interactions as well. 1223 01:00:18,220 --> 01:00:20,220 And that allows us, then, to figure out not only 1224 01:00:20,220 --> 01:00:22,170 how these proteins function individually, 1225 01:00:22,170 --> 01:00:25,460 but how they begin to function as a network. 1226 01:00:25,460 --> 01:00:27,250 Now one of the things that we've already-- 1227 01:00:27,250 --> 01:00:29,541 going to be touched on in the early parts of the course 1228 01:00:29,541 --> 01:00:32,767 are protein-DNA interactions through sequencing approaches. 1229 01:00:32,767 --> 01:00:34,850 We want to now look at them at a regulatory level. 1230 01:00:34,850 --> 01:00:38,910 And could you reconstruct the regulation of a genome 1231 01:00:38,910 --> 01:00:41,010 by predicting the protein-DNA? 1232 01:00:41,010 --> 01:00:42,790 And there's been a lot of work here. 1233 01:00:42,790 --> 01:00:45,410 We talked earlier about Eric Davidson's 1234 01:00:45,410 --> 01:00:48,230 pioneering approaches, a lot of interesting computational 1235 01:00:48,230 --> 01:00:51,790 approaches as well, that go from the relatively simple models we 1236 01:00:51,790 --> 01:00:54,710 saw on those earlier slides to these very complicated networks 1237 01:00:54,710 --> 01:00:55,750 that you see here. 1238 01:00:55,750 --> 01:00:57,860 We'll look at what these networks actually 1239 01:00:57,860 --> 01:01:01,670 tell us, how much information is really encoded in them. 1240 01:01:01,670 --> 01:01:04,280 They are certainly pretty, whatever they are. 1241 01:01:04,280 --> 01:01:06,405 We'll look at other kinds of interactions networks. 1242 01:01:06,405 --> 01:01:09,830 We'll look at genetic interaction networks as well, 1243 01:01:09,830 --> 01:01:11,350 and perhaps, some other kinds. 1244 01:01:11,350 --> 01:01:13,244 And finally, we'll look at computable models. 1245 01:01:13,244 --> 01:01:14,410 And what do we mean by that? 1246 01:01:14,410 --> 01:01:16,590 We mean model that make some kind of very 1247 01:01:16,590 --> 01:01:18,770 specific prediction, whether it's 1248 01:01:18,770 --> 01:01:21,470 a Boolean prediction-- this gene will be on or off-- or perhaps, 1249 01:01:21,470 --> 01:01:23,340 an even more quantitative prediction. 1250 01:01:23,340 --> 01:01:25,140 So we'll look at logic based modeling, 1251 01:01:25,140 --> 01:01:28,460 and probably, Bayesian networks as well. 1252 01:01:28,460 --> 01:01:32,190 So that's been a very whirlwind tour of the course. 1253 01:01:32,190 --> 01:01:33,580 Just to go back to the mechanics, 1254 01:01:33,580 --> 01:01:35,990 make sure you're signed up for the right course. 1255 01:01:35,990 --> 01:01:37,970 The undergraduate versions of the course 1256 01:01:37,970 --> 01:01:40,080 do not have a project. 1257 01:01:40,080 --> 01:01:42,810 And they certainly do not have the artificial intelligence 1258 01:01:42,810 --> 01:01:43,601 problems. 1259 01:01:43,601 --> 01:01:46,100 Then, there are the graduate versions that have the project, 1260 01:01:46,100 --> 01:01:48,580 but do not have the AI, and finally, the 6.874, 1261 01:01:48,580 --> 01:01:49,480 which has both. 1262 01:01:49,480 --> 01:01:52,740 So please be sure you're in the right class 1263 01:01:52,740 --> 01:01:56,440 so you get credit for the right assignments. 1264 01:01:56,440 --> 01:01:59,105 And finally, if you've got any questions about the course 1265 01:01:59,105 --> 01:02:00,480 mechanics, we have a few minutes. 1266 01:02:00,480 --> 01:02:02,690 We can talk about it here, in front of everybody. 1267 01:02:02,690 --> 01:02:05,340 And then, we'll all be available after class, a little bit, 1268 01:02:05,340 --> 01:02:06,780 to answer questions. 1269 01:02:06,780 --> 01:02:09,110 So please, any questions? 1270 01:02:09,110 --> 01:02:09,610 Yes. 1271 01:02:09,610 --> 01:02:11,290 AUDIENCE: Can undergrads do a project? 1272 01:02:11,290 --> 01:02:12,915 PROFESSOR: Can undergrads do a project? 1273 01:02:12,915 --> 01:02:15,790 Well, they can sign up for the graduate version, absolutely. 1274 01:02:15,790 --> 01:02:16,931 Other questions? 1275 01:02:16,931 --> 01:02:17,430 Yes. 1276 01:02:17,430 --> 01:02:22,100 AUDIENCE: Yeah, I actually can't make either of those sessions. 1277 01:02:22,100 --> 01:02:24,100 PROFESSOR: Well, send an email to the staff list 1278 01:02:24,100 --> 01:02:26,088 and we'll see what we can do. 1279 01:02:28,932 --> 01:02:29,880 Other questions? 1280 01:02:33,210 --> 01:02:34,133 Yes. 1281 01:02:34,133 --> 01:02:36,971 AUDIENCE: 6.877 doesn't exist anymore-- 1282 01:02:36,971 --> 01:02:39,514 Computational Evolutionary Biology, the class. 1283 01:02:39,514 --> 01:02:41,930 PROFESSOR: Oh, the thing we listed as alternative classes? 1284 01:02:41,930 --> 01:02:42,430 Sorry. 1285 01:02:42,430 --> 01:02:43,570 PROFESSOR: We'll fix that. 1286 01:02:43,570 --> 01:02:44,278 AUDIENCE: Really? 1287 01:02:44,278 --> 01:02:46,206 I would be so happy. 1288 01:02:46,206 --> 01:02:48,140 [LAUGHTER] 1289 01:02:48,140 --> 01:02:50,260 PROFESSOR: We'll delete it from our slides, yes. 1290 01:02:54,890 --> 01:02:58,450 We are powerful, but not that powerful. 1291 01:02:58,450 --> 01:03:01,810 Other questions, comments, critiques? 1292 01:03:01,810 --> 01:03:02,740 Yes, in the back. 1293 01:03:02,740 --> 01:03:05,210 AUDIENCE: Are each of the exams equally weighted? 1294 01:03:05,210 --> 01:03:07,293 PROFESSOR: Are each of the exams equally weighted? 1295 01:03:07,293 --> 01:03:09,745 Yes, they are. 1296 01:03:09,745 --> 01:03:10,245 Yes. 1297 01:03:10,245 --> 01:03:13,944 AUDIENCE: If we're in 6.874, we have the additional AI 1298 01:03:13,944 --> 01:03:14,908 problems. 1299 01:03:14,908 --> 01:03:17,318 Does that mean that we have more questions on the exam, 1300 01:03:17,318 --> 01:03:20,922 but just as much time to do them? 1301 01:03:20,922 --> 01:03:22,880 PROFESSOR: The question was, if you're in 6.874 1302 01:03:22,880 --> 01:03:25,250 and you have the additional AI questions on the exam, 1303 01:03:25,250 --> 01:03:27,190 does that mean you have more questions, 1304 01:03:27,190 --> 01:03:28,420 but the same amount of time. 1305 01:03:28,420 --> 01:03:32,900 And I believe the answer to that is yes. 1306 01:03:32,900 --> 01:03:34,600 PROFESSOR: We may revisit that question. 1307 01:03:34,600 --> 01:03:36,975 PROFESSOR: We will revisit that question at a later date. 1308 01:03:40,530 --> 01:03:43,180 But course six students are just so much smarter 1309 01:03:43,180 --> 01:03:45,576 than everyone else, right? 1310 01:03:45,576 --> 01:03:46,076 Yes. 1311 01:03:46,076 --> 01:03:47,814 AUDIENCE: Can we switch between different versions of the class 1312 01:03:47,814 --> 01:03:48,446 by the add deadline? 1313 01:03:48,446 --> 01:03:49,720 PROFESSOR: I'm sorry, could you say that again? 1314 01:03:49,720 --> 01:03:51,928 AUDIENCE: Can we switch between versions of the class 1315 01:03:51,928 --> 01:03:52,920 by the add deadline? 1316 01:03:52,920 --> 01:03:55,086 PROFESSOR: Can you switch between different versions 1317 01:03:55,086 --> 01:03:56,780 of class by the add/drop deadline? 1318 01:03:56,780 --> 01:03:58,400 In principle, yes. 1319 01:03:58,400 --> 01:04:00,810 But if you haven't done the work for that, 1320 01:04:00,810 --> 01:04:03,110 then you will be in trouble. 1321 01:04:03,110 --> 01:04:04,910 And there may not be a smooth mechanism 1322 01:04:04,910 --> 01:04:06,690 for making up missed work that late. 1323 01:04:06,690 --> 01:04:08,780 Because the add/drop deadline is rather late. 1324 01:04:08,780 --> 01:04:11,200 So I'd encourage you, if you're considering 1325 01:04:11,200 --> 01:04:13,300 doing the more intensive version or not, 1326 01:04:13,300 --> 01:04:15,290 sign up for the more intensive version. 1327 01:04:15,290 --> 01:04:17,380 You can always drop back. 1328 01:04:17,380 --> 01:04:20,080 Once we form groups though, for the projects, 1329 01:04:20,080 --> 01:04:21,869 then, obviously, there's an aspect 1330 01:04:21,869 --> 01:04:23,160 of letting down your teammates. 1331 01:04:23,160 --> 01:04:26,210 So think carefully, now, about which one you want to join. 1332 01:04:26,210 --> 01:04:28,457 It'll be hard to switch between them. 1333 01:04:28,457 --> 01:04:30,290 But we'll do whatever the registrar requires 1334 01:04:30,290 --> 01:04:33,010 us to do in terms of allowing it. 1335 01:04:33,010 --> 01:04:35,776 Other questions? 1336 01:04:35,776 --> 01:04:36,275 Yes. 1337 01:04:36,275 --> 01:04:40,443 AUDIENCE: If the course sites, can we undergrads access the AI 1338 01:04:40,443 --> 01:04:42,910 problems just for fun, to look at them? 1339 01:04:42,910 --> 01:04:43,960 PROFESSOR: Yes. 1340 01:04:43,960 --> 01:04:45,952 If the course sites remain separate, 1341 01:04:45,952 --> 01:04:48,035 will undergrads be able to access the AI problems? 1342 01:04:48,035 --> 01:04:49,150 The answer is yes. 1343 01:04:51,781 --> 01:04:52,280 Yes. 1344 01:04:52,280 --> 01:04:54,272 AUDIENCE: Should students in 6.874 1345 01:04:54,272 --> 01:04:56,270 attend both presentations? 1346 01:04:56,270 --> 01:04:57,755 PROFESSOR: Should students in 6.874 1347 01:04:57,755 --> 01:04:58,850 attend both presentations? 1348 01:04:58,850 --> 01:04:59,750 Not necessary. 1349 01:04:59,750 --> 01:05:01,270 You're welcome to both if you want. 1350 01:05:01,270 --> 01:05:04,410 But the course six recitation should be sufficient. 1351 01:05:07,049 --> 01:05:07,715 Other questions? 1352 01:05:10,600 --> 01:05:11,100 Yes. 1353 01:05:11,100 --> 01:05:14,214 AUDIENCE: So what's covered in the normal recitations? 1354 01:05:14,214 --> 01:05:16,380 PROFESSOR: What's covered in the normal recitations? 1355 01:05:16,380 --> 01:05:18,710 It'll be reinforcing material that's in the lectures. 1356 01:05:18,710 --> 01:05:23,750 So the goal is not introduce any new material in the course 20, 1357 01:05:23,750 --> 01:05:27,120 course seven recitations. 1358 01:05:27,120 --> 01:05:30,439 Just to clarify, not to introduce any new material. 1359 01:05:30,439 --> 01:05:31,105 Other questions? 1360 01:05:34,501 --> 01:05:35,000 OK. 1361 01:05:35,000 --> 01:05:35,350 Great. 1362 01:05:35,350 --> 01:05:36,766 And we'll hang around a little bit 1363 01:05:36,766 --> 01:05:40,550 to answer any remaining questions when they come up.