Lecture 1: Introduction to Computational and Systems Biology

Flash and JavaScript are required for this feature.

Download the video from iTunes U or the Internet Archive.

Description: In this lecture, Professors Burge, Gifford, and Fraenkel give an historical overview of the field of computational and systems biology, as well as outline the material they plan to cover throughout the semester.

Instructor: Christopher Burge, David Gifford and Ernest Fraenkel

The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.

PROFESSOR: Welcome to Foundations of Computational and Systems Biology. This course has many numbers. We'll explain all the differences and similarities. But briefly, there are three undergrad course numbers, which are all similar in content, 7.36, 20.390, 6.802. And then, there are four graduate versions.

The 7.91, 20.490, and HST versions are all very similar, basically identical. But the 6.874 has some additional AI content that we'll discuss in a moment. So make sure that you are registered for the appropriate version of this course.

And please interrupt with questions at any time. The main goal today is to give an overview of the course, both the content as well as the mechanics of how the course will be taught. And we want to make sure that everything is clear.

So this course is taught by myself and Chris Burge from Biology, Professor Fraenkel from BE, and Professor Gifford from EECS. We have three TAs, Peter Freese and Collette Picard, from Computational and Systems Biology, and Tahin, from EECS. All the TAs have expertise in computational biology as well as other quantitative areas like math, statistics, computer science.

So in addition to the lectures by the regular instructors, we will also have guest lectures by George Church, from Harvard, toward the end of the semester. Doug Lauffenburger will give a lecture in the regulatory network section of the course. And we'll have a guest lecture from Ron Weiss on synthetic biology. And just a note that today's lecture and all the lectures this semester are being recorded by AMPS, by MIT's OpenCourseWare. So the videos, after a little bit of editing, will eventually end up on OpenCourseWare.

What are these courses? So these course numbers are the graduate level versions, which are survey courses in computational biology. Our target audience is graduate students who have a solid background and comfort-- or, a solid background in biology, and also, a comfort with quantitative approaches.

We don't assume that you've programmed before, but there will be some programming content on the homeworks. And you will therefore need to learn some Python programming. And the TAs will help with that component of the course. We also have some online tutorials on Python programming that are available.

The undergrad course numbers-- this is an upper level undergraduate survey course in computational biology. And our target audience are upper level undergraduates with solid biology background and comfort with quantitative approaches. So there's one key difference between the graduate and undergraduate versions, which I'll come to in a moment.

So the goal of this course is to develop understanding of foundational methods in computational biology that will enable you to contextualize and understand a good portion of research literature in a growing field. So if you pick up Science, or Nature, or PLOS Computational Biology and you want to read those papers and understand them, after this course, you will have a better chance. We're not guaranteeing you'll be able to understand all of them. But you'll be able to recognize, perhaps, the category of paper, the class, and perhaps, some of the algorithms that are involved.

And for the graduate version, another goal is to help you gain exposure to research in this field. So it's actually possible to do a smaller scale computational biology research project on your own, on your laptop-- perhaps, on ATHENA-- with relatively limited computational resources and potentially even discover something new. And so we want to give you that experience. And that's through the project component that we'll say more about in a moment.

So just to make sure that everyone's in the right class-- this is not a systems biology class. There are some more focused systems biology classes offered on campus. But we will cover some topics that are important for analyzing complex systems. This is also certainly not a synthetic biology course. Some of the systems methods are also used in synthetic biology. And there will be this guest lecturer I mentioned, Ron Weiss, which will cover synthetic biology.

It's also not an algorithms class. We don't assume that you have experience in designing or analyzing algorithms. We'll discuss various bioinformatics algorithms. And you'll have the opportunity to implement at least one bioinformatics algorithm on your homework. But algorithms and not really the center of the course.

And there's one exception to this, which is, those of you who are taking 6.874 will go to a special recitation that will cover more advanced algorithm content. And there will be special homework problems for you as well. So that course really does have more algorithm content.

So the plan for today is that I will just do a brief, anecdotal history of computational and systems biology. This is to set the stage and the context for the class. And then we'll spend a significant amount of time reviewing the course mechanics, organization, and content. Because as you'll see, it's a little bit complicated. But it'll make sense once we go over it, hopefully.

So this is my take on computational and systems biology. Again, it's not a scholarly overview. It doesn't hit everything important that happened. It just gives you a flavor of what was happening in computational biology decade by decade.

So first of all, where does this field fall in the academic scheme of things? So I consider computational biology to be actually part of biology. So in the way that genetics or biochemistry are disciplines that have strategies for understanding biological questions, so does computational biology. You can use it to understand a variety of computational questions in gene regulation, many other areas.

There is also, some people make a distinction that bioinformatics is more about building tools whereas computational biology is more about using tools, for example. Although many people don't-- it's very blurry-- and don't try to-- people use them in various ways. And then, you could think of bioinformatics as being embedded in the larger field of informatics, where you include tools for management and analysis of data in general.

And it's certainly true that many of the core concepts and algorithms in bioinformatics come from the field, come from computer science, come from other branches of engineering, from statistics, mathematics, and so forth. So it's really a cross disciplinary field. And then, synthetic biology cuts across in the sense that it's really an engineering discipline. Because you're designing and engineering synthetic molecular cellular systems. But you can also use synthetic biology to help understand natural biological systems, of course.

All right, so what was happening, decade by decade? So in the '70s, there were not genome sequences available or large sequence databases of any sort. Except there were starting to be some protein sequences. And early computational biologists focused on comparing proteins, understanding their function, structure, and evolution. And so in order to compare proteins, you need a protein-- an amino acid substitution matrix-- a matrix that describes how often one amino acid is substituted for another.

And Margaret Dayhoff was a pioneer in developing these sorts of matrices. And some of the matrices she developed, the PAM series, are still used today. And we'll discuss those matrices early next week.

So in terms of asking evolutionary questions, two big thinkers were Russ Doolittle and Carl Woese, analyzing both ribosomal RNA sequences to study evolution. And Carl Woese realized, looking at these RNA alignments, that actually, the prokaryotes, which had been-- there was this big split between prokaryotes and eukaryotes was sort of a false split-- that actually, there was a subgroup of single-celled anuclear organisms that were closer to the eukaryotes-- and named them the Archaea. So a whole kingdom of life was recognized, really, by sequence analysis.

And Russ Doolittle also did a lot of analysis approaching sequences and came up with this molecular clock idea, or contributed to that idea, to actually build-- instead of systematics being based on phenotypic characteristics, do it on a molecular level.

So in the '80s, the databases started to expand. Sequence alignment and search became more important. And various people developed fast algorithms to compare protein and DNA sequences and align them. So the FASTA program was widely used. BLAST-- several of the authors of BLAST are shown here-- David Lipman, Pearson, Webb Miller, Stephen Altschul.

The statistics for knowing when a BLAST search result is significant were developed by Karlin and Altschul. And there was also progress in gapped alignment, in particularly, Smith-Waterman, shown above. Also progress in RNA secondary structure prediction from Nusinov and Zuker. We'll talk about all of these algorithms during the course.

And there was also development of literature databases. I always liked this picture. Many of you probably used PubMed. But Al Gore was well coached here, by these experts, in how to use it.

And then, in the '90s, computational biology really started to expand. It was driven partly by the development of a microarrays, the first genome sequences, and questions like how to identify domains in a protein, how to identify genes in the genome. It was recognized that this family of models from electrical engineering, the hidden Markov model, were quite useful for these sort of sequence labeling problems. That was really pioneered by Anders Krogh here, and David Haussler. And a variety of algorithms were developed that performed these useful tasks.

There was also important progress in the earliest comparative genomic approaches, since you have-- the first genomes were sequenced in the mid '90s, of free living organisms. And so you could then start to compare these genomes and learn a lot. We'll talk a little bit about comparative genomics.

And there was important progress on predicting protein structure from primary sequence. Particularly, David Baker made notable progress on this Rosetta algorithm. So it's a biophysics field, but it's very much part of computational biology as well.

So in the 2000s, definitely, genome sequencing became very fashionable, as you can see here. And the genomes of now larger organisms, including human, it became possible to sequence them. And then, this introduced a huge host of computational challenges in assembling the genomes, annotating the genomes, and so forth. And we'll hear from Professor Gifford about some genome assembly topics. And annotation will come up throughout.

Actually, let's just mention, this is Jim Kent, who's the guru who did the first human genome assembly-- at least, that was widely used-- and also was involved in UCSC. And here, Ewan Birney has started Ensembl and continues to run it today. You know who these other people are, probably.

OK. All right. So in another phase of the last decade, I would say that much of biological research became more high throughput than it was before. So molecular biology had traditionally, in the '80s and '90s, mostly focused on analysis of individual gene or protein products. But now it became possible, and in widespread use, that you could measure the expression of all the genes, in theory-- using microarrays, for example-- and you could start to profile all of the transcripts in the cell, all of the proteins in the cell, and so forth.

And then, a variety of groups started to use some of these high throughput data to study various challenges in gene expression to understand how transcription works, how splicing works, how microRNAs work, translation, epigenetics, and so forth. And you'll hear updates on some of that work in this course. Bioimage informatics, particularly for developmental biology, became popular. It continues to be a new emerging area.

Systems biology was also really born around 2000, roughly. A very prominent example would be the development of the first gene regulatory network models that describe sea urchin development, here, my Eric Davidson, as well as a whole variety of models of other gene networks in the cell that control things like cell proliferation, apoptosis, et cetera.

At the same time, a new field of synthetic biology was born with the development of some of the first completely artificial gene networks that would then program cells to perform desired behavior. So an example would be this so-called repressilator, where you have a network of three transcription factors. Each represses the other. And then, one of them represses GFP.

And you put these into bacteria. And it causes oscillations in GFP, expressions that are described by these differential equations here. And some of the modeling approaches used in synthetic biology will be covered by Professor Fraenkel and Lauffenburger later.

All right. So late 2000s, early 2010s, it's still too early to say, for sure, what the most important developments will be. But certainly, in the late 2000s, next gen sequencing-- which now probably should be called second generation sequencing, since there may be future generations-- really started to transform a whole wide variety of applications in biology, from making genome sequencing-- instead of having to be done in the genome center, now an individual lab can easily do microbial genome sequencing. And when needed, it's possible, also, to do genome sequencing of larger organisms.

Transcriptome sequencing is now routine. We'll hear about that. There are applications for mapping protein-DNA interactions genome wide, including both sequence specific transcription factors as well as more general factors like histones, protein-RNA interactions-- a method called CLIP-Seq-- methods for mapping all the translated messages, the methylated sites in the genome, open chromatin, and so forth.

So many people contributed to this, obviously. I'm just mentioning, Barbara Wold was a pioneer in both RNA-Seq as well as ChIP-Seq. And some of the sequencing technologies that came out here are shown here. And we will discuss those at the beginning of lecture on Thursday.

So I encourage you to read this review here, by Metzger, which covers many of the newer sequencing technologies. And they're pretty interesting. As you'll see, there's some interesting tricks, interesting chemistry and image analysis tricks.

All right. So that was not very scholarly. But if you want a proper history, then this guy, Hallam Stevens, who was a History of Science PhD student at Harvard and recently graduated, wrote this history of bioinformatics.

OK. So let's look at the syllabus. So also posted on the [INAUDIBLE] site is a syllabus. It looks like this. This is quite an information-rich document. It has all the lecture titles, all the due dates of all the problem sets, and so forth. So please print yourself a copy and familiarize yourself with it. So we'll just try to look at a high level first, and then zoom in to the details.

So at a high level, if you look in this column here, we've broken the course into six different topics. OK? So there's Genomic Analysis I, that I'll be teaching, which is more classical computational biology, you could say-- local alignment, global alignment, and so forth. Then, Genomic Analysis II, which Professor Gifford will be teaching, covers some newer methods that are required when you're doing a lot of second generation sequencing-- the standard algorithms are not fast enough, you need better algorithms, and so forth.

And then, I will come back and give a few lectures on modeling biological function. This will have to do with sequence motifs, hidden Markov models, and RNA secondary structure. Professor Fraenkel will then do a unit on proteomics and protein structure. And then, there will be an extended unit on regulatory networks. Different types of regulatory networks will be covered, with most of the lectures by Ernest, one by David, and one by Doug.

And then, we'll finish up with computational genetics, by David. And there will also be some guest lecturers, one of them interspersed in regulatory networks, and then two at the end, from Ron Weiss and George Church.

So I just wanted to point out that in all of these topics, we will include some discussion of motivating question. So, what are the biological questions that we're seeking to address with these approaches. And there will also be some discussion of the experimental method.

So for example, in the first unit, it's heavy on sequence analysis. So we'll talk about how sequencing is done, and then, quite a bit about the interaction between the experimental technology and the computational analysis, which often involves statistical methods for estimating the error rate of the experimental method, and things like that. So the emphasis is on the computational part, but we'll have some discussion about experiments.

Everyone with me so far? Any questions? OK.

All right. So what are some of these motivating questions that we'll be talking about? So what are the instructions encoded in our genomes? You can think of the genome as a book. But it's in this very strange language. And we need to understand the rules, the code that underlies a lot of research in gene expression.

How are chromosomes organized? What genes are present-- so tools for annotating genomes. What regulatory circuitry is encoded? You'd like to be able to eventually look at a genome, understand all the regulatory elements, and be able to predict that there's some feedback circuit there that responds to-- a particular stimulation that responds to light, or nutrient deprivation, or whatever it might be.

Can the transcriptome be predicted from the genome? This is a longstanding question. The translatome, if you will-- well, let's say, the proteome can be predicted from the transcriptome in the sense that we have a genetic code and we can look up those triplets. So there's a dream that we would be able to model other steps in gene expression with the precision with which the genetic code predicts translation-- that we'd be able to predict where the polymerase would start transcribing, where it will finish transcribing, how a transcript will be spliced, et cetera-- all the other steps in gene expression. And that motivates a lot of work in the field.

Can protein function be predicted from sequence? So this is a very classical problem. But there are a number of new and interesting developments as resolved from a lot of this high throughput data generation, both in nucleic acid sequencing as well as in proteomics.

Can evolutionary history be reconstructed from sequence? Again, this has been a longstanding goal of the field. And a lot of progress has been made here. And now most evolutionary classifications are actually based on molecular sequence at some level. And new species are often defined based on sequence.

OK. Other motivating questions. So what would you need to measure if you wanted to discover the causes of a disease, the mechanisms of existing drugs, metabolic pathways in a micro-organism? So this is a systems biology question.

You've got a new bug. It causes some disease. What should you measure? Should you sequences its genome? Should you sequence its transcriptome? Should you do proteomics? What type of proteomics? Should you perturb the system in some way and do a time series? What are the most efficient ways? What information should be gathered, and in what quantities? And how should that information be integrated in order to come up with an understanding of the physiology of that organisms so that, then, you can know where to intervene, what would be suitable drug targets?

Yeah. What kind of modeling would help you to use the data to design new therapies, or even, in a synthetic biology context, to re-engineer organisms for new purposes? So microbes to generate-- to produce fuel, for example, or other useful products. What can we currently measure?

What does each type of data mean individually? What are the strengths and weaknesses of each of the types of high throughput approaches that we have? And how do we integrate all the data we have on a system to understand the functioning of that system? So these are some of the questions that motivate the latter topics on regulatory networks. OK.

So let's now zoom in and look more closely at the course syllabus. I've just broken it into two halves, just so it's more readable. So today we're going over, obviously, course mechanics mostly. On Thursday, we'll cover both some DNA sequencing technologies and we'll talk about local alignment on BLAST. More on that in a bit.

The 6.8047 recitation-- 6.874, thank you-- recitation will be on Friday. The other recitations will start next week. And then, as you can see, we'll move through the other topics. So each of the instructors is going to briefly review their topic. So I won't go through all the titles here.

But please note, on the left side here, that their assignment due dates are marked. OK? And they're all due at noon on the indicated day. And so some of these are problem sets. So for example, problem set 1 will be due on Thursday, February 20 at noon.

And some of the other assignments relate to the project component of the course, which we're going to talk more about in a moment. In particular, we're going to ask you to submit a brief statement of your background and your research interests related to forming teams. So the projects are going to be done in teams of one to five students.

And in order to facilitate especially cross disciplinary teams-- we'd love if you interact with, maybe, students in a different grad program, or whatever-- you'll post your background. You know, I'm a first year BE student and I have a background in Perl programming, but never done Python, or whatever-- something like that. And then, I'm interested in doing systems biology modeling in microbial systems, or something like that.

And then you can match up your interests with others and form teams. And then you'll come up with your own project ideas so that the team and initial idea will be due here, February 25th. Then you'll need to do some aims and so forth. So the project components here, these are only for those taking the grad version of the course. We'll make that clear later. OK.

So after the first three topics here, taught by myself and David, there will be an exam. More on that later. And then there will be three more topics, mostly taught by Ernest. And notice there are additional assignments here related to the project-- so, to research strategy-- and the final written report, additional problems sets.

We'll have a guest lecturer here. This will be Ron Weiss. Then, there will be the second exam. Exams are non-cumulative, so the second exam will just cover these three topics here predominantly. And then there will be another guest lecturer. This would be George Church here.

And then notice, here, presentation. So those who are doing the project component, those teams will be given-- assigned a time to present, to the class, the results of their research. And you'll be graded-- the presentation will be part of the overall project grade assigned by the instructors. But you'll also-- we'll also ask all the students in the class to send comments on the presentations.

So you may find that you get helpful suggestions about interpreting your data from other people and so forth. So that'll be a required component of the course for all students, to attend the presentations and comment on them. And we hope that will be a lot of fun.

OK. So is this the right course for me? So I just wanted to let you know you're fortunate to have a rich selection of courses in computational systems, synthetic biology here at MIT. I've listed many of them. Probably not all, but the ones that I'm aware of that are available on-campus.

757 is really only for biology grad students. But the other courses listed here are generally open. Some are more geared for graduate students, some more undergrads. Some are more specialized. So for example, Jeff Gore's systems biology course, it's more focused on systems biology whereas our course covers both computational and systems. So keep that in mind. Make sure you're in the right place, that this is what you want.

OK. A few notes on the textbook-- so there is a textbook. It's not required. It's called Understanding Bioinformatics by Zvelebil and Baum. It's quite good on certain topics. But it really only covers about, maybe, a third of what we cover in the course. So there is good content on local alignment, global alignment, scoring matrices-- the topics of the next couple lectures. And I'll point you to those chapters.

But it's very important to emphasize that the content of the course is really what happens in lecture, and on the homeworks, and to some extent, what happens in recitation. And the textbook is just there as a backup, if you will, or for those who would like to get more background on the topic or want to read a different description of that topic. So you decide whether you want to purchase the textbook or not.

It's available at the Coop or through Amazon. Shop around. You can find it. It's paperback. Pretty good general reference on a variety of topics, but it doesn't really have much on systems biology.

All right. Another important reference that was developed specifically for this course a few years ago is the probability and statistics primer. So you'll notice that some of the homeworks, particularly in the earlier parts of the course, will have significant probability and statistics. And we assume that you have some background in this area. Many of you do. If you don't, you'll need to pick that up. And this primer was written to provide those topics, in probability especially, that are foundational and most relevant to computational biology.

So for example, there are some concepts like p-value, probability density function, probability mass function, cumulative distribution function, and then, common distributions, exponential distribution, Poisson distribution, extreme value distribution. If those are mostly sounding familiar to you, that's good. If they're familiar, but you couldn't-- you really don't-- you get binomial and Poisson confused or something, then, definitely, you want to consult this primer.

So I think, looking at the lectures and the homeworks, it should probably become pretty clear which aspects are going to be relevant. And I'll try to point those out when possible. And you can also consult your TAs if you're having trouble with the probability and statistics content.

So we are going to focus, here, on, really, the computational biology, bioinformatics content. And we might briefly review a concept from probability, like, maybe, conditional probability when we talk about Markov chains. But we're not going to spend a lot of time. So if that's the first time you've seen conditional probability, you might be a little bit lost. So you'd be better off reading about it in advance. OK?

Questions? No questions? All right. Maybe it's that the video is intimidating people. OK. All right. The TAs know a lot about probability and statistics and will be able to help you.

OK. So homework-- so I apologize, the font is a little bit small here. So I'll try to state it clearly. So there are going to be five problem sets that are roughly one per topic. Except you'll see p set two covers topics two and three. So it might be a little bit longer.

The way we handle students who have to travel-- so many of you might be seniors. You might be interviewing for graduate schools. Or you might have other conflicts with the course. So rather than doing that on a case by case basis, which, we've found, gets very complicated and is not necessarily fair, the way we've set it up is that the total number of points available on the five homeworks is 120. OK? But the maximum score that you can get is 100.

So if you, for example, were to get 90% on all five of the homeworks, that would be 90% of 120, which would be 108 points. You would get the full 100-- you'd get 100% on your homework. That would be a perfect score on the homework. OK? But because of that-- because there's more points available than you need-- we don't allow you to drop homeworks, or to do an alternate assignment, or something like that.

So the way it works is you can basically miss-- as long as you do well on, say, four of the homeworks-- you could actually miss one without much of a penalty. For example, if each of the homeworks were worth 24 points and you got a perfect score on four of them, that would be 96 points. You would have an almost perfect score on your homework and you could miss that fifth homework.

Now of course, we don't encourage you to skip that homework. We think the homeworks are useful and are a good way to solidify the information you've gotten from lecture, and reading, and so forth. So It's good to do them. And doing the homeworks will help you and perhaps prepare you for the exams. But that's the way we handle the homework policy.

Now I should also mention that not all the homeworks will be the same number of points. We'll apportion the points in proportion to the difficulty and length of the homework assignment. So for example, the first homework assignment is a little bit easier than the others. So it's going to have somewhat fewer points.

So late assignments-- so all the homeworks are due at noon on the indicated day. And if it's within 24 hours after that, you'll be eligible for 50% credit. And beyond that, you don't get any points, in part because the TAs will be posting the answers to the homeworks. OK? And we want to be able to post them promptly so that you'll get the answers while those problems are fresh in your mind. Questions about homeworks? OK. Good.

So collaboration on problem sets-- so we want you to do the problem sets. You can do them independently. You can work with a friend on them, or even in a group, discuss them together. But write up your solutions independently. You don't learn anything by copying someone else's solution.

And if the TAs see duplicate or near identical solutions, both of those homeworks will get a 0. OK? And this occasionally happens. We don't want this to happen to you. So just avoid that. So discuss together, but write up your solutions separately.

Similarly, with programming, if you have a friend who's a more experienced programmer than you are, by all means, ask them for advice, general things, how should I structure my program, do you know of a function that generates a loop, or whatever it is that you need. But don't share code with anyone else. OK? That would be a no-no. So write up your code independently.

And again, the graders will be looking for identical code. And that will be thrown out. And so we don't want to have any misconduct of that type occurring.

All right. So recitations-- there are three recitation sessions offered each week, Wednesday at 4:00 by Peter, Thursday at 4:00 by Colette, Friday at 4:00 by Tahin. And that is a special recitation that's required for the 6.874 students-- David? Yes-- and has additional AI content. So anyone is welcome to go to that recitation. But those who are taking 6.874 must go.

For students registered for the other versions of the course, going to recitation is optional but strongly recommended. Because the TAs will go over material from the lectures, material that's helpful for the homeworks or for studying for exams-- in the first weeks, Python, probability as well. So go to the recitations, particularly if you're having trouble in the course. So Tahin's recitation starts this week. And Peter and Colette's will start next week.

Question. Yes.

AUDIENCE: Are they covering the same material, Peter and Colette?

PROFESSOR: Peter and Colette will cover similar material and Tahin will cover different material.

Python instruction-- so the first problem set, which will be posted this evening, doesn't have any programming on it. But if you don't have programming, you need to start learning it very soon. And so there will be a significant programming problem on p set two. And we'll be posting that problem soon-- this week, some time. And you'll want to look at that and gauge, how much Python do I need to learn to at least do that problem.

So what is this project component that we've been hearing about? So here's a more concrete description. So again, this is only for the graduate versions of the course. So students will-- basically, we've structured it so you work incrementally toward the final research project and so that we can offer feedback and help along the way if needed.

So the first assignment will be next week. I think it's on Tuesday. I'll have more instructions on Thursday's lecture. But all the students registered for the grad version will submit their background and interests for posting on the course website. And then you can look at those and try to find other students who have, ideally, similar interests but perhaps somewhat different backgrounds. It's particularly helpful to have a strong biologist on the team. And perhaps, a strong programmer would help as well.

So then, you will choose your teams and submit a project title and one-paragraph summary, the basic idea of your project. Now we are not providing a menu of research projects. It's your choice-- whatever you want to do, as long as it's related to computational and systems biology.

So it could be analysis of some publicly available data. Could be analysis of some data that you got during your rotation. Or for those of you who are already in labs that you're actually working on, it's totally fine and encouraged that the project be something that's related to your main PhD work if you've started on that. And it could also be more in the modeling, some modeling with MATLAB or something, if you're familiar with that.

But, a variety of possibilities. We'll have more information on this later. But we want you to form teams. The teams can work independently or with up to four friends in teams of five. And if people want to have a giant team to do some really challenging project, then you can come and discuss with us. And we'll see if that would work.

So then there's the initial title and one-paragraph summary. We'll give you a little feedback on that. And then you'll submit an actual specific aims document-- so with actual, NIH-style, specific aims-- the goal is to understand whether this organism has operons or not, or some actual scientific question-- and a bit about how you will undertake that.

And then you'll submit a longer two-page research strategy, which will include, specifically, we will use these data. We will use this software, these statistical approaches-- that sort of thing. And then, toward the end of the semester, a final written report will be due that'll be five pages.

You'll work on it together, but it'll need to be clear who did what. You'll need an author contribution statement. So and so did this analysis. So and so wrote this section-- that sort of thing. And then, as I mentioned before, there will be oral presentations, by each team, on the last two course sessions. OK? Questions about the projects? There's more information on the course info document online. OK? Good.

Yes, so for those taking 6.874, in addition to the project, there will also be additional AI problems on both the p sets-- and the exam? David? Yes. OK. And they're optional for others. All right. Good.

OK. So how are we going to do the exam? So as I mentioned, there's two 80-minute exams. They're non-cumulative. So the first exam covers, basically, the first three topics. The second exam is on the last three topics. They're 80 minutes. They're during normal class time. There's no final exam.

The grading-- so for those taking the undergrad version, the homeworks will count 36% out of the maximum 100 points. And the exams will count 62%. And then, this peer review, where there's two days where you go, and you listen to presentations, and you submit comments online counts 2%. For the graduate Bio BE HST versions, it's 30% homeworks, 48% exams, 20% project, 2% peer review. For the EECS version, 6.874, 25% homework, 48% exams, 20% project, and then, 5% for these extra AI related problems, and 2% peer review.

And those should add up to 100. And then, in addition, we will reward 1% extra credit for outstanding class participation-- so questions, comments during class. OK. All right.

So a few announcements about topic one. And then, each of us will review the topics that are coming up. So p set one will be posted tonight. It's due February 20th, at noon. It involves basic microbiology, probability, and statistics. It'll give you some experience with BLAST and some of the statistics associated.

P set two will be posted later this week. So you don't need to, obviously, start on p set two yet. It's not due for several weeks. But definitely, look at the programming problem to give you an idea of what's involved and what to focus on when you're reviewing your Python. Mentioned the probability/stats primer.

Sequencing technology-- so for Thursday's lecture, it will be very helpful if you read the review, by Metzger, on next gen sequencing technologies. It's pretty well written. Covers Illumina, 454, PACBIO, and a few other interesting sequence technologies.

Other background reading-- so we'll be talking about local alignment, global alignment, statistics, and similarly matrices for the next two lectures. And chapters four and five of the textbook provide a pretty good background on these topics. I encourage you to take a look.

OK. So I'm just going to briefly review my lectures. And then we'll have David and Ernest do the same. So sequencing technologies will be the beginning of lecture two. And then we'll talk about local ungapped sequence alignment-- in particular, BLAST.

So BLAST is something like the Google search engine of bioinformatics, if you will. It's one of the most widely used tools. And it's important to understand something about how it works, and in particular, how to evaluate the significance of BLAST hits, which are described by this extreme value distribution here.

And then, lecture three, we'll talk about global alignment and introducing gaps into sequence alignments. We'll talk about some dynamic programming algorithms-- Needleman-Wunsch, Smith-Waterman. And in lecture four, we'll talk about comparative genomic analysis of gene regulation-- so using sequence similarity across genomes two infer location of regulatory elements such as microRNA target sites, other things like that.

All right. So I think this is now-- oh sorry, a few more lectures. Then, in the next unit, modeling biological function, I'll talk about the problem of motif finding-- so searching a set of sequences for a common subsequence, or similar subsequences, that possess a particular biological function, like binding to a protein. It's often a complex search space. We'll talk about the Gibbs sampling algorithm and some alternatives.

And then, in lecture 10, I'll talk about Markov and hidden Markov models, which have been called the Legos of bioinformatics, which can be used to model a variety of linear sequence labeling problems. And then, in the last lecture of that unit, I'll talk a little bit about protein-- I'm sorry, about RNA-- secondary structure-- so the base pairing of RNAs-- predicting it from thermodynamic tools, as well as comparative genomic approaches.

And you'll learn about the mfold tool and how you can use a diagram like that to infer that this RNA may have different possible structures that I can fold into, like those shown. All right. So I'm going to pass it off to David here.

PROFESSOR: Thanks very much, Chris. All Right. So I'm David Gifford. And I'm delighted to be here.

It's really a wonderfully exciting time in computational biology. And one of the reasons it's so exciting is shown on this slide, which is the production of DNA base sequence per instrument over time. And as you can see, it's just amazingly more efficient as time goes forward.

And if you think about the reciprocal of this curve, the cost per base is basically becoming extraordinarily low. And this kind of instrument allows us to produce hundreds of millions of sequence reads for a single experiment. And thus do we not only need new computational methods to handle this kind of a big data problem, but we need computational methods to represent the results in computational models.

And so we have multiple challenges computationally. Because modern biology really can't be done outside of a computational framework. And to summarise the way people have adapted these high throughput DNA sequencing instruments, I built this small figure for you.

And you can see that-- Professor Burge will be talking about DNA sequencing next time. And obviously, you can use DNA sequencing to sequence your own genomes, or the genomes of your favorite pet, or whatever you like. And so we'll talk about, in lecture six, how to actually do genome sequencing. And one of the challenges in doing genome sequencing is how to actually find what you have sequenced. And we'll be talking about how to map sequence reads as well.

Another way to use DNA sequencing is to take the RNA species present in a single cell, or in a population of cells, and convert them into DNA using reverse transcriptase. Then we can sequence the DNA and understand the RNA component of the cell, which, of course, either can be used as messenger RNA to code for protein, or for structural RNAs, or for non-coding RNAs that have other kinds of functions associated with chromatin.

In lecture seven, we'll be talking about protein/DNA interactions, which Professor Burge already mentioned-- the idea that we can actually locate all the regulatory factors associated with the genome using a single high throughput experiment. We do this by isolating the proteins and their associated DNA fragments and sequencing the DNA fragments using this DNA sequencing technology.

So briefly, the first thing that we'll look at in lecture five is, given a reference genome sequence and a basket of DNA sequence reads, how do we build an efficient index so that we can either map or align those reads back to the reference genome. That's a very important and fundamental problem. Because if I give you a basket of 200 million reads, we need to build its alignment very, very rapidly, and quickly, and accurately, especially in the context of repetitive elements. Because a genome obviously has many repeats in it. We need to consider how our indexing and searching algorithms are going to handle those sorts of elements.

In the next lecture, we'll talk about how to actually sequence a genome and assemble it. So the fundamental way that we approach genome sequencing given today's sequencing instruments is that we take intact chromosomes at the top, which, of course, are hundreds of millions of bases long, and we shatter them into pieces. And then we sequence size selected pieces in a sequencing instrument. And then we need to put the jigsaw puzzle back together with a computational assembler.

So we'll be talking about assembly algorithms, how they work, and furthermore, how resolve ambiguities as we put that puzzle together, which often arise in the context of repetitive sequence. And in the next lecture, in lecture seven, we'll be looking at how to actually take those little DNA molecules that are associated with proteins and analyze them to figure out where particular proteins are bound to the genome and how they might regulate target genes. Here we see two different occurrences of OCT4 binding events binding proximally to the SOX2 gene, which they are regulating.

So that's the beginning of our analysis of DNA sequencing. And then we'll look at RNA sequencing-- once again, with going through a DNA intermediate-- and ask the question, how can we look at the expression of particular genes by mapping RNA sequence reads back onto the genome. And there are two fundamental questions we can address here, which is, what is the level of expression of a given gene, and secondarily, what isoforms are being expressed.

The second sets of reads, you see up on the screen, are split reads that cross splice junctions. And so by looking at how reads align to the genome, we can figure out which particular axons are included or excluded from a particular transcript. So that's the beginning of the high throughput biology genomic analysis module. And I'll be returning to talk, later in the term, about computational genetics, which, really, is a way to summarize everything we're learning in the course into an applicable way to ask fundamental questions about genome function, which Professor Burge talked about earlier.

Now we all have about 3 billion bases in our genomes. And as you know, you differ from the individual sitting next to you in about one in every 1,000 base pairs, on average. And so one question is, how do we actually interpret these differences between genomes. And how can we build accurate computational models that allow us to infer function from genome sequence? And that's a pretty big challenge.

So we'll start by asking questions about, what parts of the genome are active and how could we annotate them. So we can use, once again, different kinds of sequencing based assays to identify the regions of the genome that are active in any given cellular state. And furthermore, if we look at different cells, we can tell which parts of the genome are differentially active.

Here you can see the active chromatin during the differentiation of an ESL into a terminal type over a 50 kilobase window. And the regions of the genome that are shaded in yellow represent regions that are differentially active. And so, using this kind of DNA seq. data and other data, we can automatically annotate the genome with where the regulatory elements are and begin to understand what the regulatory code of the genome is.

So once we understand what parts of the genome are active, we can ask questions about, how do they contribute to some overall phenotype. And our next lecture, lecture 19, we'll be looking at how we can build a model of a quantitative trait based upon multiple loci and the particular alleles that are present at that loci. So here you see an example of a bunch of different quantitative trait loci that are contributing to the growth rate of yeast in a given condition.

So part of our exploration during this term will be to develop computational methods to automatically identify regions of the genome that control such traits and to assign them significance. And finally, we'd like to put all this together and ask a very fundamental question, which is, how do we assign variations in the human genome to differential risk for human disease. And associated questions are, how could we assign those variants to what best therapy would be applied to the disease-- what therapeutics might be used, for example.

And here we have a bunch of results from genome wide association studies, starting at the top with bipolar disorder. And at the bottom is type 2 diabetes. And looking along the chromosomes, we're asking which locations along the genome have variants that are highly associated with these particular diseases in these so-called Manhattan plots. Because the things that stick up look like buildings.

And so these sorts of studies are yielding very interesting insights into variants that are associated with human disease. And the next step, of course, is to figure out how to actually prove that these variants are causal, and also, to look at mechanisms where we might be able to address what kinds of therapeutics might be applied to deal with these diseases.

So those are my two units. Once again, high throughput genomic analysis, and secondarily, computational genetics. And finally, if you have any questions about 6.874, I'll be here after lecture. Feel free to ask me. Ernest.

PROFESSOR: Thank you very much, Dave. All right. So in the preceding lectures, you'll have heard a lot about the amazing things we can learn from nucleic acid sequencing. And what we're going to do in the latter part of the course is look at other modalities in the cell. Obviously, there's a lot that goes on inside of cells that's not taking place at the level of DNA/RNA. And so we're going to start to look at proteins, protein interactions, and ultimately, protein interaction networks.

So we'll start with the small scale, looking at intermolecular interactions of the biophysics-- the fundamental biophysics of a protein structure. Then we'll start to look at protein-protein interactions. And as I said, the final level will be networks.

There's been an amazing advance in our ability to predict protein structure. So it's always been the dream of computational biology to be able to go from the sequence of a gene to the structure of the corresponding protein. And ever since Anfinsen, we know that, at least, that should be theoretically possible for a lot of proteins. But it's been computationally virtually intractable until quite recently. And a number of different approaches have allowed us to predict protein structure.

So this slide shows, for example-- I don't know which one's which-- but in blue, perhaps, prediction, and in red, the true structure, or the other way around. But you can see, it doesn't matter very much. We're getting extremely accurate predictions of small protein structure.

There are a variety of approaches here that live on a spectrum. On the one hand, there's the computational approach that tries to make special purpose hardware to carry out the calculations for protein structure. And that's been wildly successful. On the other end of the extreme, there's been the crowd sourcing approach to have gamers try to predict protein structure. And that turns out to be successful. And there's a lot of interesting computational approaches in the middle as well. So we'll explore some of these different strategies.

Then, the ability to go from just seeing a single protein structure, how these proteins interact. So now there are pretty good algorithms predicting protein-protein interactions as well. And that allows us, then, to figure out not only how these proteins function individually, but how they begin to function as a network.

Now one of the things that we've already-- going to be touched on in the early parts of the course are protein-DNA interactions through sequencing approaches. We want to now look at them at a regulatory level. And could you reconstruct the regulation of a genome by predicting the protein-DNA? And there's been a lot of work here.

We talked earlier about Eric Davidson's pioneering approaches, a lot of interesting computational approaches as well, that go from the relatively simple models we saw on those earlier slides to these very complicated networks that you see here. We'll look at what these networks actually tell us, how much information is really encoded in them. They are certainly pretty, whatever they are.

We'll look at other kinds of interactions networks. We'll look at genetic interaction networks as well, and perhaps, some other kinds. And finally, we'll look at computable models. And what do we mean by that? We mean model that make some kind of very specific prediction, whether it's a Boolean prediction-- this gene will be on or off-- or perhaps, an even more quantitative prediction. So we'll look at logic based modeling, and probably, Bayesian networks as well.

So that's been a very whirlwind tour of the course. Just to go back to the mechanics, make sure you're signed up for the right course. The undergraduate versions of the course do not have a project. And they certainly do not have the artificial intelligence problems.

Then, there are the graduate versions that have the project, but do not have the AI, and finally, the 6.874, which has both. So please be sure you're in the right class so you get credit for the right assignments. And finally, if you've got any questions about the course mechanics, we have a few minutes. We can talk about it here, in front of everybody. And then, we'll all be available after class, a little bit, to answer questions.

So please, any questions? Yes.

AUDIENCE: Can undergrads do a project?

PROFESSOR: Can undergrads do a project? Well, they can sign up for the graduate version, absolutely. Other questions? Yes.

AUDIENCE: Yeah, I actually can't make either of those sessions.

PROFESSOR: Well, send an email to the staff list and we'll see what we can do. Other questions? Yes.

AUDIENCE: 6.877 doesn't exist anymore-- Computational Evolutionary Biology, the class.

PROFESSOR: Oh, the thing we listed as alternative classes? Sorry.

PROFESSOR: We'll fix that.

AUDIENCE: Really? I would be so happy.

[LAUGHTER]

PROFESSOR: We'll delete it from our slides, yes. We are powerful, but not that powerful. Other questions, comments, critiques? Yes, in the back.

AUDIENCE: Are each of the exams equally weighted?

PROFESSOR: Are each of the exams equally weighted? Yes, they are. Yes.

AUDIENCE: If we're in 6.874, we have the additional AI problems. Does that mean that we have more questions on the exam, but just as much time to do them?

PROFESSOR: The question was, if you're in 6.874 and you have the additional AI questions on the exam, does that mean you have more questions, but the same amount of time. And I believe the answer to that is yes.

PROFESSOR: We may revisit that question.

PROFESSOR: We will revisit that question at a later date. But course six students are just so much smarter than everyone else, right? Yes.

AUDIENCE: Can we switch between different versions of the class by the add deadline?

PROFESSOR: I'm sorry, could you say that again?

AUDIENCE: Can we switch between versions of the class by the add deadline?

PROFESSOR: Can you switch between different versions of class by the add/drop deadline? In principle, yes. But if you haven't done the work for that, then you will be in trouble. And there may not be a smooth mechanism for making up missed work that late. Because the add/drop deadline is rather late.

So I'd encourage you, if you're considering doing the more intensive version or not, sign up for the more intensive version. You can always drop back. Once we form groups though, for the projects, then, obviously, there's an aspect of letting down your teammates. So think carefully, now, about which one you want to join. It'll be hard to switch between them. But we'll do whatever the registrar requires us to do in terms of allowing it. Other questions? Yes.

AUDIENCE: If the course sites, can we undergrads access the AI problems just for fun, to look at them?

PROFESSOR: Yes. If the course sites remain separate, will undergrads be able to access the AI problems? The answer is yes. Yes.

AUDIENCE: Should students in 6.874 attend both presentations?

PROFESSOR: Should students in 6.874 attend both presentations? Not necessary. You're welcome to both if you want. But the course six recitation should be sufficient. Other questions? Yes.

AUDIENCE: So what's covered in the normal recitations?

PROFESSOR: What's covered in the normal recitations? It'll be reinforcing material that's in the lectures. So the goal is not introduce any new material in the course 20, course seven recitations. Just to clarify, not to introduce any new material. Other questions?

OK. Great. And we'll hang around a little bit to answer any remaining questions when they come up.