Computational Personal Genomics - Week 2 Lab

Table of Contents

1 Introduction

In this lab we will look at the effect of sample size, trait penetrance, relative risk and noise on GWAS studies. We'll do this by generating simulated GWAS studies under various models and seeing what kind of results we can get (i.e. can we beat the genome-wide significance level and get published).

This is meant to be a thought-provoking exercise only, the simulated data does a pretty poor job representing real GWAS data. Most notably it doesn't have any LD or population substructure. Additionally, the method for generating SNP profiles from risk profiles is a little ad-hoc, better methods exist but they don't interface well with R (or non-Linux machines for that matter), but for those of you bold enough to take on an extra challenge, see the HapGen2 software https://mathgen.stats.ox.ac.uk/genetics_software/hapgen/hapgen2.html to try out a real GWAS simulation algorithm and generate more realistic data.

At the end of the lab, we will pose some questions about how these parameters affect our model, and see if you can design a realistic but underpowered study.

2 Function definitions

2.1 General setup

options(stringsAsFactors=FALSE)


2.2 Function to create individuals

This function creates a set of individuals.

n
Number of individuals
n.case
How many have the phenotype
penetrance
What is the trait's penetrance?
background.proportion.phenotype
What's the chance of a random person having this trait?
gen_individuals <- function(n,n.case=floor(n/2),penetrance=1,
                            background.proportion.phenotype=.05) {
  ## all cases are carriers
  ## (1 - penetrance) * backround.proportion.phenotype * n.controls...
  ## are controls that are carriers
  control.carriers <- round((1-penetrance) *
                            background.proportion.phenotype * (n-n.case))
  cat("Control carriers: ",control.carriers,"\n")
  ind.df <- data.frame(
                       subj.id=paste("subj_",1:n,sep=""),
                       case=FALSE,
                       carrier=FALSE
                       )
  rownames(ind.df) <- ind.df[,"subj.id"]
  ind.df[1:n.case,'case'] <- TRUE
  ind.df[which(ind.df[,'case']==TRUE),'carrier'] <- TRUE
  ## which ones are control carriers?
  if(control.carriers > 0) {
    rn <- sample(rownames(ind.df[which(ind.df[,'case']==FALSE),]),
                 control.carriers,replace=FALSE)
    ind.df[rn,'carrier'] <- TRUE
  }
  return(ind.df)
}


2.3 Function to create SNPs

n.snps
How many snps to simulate
n.risk.snps
How many of them are actually associated with the disease
risk
What's the risk associated with the allele? This should be a vector n.risk.snps long giving the risks.
MAF
The minor allele frequency. This can be saclar or a vector n.snps long.
gen_snps <- function(n.snps,n.risk.snps,risk=NULL,MAF=NULL,
                     chr=10,start=10^6,stop=10^7,noise=1) {
  if(is.null(MAF)) {
    MAF <- runif(n.snps,0,.25)
  }
  if(length(MAF)==1) {
    MAF <- rep(MAF,n.snps)
  }
  if(is.null(risk)) {
    risk <- sample(4,n.risk.snps)
  }
  if(length(risk)==1) {
    risk <- rep(risk,n.risk.snps)
  }
  snp.df <- data.frame(
                       snp.id=paste("snp_",1:n.snps,sep=''),
                       MAF=MAF,
                       risk=1,
                       chr=chr,
                       pos=ceiling(seq(start,stop,length=n.snps))
                       )
  rownames(snp.df) <- snp.df[,'snp.id']
  risk.snps <- sample(rownames(snp.df),n.risk.snps)
  for(i in 1:n.risk.snps) {
    risk.snp <- risk.snps[i]
    snp.df[risk.snp,'risk'] <- risk[i]
  }
  ## throw in a little noise to make things interesting
  snp.df[,'risk'] <- jitter(snp.df[,'risk'],factor=noise)

  return(snp.df)
}


2.4 Function to create simulated data set

Create the matrix of genotypes. This basically proceeds by creating the haplotypes and then combining them.

snps.df
Data frame of snps generated by gensnps.
subjects.df
Data frame of subjects generated by gensubjects.
## Just a helper function for IDs
parse_subj_id_from_haplotype_id <- function(haplotype_ids) {
  sapply(strsplit(haplotype_ids,".",fixed=TRUE),function(x){x[1]})
}

## generate the matrix of genotypes
gen_genotypes <- function(snps.df,subjects.df) {
  n.snps <- nrow(snps.df)
  n.subjects <- nrow(subjects.df)

  ## generate the matrix of haplotypes that underlies the genotypes
  haplotype.matrix <- matrix(data=0,nrow=n.subjects*2,ncol=n.snps)
  rownames(haplotype.matrix) <- unlist(lapply(rownames(subjects.df),paste,c(1,2),sep="."))
  subjid.haprows.vector <- parse_subj_id_from_haplotype_id(rownames(haplotype.matrix))
  colnames(haplotype.matrix) <- rownames(snps.df)

  ## determine which of rows in the haplotype matrix belongs to a carrier
  carrier.subjid <- subjects.df[which(subjects.df[,'carrier']),'subj.id']
  noncarrier.subjid <- subjects.df[which(!(subjects.df[,'carrier'])),'subj.id']
  hap.carrier.idx <- which(subjid.haprows.vector %in% carrier.subjid)
  hap.noncarrier.idx <- which(subjid.haprows.vector %in% noncarrier.subjid)

  ## initially all loci are set to the major allele (0)
  ## decide which ones to set to 1 based on MAF (minor allele
  ## frequency) and risk
  ## This next block loops over each SNP setting the minor allele
  ## in the matrix
  for(i in 1:n.snps) {
    MAF <- snps.df[i,'MAF']
    risk <- snps.df[i,'risk']
    ## weight the selection of the minor allele by its overall
    ## frequency and by its risk
    minor.weight.carrier <- MAF * risk
    if(minor.weight.carrier > 1) {
      warn("Given this minor allele frequency, the risk factor seems too high")
      minor.weight.carrier <- 1
    }
    minor.weight.noncarrier <- MAF
    ## sample from the major and minor alleles given these weights
    res.carrier <- rbinom(length(hap.carrier.idx),1,minor.weight.carrier)
    res.noncarrier <- rbinom(length(hap.noncarrier.idx),1,minor.weight.noncarrier)
    haplotype.matrix[hap.carrier.idx,i] <- res.carrier
    haplotype.matrix[hap.noncarrier.idx,i] <- res.noncarrier
  }

  ## now we have a matrix of haplotypes, combine them into the genotypes
  genotype.matrix <- matrix(data=0,nrow=n.subjects,ncol=n.snps)
  rownames(genotype.matrix) <- rownames(subjects.df)
  colnames(genotype.matrix) <- rownames(snps.df)
  for(subj.id in rownames(genotype.matrix)) {
    mat.sub <- haplotype.matrix[which(subjid.haprows.vector==subj.id),]
    genotype.matrix[subj.id,] <- apply(mat.sub,2,sum)
  }

  return(genotype.matrix)
}

3 Creating the simulated data set

First, we generate the subjects. In this example we'll use 500 with 250 cases and assume full penetrance of the trait.

n.subjects <- 500
n.cases <- 250
penetrance <- 1
subjects.df <- gen_individuals(n.subjects,n.cases,penetrance)

Next, we generate the SNPs. We'll use 5000 SNPs, 5 will be risk-associated, and their relative risks will be 2, 2.5, 3, 3.5, and 4. On its own, the function will fill in random minor allele frequencies (MAF's) between 0 and .25. By default the funciton will put all of the SNPs on chr 10 between 1 and 100MB. We'll go in after the fact and give teh snps with an elevated risk a special color so we can find them easier later on.

n.snps <- 5000
n.risk.snps <- 5
risk <- seq(2,4,by=.5)
snps.df <- gen_snps(n.snps,n.risk.snps,risk=risk,MAF=NULL)
snps.df[,'col'] <- 'black'
snps.df[which(snps.df[,'risk'] > 1.5),'col'] <- 'red'

Finally, we'll create the genotype matrix. This matrix will be a matrix of integers, with 0 meaning homozygous major allele, 1 meaning heterozygous and 2 meaning homozygous minor allele.

genotype.matrix <- gen_genotypes(snps.df,subjects.df)

## Take a look at the allele distribution:
table(as.vector(genotype.matrix))



Note that since we assigned variants randomly to haplotypes, the specific allele distribution of each SNP should be approximatley in HWE.

4 Running the GWAS

Next, we'll run the GWAS by testing each SNP individually. We could do this ourselves, but the package snpStats has provided a very nice framework to automate the process. Replicating its results yourself would be one of those "good for your soul" exercises.

4.1 Install snpStats

source("http://www.bioconductor.org/biocLite.R")
biocLite("snpStats")

4.2 Do the single SNP statistical tests

This will carry out the Cochran-Armitage test on each SNP individually. For more info see http://en.wikipedia.org/wiki/Cochran-Armitage_test_for_trend.

library("snpStats")
genotype.sm <- as(genotype.matrix,"SnpMatrix")
results.sst <- single.snp.tests(subjects.df[,'case'],snp.data=genotype.sm)

## let's take a look:
summary(results.sst)

5 Examining the results

Now that we have our results, we can look across our region of interest (in this case the first 100MB of chr10) to find out where any risk loci might be in what is known as a Manhattan plot. We can also check the distribution of the p-values we obtained against what we'd expect under a null model using a Q-Q plot. Often the points that deviate from the diagonal in the Q-Q plot indicate our putative hits, although too much deviation indicates that we didn't do an appropriate job modeling the null hypothesis (i.e. we allowed some patient stratification).

5.1 Manhattan plot

A Manhattan plot is generally a scatter plot with genomic coordinates along the x axis and -log10(p-value) along the y. Often the points acre colored accordingn to chromosome, but since we only simulated one chromosome we'll use color to pick out the risk alleles that the gensnps function designated.

p1 <- p.value(results.sst,df=1)
plot(snps.df[,'pos'],-log10(p1),pch=19,cex=.5,col=snps.df[,'col'],
     main="Manhattan plot",xlab="Chr 10",ylab="-log10(p-value)",
     sub="Color indicates the alleles we created with elevated risk")
abline(h=8,lty=2)


One thing that's important to note about this plot is that my simulation functions don't simulate LD (you could think of them as only generating one SNP per LD block), there's no local inflation of p-values around the risk-associated SNPs. If this were a real GWAS study, this lack of linkage between risk alleles and nearby SNPs would force us to wonder about the quality of the data, and perhaps some artifactual associations, since we know any strong effect should be in LD with other strong, if somewhat muted, effects.

The horizontal line at 10-8 shows an often-used, highly conservative cutoff for genome-wide significance.

5.2 Q-Q plot:

A Q-Q plot shows the distribution of the p-values we obtained above relative to the distribution of p-values under the null hypothesis. This is a popular way to gauge at what point in the ranked list of SNPs they begin to become meaningful. This is also a good way to determine if there are underlying population errors in your sample, as Q-Q plot that shows inflated p-values tends to indicate that there's some population substructure that's throwing off your results (since we only expect a small fraction of the SNPs to be associated with any given disease).

chi2 <- chi.squared(results.sst,df=1)
qq.chisq(chi2,df=1,pch=19,cex=.5)

Here's how to make a similar plot de novo with our risk alleles picked out in red:

p1.o <- sort(p1)
snp.o <- snps.df[names(p1.o),]

plot(-log10((1:length(p1.o))/length(p1.o)),-log10(p1.o),
     pch=19,cex=.5,col=snp.o[,'col'],main="QQ plot",
     xlab="Expected",ylab="Observed")
abline(a=0,b=1,lty=2)

6 Assignment

For this week's assignment, get the above code to run, and then play around with varying the parameters such as noise, number of disease-associated SNPs, number of subjects, and risk factors. See what you would have to do to this data set to make the SNP associations look very strong (for example, try adding more people). Additionally, send us a plot of an underpowered GWAS and tell us what parameters you used to make it. Please try to make the simulation reasonable (1 case and 1 control doesn't count, for example). For inspiration you can take a look at some published GWAS studies which uncovered only weak effects and try to replicate their parameters. Does this alter the way you think about GWAS studies? If you looked at an existing study, what would you recommend they do to make their results stronger?

Simulating this data turned out to be a little bit more complicated than we had imagined. One idea for a potential final project would be to take this method or HAPGEN and compare it to a simulation where you begin with a small pool of haplotypes and attempt to simulate human evolution by simulating recombination and bottleneck events, and to see which method gets you closer to what is actually observed in today's human population.

For extra credit, critique the method for generating the risk alleles in the carrier population.

Validate XHTML 1.0